data funnel · python-dev · Zulip Chat Archive

This example tries to demonstrate some interesting things that might be possible if we can call notebooks with parameters—and possibly get return arguments, perhaps as data catalog-ish type objects.

I was not able to install papermill in my environment due to conflicts and don't have time to investigate further, so the notebooks are not actually being called. I don't think papermill support return arguments (yet?).

Also intake-esm may be broken with latest commits (I am using the head of master). I can't get my notebook to load the data, but it was working over the weekend.

In each of the _*.ipynb notebooks, I can develop and validate a particular workflow, processing data into a form ready for plotting, then caching that data locally. These processing notebooks are helpful to document low-level details of a calculation.

This also enables me to focus more effectively on the plotting/final analysis notebook with the time-consuming processing done. Eventually, this final notebook could support a web-app for visualization, the "backend" supplied by the other notebooks.

Also a notebook like _cesm-le-data.ipynb could be effectively developed into a more general backend for support CESM-LE analysis. We might imagine curating notebooks like these that people can incorporate into their own workflows. By exposing the processing code, we can enable off-piste activities more effectively than trying to package everything and handle all use cases.

I would like to share this more broadly if we can get it working. Feedback, contributions are welcome. I don't think I'll have much more time to work on this for awhile. Have to shift focus to a proposal.

Kevin Paul (Mar 17 2020 at 16:37):

Our current "recommendation" for users is to use Jupyter Notebooks for "interactive" analysis, and Python scripting for "batch" analysis. In this workflow, your example would be structured like so:

The difference comes in the development of the "batch"-style scripts (_cesm-le-data.py and _pop_region_mask.py), instead of the Notebook dependencies. These scripts could have started as Jupyter Notebooks, and you could have exported them to an executable script directly from JupyterLab (i.e., "File" --> "Export Notebook As ..." --> "Export Notebook to Executable Script").

However, there are advantages to writing the "batch" operations as Notebooks, rather than scripts. Namely, Notebooks can contain "diagnostic" cells that you can look at later to see how the operation progressed. Yes, you could accomplish the same thing with "print statements" in a script, but the Notebook allows you to contain the output with the input, so you can easily access the diagnostic information later. It never gets separated or lost (as a print statement would if it was only sent to stdout).

You are also suggesting the concept of modularity, which is another "best practice" in development, splitting the pieces of a workflow into independent "modules". This makes it easier to share commonly used "modules" with others, instead of having to reinvent the wheel.

So, altogether, I think that this is great! It is clear from your Notebooks, though, that you are also hoping to be able to call the dependency Notebooks. That functionality is missing right now, in its current form, and I agree that this functionality is critical to getting this to work "right."

1. We definitely need the ability to call/run other Notebooks from within a Notebook. But not "necessarily." In other words, we need to be able to check if the "product" of a Notebook (i.e., some file...or, perhaps, an object in memory...hard to do across independent processes) has been produced and is "up-to-date" with the Notebook itself. If the product exists and is up-to-date, then load the product instead of running. The run functionality could, potentially, be handled by papermill, but I have not had time to investigate this.

2. Make the Notebooks and their "products" cacheable in a common place so that other people can easily use them in their own Notebooks. They should be searchable and well documented, so other people can easily find them and choose whether they "trust" them, or not. That is, there needs to be a "vetting" procedure.

NBGallery provides some of these features, namely the searchability and part of the vetting (i.e., rankings) features. But it's not quite there, yet. Personally, because NBGallery doesn't work with JupyterHub (that is, it is a separate service independent of JupyterHub), I've talked with @Anderson Banihirwe about making the searching and ranking capabilities of NBGallery hook into JupyterLab via a JupyterLab Extension. The notebook+product caching would need to be something we would have to create.

Matt Long (Mar 17 2020 at 16:59):

I agree, we could be developing modules for these dependencies, rather than notebooks. But the advantage is the notebooks offer an ability to showcase aspects of the calculation.

Yes! I have been thinking about using xpersist or similar functionality for this. This could be embedded in the dependency notebook itself.

papermill seems like a work in progress. It does permit parameterizing and calling notebooks. It does not support return arguments. We could easily hack something based on messaging files...

Yes. A starting place might be to build dataset specific notebooks. For instance, you might imagine interacting with the CESM-LE through a community-developed notebook that encapsulates many common first-cut operations (dim reductions, derived variables, etc) applied appropriately to these data.

Anderson Banihirwe (Mar 17 2020 at 17:01):

In some cases you may even leave the notebooks as is and import them as modules via this package called importnb

In [1]: from importnb import Notebook

In [3]: with Notebook():
   ...:     import _pop_region_mask # This is the  _pop_region_mask.ipynb notebook
   ...:
Cannot write to data cache '/glade/p/cesmdata/cseg'. Will not be able to download remote data files. Use environment variable 'CESMDATAROOT' to specify another directory.
------------------------------
Writing /glade/work/abanihi/devel/gists/matt-long/funnel/notebooks/data/region-mask-POP_gx1v6-krill-ToE.zarr
<xarray.Dataset>
Dimensions:      (nlat: 384, nlon: 320, region: 3)
Coordinates:
  * region       (region) <U14 'Southern Ocean' 'WAP & Atlantic' 'Indo-Pacific'
Dimensions without coordinates: nlat, nlon
Data variables:
    masked_area  (region, nlat, nlon) float64 nan nan nan nan ... nan nan nan


In [4]: _pop_region_mask.grid_name
Out[4]: 'POP_gx1v6'

In [5]: _pop_region_mask.masked_area
Out[5]:
<xarray.DataArray 'masked_area' (region: 3, nlat: 384, nlon: 320)>
array([[[           nan,            nan,            nan, ...,
                    nan,            nan,            nan],
        [           nan,            nan,            nan, ...,
                    nan,            nan,            nan],
        [1.52530781e+13, 1.52530781e+13, 1.52530781e+13, ...,
                    nan,            nan,            nan],
        ...,
        [           nan,            nan,            nan, ...,
                    nan,            nan,            nan],
        [           nan,            nan,            nan, ...,
                    nan,            nan,            nan],
        [           nan,            nan,            nan, ...,
                    nan,            nan,            nan]],

       [[           nan,            nan,            nan, ...,
                    nan,            nan,            nan],
        [           nan,            nan,            nan, ...,
                    nan,            nan,            nan],
        [1.52530781e+13, 1.52530781e+13, 1.52530781e+13, ...,
                    nan,            nan,            nan],
        ...,
        [           nan,            nan,            nan, ...,
                    nan,            nan,            nan],
        [           nan,            nan,            nan, ...,
                    nan,            nan,            nan],
        [           nan,            nan,            nan, ...,
                    nan,            nan,            nan]],

       [[           nan,            nan,            nan, ...,
                    nan,            nan,            nan],
        [           nan,            nan,            nan, ...,
                    nan,            nan,            nan],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
                    nan,            nan,            nan],
        ...,
        [           nan,            nan,            nan, ...,
                    nan,            nan,            nan],
        [           nan,            nan,            nan, ...,
                    nan,            nan,            nan],
        [           nan,            nan,            nan, ...,
                    nan,            nan,            nan]]])
Coordinates:
  * region   (region) <U14 'Southern Ocean' 'WAP & Atlantic' 'Indo-Pacific'
Dimensions without coordinates: nlat, nlon

Kevin Paul (Mar 17 2020 at 17:07):

That's a cool feature! That definitely makes the clunky "convert to executable script" step nicer.

Kevin Paul (Mar 17 2020 at 17:53):

xpersist goes part of the way there, but I am thinking of something more closely tied to the Notebook itself. That is, the Notebook and its "products" need to be tied together very closely, in a way that makes it possible to correctly determine if the products need re-generation or not. One way to do this is to record a hash (like a git hash) in the metadata of the Notebook itself, and store the same hash in the attributes of the "product". Then, you would update the Notebook hash every time the Notebook was "updated."

This, now, introduces the concept of Notebook "updating," which naturally is different from just saving the Notebook file. Maybe this is actually "committing" rather than "saving", like a git commit. That might be a nice feature to add, and could possibly be added to JupyterLab via an extension ("Commit Notebook"). This would be a bit like NBGallery, then, with the ability to "commit" a Notebook to the "repository". Then the actual git commit hash could be stored in the product attributes.

...And there's probably more. We should set up a meeting to flesh out this workflow so we know what we all can actually work on to make this real.

Matt Long (Mar 17 2020 at 17:58):

Seth McGinnis (Mar 26 2020 at 22:24):

An important conclusion that we came to back when we were working on this kind of idea in the form of the Capstone project: provenance and versioning need to be threaded through this entire process and captured automatically. You need to be able to start from the end product and be able to determine that the Nth step of the workflow was performed by version X of module Y. That's essential for scientific reproducibility, for data and software citation, and for being able to determine whether the results of a given analysis are affected by a bug somebody recently found.

In a serial / on-prem environment, you can accomplish this by using netcdf as the 'pipeline' format between modules; if every module reads netcdf as input, writes netcdf as output, and is well-behaved in terms of adding an appropriate entry to the "history" metadata attribute, then you have a system that automatically captures the provenance of the workflow*. I'm not sure if that really works in a parallel / cloud / zarr environment, but hopefully it's useful as a starting point for thinking about the issue.

(* For linear workflows, at least; there were some open questions about how to handle convergent workflows with multiple inputs...)

Stream: python-dev

Topic: data funnel