Stream: python-dev
Topic: data funnel
Matt Long (Mar 16 2020 at 16:02):
I have been playing around with daisy chaining notebooks in a workflow:
https://github.com/matt-long/funnel
This example tries to demonstrate some interesting things that might be possible if we can call notebooks with parameters—and possibly get return arguments, perhaps as data catalog-ish type objects.
I was not able to install papermill
in my environment due to conflicts and don't have time to investigate further, so the notebooks are not actually being called. I don't think papermill
support return arguments (yet?).
Also intake-esm
may be broken with latest commits (I am using the head of master). I can't get my notebook to load the data, but it was working over the weekend.
The idea here is that
Southern-Ocean-Surface-Fields.ipynb
depends on_cesm-le-data.ipynb
, which depends in turn on_pop_region_mask.ipynb
.
I find that this is an effective way to develop a project.
In each of the _*.ipynb
notebooks, I can develop and validate a particular workflow, processing data into a form ready for plotting, then caching that data locally. These processing notebooks are helpful to document low-level details of a calculation.
This also enables me to focus more effectively on the plotting/final analysis notebook with the time-consuming processing done. Eventually, this final notebook could support a web-app for visualization, the "backend" supplied by the other notebooks.
Also a notebook like _cesm-le-data.ipynb
could be effectively developed into a more general backend for support CESM-LE analysis. We might imagine curating notebooks like these that people can incorporate into their own workflows. By exposing the processing code, we can enable off-piste activities more effectively than trying to package everything and handle all use cases.
@Anderson Banihirwe, perhaps you can take a look at the intake-esm
issue?
I would like to share this more broadly if we can get it working. Feedback, contributions are welcome. I don't think I'll have much more time to work on this for awhile. Have to shift focus to a proposal.
Kevin Paul (Mar 17 2020 at 16:37):
So, I'm taking a look at this this morning, and I have some thoughts.
Our current "recommendation" for users is to use Jupyter Notebooks for "interactive" analysis, and Python scripting for "batch" analysis. In this workflow, your example would be structured like so:
Southern-Ocean-Surface-Fields.ipynb
depends on_cesm-le-data.py
(note, not a Notebook), which depends in turn on_pop_region_mask.py
(again, not a Notebook)
Do I understand this correctly?
The difference comes in the development of the "batch"-style scripts (_cesm-le-data.py
and _pop_region_mask.py
), instead of the Notebook dependencies. These scripts could have started as Jupyter Notebooks, and you could have exported them to an executable script directly from JupyterLab (i.e., "File" --> "Export Notebook As ..." --> "Export Notebook to Executable Script").
However, there are advantages to writing the "batch" operations as Notebooks, rather than scripts. Namely, Notebooks can contain "diagnostic" cells that you can look at later to see how the operation progressed. Yes, you could accomplish the same thing with "print statements" in a script, but the Notebook allows you to contain the output with the input, so you can easily access the diagnostic information later. It never gets separated or lost (as a print statement would if it was only sent to stdout).
You are also suggesting the concept of modularity, which is another "best practice" in development, splitting the pieces of a workflow into independent "modules". This makes it easier to share commonly used "modules" with others, instead of having to reinvent the wheel.
So, altogether, I think that this is great! It is clear from your Notebooks, though, that you are also hoping to be able to call
the dependency Notebooks. That functionality is missing right now, in its current form, and I agree that this functionality is critical to getting this to work "right."
Here's what I think this all needs to be of the most use to the other users:
1. We definitely need the ability to call
/run
other Notebooks from within a Notebook. But not "necessarily." In other words, we need to be able to check if the "product" of a Notebook (i.e., some file...or, perhaps, an object in memory...hard to do across independent processes) has been produced and is "up-to-date" with the Notebook itself. If the product exists and is up-to-date, then load the product instead of run
ning. The run
functionality could, potentially, be handled by papermill
, but I have not had time to investigate this.
2. Make the Notebooks and their "products" cacheable in a common place so that other people can easily use them in their own Notebooks. They should be searchable and well documented, so other people can easily find them and choose whether they "trust" them, or not. That is, there needs to be a "vetting" procedure.
NBGallery provides some of these features, namely the searchability and part of the vetting (i.e., rankings) features. But it's not quite there, yet. Personally, because NBGallery doesn't work with JupyterHub (that is, it is a separate service independent of JupyterHub), I've talked with @Anderson Banihirwe about making the searching and ranking capabilities of NBGallery hook into JupyterLab via a JupyterLab Extension. The notebook+product caching would need to be something we would have to create.
Matt Long (Mar 17 2020 at 16:59):
Thanks @Kevin Paul. Yes you have it right.
I agree, we could be developing modules for these dependencies, rather than notebooks. But the advantage is the notebooks offer an ability to showcase aspects of the calculation.
1. We definitely need the ability to call/run other Notebooks from within a Notebook. But not "necessarily." In other words, we need to be able to check if the "product" of a Notebook (i.e., some file...or, perhaps, an object in memory...hard to do across independent processes) has been produced and is "up-to-date" with the Notebook itself.
Yes! I have been thinking about using xpersist
or similar functionality for this. This could be embedded in the dependency notebook itself.
papermill
seems like a work in progress. It does permit parameterizing and calling notebooks. It does not support return arguments. We could easily hack something based on messaging files...
2. Make the Notebooks and their "products" cacheable in a common place so that other people can easily use them in their own Notebooks.
Yes. A starting place might be to build dataset specific notebooks. For instance, you might imagine interacting with the CESM-LE through a community-developed notebook that encapsulates many common first-cut operations (dim reductions, derived variables, etc) applied appropriately to these data.
Anderson Banihirwe (Mar 17 2020 at 17:01):
The difference comes in the development of the "batch"-style scripts (_cesm-le-data.py and _pop_region_mask.py), instead of the Notebook dependencies. These scripts could have started as Jupyter Notebooks, and you could have exported them to an executable script directly from JupyterLab (i.e., "File" --> "Export Notebook As ..." --> "Export Notebook to Executable Script").
In some cases you may even leave the notebooks as is and import them as modules via this package called importnb
In [1]: from importnb import Notebook In [3]: with Notebook(): ...: import _pop_region_mask # This is the _pop_region_mask.ipynb notebook ...: Cannot write to data cache '/glade/p/cesmdata/cseg'. Will not be able to download remote data files. Use environment variable 'CESMDATAROOT' to specify another directory. ------------------------------ Writing /glade/work/abanihi/devel/gists/matt-long/funnel/notebooks/data/region-mask-POP_gx1v6-krill-ToE.zarr <xarray.Dataset> Dimensions: (nlat: 384, nlon: 320, region: 3) Coordinates: * region (region) <U14 'Southern Ocean' 'WAP & Atlantic' 'Indo-Pacific' Dimensions without coordinates: nlat, nlon Data variables: masked_area (region, nlat, nlon) float64 nan nan nan nan ... nan nan nan In [4]: _pop_region_mask.grid_name Out[4]: 'POP_gx1v6' In [5]: _pop_region_mask.masked_area Out[5]: <xarray.DataArray 'masked_area' (region: 3, nlat: 384, nlon: 320)> array([[[ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan], [1.52530781e+13, 1.52530781e+13, 1.52530781e+13, ..., nan, nan, nan], ..., [ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan]], [[ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan], [1.52530781e+13, 1.52530781e+13, 1.52530781e+13, ..., nan, nan, nan], ..., [ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan]], [[ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan], [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ..., nan, nan, nan], ..., [ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan]]]) Coordinates: * region (region) <U14 'Southern Ocean' 'WAP & Atlantic' 'Indo-Pacific' Dimensions without coordinates: nlat, nlon
Kevin Paul (Mar 17 2020 at 17:07):
That's a cool feature! That definitely makes the clunky "convert to executable script" step nicer.
Kevin Paul (Mar 17 2020 at 17:53):
@Matt Long Your points are correct.
xpersist
goes part of the way there, but I am thinking of something more closely tied to the Notebook itself. That is, the Notebook and its "products" need to be tied together very closely, in a way that makes it possible to correctly determine if the products need re-generation or not. One way to do this is to record a hash (like a git hash) in the metadata of the Notebook itself, and store the same hash in the attributes of the "product". Then, you would update the Notebook hash every time the Notebook was "updated."
This, now, introduces the concept of Notebook "updating," which naturally is different from just saving the Notebook file. Maybe this is actually "committing" rather than "saving", like a git commit. That might be a nice feature to add, and could possibly be added to JupyterLab via an extension ("Commit Notebook"). This would be a bit like NBGallery, then, with the ability to "commit" a Notebook to the "repository". Then the actual git commit hash could be stored in the product attributes.
There's a lot here to flesh out, including things like:
- Marking cells in a Notebook so that their output can be "referenced" in another Notebook. @Anderson Banihirwe has already shown how to import a Notebook into another Notebook, but I don't know what the Notebook "namespace" looks like. So, this might be done, already. Not sure.
- Storing separate I/O output (i.e., output file paths) in the Notebook in a referenceable form.
xpersist
might accomplish this on its own. But nowxpersist
ed datasets and variables created in a referenced Notebook need to be "findable" (somehow) from the referencing Notebook. Might need something like adocstrings
standard for Notebooks themselves. (Relates to the previous bullet, actually) - Notebooks could be "committed" to a common "repository" (using git language here, but it might not need to be git), but the Notebook products should also be "commitable" to a common "repository". In the case of "products," though, the "repository" starts looking more like a "catalog," so this might hook best into
intake
.
...And there's probably more. We should set up a meeting to flesh out this workflow so we know what we all can actually work on to make this real.
Matt Long (Mar 17 2020 at 17:58):
I agree, a meeting would be useful for brainstorming.
Seth McGinnis (Mar 26 2020 at 22:24):
Sounds appealing to me.
An important conclusion that we came to back when we were working on this kind of idea in the form of the Capstone project: provenance and versioning need to be threaded through this entire process and captured automatically. You need to be able to start from the end product and be able to determine that the Nth step of the workflow was performed by version X of module Y. That's essential for scientific reproducibility, for data and software citation, and for being able to determine whether the results of a given analysis are affected by a bug somebody recently found.
In a serial / on-prem environment, you can accomplish this by using netcdf as the 'pipeline' format between modules; if every module reads netcdf as input, writes netcdf as output, and is well-behaved in terms of adding an appropriate entry to the "history" metadata attribute, then you have a system that automatically captures the provenance of the workflow*. I'm not sure if that really works in a parallel / cloud / zarr environment, but hopefully it's useful as a starting point for thinking about the issue.
(* For linear workflows, at least; there were some open questions about how to handle convergent workflows with multiple inputs...)
Last updated: Jan 30 2022 at 12:01 UTC