workflow · python-dev · Zulip Chat Archive

A group of us (@Matt Long , @Anderson Banihirwe , and I, among others) are looking into the feasibility of having CESM generate an intake-esm catalog with every run. This would allow a CESM diagnostic package to rely on intake-esm to read data into DataSets and abstract away the need to know what data is in what file in the archive.

This will likely change the intake-esm interface quite a bit - currently, open_esm_datastore() expects a path to a json file as an argument; that file points to a csv file (that actually contains the data such as file paths and variable names) and also tells intake-esm how to interpret said data. We hope to standardize the information in the json file, and eventually update the open_esm_datastore() API such that it expects a path to the csv file and another argument such as spec="cesm" that replaces the need for the json file.

We are still in early stages of this project, and encourage any feedback, concerns, or assistance you want to offer. I suspect this will be my focus in #hack-projects for the next several months

Kevin Paul (May 20 2020 at 23:25):

Please include @Sheri Mickelson in these discussions.

Kevin Paul (May 20 2020 at 23:28):

In general, I am in favor of this. Just be sure to keep intake-esm general, and not CESM-specific. If you can encode the "cesm" spec in a way that makes it easier for other models to add their own spec (or, even better, make the spec a separate package independent of intake-esm or any of its other core packages...so the spec is a plug-in), that's ideal.

Matt Long (May 20 2020 at 23:37):

it should be an esm spec.

Anderson Banihirwe (May 20 2020 at 23:39):

We hope to standardize the information in the json file, and eventually update the open_esm_datastore() API such that it expects a path to the csv file and another argument such as spec="cesm" that replaces the need

Today intake-esm allows instantiating an instance of esm_datastore class without using a JSON file. All you have to do is pass in a data-frame (content of the CSV), and a dictionary(content of the JSON file) as follows:

import intake
escmol_obj = pd.DataFrame()
esmcol_data = {...}
col = intake_esm.core.esm_datastore(esmcol_obj, esmcol_data=esmcol_data)

Anderson Banihirwe (May 20 2020 at 23:40):

I would recommend investigating this approach before trying out the "spec" plugin approach...

Jeff de La Beaujardiere (May 21 2020 at 00:02):

I like the idea of having intake-esm catalog being generated as part of the model run. I would also like to see the CSV file contain more information in order to enhance semantic interoperability and data discovery. Can we find a time for me to do a brief demo of a Notebook that uses an enhanced Intake-ESM CSV file to illustrate at least part of the concept?

Anderson Banihirwe (May 21 2020 at 00:33):

Can we find a time for me to do a brief demo of a Notebook that uses an enhanced Intake-ESM CSV file to illustrate at least part of the concept?

I'd love to see this in action. Is this notebook publicly available?

Jeff de La Beaujardiere (May 21 2020 at 12:18):

The Notebook works in JupyterLab on my laptop, but fails in the Pangeo JupyterHub because of some dependency issues that I haven't had time to debug.

Jeff de La Beaujardiere (May 21 2020 at 13:56):

Are you folks having a regularly-scheduled call that I could join to do a brief demo from my laptop, or should I set up a separate event?

Kevin Paul (May 21 2020 at 14:19):

I don't think there is anything regularly scheduled except for the Friday half-day hacks in the afternoon. But I'd really like to see this demo, and I don't work on Friday. So, maybe we could schedule something special? I'd like the whole Xdev team to join.

Jeff de La Beaujardiere (May 21 2020 at 15:09):

@Kevin Paul Please feel free to look at my Google Calendar and propose a time next week that works for the Xdev group. Note that I am currently riding out the lockdown in Maryland on eastern time zone.

Kevin Paul (May 21 2020 at 15:10):

Will do.

Kevin Paul (May 21 2020 at 15:12):

Invite sent. Let me know if that doesn't work.

Brian Bonnlander (May 21 2020 at 16:03):

Is the proposed interface a reformulation of what's already specified in existing catalog structures, because if so then it would be straightforward to port other catalogs, like CMIP6, to this format also. It would be a good test of the interface's generality.

Anderson Banihirwe (May 21 2020 at 17:08):

Is the proposed interface a reformulation of what's already specified in existing catalog structures

It's my understanding that the focus here is CESM catalogs

Brian Bonnlander (May 21 2020 at 17:12):

Meaning that generality is not a concern right now?

Kevin Paul (May 21 2020 at 17:12):

Generality should be a concern.

Anderson Banihirwe (May 21 2020 at 17:13):

Meaning that generality is not a concern right now?

generality in catalog generation or catalog usage?

Kevin Paul (May 21 2020 at 17:14):

The code that we publish and other people install should be general. If it can be used to generate specific catalogs for something like a CESM run, then that's great.

Brian Bonnlander (May 21 2020 at 17:14):

I was thinking catalog usage; maybe that is not correct?

Kevin Paul (May 22 2020 at 17:07):

I’ve been thinking more about this, and I grow increasingly concerned about developing technology to deal with CESM issues. In my mind, many of the problems people have with ESM data stems from non-standardization, and the “need” to make modifications to intake-ESM to accommodate CESM needs, is one of those issues. I believe that we should not be spending time developing solutions to make the tools work with CESM data, but spending more time making CESM data “standardized.”

Now, I know that most CESM developers say that CESM is standardized because it is CF compliant. But CF is not a standard. It is a set of conventions. The only thing we have close to a standard for ESM data is CMIP. So, I think we should be spending our effort getting CESM to output CMIP-compliant data.

My $0.02.

Matt Long (May 22 2020 at 17:40):

@Kevin Paul, thanks for your perspective.

I take exception to your statement

I grow increasingly concerned about developing technology to deal with CESM issues

We develop CESM and are in urgent need of frameworks for effective analysis. In my view, our primary objectives are to develop general tools that work with CESM. Conventions are great, but we also need a general framework to provide work-arounds when they inevitably fall short. There is no such thing as full CMIP-compliance. We output numerous variables that aren't even defined in CMIP.

Kevin Paul (May 22 2020 at 18:17):

I didn’t mean to offend. And perhaps
I should not have used the word “issues” because that implies “problems”.

I believe the shortfalls you are referring to primarily come from not having a fully defined standard. And as you point out, CMIP is not fully defined. Rather, it is the closest thing we have. I think we should be involved in fully defining that standard. And perhaps modifications to CESM could lead the way. However, it is generally believed that property defined standards, and rigid conformance to those standards, reduces the pitfalls to which you allude.

Workarounds are always needed in a pinch. But I fear that the workaround becomes the practice. And it then becomes the “approved solution.” Also, workarounds should be minimal effort, and I am concerned of the level of effort being put into workarounds.

Anderson Banihirwe (May 22 2020 at 18:21):

I believe that we should not be spending time developing solutions to make the tools work with CESM data, but spending more time making CESM data “standardized.”

Having some form of standardization across the CESM components would help a lot with the catalog generation(at least). I am not familiar with CESM run workflow, but it is my understanding that a lot of information we are having to assemble in order to build the catalog (see this comment by @Sheri Mickelson ) consists of attributes that may be persisted in the global attributes of the model output.

Kevin Paul (May 22 2020 at 18:25):

Yes. I believe @Anderson Banihirwe’s point is that we might need to work on both sides of the problem: both developing general tools that operate on standardized data and modifying the model output to better conform to those standards.

In many ways this is the same issue as deciding what new features should go into the upstream codebase vs into a dependency or new package. In this case, the upstream codebase is CESM.

Anderson Banihirwe (May 22 2020 at 18:38):

In many ways this is the same issue as deciding what new features should go into the upstream codebase vs into a dependency or new package. In this case, the upstream codebase is CESM.

Does CESM have an existing framework/process for proposing new features across all components or does each component have its own policy?

Matt Long (May 22 2020 at 18:53):

we might need to work on both sides of the problem: both developing general tools that operate on standardized data and modifying the model output to better conform to those standards

Absolutely.

@Kevin Paul, no offense taken. (I took "exception"!)

I think we need to focus on discretizing the elements of a workflow into the right Python "objects".

Intake-esm should have a relatively narrow focus, which is to bridge the gap between the vagaries of file systems and files, enabling a semantic API to getting an xarray.Dataset from collections of files. That's it.

@Michael Levy started this thread with a very simple objective: get CESM to write intake-esm catalog files. No changes to the model's output data is needed to accomplish this. We just need generate a table of the model output.

I am starting to make notes on a general integration framework as a design doc for a package called integral here:
https://hackmd.io/@matt-long/ryfGpDSsL/edit

I welcome contributions. This package would aim to integrate intake-esm and xpersist to provide an extensible framework for operating on datasets with full provenance tracking and a natural way to accommodate peculiarities in a general way.

CESM scientists are going to write code to analyze CESM. Integral will aim to provide a general scheme for applying this code in the context of routine model diagnostics and ad hoc, exploratory work.

Kevin Paul (May 22 2020 at 19:02):

Intake-esm should have a relatively narrow focus, which is to bridge the gap between the vagaries of file systems and files, enabling a semantic API to getting an xarray.Dataset from collections of files. That's it.

Agreed.

Michael Levy started this thread with a very simple objective: get CESM to write intake-esm catalog files. No changes to the model's output data is needed to accomplish this. We just need generate a table of the model output.

...But maybe some changes to model output might make this easier to accomplish? In a way less prone to future errors? Those are questions, not statements. I'm curious. Perhaps @Michael Levy would have something to say on that.

I am starting to make notes on a general integration framework as a design doc for a package called integral here:
https://hackmd.io/@matt-long/ryfGpDSsL/edit

I will definitely take a look! I'm actually really interested in this.

CESM scientists are going to write code to analyze CESM. Integral will aim to provide a general scheme for applying this code in the context of routine model diagnostics and ad hoc, exploratory work.

:+1:

Anderson Banihirwe (May 22 2020 at 19:02):

@Michael Levy started this thread with a very simple objective: get CESM to write intake-esm catalog files. No changes to the model's output data is needed to accomplish this. We just need generate a table of the model output.

I think a prototype to accomplish this is worth it (as a short term solution). I echo @Kevin Paul's comments on modifying the model output to better conform to some standards as a long term solution

Kevin Paul (May 22 2020 at 19:03):

...Well, and keep in mind that we might need to actually develop some standards along the way. :smiley:

Michael Levy (May 22 2020 at 19:04):

I'll jump into the current conversation in a minute, but first an commercial break: #hack-projects > generating intake catalog for CESM output

Anderson Banihirwe (May 22 2020 at 19:05):

...Well, and keep in mind that we might need to actually develop some standards along the way. :smiley:

That's why I was asking about this

Does CESM have an existing framework/process for proposing new features across all components or does each component have its own policy?

Matt Long (May 22 2020 at 19:08):

I'll let you have the last word on standards, I'll just note that we can't solve all the problems at once. We should have lofty aspirations and, with these in mind, follow the most direct path to extensible solutions.

Michael Levy (May 22 2020 at 19:08):

Part of me thinks that discussing whether or not we need more information in CESM history files is a little orthogonal to the goal of have the CESM workflow generate the catalog. It would definitely help in generating catalogs after the fact, but for the proposed idea of either updating the timeseries generation step in CESM postprocessing or the short term archiver, both of those tools will have access to the XML files in $CASEROOT that already contain the information we'd want to put in history files.

Matt Long (May 22 2020 at 19:08):

I completely agree.

Kevin Paul (May 22 2020 at 19:10):

@Michael Levy: Ah! So this is a proposal to add/amend the CESM post-processing code? (I.e., this is actually changing CESM codebase?)

Matt Long (May 22 2020 at 19:14):

@Kevin Paul, the object is to get the model to write intake catalog files. By hook or by crook. I would suggest that this is a relatively autonomous tool that is called by post-processing hooks in the run infrastructure.

Michael Levy (May 22 2020 at 19:14):

And if the goal is to distribute the catalog with the archive (and introduce a mechanism for combining catalogs for ensembles), then I think the only use for generating a catalog from CESM output will be to catalog old runs, which won't have the proposed metadata anyway.

Ah! So this is a proposal to add/amend the CESM post-processing code? (I.e., this is actually changing CESM codebase?)

@Kevin Paul Yes! Sorry if this was unclear, but the bulk of this work will be updating CESM (I think either post-processing or the archiver, which might technically be in CIME). I proposed changing the intake-esm interface to require a single csv file rather than a csv file AND a json file because I've found that I generate the same json file for every CESM catalog, except they all point to different csv files; that sort of redundancy led me to think that the intake-esm interface could be improved once everything is standardized, but I don't think I did a good job of separating that particular wishlist item from the project description as a whole

Kevin Paul (May 22 2020 at 19:16):

@Michael Levy @Matt Long I think I'm getting it. And no worries about communication. That's the beauty of Zulip! What you are proposing sounds laudable. ...I have no idea how to do it. :smiley: I'll leave that to the experts.

Michael Levy (May 22 2020 at 19:16):

@Matt Long makes a good point -- my "Yes!" response is more based on a blueprint that mostly exists in my head and will likely change, but CESM will either generate the catalog itself or call an independent tool that is CESM-aware and can pull data out of the case root.

Michael Levy (May 22 2020 at 19:17):

but I think everyone here agrees that intake-esm / intake-esm-datastore will not be updated to generate these catalogs... there might be suggestions on API improvements resulting from the project, but they will rightfully remain model-agnostic

Kevin Paul (May 22 2020 at 19:24):

I think I'm finally getting it. :smile:

Jeff de La Beaujardiere (May 22 2020 at 21:14):

Now, I know that most CESM developers say that CESM is standardized because it is CF compliant. But CF is not a standard. It is a set of conventions. The only thing we have close to a standard for ESM data is CMIP. So, I think we should be spending our effort getting CESM to output CMIP-compliant data.

Much of the Earth science community uses and is familiar with the CF Conventions, whereas only CMIP people are familiar with the CMIP conventions, so I would argue that CF is "more standard." As far as I know, neither has been officially adopted by a standards-developing organization, although CF-NetCDF3 was adopted by the Open Geospatial Consortium in ~2010 (thanks to the work of Ben Domenico at Unidata).

I am curious which aspects of CESM should be made more standard, and whether some aspects are covered only by CF or only by CMIP, or are covered by both in a contradictory fashion.

Personally, I find it unfortunate that:

The variable names are different in CESM/CMIP and CESM/LENS.
CF Standard Names are buried in the metadata rather than exposed at the file/object level.
In LENS, there are a number of gratuitous incompatibilities requiring a human, e.g.:
Some fields have different short names depending on whether monthly or daily (e.g., aice vs aice_d).
[I realize the ice model needs to do it this way, but it could have been fixed in post-processing.]
Some fields have different short names but the same long name (e.g., SST, TEMP = Potential Temperature).
[I realize SST = TEMP @ surface, but still this is bad practice.]
Some fields are the same but express the units differently (e.g., atmos TAUX is N/m2, but land TAUX is in kg/m/s^2).
[I realize 1 N = 1 kg*m/s^2]
Squared units are expressed as s2 or m2 at some times, and s^2 or m^2 at others.
TAUX2 is wind stress squared for ocean monthly, but TAUX_2 is wind stress (not squared) for ocean daily.

At least some of these things can be made more standard at the Intake-ESM level, as I will demonstrate on Wed Mar 27 at 09:00 MDT.

Keith Lindsay (Jun 16 2020 at 13:29):

@Jeff de La Beaujardiere , delayed response here...
Regarding fields having different short names but the same long name, it's not clear what behavior you would like to see.
Do you want 1) different long names, or 2) same short names? I see pro and cons to either alternative.

Last updated: May 16 2025 at 17:14 UTC

Stream: python-dev

Topic: workflow

Michael Levy (May 20 2020 at 23:23):

Kevin Paul (May 20 2020 at 23:25):

Kevin Paul (May 20 2020 at 23:28):

Matt Long (May 20 2020 at 23:37):

Anderson Banihirwe (May 20 2020 at 23:39):

Anderson Banihirwe (May 20 2020 at 23:40):

Jeff de La Beaujardiere (May 21 2020 at 00:02):

Anderson Banihirwe (May 21 2020 at 00:33):

Jeff de La Beaujardiere (May 21 2020 at 12:18):

Jeff de La Beaujardiere (May 21 2020 at 13:56):

Kevin Paul (May 21 2020 at 14:19):

Jeff de La Beaujardiere (May 21 2020 at 15:09):

Kevin Paul (May 21 2020 at 15:10):

Kevin Paul (May 21 2020 at 15:12):

Brian Bonnlander (May 21 2020 at 16:03):

Anderson Banihirwe (May 21 2020 at 17:08):

Brian Bonnlander (May 21 2020 at 17:12):

Kevin Paul (May 21 2020 at 17:12):

Anderson Banihirwe (May 21 2020 at 17:13):

Kevin Paul (May 21 2020 at 17:14):

Brian Bonnlander (May 21 2020 at 17:14):

Kevin Paul (May 22 2020 at 17:07):

Matt Long (May 22 2020 at 17:40):

Kevin Paul (May 22 2020 at 18:17):

Anderson Banihirwe (May 22 2020 at 18:21):

Kevin Paul (May 22 2020 at 18:25):

Anderson Banihirwe (May 22 2020 at 18:38):

Matt Long (May 22 2020 at 18:53):

Kevin Paul (May 22 2020 at 19:02):

Anderson Banihirwe (May 22 2020 at 19:02):

Kevin Paul (May 22 2020 at 19:03):

Michael Levy (May 22 2020 at 19:04):

Anderson Banihirwe (May 22 2020 at 19:05):

Matt Long (May 22 2020 at 19:08):

Michael Levy (May 22 2020 at 19:08):

Matt Long (May 22 2020 at 19:08):

Kevin Paul (May 22 2020 at 19:10):

Matt Long (May 22 2020 at 19:14):

Michael Levy (May 22 2020 at 19:14):

Kevin Paul (May 22 2020 at 19:16):

Michael Levy (May 22 2020 at 19:16):

Michael Levy (May 22 2020 at 19:17):

Kevin Paul (May 22 2020 at 19:24):

Jeff de La Beaujardiere (May 22 2020 at 21:14):

Keith Lindsay (Jun 16 2020 at 13:29):