A group of us (@Matt Long , @Anderson Banihirwe , and I, among others) are looking into the feasibility of having CESM generate an intake-esm
catalog with every run. This would allow a CESM diagnostic package to rely on intake-esm
to read data into DataSets and abstract away the need to know what data is in what file in the archive.
This will likely change the intake-esm
interface quite a bit - currently, open_esm_datastore()
expects a path to a json file as an argument; that file points to a csv
file (that actually contains the data such as file paths and variable names) and also tells intake-esm
how to interpret said data. We hope to standardize the information in the json file, and eventually update the open_esm_datastore()
API such that it expects a path to the csv file and another argument such as spec="cesm"
that replaces the need for the json file.
We are still in early stages of this project, and encourage any feedback, concerns, or assistance you want to offer. I suspect this will be my focus in #hack-projects for the next several months
Please include @Sheri Mickelson in these discussions.
In general, I am in favor of this. Just be sure to keep intake-esm
general, and not CESM-specific. If you can encode the "cesm" spec
in a way that makes it easier for other models to add their own spec
(or, even better, make the spec
a separate package independent of intake-esm
or any of its other core packages...so the spec
is a plug-in), that's ideal.
it should be an esm spec.
We hope to standardize the information in the json file, and eventually update the open_esm_datastore() API such that it expects a path to the csv file and another argument such as spec="cesm" that replaces the need
Today intake-esm allows instantiating an instance of esm_datastore
class without using a JSON file. All you have to do is pass in a data-frame (content of the CSV), and a dictionary(content of the JSON file) as follows:
import intake escmol_obj = pd.DataFrame() esmcol_data = {...} col = intake_esm.core.esm_datastore(esmcol_obj, esmcol_data=esmcol_data)
I would recommend investigating this approach before trying out the "spec" plugin approach...
I like the idea of having intake-esm catalog being generated as part of the model run. I would also like to see the CSV file contain more information in order to enhance semantic interoperability and data discovery. Can we find a time for me to do a brief demo of a Notebook that uses an enhanced Intake-ESM CSV file to illustrate at least part of the concept?
Can we find a time for me to do a brief demo of a Notebook that uses an enhanced Intake-ESM CSV file to illustrate at least part of the concept?
I'd love to see this in action. Is this notebook publicly available?
The Notebook works in JupyterLab on my laptop, but fails in the Pangeo JupyterHub because of some dependency issues that I haven't had time to debug.
Are you folks having a regularly-scheduled call that I could join to do a brief demo from my laptop, or should I set up a separate event?
I don't think there is anything regularly scheduled except for the Friday half-day hacks in the afternoon. But I'd really like to see this demo, and I don't work on Friday. So, maybe we could schedule something special? I'd like the whole Xdev team to join.
@Kevin Paul Please feel free to look at my Google Calendar and propose a time next week that works for the Xdev group. Note that I am currently riding out the lockdown in Maryland on eastern time zone.
Will do.
Invite sent. Let me know if that doesn't work.
Is the proposed interface a reformulation of what's already specified in existing catalog structures, because if so then it would be straightforward to port other catalogs, like CMIP6, to this format also. It would be a good test of the interface's generality.
Is the proposed interface a reformulation of what's already specified in existing catalog structures
It's my understanding that the focus here is CESM catalogs
Meaning that generality is not a concern right now?
Generality should be a concern.
Meaning that generality is not a concern right now?
generality in catalog generation or catalog usage?
The code that we publish and other people install should be general. If it can be used to generate specific catalogs for something like a CESM run, then that's great.
I was thinking catalog usage; maybe that is not correct?
I’ve been thinking more about this, and I grow increasingly concerned about developing technology to deal with CESM issues. In my mind, many of the problems people have with ESM data stems from non-standardization, and the “need” to make modifications to intake-ESM to accommodate CESM needs, is one of those issues. I believe that we should not be spending time developing solutions to make the tools work with CESM data, but spending more time making CESM data “standardized.”
Now, I know that most CESM developers say that CESM is standardized because it is CF compliant. But CF is not a standard. It is a set of conventions. The only thing we have close to a standard for ESM data is CMIP. So, I think we should be spending our effort getting CESM to output CMIP-compliant data.
My $0.02.
@Kevin Paul, thanks for your perspective.
I take exception to your statement
I grow increasingly concerned about developing technology to deal with CESM issues
We develop CESM and are in urgent need of frameworks for effective analysis. In my view, our primary objectives are to develop general tools that work with CESM. Conventions are great, but we also need a general framework to provide work-arounds when they inevitably fall short. There is no such thing as full CMIP-compliance. We output numerous variables that aren't even defined in CMIP.
I didn’t mean to offend. And perhaps
I should not have used the word “issues” because that implies “problems”.
I believe the shortfalls you are referring to primarily come from not having a fully defined standard. And as you point out, CMIP is not fully defined. Rather, it is the closest thing we have. I think we should be involved in fully defining that standard. And perhaps modifications to CESM could lead the way. However, it is generally believed that property defined standards, and rigid conformance to those standards, reduces the pitfalls to which you allude.
Workarounds are always needed in a pinch. But I fear that the workaround becomes the practice. And it then becomes the “approved solution.” Also, workarounds should be minimal effort, and I am concerned of the level of effort being put into workarounds.
I believe that we should not be spending time developing solutions to make the tools work with CESM data, but spending more time making CESM data “standardized.”
Having some form of standardization across the CESM components would help a lot with the catalog generation(at least). I am not familiar with CESM run workflow, but it is my understanding that a lot of information we are having to assemble in order to build the catalog (see this comment by @Sheri Mickelson ) consists of attributes that may be persisted in the global attributes of the model output.
Yes. I believe @Anderson Banihirwe’s point is that we might need to work on both sides of the problem: both developing general tools that operate on standardized data and modifying the model output to better conform to those standards.
In many ways this is the same issue as deciding what new features should go into the upstream codebase vs into a dependency or new package. In this case, the upstream codebase is CESM.
In many ways this is the same issue as deciding what new features should go into the upstream codebase vs into a dependency or new package. In this case, the upstream codebase is CESM.
Does CESM have an existing framework/process for proposing new features across all components or does each component have its own policy?
we might need to work on both sides of the problem: both developing general tools that operate on standardized data and modifying the model output to better conform to those standards
Absolutely.
@Kevin Paul, no offense taken. (I took "exception"!)
I think we need to focus on discretizing the elements of a workflow into the right Python "objects".
Intake-esm should have a relatively narrow focus, which is to bridge the gap between the vagaries of file systems and files, enabling a semantic API to getting an xarray.Dataset
from collections of files. That's it.
@Michael Levy started this thread with a very simple objective: get CESM to write intake-esm
catalog files. No changes to the model's output data is needed to accomplish this. We just need generate a table of the model output.
I am starting to make notes on a general integration framework as a design doc for a package called integral
here:
https://hackmd.io/@matt-long/ryfGpDSsL/edit
I welcome contributions. This package would aim to integrate intake-esm and xpersist to provide an extensible framework for operating on datasets with full provenance tracking and a natural way to accommodate peculiarities in a general way.
CESM scientists are going to write code to analyze CESM. Integral
will aim to provide a general scheme for applying this code in the context of routine model diagnostics and ad hoc, exploratory work.
Intake-esm should have a relatively narrow focus, which is to bridge the gap between the vagaries of file systems and files, enabling a semantic API to getting an
xarray.Dataset
from collections of files. That's it.
Agreed.
Michael Levy started this thread with a very simple objective: get CESM to write
intake-esm
catalog files. No changes to the model's output data is needed to accomplish this. We just need generate a table of the model output.
...But maybe some changes to model output might make this easier to accomplish? In a way less prone to future errors? Those are questions, not statements. I'm curious. Perhaps @Michael Levy would have something to say on that.
I am starting to make notes on a general integration framework as a design doc for a package called
integral
here:
https://hackmd.io/@matt-long/ryfGpDSsL/edit
I will definitely take a look! I'm actually really interested in this.
CESM scientists are going to write code to analyze CESM.
Integral
will aim to provide a general scheme for applying this code in the context of routine model diagnostics and ad hoc, exploratory work.
:+1:
@Michael Levy started this thread with a very simple objective: get CESM to write intake-esm catalog files. No changes to the model's output data is needed to accomplish this. We just need generate a table of the model output.
I think a prototype to accomplish this is worth it (as a short term solution). I echo @Kevin Paul's comments on modifying the model output to better conform to some standards as a long term solution
...Well, and keep in mind that we might need to actually develop some standards along the way. :smiley:
I'll jump into the current conversation in a minute, but first an commercial break: #hack-projects > generating intake catalog for CESM output
...Well, and keep in mind that we might need to actually develop some standards along the way. :smiley:
That's why I was asking about this
Does CESM have an existing framework/process for proposing new features across all components or does each component have its own policy?
I'll let you have the last word on standards, I'll just note that we can't solve all the problems at once. We should have lofty aspirations and, with these in mind, follow the most direct path to extensible solutions.
Part of me thinks that discussing whether or not we need more information in CESM history files is a little orthogonal to the goal of have the CESM workflow generate the catalog. It would definitely help in generating catalogs after the fact, but for the proposed idea of either updating the timeseries generation step in CESM postprocessing or the short term archiver, both of those tools will have access to the XML files in $CASEROOT
that already contain the information we'd want to put in history files.
I completely agree.
@Michael Levy: Ah! So this is a proposal to add/amend the CESM post-processing code? (I.e., this is actually changing CESM codebase?)
@Kevin Paul, the object is to get the model to write intake catalog files. By hook or by crook. I would suggest that this is a relatively autonomous tool that is called by post-processing hooks in the run infrastructure.
And if the goal is to distribute the catalog with the archive (and introduce a mechanism for combining catalogs for ensembles), then I think the only use for generating a catalog from CESM output will be to catalog old runs, which won't have the proposed metadata anyway.
Ah! So this is a proposal to add/amend the CESM post-processing code? (I.e., this is actually changing CESM codebase?)
@Kevin Paul Yes! Sorry if this was unclear, but the bulk of this work will be updating CESM (I think either post-processing or the archiver, which might technically be in CIME). I proposed changing the intake-esm
interface to require a single csv
file rather than a csv
file AND a json
file because I've found that I generate the same json file for every CESM catalog, except they all point to different csv
files; that sort of redundancy led me to think that the intake-esm
interface could be improved once everything is standardized, but I don't think I did a good job of separating that particular wishlist item from the project description as a whole
@Michael Levy @Matt Long I think I'm getting it. And no worries about communication. That's the beauty of Zulip! What you are proposing sounds laudable. ...I have no idea how to do it. :smiley: I'll leave that to the experts.
@Matt Long makes a good point -- my "Yes!" response is more based on a blueprint that mostly exists in my head and will likely change, but CESM will either generate the catalog itself or call an independent tool that is CESM-aware and can pull data out of the case root.
but I think everyone here agrees that intake-esm
/ intake-esm-datastore
will not be updated to generate these catalogs... there might be suggestions on API improvements resulting from the project, but they will rightfully remain model-agnostic
I think I'm finally getting it. :smile:
Now, I know that most CESM developers say that CESM is standardized because it is CF compliant. But CF is not a standard. It is a set of conventions. The only thing we have close to a standard for ESM data is CMIP. So, I think we should be spending our effort getting CESM to output CMIP-compliant data.
Much of the Earth science community uses and is familiar with the CF Conventions, whereas only CMIP people are familiar with the CMIP conventions, so I would argue that CF is "more standard." As far as I know, neither has been officially adopted by a standards-developing organization, although CF-NetCDF3 was adopted by the Open Geospatial Consortium in ~2010 (thanks to the work of Ben Domenico at Unidata).
I am curious which aspects of CESM should be made more standard, and whether some aspects are covered only by CF or only by CMIP, or are covered by both in a contradictory fashion.
Personally, I find it unfortunate that:
The variable names are different in CESM/CMIP and CESM/LENS.
CF Standard Names are buried in the metadata rather than exposed at the file/object level.
In LENS, there are a number of gratuitous incompatibilities requiring a human, e.g.:
At least some of these things can be made more standard at the Intake-ESM level, as I will demonstrate on Wed Mar 27 at 09:00 MDT.
@Jeff de La Beaujardiere , delayed response here...
Regarding fields having different short names but the same long name, it's not clear what behavior you would like to see.
Do you want 1) different long names, or 2) same short names? I see pro and cons to either alternative.
Last updated: May 16 2025 at 17:14 UTC