generating intake catalog for CESM output · hack-projects

Stream: hack-projects

Topic: generating intake catalog for CESM output

Michael Levy (May 22 2020 at 19:02):

A group of us are just starting to work on an update to the CESM workflow to include a "generate intake-esm catalog" step, targeting CESM 2.3. Some background is available in #python-dev > workflow

@Anderson Banihirwe and I are going to meet via zoom at 2:30 today to work on a more detailed plan for the project; if you're interested, let me know and I'll add you to the meeting.

Michael Levy (May 23 2020 at 00:00):

@Anderson Banihirwe and I have a tentative plan to move forward. Our vision is a tool that generates a csv file for intake-esm based on the contents of the short-term archive for a single CESM run (currently starting with the requirement that pyReshaper has archived time series, but adding a tool to handle archived time slice files soon after), and a wrapper layer that sends intake-esm (1) a dataFrame from the csv file but converts relative path to files listed in the csv into an absolute path, and (2) a dictionary containing what is currently in the JSON files for cesm non-CMOR catalogs. At some point we will also add a tool to generate a catalog of an ensemble of runs by stitching together individual catalogs.

I have created https://github.com/NCAR/CESM_catalog for this purpose, but am happy to rename the tool. @Kevin Paul I may bug you on Tuesday for advice on Xdev-ing the repo; I need to add the group to the admin list, and I think we want the xdev bot helping manage issue tickets.

Action items for this coming week:

I'll generate 1 year of output from a B1850 compset and have pyReshaper convert it to time series (I thought I had done this but the reshaping job failed)
I'll write a python script that takes $CASEROOT as an input and uses xmlquery to save CASE, and DOUT_S_ROOT to memory.
Anderson will take my script and use it to generate a csv file containing the following columns:
- case [strictly speaking this isn't needed, but it could provide useful]
- component
- stream
- variable
- date_range
- path
- ctrl_branch_year
- ctrl_case

We also talked about the idea of adding long_name to the column list; once intake-esm supports regular expression searches, that could be an easy way for someone unfamiliar with CESM variable naming conventions to find the data they are looking for. It might also be useful to have a tool that generates the csv.gz and json pair to be read in by intake-esm... e.g. if you are sharing the catalog with someone who isn't familiar with CESM_catalog.

Michael Levy (May 23 2020 at 00:00):

The tool to generate an ensemble would be pretty straightforward -- read a YAML file that is something like

experiment1:
  case1:
    catalog_path: [path to catalog]
    member_id: ###
  case2:
    catalog_path: [path to catalog]
    member_id: ###
experiment2:
  case3:
    catalog_path: [path to catalog]
    member_id: ###

each individual catalog would be read in, the experiment and member_id columns would be added, and the ctrl_case column would be replaced with ctrl_experiment and ctrl_member_id (assuming ctrl_case is also a member of the ensemble). Then all the individual catalogs would be concatenated into one giant file (we'd need to replace relative path names with absolute ones at this stage).

Anderson Banihirwe (May 23 2020 at 00:04):

Ccing @Sheri Mickelson

Last updated: May 16 2025 at 17:14 UTC