Stream: python-questions

Topic: intake-esm unique constraints?


view this post on Zulip Brian Bonnlander (Jan 20 2021 at 17:57):

@Anderson Banihirwe Does intake-esm require any columns in an intake catalog to have unique values? I am considering building a catalog where the path to a Zarr store will appear more than once, as a way of describing the list of ensemble members for that store. This list varies across the different Zarr stores described by the catalog.

view this post on Zulip Anderson Banihirwe (Jan 20 2021 at 21:25):

Does intake-esm require any columns in an intake catalog to have unique values?

You can have duplicate values in intake-esm catalog columns.

I am considering building a catalog where the path to a Zarr store will appear more than once, as a way of describing the list of ensemble members for that store.

Do you mind elaborating on this with a small example (for instance what will a single row look like)?

view this post on Zulip Brian Bonnlander (Jan 20 2021 at 21:42):

Do you mind elaborating on this with a small example (for instance what will a single row look like)?

OK, I am imagining that the first few lines of the catalog CSV file would look like the following, with the same path for all lines, and differences only in the RCM or Driver columns, because the same Zarr store contains all these different simulation runs:

,path,variable,scenario,driver,rcm,frequency,grid,biascorrection,common,longname,units,member_id
0,s3://ncar/na-cordex/tasmin.hist.day.NAM-22i.raw.zarr,tasmin,hist,MPI-ESM-MR,CRCM5-UQAM,day,NAM-22i,raw,common,Daily Minimum Near-Surface Air Temperature,K,MPI-ESM-MR.CRCM5-UQAM
1,s3://ncar/na-cordex/tasmin.hist.day.NAM-22i.raw.zarr,tasmin,hist,GEMatm-Can,CRCM5-UQAM,day,NAM-22i,raw,common,Daily Minimum Near-Surface Air Temperature,K,GEMatm-Can.CRCM5-UQAM
2,s3://ncar/na-cordex/tasmin.hist.day.NAM-22i.raw.zarr,tasmin,hist,CNRM-CM5,CRCM5-OUR,day,NAM-22i,raw,common,Daily Minimum Near-Surface Air Temperature,K,CNRM-CM5.CRCM5-OUR
3,s3://ncar/na-cordex/tasmin.hist.day.NAM-22i.raw.zarr,tasmin,hist,GFDL-ESM2M,CRCM5-OUR,day,NAM-22i,raw,common,Daily Minimum Near-Surface Air Temperature,K,GFDL-ESM2M.CRCM5-OUR
4,s3://ncar/na-cordex/tasmin.hist.day.NAM-22i.raw.zarr,tasmin,hist,HadGEM2,RegCM4,day,NAM-22i,raw,common,Daily Minimum Near-Surface Air Temperature,K,HadGEM2-ES.RegCM4

view this post on Zulip Brian Bonnlander (Jan 20 2021 at 21:53):

So essentially the catalog rows describe individual simulation runs, rather than individual Zarr stores. Multiple runs would map to a single Zarr store. I am hoping this doesn't break any of the intake-esm search behavior...it sounds like it would not.

view this post on Zulip Anderson Banihirwe (Jan 20 2021 at 21:53):

Thank you for the clarification! Have you had a chance to look at the multi-variable example here: https://intake-esm.readthedocs.io/en/latest/user-guide/multi-variable-assets.html

I am asking because I think you may avoid the entry duplication by putting iterables (list, tuple, et...) in the RCM or Driver columns. I imagine something along these lines would work ( I combine the first two rows into a single entry):

0,s3://ncar/na-cordex/tasmin.hist.day.NAM-22i.raw.zarr,tasmin,hist, "(MPI-ESM-MR, GEMatm-Can)", CRCM5-UQAM,day,NAM-22i,raw,common,Daily Minimum Near-Surface Air Temperature,K,"(MPI-ESM-MR.CRCM5-UQAM, GEMatm-Can.CRCM5-UQAM)"

view this post on Zulip Brian Bonnlander (Jan 20 2021 at 21:55):

Aha, that is good to know. We could design the catalog that way also. Would the search and discovery behavior change depending on the choice of catalog implementation?

view this post on Zulip Anderson Banihirwe (Jan 20 2021 at 21:59):

The search should work regardless of the types of items in a row/columns. The only caveat is that we have to tell pandas how to parse these "special columns":

import ast
import intake
col = intake.open_esm_datastore(
    "my-catalog.json",
    csv_kwargs={"converters": {"driver": ast.literal_eval, "member_id": ast.literal_eval}},
)

Last updated: Jan 30 2022 at 12:01 UTC