Stream: python-questions

Topic: Add intake to NCAR maintained NPL environment please


view this post on Zulip Anna-Lena Deppenmeier (Jun 21 2023 at 23:11):

Hi all, I am not sure who to contact about this, but could we please add intake to the maintained NPL environment(s)? It seems a fairly standard package if you are dealing with CESM output.

view this post on Zulip Katie Dagon (Jun 22 2023 at 16:17):

Great idea @Anna-Lena Deppenmeier ! There is some information here on the NPL update schedule, and I'm thinking that submitting a research computing help ticket is probably the best way to get intake on their radar and into the next released version.

view this post on Zulip Anna-Lena Deppenmeier (Jun 22 2023 at 16:28):

Thanks for the pointers @Katie Dagon, I'll go ahead and submit a ticket!

view this post on Zulip Deepak Cherian (Jun 22 2023 at 18:54):

Just checked with Ben Kirk, and the tickets (help.ucar.edu) are the best way of new packages at the moment

view this post on Zulip Negin Sobhani (Jun 22 2023 at 20:07):

We are adding intake and intake-esm to NPL, but the best way for now is creating a new ticket with RC and pursue that. I created a ticket on this today morning.

In future we are going to update this to a Github repository, so new requests can come through the Github issues. (But not implemented now).

view this post on Zulip Deepak Cherian (Jun 22 2023 at 20:11):

In future we are going to update this to a Github repository, so new requests can come through the Github issues.

That would be perfect!

view this post on Zulip Negin Sobhani (Jun 22 2023 at 21:27):

We have added intake, intake-esm, intake-xarray, and flox to the base environment. @Anna-Lena Deppenmeier and @Deepak Cherian

view this post on Zulip Deepak Cherian (Jun 22 2023 at 21:28):

Thanks @Negin Sobhani !

view this post on Zulip Anna-Lena Deppenmeier (Jun 22 2023 at 21:46):

That's fantastic, thanks @Negin Sobhani !!

view this post on Zulip Anna-Lena Deppenmeier (Jun 23 2023 at 18:37):

Hi @Negin Sobhani I can open a new ticket but wanted to make mention of this here -- intake is hanging , i.e. not working with the NPL environment right now. I tested with @Kristen Krumhardt and for her it's the same, the NPL environment doesn't work but her own environment is able to process the cells fairly quickly. I can point you to the yml she installed from (and I am trying to install from right now) if it's of any use.

view this post on Zulip Negin Sobhani (Jun 26 2023 at 20:12):

Hello @Anna-Lena Deppenmeier , Thanks for raising this issue. I have just tested the npl-2023a (which is the latest NPL flavor) and it seems like that the intake package works. Can you please let me know which environment you use for this?

import intake
catalog_url = 'https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json'
col = intake.open_esm_datastore(catalog_url)
col

view this post on Zulip Anna-Lena Deppenmeier (Jun 26 2023 at 20:15):

Hi @Negin Sobhani , this also works for me, the line that is hanging is this after having created a catalogue:

%%time
catalog = intake.open_esm_datastore(
    '/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cesm2-le.json')
catalog.df.experiment.unique()

works,

%%time
var = ['TEMP','SALT']
# get the historical
subset_hist = catalog.search(component='ocn',
                        variable=var,
                        experiment='historical',
                        forcing_variant='cmip6')

works, but

%%time
with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    dsets = subset_hist.to_dataset_dict(preprocess=preprocess)

hangs. it seems to actually open the data, i can see dask processes moving, but it never completes the cell. note that I tried with different definitions for preprocessing and it didn't work for either, but the same command runs fine in my personal environment.
Thanks for looking into it!

view this post on Zulip Negin Sobhani (Jun 26 2023 at 20:21):

@Anna-Lena Deppenmeier This looks like an issue with Dask rather than intake. But please let me take a closer look at this.

view this post on Zulip Anna-Lena Deppenmeier (Jun 26 2023 at 20:22):

Oh yeah that makes sense! There have been other issues with dask too, it would be great to get to the bottom of this. Thanks!

view this post on Zulip Negin Sobhani (Jun 26 2023 at 22:10):

Hello @Anna-Lena Deppenmeier , Interestingly when I explore other catalogs this environment works without any issues with Dask. But with this catalog, I am experiencing an issue similar to what you reports and Dask workers are stay idle.

Here is an example that works fine with npl-2023a +dask :

import intake
import dask
from dask_jobqueue import PBSCluster
from dask.distributed import Client

# Create a PBS cluster object
cluster = PBSCluster(
    job_name = 'dask-wk23-hpc',
    cores = 1,
    memory = '4GiB',
    processes = 1,
    local_directory = '/local_scratch/pbs.$PBS_JOBID/dask/spill',
    resource_spec = 'select=1:ncpus=1:mem=4GB',
    queue = 'casper',
    walltime = '30:00',
    interface = 'ib0'
)


catalog_url = 'https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json'
col = intake.open_esm_datastore(catalog_url)
col

client = Client(cluster)
client

cluster.scale(4)

client.wait_for_workers(4)

col_subset = col.search(frequency=["daily", "monthly"], component="atm", variable="TREFHT",
                        experiment=["20C", "RCP85", "HIST"])

col_subset

dsets = col_subset.to_dataset_dict()
dsets

So I think the problem is either with the catalog or some other packages missing that is required for transforming the data. Can you please confirm that the catalog works with another environment? I am still looking at this to find out what is causing this issue.

view this post on Zulip Anna-Lena Deppenmeier (Jun 26 2023 at 22:34):

Thanks @Negin Sobhani , yes this catalogue works with other environments. I could point you to an example environment file that I know it works with if that's helpful?

view this post on Zulip Negin Sobhani (Jun 27 2023 at 00:17):

Yes , @Anna-Lena Deppenmeier , Can you please point me to the exact code and environment where this works?

view this post on Zulip Negin Sobhani (Jun 27 2023 at 00:25):

When I try reading in a subset of data it worked without any issue with two workers in npl-2023a.
Please feel free to try it:

catalog_url = '/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cesm2-le.json'
col = intake.open_esm_datastore(catalog_url)
col

# get the historical
col_subset = col.search(component="ocn", variable="TEMP",
                        experiment="historical", forcing_variant='cmip6',member_id='r1i1001p1f1')

with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    dsets = col_subset.to_dataset_dict()

view this post on Zulip Anna-Lena Deppenmeier (Jun 27 2023 at 15:00):

Hi @Negin Sobhani , can you do the same for two variables? I don't know the syntax in the block of code above to modify it to two variables, will it be variable=['TEMP', 'SALT']? Also do you think it works better with less workers (you mention two)? I will need more than two workers to eventually do anything with the dataset, but I guess I could load it with 2 and then cluster.scale(12)?

view this post on Zulip Anna-Lena Deppenmeier (Jun 27 2023 at 15:03):

also do you still need the environment since it's working for you now?

view this post on Zulip Negin Sobhani (Jun 28 2023 at 20:26):

@Anna-Lena Deppenmeier , if you use the list (i.e. ['TEMP','SALT'] ) instead of 'TEMP' this should work too.
I have tested the following (adding SALT) and it worked fine. I am using 8 dask workers and overall I usually set the number of dask workers based on the memory/CPU needs of a notebook.

Here is my code:

catalog_url = '/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cesm2-le.json'
col = intake.open_esm_datastore(catalog_url)
# get the historical
col_subset = col.search(component="ocn", variable=['TEMP','SALT'],
                        experiment="historical", forcing_variant='cmip6',member_id='r1i1001p1f1')
with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    dsets = col_subset.to_dataset_dict()

Please note that I am still reading in a subset (2%) of the data ( see the member_idargument) and it takes a few minutes to complete.
I am wondering if you can point me to the environment that you used that read in the data quickly.
Thanks!
Negin

view this post on Zulip Anna-Lena Deppenmeier (Jun 28 2023 at 20:41):

Thanks @Negin Sobhani ! The environment file is here: /glade/u/home/deppenme/analysis6_versions.yml

view this post on Zulip Negin Sobhani (Jun 29 2023 at 19:16):

Hello @Anna-Lena Deppenmeier , Thanks for sending the environment.
I did several different test with this catalogue and it seems like althought the Dask workers are not hung, they are extremely slow. I cannot see high CPU or mem usage on any workers. But for example reading in half of your data took ~32 minutes on 60+ workers.

Can you please confirm that how much memory do you typically request for the main server you start up (if it's a login server, you'd get 4 GB)? Although I don't think this is causing the issue for you. As I have already tested with higher memory and did not see any improvement.

I am currently testing your environment to see if I can reproduce your result in loading the dataset quickly as you mentioned earlier. If that is not the case, we can rule out that the source of this issue is the npl environment and look at diagnosing this.

view this post on Zulip Anna-Lena Deppenmeier (Jun 29 2023 at 19:18):

Hi Negin, I request a large amount of memory when I log in, something between 40 and 100GB and I log in via jupyterhub on the acasper PBS system rather than the login node

view this post on Zulip Negin Sobhani (Jun 29 2023 at 19:20):

Good to know. Thanks for confirming that. As we suspect this issue is probably different from what @Holly Olivarez is experiencing.

view this post on Zulip Anna-Lena Deppenmeier (Jun 29 2023 at 19:20):

ok

view this post on Zulip Negin Sobhani (Jun 29 2023 at 19:33):

I can confirm that your environment is loading half dataset on the same number of workers in 8 seconds!

The issue might be due to some incompatibility between intake and dask versions. For reference, I am putting the versions of both environments here:

npl-2023a:

dask                      2022.10.0
intake                    0.7.0
intake-esm                2023.6.14
xarray                    2023.1.0

analysis6:

dask                      2021.11.2
intake                    0.6.7
intake-esm                2021.8.17
xarray                    0.20.2

Last updated: May 16 2025 at 17:14 UTC