Add intake to NCAR maintained NPL environment please · python-questions

Hi all, I am not sure who to contact about this, but could we please add intake to the maintained NPL environment(s)? It seems a fairly standard package if you are dealing with CESM output.

Katie Dagon (Jun 22 2023 at 16:17):

Great idea @Anna-Lena Deppenmeier ! There is some information here on the NPL update schedule, and I'm thinking that submitting a research computing help ticket is probably the best way to get intake on their radar and into the next released version.

Anna-Lena Deppenmeier (Jun 22 2023 at 16:28):

Deepak Cherian (Jun 22 2023 at 18:54):

Just checked with Ben Kirk, and the tickets (help.ucar.edu) are the best way of new packages at the moment

Negin Sobhani (Jun 22 2023 at 20:07):

We are adding intake and intake-esm to NPL, but the best way for now is creating a new ticket with RC and pursue that. I created a ticket on this today morning.

In future we are going to update this to a Github repository, so new requests can come through the Github issues. (But not implemented now).

Deepak Cherian (Jun 22 2023 at 20:11):

Negin Sobhani (Jun 22 2023 at 21:27):

We have added intake, intake-esm, intake-xarray, and flox to the base environment. @Anna-Lena Deppenmeier and @Deepak Cherian

Deepak Cherian (Jun 22 2023 at 21:28):

Anna-Lena Deppenmeier (Jun 22 2023 at 21:46):

Anna-Lena Deppenmeier (Jun 23 2023 at 18:37):

Hi @Negin Sobhani I can open a new ticket but wanted to make mention of this here -- intake is hanging , i.e. not working with the NPL environment right now. I tested with @Kristen Krumhardt and for her it's the same, the NPL environment doesn't work but her own environment is able to process the cells fairly quickly. I can point you to the yml she installed from (and I am trying to install from right now) if it's of any use.

Negin Sobhani (Jun 26 2023 at 20:12):

Hello @Anna-Lena Deppenmeier , Thanks for raising this issue. I have just tested the npl-2023a (which is the latest NPL flavor) and it seems like that the intake package works. Can you please let me know which environment you use for this?

import intake
catalog_url = 'https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json'
col = intake.open_esm_datastore(catalog_url)
col

Anna-Lena Deppenmeier (Jun 26 2023 at 20:15):

Hi @Negin Sobhani , this also works for me, the line that is hanging is this after having created a catalogue:

%%time
catalog = intake.open_esm_datastore(
    '/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cesm2-le.json')
catalog.df.experiment.unique()

%%time
var = ['TEMP','SALT']
# get the historical
subset_hist = catalog.search(component='ocn',
                        variable=var,
                        experiment='historical',
                        forcing_variant='cmip6')

%%time
with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    dsets = subset_hist.to_dataset_dict(preprocess=preprocess)

hangs. it seems to actually open the data, i can see dask processes moving, but it never completes the cell. note that I tried with different definitions for preprocessing and it didn't work for either, but the same command runs fine in my personal environment.
Thanks for looking into it!

Negin Sobhani (Jun 26 2023 at 20:21):

@Anna-Lena Deppenmeier This looks like an issue with Dask rather than intake. But please let me take a closer look at this.

Anna-Lena Deppenmeier (Jun 26 2023 at 20:22):

Oh yeah that makes sense! There have been other issues with dask too, it would be great to get to the bottom of this. Thanks!

Negin Sobhani (Jun 26 2023 at 22:10):

Hello @Anna-Lena Deppenmeier , Interestingly when I explore other catalogs this environment works without any issues with Dask. But with this catalog, I am experiencing an issue similar to what you reports and Dask workers are stay idle.

import intake
import dask
from dask_jobqueue import PBSCluster
from dask.distributed import Client

# Create a PBS cluster object
cluster = PBSCluster(
    job_name = 'dask-wk23-hpc',
    cores = 1,
    memory = '4GiB',
    processes = 1,
    local_directory = '/local_scratch/pbs.$PBS_JOBID/dask/spill',
    resource_spec = 'select=1:ncpus=1:mem=4GB',
    queue = 'casper',
    walltime = '30:00',
    interface = 'ib0'
)


catalog_url = 'https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json'
col = intake.open_esm_datastore(catalog_url)
col

client = Client(cluster)
client

cluster.scale(4)

client.wait_for_workers(4)

col_subset = col.search(frequency=["daily", "monthly"], component="atm", variable="TREFHT",
                        experiment=["20C", "RCP85", "HIST"])

col_subset

dsets = col_subset.to_dataset_dict()
dsets

So I think the problem is either with the catalog or some other packages missing that is required for transforming the data. Can you please confirm that the catalog works with another environment? I am still looking at this to find out what is causing this issue.

Anna-Lena Deppenmeier (Jun 26 2023 at 22:34):

Thanks @Negin Sobhani , yes this catalogue works with other environments. I could point you to an example environment file that I know it works with if that's helpful?

Negin Sobhani (Jun 27 2023 at 00:17):

Yes , @Anna-Lena Deppenmeier , Can you please point me to the exact code and environment where this works?

Negin Sobhani (Jun 27 2023 at 00:25):

When I try reading in a subset of data it worked without any issue with two workers in npl-2023a.
Please feel free to try it:

catalog_url = '/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cesm2-le.json'
col = intake.open_esm_datastore(catalog_url)
col

# get the historical
col_subset = col.search(component="ocn", variable="TEMP",
                        experiment="historical", forcing_variant='cmip6',member_id='r1i1001p1f1')

with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    dsets = col_subset.to_dataset_dict()

Anna-Lena Deppenmeier (Jun 27 2023 at 15:00):

Hi @Negin Sobhani , can you do the same for two variables? I don't know the syntax in the block of code above to modify it to two variables, will it be variable=['TEMP', 'SALT']? Also do you think it works better with less workers (you mention two)? I will need more than two workers to eventually do anything with the dataset, but I guess I could load it with 2 and then cluster.scale(12)?

Anna-Lena Deppenmeier (Jun 27 2023 at 15:03):

Negin Sobhani (Jun 28 2023 at 20:26):

@Anna-Lena Deppenmeier , if you use the list (i.e. ['TEMP','SALT'] ) instead of 'TEMP' this should work too.
I have tested the following (adding SALT) and it worked fine. I am using 8 dask workers and overall I usually set the number of dask workers based on the memory/CPU needs of a notebook.

catalog_url = '/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cesm2-le.json'
col = intake.open_esm_datastore(catalog_url)
# get the historical
col_subset = col.search(component="ocn", variable=['TEMP','SALT'],
                        experiment="historical", forcing_variant='cmip6',member_id='r1i1001p1f1')
with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    dsets = col_subset.to_dataset_dict()

Please note that I am still reading in a subset (2%) of the data ( see the member_idargument) and it takes a few minutes to complete.
I am wondering if you can point me to the environment that you used that read in the data quickly.
Thanks!
Negin

Anna-Lena Deppenmeier (Jun 28 2023 at 20:41):

Thanks @Negin Sobhani ! The environment file is here: /glade/u/home/deppenme/analysis6_versions.yml

Negin Sobhani (Jun 29 2023 at 19:16):

Hello @Anna-Lena Deppenmeier , Thanks for sending the environment.
I did several different test with this catalogue and it seems like althought the Dask workers are not hung, they are extremely slow. I cannot see high CPU or mem usage on any workers. But for example reading in half of your data took ~32 minutes on 60+ workers.

Can you please confirm that how much memory do you typically request for the main server you start up (if it's a login server, you'd get 4 GB)? Although I don't think this is causing the issue for you. As I have already tested with higher memory and did not see any improvement.

I am currently testing your environment to see if I can reproduce your result in loading the dataset quickly as you mentioned earlier. If that is not the case, we can rule out that the source of this issue is the npl environment and look at diagnosing this.

Anna-Lena Deppenmeier (Jun 29 2023 at 19:18):

Hi Negin, I request a large amount of memory when I log in, something between 40 and 100GB and I log in via jupyterhub on the acasper PBS system rather than the login node

Negin Sobhani (Jun 29 2023 at 19:20):

Good to know. Thanks for confirming that. As we suspect this issue is probably different from what @Holly Olivarez is experiencing.

Anna-Lena Deppenmeier (Jun 29 2023 at 19:20):

Negin Sobhani (Jun 29 2023 at 19:33):

I can confirm that your environment is loading half dataset on the same number of workers in 8 seconds!

The issue might be due to some incompatibility between intake and dask versions. For reference, I am putting the versions of both environments here:

dask                      2022.10.0
intake                    0.7.0
intake-esm                2023.6.14
xarray                    2023.1.0

dask                      2021.11.2
intake                    0.6.7
intake-esm                2021.8.17
xarray                    0.20.2

Stream: python-questions

Topic: Add intake to NCAR maintained NPL environment please

Anna-Lena Deppenmeier (Jun 21 2023 at 23:11):