Hi all, I am not sure who to contact about this, but could we please add intake
to the maintained NPL environment(s)? It seems a fairly standard package if you are dealing with CESM output.
Great idea @Anna-Lena Deppenmeier ! There is some information here on the NPL update schedule, and I'm thinking that submitting a research computing help ticket is probably the best way to get intake
on their radar and into the next released version.
Thanks for the pointers @Katie Dagon, I'll go ahead and submit a ticket!
Just checked with Ben Kirk, and the tickets (help.ucar.edu) are the best way of new packages at the moment
We are adding intake
and intake-esm
to NPL, but the best way for now is creating a new ticket with RC and pursue that. I created a ticket on this today morning.
In future we are going to update this to a Github repository, so new requests can come through the Github issues. (But not implemented now).
In future we are going to update this to a Github repository, so new requests can come through the Github issues.
That would be perfect!
We have added intake
, intake-esm
, intake-xarray
, and flox
to the base environment. @Anna-Lena Deppenmeier and @Deepak Cherian
Thanks @Negin Sobhani !
That's fantastic, thanks @Negin Sobhani !!
Hi @Negin Sobhani I can open a new ticket but wanted to make mention of this here -- intake
is hanging , i.e. not working with the NPL environment right now. I tested with @Kristen Krumhardt and for her it's the same, the NPL environment doesn't work but her own environment is able to process the cells fairly quickly. I can point you to the yml she installed from (and I am trying to install from right now) if it's of any use.
Hello @Anna-Lena Deppenmeier , Thanks for raising this issue. I have just tested the npl-2023a
(which is the latest NPL flavor) and it seems like that the intake
package works. Can you please let me know which environment you use for this?
import intake
catalog_url = 'https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json'
col = intake.open_esm_datastore(catalog_url)
col
Hi @Negin Sobhani , this also works for me, the line that is hanging is this after having created a catalogue:
%%time
catalog = intake.open_esm_datastore(
'/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cesm2-le.json')
catalog.df.experiment.unique()
works,
%%time
var = ['TEMP','SALT']
# get the historical
subset_hist = catalog.search(component='ocn',
variable=var,
experiment='historical',
forcing_variant='cmip6')
works, but
%%time
with dask.config.set(**{'array.slicing.split_large_chunks': True}):
dsets = subset_hist.to_dataset_dict(preprocess=preprocess)
hangs. it seems to actually open the data, i can see dask processes moving, but it never completes the cell. note that I tried with different definitions for preprocessing and it didn't work for either, but the same command runs fine in my personal environment.
Thanks for looking into it!
@Anna-Lena Deppenmeier This looks like an issue with Dask rather than intake. But please let me take a closer look at this.
Oh yeah that makes sense! There have been other issues with dask too, it would be great to get to the bottom of this. Thanks!
Hello @Anna-Lena Deppenmeier , Interestingly when I explore other catalogs this environment works without any issues with Dask. But with this catalog, I am experiencing an issue similar to what you reports and Dask workers are stay idle.
Here is an example that works fine with npl-2023a +dask :
import intake
import dask
from dask_jobqueue import PBSCluster
from dask.distributed import Client
# Create a PBS cluster object
cluster = PBSCluster(
job_name = 'dask-wk23-hpc',
cores = 1,
memory = '4GiB',
processes = 1,
local_directory = '/local_scratch/pbs.$PBS_JOBID/dask/spill',
resource_spec = 'select=1:ncpus=1:mem=4GB',
queue = 'casper',
walltime = '30:00',
interface = 'ib0'
)
catalog_url = 'https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json'
col = intake.open_esm_datastore(catalog_url)
col
client = Client(cluster)
client
cluster.scale(4)
client.wait_for_workers(4)
col_subset = col.search(frequency=["daily", "monthly"], component="atm", variable="TREFHT",
experiment=["20C", "RCP85", "HIST"])
col_subset
dsets = col_subset.to_dataset_dict()
dsets
So I think the problem is either with the catalog or some other packages missing that is required for transforming the data. Can you please confirm that the catalog works with another environment? I am still looking at this to find out what is causing this issue.
Thanks @Negin Sobhani , yes this catalogue works with other environments. I could point you to an example environment file that I know it works with if that's helpful?
Yes , @Anna-Lena Deppenmeier , Can you please point me to the exact code and environment where this works?
When I try reading in a subset of data it worked without any issue with two workers in npl-2023a
.
Please feel free to try it:
catalog_url = '/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cesm2-le.json'
col = intake.open_esm_datastore(catalog_url)
col
# get the historical
col_subset = col.search(component="ocn", variable="TEMP",
experiment="historical", forcing_variant='cmip6',member_id='r1i1001p1f1')
with dask.config.set(**{'array.slicing.split_large_chunks': True}):
dsets = col_subset.to_dataset_dict()
Hi @Negin Sobhani , can you do the same for two variables? I don't know the syntax in the block of code above to modify it to two variables, will it be variable=['TEMP', 'SALT']
? Also do you think it works better with less workers (you mention two)? I will need more than two workers to eventually do anything with the dataset, but I guess I could load it with 2 and then cluster.scale(12)
?
also do you still need the environment since it's working for you now?
@Anna-Lena Deppenmeier , if you use the list (i.e. ['TEMP','SALT'] ) instead of 'TEMP' this should work too.
I have tested the following (adding SALT) and it worked fine. I am using 8 dask workers and overall I usually set the number of dask workers based on the memory/CPU needs of a notebook.
Here is my code:
catalog_url = '/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cesm2-le.json'
col = intake.open_esm_datastore(catalog_url)
# get the historical
col_subset = col.search(component="ocn", variable=['TEMP','SALT'],
experiment="historical", forcing_variant='cmip6',member_id='r1i1001p1f1')
with dask.config.set(**{'array.slicing.split_large_chunks': True}):
dsets = col_subset.to_dataset_dict()
Please note that I am still reading in a subset (2%) of the data ( see the member_id
argument) and it takes a few minutes to complete.
I am wondering if you can point me to the environment that you used that read in the data quickly.
Thanks!
Negin
Thanks @Negin Sobhani ! The environment file is here: /glade/u/home/deppenme/analysis6_versions.yml
Hello @Anna-Lena Deppenmeier , Thanks for sending the environment.
I did several different test with this catalogue and it seems like althought the Dask workers are not hung, they are extremely slow. I cannot see high CPU or mem usage on any workers. But for example reading in half of your data took ~32 minutes on 60+ workers.
Can you please confirm that how much memory do you typically request for the main server you start up (if it's a login server, you'd get 4 GB)? Although I don't think this is causing the issue for you. As I have already tested with higher memory and did not see any improvement.
I am currently testing your environment to see if I can reproduce your result in loading the dataset quickly as you mentioned earlier. If that is not the case, we can rule out that the source of this issue is the npl
environment and look at diagnosing this.
Hi Negin, I request a large amount of memory when I log in, something between 40 and 100GB and I log in via jupyterhub on the acasper PBS system rather than the login node
Good to know. Thanks for confirming that. As we suspect this issue is probably different from what @Holly Olivarez is experiencing.
ok
I can confirm that your environment is loading half dataset on the same number of workers in 8 seconds!
The issue might be due to some incompatibility between intake
and dask
versions. For reference, I am putting the versions of both environments here:
npl-2023a:
dask 2022.10.0
intake 0.7.0
intake-esm 2023.6.14
xarray 2023.1.0
analysis6:
dask 2021.11.2
intake 0.6.7
intake-esm 2021.8.17
xarray 0.20.2
Last updated: May 16 2025 at 17:14 UTC