Stream: dask

Topic: Reading multiple large files

Mira Berdahl (Dec 08 2023 at 22:26):

Hi,
I'm trying to read in some large, high resolution (0.1deg), ocean (POP) output.
In the past, I've used the following without issue (with 1deg ocean), but the 0.1deg files are bigger and so it stalls out and crashes. Can anyone help me optimize this so it runs more efficiently? I'm running on JupyterHub.

import numpy as np
import xarray as xr
#from matplotlib import pyplot as plt
import cartopy.crs as ccrs
import nc_time_axis as nc_time_axis
import cmocean
import zarr
import numpy.ma as ma
import pop_tools

import matplotlib.pyplot as plt
from distributed import Client
from ncar_jobqueue import NCARCluster

on cheyenne

cluster = NCARCluster(project = 'ULNL0002', memory="125GB", walltime='1:00:00', cores=4, processes=4, resource_spec='select=1:ncpus=4:mem=125GB')
cluster.scale(32)
#cluster.adapt(minimum_jobs=1, maximum_jobs=5)
client = Client(cluster)
cluster

#############################################################################################################
from glob import glob

################### READ 0.1degree data ######################################################
ddir = '/glade/campaign/collections/cmip/CMIP6/CESM-HR/BHIST/HR/b.e13.BHISTC5.ne120_t12.cesm-ihesp-sehires38-1850-2005.001/ocn/proc/tseries/month_1/'
dfiles = sorted(glob(ddir + '.TEMP..nc')) # use sorted to make sure the files are in order for concatenation

a tried and true method from Anderson.

mfds = xr.open_mfdataset(dfiles, combine='by_coords', parallel=True , chunks={'time': 6}, data_vars=['TEMP', 'time_bound'], decode_times=False)
mfds = xr.decode_cf(fixmonth(mfds))
mfds
############################################################################################

David John Gagne (Dec 11 2023 at 16:26):

If you are loading a larger amount of data, I would increase the number of CPUs and memory you are requesting so that you can take advantage of parallelism and larger nodes. If you are running on casper, you can ask for up to 384 GB and 36 CPUs per node.

If that doesn't work, then I would suggest performing analysis in a way that does not require opening all the files at once. You may have to write a more manual loop instead of using open_mfdataset.

Last updated: May 16 2025 at 17:14 UTC