Stream: dask

Topic: optimizing compression for fast reading


view this post on Zulip Trude Eidhammer (Aug 25 2021 at 17:27):

Hi
We have a 250 member ensemble with time series. The time series are created from CESM outputs and compressed (time chunk = 1). This is not efficient for reading the files, and I am wondering if anyone have scripts for compressing files that also can be read in efficiently.

view this post on Zulip Trude Eidhammer (Aug 25 2021 at 17:56):

FYI, I am compressing the files with : nccopy -d1 -c time/1,lat/$lath,lon/$lonh $fname1 $fname2

view this post on Zulip Maria Frediani (Sep 10 2021 at 17:02):

Trude, in the past I used to save the variables I was interested in as a native numpy format (.npy). I didn't use dask at all but maybe this suits your needs because it loads quickly and compressed the data size.
To save the array compressed:
np.savez_compressed(fileout, ifile=filename, var=varname, data=mydata)
To load it:
mydata = np.load(filein)['data']
Let me know if that works for you and I can give you my scripts.

view this post on Zulip James McCreight (Sep 10 2021 at 17:09):

Trude Eidhammer said:

Hi
We have a 250 member ensemble with time series. The time series are created from CESM outputs and compressed (time chunk = 1). This is not efficient for reading the files, and I am wondering if anyone have scripts for compressing files that also can be read in efficiently.

Hi Trude, I have some recipes for "rechunking" using the rechunker package and appending to zarr files. The input can be netcdf but the output of my recipes is zarr. There is a postprocessing step that can convert to netcdf again. I'm happy to discuss off line but wanted to post here in case such recipies are of interest to others.

James


Last updated: Jan 30 2022 at 12:01 UTC