I'm getting this error when trying to plot a figure. I'm running lines of plot that work with other data, and as far as I can tell just my xarray dataset I regridded with xesmf throws this. Searching the internet for solutions suggested that I do what the error message suggests and make a jupyter notebook config file to raise the data rate limit, but I'm not sure what that does, exactly, or if I could, so am hesitant to try. Here's the error message:
msgpack.exceptions.ExtraData: unpack(b) received extra data.
IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.
Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)
and the single line of plotting that caused the problem:
dsgr.oxy.isel(z_t=0).plot()
edit: the error also pops up when I run dsgr.compute() so the regridded data array must be too big, will try smaller chunks in dask
Alright, well, it doesn't seem to matter the chunk size, I'm regridding a single selected variable that's already been meaned across time, so its pretty reduced with 60 depths going from lat/lon of 145, 360 to 384, 320. The dask and xesmf github issues I'm reading through seem like this shouldn't be a problem, and my attempts to generate and change a config file to increase the rate limit manually don't seem to work. Not sure what else to try.
Emma Hoffman said:
as far as I can tell just my xarray dataset I regridded with xesmf throws this
combined with
edit: the error also pops up when I run dsgr.compute() so the regridded data array must be too big, will try smaller chunks in dask
Makes me wonder if you're running out of memory in the regrid step. Have you remapped other variables from this 145x360 grid to the 384x320 grid? Another possibility would be running out of memory in the time-mean step: are you reading in multiple time levels and computing the time mean in your notebook? dask
will do that lazily, so it might be good to try to do the .compute()
on the source grid data before remapping it (how many time levels are you reading in, and what dask resources are you requesting to parallelize the computation?)
@Michael Levy
I did run compute on the time mean and as a netcdf file, then read that back in to use in the regridding, so I know it's not coming from that step, at least. I just tried on temperature instead of my desired oxygen and the regridding worked fine, both compute() and plot() ran without errors and a plot spit out. Hmm.
For dask, I was asking for 1 node, for 20 minutes, with 100GB and 30 workers. The dashboard never even started to load though, it didn't overload and kill workers, or anything, as far as I can tell.
Just tried oxygen again, and it seems to work? Plots are loading, at least, but when I tried subtracting the regridded data from the data originally on that grid, which is the original goal, and now I'm getting the IOPub data rate error when I run compute or plot on that. No idea what's going on.
Weird. Can you dump the regridded oxygen to netCDF and then read it back in? It definitely seems like a memory / resource limitation, but given your description there isn't really a resource-limited step. If you point me to your notebook, I'm happy to take a look and try to run it myself
@Michael Levy
Thank you so much for your help. This is the path to the notebook I'm running:
u/home/emmah/1_Variability_Ventilation_O2/exploration_setup.ipynb
when I tried to save the regridded file it failed to serialize with a "0-dim memory has no length" type error.
I started to poke around this afternoon, but couldn't find a conda environment to run it in (I tried cloning Yassir's TPAC_CO2
environment but that's missing geopy
). What environment are you using?
so sorry about that, I just updated the environment folder in the 1_Variability_Ventilation_O2 folder
I was able to install your environment and play with your notebook some. I don't have any definitive answers, but I do have a few suggestions:
C=CLSTR(1,"00:30",100,30)
)/glade/derecho/scratch/yeddebba/FOSI/LR/g.e22.GOMIPECOIAF_JRA-1p4-2018.TL319_g17.4p2z.002branch.pop.h.O2.195801-202112.nc
is a netCDF4 file, and it is already written in chunks - each time level is its own chunk:$ ncdump -hs /glade/derecho/scratch/yeddebba/FOSI/LR/g.e22.GOMIPECOIAF_JRA-1p4-2018.TL319_g17.4p2z.002branch.pop.h.O2.195801-202112.nc | grep O2
netcdf g.e22.GOMIPECOIAF_JRA-1p4-2018.TL319_g17.4p2z.002branch.pop.h.O2.195801-202112 {
float O2(time, z_t, nlat, nlon) ;
O2:long_name = "Dissolved Oxygen" ;
O2:units = "mmol/m^3" ;
O2:coordinates = "TLONG TLAT z_t " ;
O2:grid_loc = "3111" ;
O2:cell_methods = "time: mean" ;
O2:_FillValue = 9.96921e+36f ;
O2:missing_value = 9.96921e+36f ;
O2:_Storage = "chunked" ;
O2:_ChunkSizes = 1, 60, 384, 320 ;
O2:_Shuffle = "true" ;
O2:_DeflateLevel = 1 ;
O2:_Endianness = "little" ;
:history = "Wed Jun 21 08:40:54 2023: ncks -O -4 -L 1 /glade/scratch/kristenk/archive/g.e22.GOMIPECOIAF_JRA-1p4-2018.TL319_g17.4p2z.002branch/ocn/proc/tseries/month_1/g.e22.GOMIPECOIAF_JRA-1p4-2018.TL319_g17.4p2z.002branch.pop.h.O2.195801-202112.nc /glade/scratch/kristenk/archive/g.e22.GOMIPECOIAF_JRA-1p4-2018.TL319_g17.4p2z.002branch/ocn/proc/tseries/month_1/g.e22.GOMIPECOIAF_JRA-1p4-2018.TL319_g17.4p2z.002branch.pop.h.O2.195801-202112.nc\nnone" ;
I think that changing from chunks={'nlon':60,'nlat':60,'z_t':30}
to chunks={'time':4}
will avoid some re-chunking. I know the goal is to average over all the time values, so it seems beneficial to have each chunk contain all the time levels, but in my experience rechunking tends to be expensive and should be done sparingly.
Also, it might be worthwhile to create your own copy of /glade/work/yeddebba/GOBAI_O2/GOBAI-O2-v2.1.nc
with the longitudes sorted correctly -- that would be more efficient than using sortby
every time you read the file . Alternatively (or additionally), you could also read each file once and then do subsetting from memory -- e.g. your initial plotting cell could change to
dircsl='/glade/derecho/scratch/yeddebba/FOSI/LR/'
dsl=xr.open_mfdataset(dircsl+'*.O2.*.nc')
-dsl_150= xr.open_mfdataset(dircsl+'*.O2.*.nc',chunks={'z_t':20,'time':100}).sel(z_t=15000,method='nearest')
-dsl_300= xr.open_mfdataset(dircsl+'*.O2.*.nc',chunks={'z_t':20,'time':100}).sel(z_t=30000,method='nearest')
-dsl_500= xr.open_mfdataset(dircsl+'*.O2.*.nc',chunks={'z_t':20,'time':100}).sel(z_t=50000,method='nearest')
+dsl_150=dsl.sel(z_t=15000,method='nearest')
+dsl_300=dsl.sel(z_t=30000,method='nearest')
+dsl_500=dsl.sel(z_t=50000,method='nearest')
Or maybe dsl
is already in memory from earlier in the notebook. If you want to chat about it, I'm happy to hop on a zoom call or google meet this afternoon (sometime between 2p and 4p Mountain Time?) or next week.
Last updated: May 16 2025 at 17:14 UTC