multiprocessing with matplotlib · python-questions

I am attempting to parallelize three calls to pcolormesh using the Python multiprocessing module. Here is some pseduo-code:

import multiprocessing

def setup_ax(label):
  ax = plt.subplot(111,projection=ccrs.LambertConformal(central_longitude=-97.5,central_latitude=38.5),label=idstr)
  ax.add_feature(cfeature.COASTLINE.with_scale('50m'), linewidth=0.5)
  ax.add_feature(cfeature.STATES, linewidth=0.5)
  ax.add_feature(cfeature.BORDERS, linewidth=0.5)
  return(ax)

def plot_comp(ds,ax,field,minval,cmap,norm):
  p = ax.pcolormesh(ds.lon0,ds.lat0,ds[field].max(dim='z0').where(ds[field].max(dim='z0')>minval),transform=ccrs.PlateCarree(),cmap=cmap,norm=norm)
  return(p)

fig = plt.figure(1,figsize=(22,15))
ax1 = setup_ax('ax1')
ax2 = setup_ax('ax2')
ax3 = setup_ax('ax3')

ax = [ax1,ax2,ax3]
fn = ['f1','f2','f3']
mv = [0.0,0.0,0.0]
cm = [col1,col2,col3]
nm = [norm1,norm2,norm3]

mp = multiprocessing.Pool(max(multiprocessing.cpu_count()-2,1))
results = mp.starmap(plot_comp,[(ds,a,f,m,c,n) for a,f,m,c,n in tuple(zip(ax,fn,mv,cm,nm))])

Traceback (most recent call last):
  File "plot_mdv64_field.py", line 136, in <module>
    results = mp.starmap(plot_comp,[(fileData,a,f,m,c,n) for a,f,m,c,n in tuple(zip(ax,fn,mv,cm,nm))])
  File "/home/dadriaan/.conda/envs/icicle/lib/python3.8/multiprocessing/pool.py", line 372, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/home/dadriaan/.conda/envs/icicle/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '[<matplotlib.collections.QuadMesh object at 0x7f7c5879ba90>]'. Reason: 'AttributeError("Can't pickle local object 'GeoAxes._pcolormesh_patched.<locals>.<lambda>'")'

I can only imagine this has something to do with figures or subplots, but I'm not quite sure in what way. I would expect results to just be a list of three objects returned from pcolormesh in plot_comp(), but I must be missing something.

Matt Long (Mar 14 2022 at 17:17):

@Daniel Adriaansen, I am not exactly sure what's going wrong, but in previous instances I've found that matplotlib is not thread-safe. I have used dask.delayed to successfully parallelize plotting. For example, see here.

Kevin Paul (Mar 14 2022 at 19:21):

@Daniel Adriaansen: @Matt Long's suggestion of using Dask for parallelism is a good one. Can I ask if this is "new code" that you have written and it is failing? Or is this an old script that used to work but now does not?

Daniel Adriaansen (Mar 16 2022 at 15:59):

Thank you @Matt Long and @Kevin Paul! To answer Kevin's question- this is new code that I wrote and is failing. Are you mostly curious from a version standpoint (i.e. new versions breaking my old code)? Or is there something else that might be at play here?

I have not used Dask before, and frankly misunderstood it as only useful for parallelizing problems with specific Pythonic data containers/objects like ndarray/DataArray and DataFrames. It turns out that Dask can be used by itself and this opens up a whole new world. From the Dask documentation here https://examples.dask.org/delayed.html, I see "Systems like Dask.dataframe are built with Dask.delayed. If you have a problem that is paralellizable, but isn’t as simple as just a big array or a big dataframe, then dask.delayed may be the right choice for you."

Brian Bonnlander (Mar 16 2022 at 16:52):

Hi Daniel, note that matplotlib is constrained to have one process/thread produce the plot itself. Dask is best used to parallelize the data processing steps for the plot, but the process of constructing the plot itself cannot be easily parallelized.

Daniel Adriaansen (Mar 16 2022 at 17:04):

Thanks @Brian Bonnlander. @Matt Long example above shows using dask.delayed for constructing the plot itself, presumably in parallel. Again, I am very new to Dask so I may not fully grasp what is going on in the example. Is using dask.delayed useful for calls to things like contourf and pcolormesh?

Brian Bonnlander (Mar 16 2022 at 19:50):

Hi Daniel, I just looked at the example and it seems related to data processing, not plotting. I could be wrong, but everything I've read suggests that contourf and pcolormesh are non-parallizable. Producing the data values for these plots are parallelizable using Dask, however.

Daniel Adriaansen (Mar 16 2022 at 20:09):

OK thanks- I think I understand. What I am trying to do is get a single Python script to call three instances of pcolormesh simultaneously for three different plots (not the same plot). This would be equivalent to running three separate python scripts at the same time to call pcolormesh. Separate resources are used for each, but I just want to do it from a single script. I'm not actually interesting in parallellizing the work that's done within pcolormesh, but rather running multiple calls to those simultaneously on the same piece of hardware.

Brian Bonnlander (Mar 16 2022 at 20:43):

Ah, I see. It may be possible to do that, as long as the plots are completely distinct and not combined as separate subplots. Again though, this is more based on what I've read.

Daniel Adriaansen (Mar 17 2022 at 19:43):

After much tinkering (including with dask a bit), I was ultimately successful with my original approach using multiprocessing:

import multiprocessing

def setup_ax(label):
  ax = plt.subplot(111,projection=ccrs.LambertConformal(central_longitude=-97.5,central_latitude=38.5),label=idstr)
  ax.add_feature(cfeature.COASTLINE.with_scale('50m'), linewidth=0.5)
  ax.add_feature(cfeature.STATES, linewidth=0.5)
  ax.add_feature(cfeature.BORDERS, linewidth=0.5)
  return(ax)

def plot_comp(ds,field,minval,cmap,norm):
  ax = setup_ax(field)
  p = ax.pcolormesh(ds.lon0,ds.lat0,ds[field].max(dim='z0').where(ds[field].max(dim='z0')>minval),transform=ccrs.PlateCarree(),cmap=cmap,norm=norm)
  fig.savefig(fname+'.png')

# New figure
fig = plt.figure(1,figsize=(22,15))

# Items for parallelizing
fn = ['f1','f2','f3']
mv = [0.0,0.0,0.0]
cm = [col1,col2,col3]
nm = [norm1,norm2,norm3]

mp = multiprocessing.Pool(max(multiprocessing.cpu_count()-2,1))
mp.starmap(plot_comp,[(ds[fn],f,m,c,n) for f,m,c,n in tuple(zip(fn,mv,cm,nm))])

The major change I think was defining a new axis and saving the figure within plot_comp(). Thus, nothing is returned from multiprocessing in this instance. Timing within Python shows roughly a 40-50% speedup taking this approach for three fields.

Deepak Cherian (Mar 18 2022 at 02:07):

Nice @Daniel Adriaansen . This would make a great blogpost for the ESDS blog if you're up for contributing: https://ncar.github.io/esds/

Daniel Adriaansen (Mar 21 2022 at 16:19):

Thanks for the opportunity! I'd be happy to contribute this as an example. What's the best way to coordinate? Feel free to email me directly (I believe my email is visible in my profile, but if not reply here and I will send it directly).

Deepak Cherian (Mar 22 2022 at 14:30):

A pull request here would be best: https://github.com/NCAR/esds .

Stream: python-questions

Topic: multiprocessing with matplotlib

Daniel Adriaansen (Mar 14 2022 at 15:50):