speeding up a script · python-questions

/glade/u/home/lamar/Python/CMIP6_analysis/PM2.5/health_ssp.py that goes through a (very) long list of files (from CMIP6) and computes health impacts from PM2.5. I don't want to change the main functions GEMM and MORBIDITY, so the only speed up I can think of would be to parallelize on the files themselves. Thoughts?

Stephen Yeager (Oct 27 2021 at 16:43):

Seems like a good candidate for dask. Eliminate the for loop, perhaps write a wrapper function that reads & processes files, then have each dask worker call the function. I'm not sure how to implement this, but am interested in the solution to this problem.

Matt Long (Oct 27 2021 at 17:42):

A simple implementation could look something like this:

def process_one_file(file_in):
   # do the analysis here

delayed_objs = []
for f in file_list:
   delayed_objs.append(
       dask.delayed(process_one_file)(f)
  )

results = dask.compute(*delayed_objs)

You can also use a decorator:

@dask.delayed
def process_one_file(file_in):
   # do the analysis here

delayed_objs = []
for f in file_list:
   delayed_objs.append(process_one_file(f))


results = dask.compute(*delayed_objs)

A common pitfall, see here:
https://docs.dask.org/en/latest/delayed-best-practices.html#avoid-too-many-tasks

Jean-Francois Lamarque (Nov 01 2021 at 19:12):

This is great. Do I need to specify multiple MPI tasks when requesting a cluster?

Update: I tried by setting up the number of CPUs to 8 and MPI tasks to 8. It seems to work.

Matt Long (Nov 01 2021 at 19:52):

You may want to configure a dask cluster; you can use ncar_jobqueue to do this:
https://github.com/NCAR/ncar-jobqueue

Last updated: May 16 2025 at 17:14 UTC

Stream: python-questions

Topic: speeding up a script

Jean-Francois Lamarque (Oct 26 2021 at 23:14):

Stephen Yeager (Oct 27 2021 at 16:43):

Matt Long (Oct 27 2021 at 17:42):

Jean-Francois Lamarque (Nov 01 2021 at 19:12):

Matt Long (Nov 01 2021 at 19:52):