Stream: python-questions

Topic: speeding up a script


view this post on Zulip Jean-Francois Lamarque (Oct 26 2021 at 23:14):

Hi

I have the following script

/glade/u/home/lamar/Python/CMIP6_analysis/PM2.5/health_ssp.py that goes through a (very) long list of files (from CMIP6) and computes health impacts from PM2.5. I don't want to change the main functions GEMM and MORBIDITY, so the only speed up I can think of would be to parallelize on the files themselves. Thoughts?

view this post on Zulip Stephen Yeager (Oct 27 2021 at 16:43):

Seems like a good candidate for dask. Eliminate the for loop, perhaps write a wrapper function that reads & processes files, then have each dask worker call the function. I'm not sure how to implement this, but am interested in the solution to this problem.

view this post on Zulip Matt Long (Oct 27 2021 at 17:42):

A simple implementation could look something like this:

def process_one_file(file_in):
   # do the analysis here

delayed_objs = []
for f in file_list:
   delayed_objs.append(
       dask.delayed(process_one_file)(f)
  )

results = dask.compute(*delayed_objs)

You can also use a decorator:

@dask.delayed
def process_one_file(file_in):
   # do the analysis here

delayed_objs = []
for f in file_list:
   delayed_objs.append(process_one_file(f))


results = dask.compute(*delayed_objs)

A common pitfall, see here:
https://docs.dask.org/en/latest/delayed-best-practices.html#avoid-too-many-tasks

view this post on Zulip Jean-Francois Lamarque (Nov 01 2021 at 19:12):

This is great. Do I need to specify multiple MPI tasks when requesting a cluster?

Update: I tried by setting up the number of CPUs to 8 and MPI tasks to 8. It seems to work.

view this post on Zulip Matt Long (Nov 01 2021 at 19:52):

You may want to configure a dask cluster; you can use ncar_jobqueue to do this:
https://github.com/NCAR/ncar-jobqueue


Last updated: Jan 30 2022 at 12:01 UTC