Stream: python-questions
Topic: speeding up a script
Jean-Francois Lamarque (Oct 26 2021 at 23:14):
Hi
I have the following script
/glade/u/home/lamar/Python/CMIP6_analysis/PM2.5/health_ssp.py that goes through a (very) long list of files (from CMIP6) and computes health impacts from PM2.5. I don't want to change the main functions GEMM and MORBIDITY, so the only speed up I can think of would be to parallelize on the files themselves. Thoughts?
Stephen Yeager (Oct 27 2021 at 16:43):
Seems like a good candidate for dask. Eliminate the for loop, perhaps write a wrapper function that reads & processes files, then have each dask worker call the function. I'm not sure how to implement this, but am interested in the solution to this problem.
Matt Long (Oct 27 2021 at 17:42):
A simple implementation could look something like this:
def process_one_file(file_in):
# do the analysis here
delayed_objs = []
for f in file_list:
delayed_objs.append(
dask.delayed(process_one_file)(f)
)
results = dask.compute(*delayed_objs)
You can also use a decorator:
@dask.delayed
def process_one_file(file_in):
# do the analysis here
delayed_objs = []
for f in file_list:
delayed_objs.append(process_one_file(f))
results = dask.compute(*delayed_objs)
A common pitfall, see here:
https://docs.dask.org/en/latest/delayed-best-practices.html#avoid-too-many-tasks
Jean-Francois Lamarque (Nov 01 2021 at 19:12):
This is great. Do I need to specify multiple MPI tasks when requesting a cluster?
Update: I tried by setting up the number of CPUs to 8 and MPI tasks to 8. It seems to work.
Matt Long (Nov 01 2021 at 19:52):
You may want to configure a dask cluster; you can use ncar_jobqueue
to do this:
https://github.com/NCAR/ncar-jobqueue
Last updated: Jan 30 2022 at 12:01 UTC