Hi
I have the following script
/glade/u/home/lamar/Python/CMIP6_analysis/PM2.5/health_ssp.py that goes through a (very) long list of files (from CMIP6) and computes health impacts from PM2.5. I don't want to change the main functions GEMM and MORBIDITY, so the only speed up I can think of would be to parallelize on the files themselves. Thoughts?
Seems like a good candidate for dask. Eliminate the for loop, perhaps write a wrapper function that reads & processes files, then have each dask worker call the function. I'm not sure how to implement this, but am interested in the solution to this problem.
A simple implementation could look something like this:
def process_one_file(file_in):
# do the analysis here
delayed_objs = []
for f in file_list:
delayed_objs.append(
dask.delayed(process_one_file)(f)
)
results = dask.compute(*delayed_objs)
You can also use a decorator:
@dask.delayed
def process_one_file(file_in):
# do the analysis here
delayed_objs = []
for f in file_list:
delayed_objs.append(process_one_file(f))
results = dask.compute(*delayed_objs)
A common pitfall, see here:
https://docs.dask.org/en/latest/delayed-best-practices.html#avoid-too-many-tasks
This is great. Do I need to specify multiple MPI tasks when requesting a cluster?
Update: I tried by setting up the number of CPUs to 8 and MPI tasks to 8. It seems to work.
You may want to configure a dask cluster; you can use ncar_jobqueue
to do this:
https://github.com/NCAR/ncar-jobqueue
Last updated: May 16 2025 at 17:14 UTC