Dask Introduction#

Dask is a solution that enables you to scale Python libraries. It mimics popular scientific libraries such as numpy, pandas, and xarray that enables an easier path to parallel processing without having to refactor your code.

There are 3 components to parallel processing with Dask: the client, the scheduler, and the workers.

The Client is best envisioned as the application that communicates with Dask. In Python applications this is handled when you define client = Client(CLUSTER_TYPE). A Dask cluster comprises of a single scheduler that manages the execution of tasks on workers. The CLUSTER_TYPE can be defined in a number of different ways.

  • You can spin up a LocalCluster, a cluster running on the same hardware as the application and sharing the available resources, directly in Python with dask.distributed.

  • You can spin up a dedicated dask cluster with its own resources with dask.gateway.

  • You can also utilize dask_jobqueue to connect to the HPC Slurm and PBS job schedulers to provision resources for you.

The dask.distributed client python module can also be used to connect to existing clusters. You can deploy a Dask Scheduler and Workers to kubernetes without using a Python function to create a dask cluster. The dask.distributed Client is configured to connect to the scheduler either by IP or by the k8s service name with the later being preferred because we can set the name but not the IP.