Posts tagged dask

Debugging dask workflows: Detrending

Detrending - subtracting a trend, commonly a linear fit, from the data - along the time dimension is a common workflow in the climate sciences.

Here’s an example

../../../_images/e25cf220f15e78e7aa8d54f8e9d18a52e56c57a24b0f30bbd7bc7410543329f9.png

Read more ...


Sparse arrays and the CESM land model component

An underappreciated feature of Xarray + Dask is the ability to plug in different array types. Usually we work with Xarray wrapping a Dask array which in turn uses NumPy arrays for each block; or just Xarray wrapping NumPy arrays directly. NumPy arrays are dense in-memory arrays. Other array types exist:

sparse for sparse arrays

../../../_images/c4d9bf482f609b5c8823a94f55faa35a854a944a89bdc967aff406eb134a52c5.png

Read more ...


ESDS Update November 2021

November was an active month! There were a couple of ESDS Forum talks, a variety of answered Python questions during office hours, and a Python tutorial!

Check out the following ESDS update for the month of November 2021.

november-2021-office-hours

Read more ...


ESDS Update October 2021

October has been an active month! There were a variety of talks, a variety of answered Python questions during office hours, and a Python tutorial!

Check out the following ESDS update for the month of October 2021.

october-2021-office-hours

Read more ...


Reading WRF data into Xarray and Visualizing the Output using hvPlot

The typical data workflow within the Python ecosystem when working with Weather Research and Forecasting (WRF) data is to use the wrf-python package! Traditionally, it can be difficult to utilize the xarray data model with WRF data, due to a few challenges:

WRF data not being CF-compliant (which makes it hard for xarray to properly construct the dataset out of the box using xr.open_dataset)

xarray-wrf-gif

Read more ...


Benchmarking Performance of History vs. Timeseries Files

In this example, we will look at how long reading data from the Community Earth System Model (CESM), applying calculations, and visualizing the output takes using the following packages:

ecgtools

Read more ...


Scaling Python with Dask Class Takeaways

This week, I had the opportunity to attend the Scaling Python with Dask class offered by Coiled. The class provided an overview of a variety of topics, including:

Parellelizing Python Code

NCARCluster view

Read more ...


Dask Distributed Summit 2021 Takeaways

Cloud optimized datasets help improve speed of analysis

Entire 4 TB datasets open up in a few seconds

Read more ...


How to Use xarray.map_blocks for Vertical Interpolation of a 3D Field

Within this example, we cover how to use xarray.map_blocks to calculate the mixed-layer depth within the CESM POP model output.

This calculation is “embarassingly parallel” such that each calculation is done within a single a column. The calculation should be easily computed within each column across the model domain. This is where map_blocks can be used to improve the performance of this metric.

MLD example plot

Read more ...


NCAR-Jobqueue

Last week, we added posts detailing how to configure Dask using the new PBS scheduler on Casper. In this week’s example, we provide an example of the recent updates to ncar-jobqueue, added by Anderson Banihirwe, which allow users to easily configure dask on Casper without having to add many extra steps.

You must update the package to use the newest updates. You can update using conda!

Read more ...


An Example of Using Intake-ESM

This past week, NCAR CISL updated the Casper Node to use PBS instead of Slurm for scheduling jobs. This led a post in which an example of spinning up dask clusters on the new configuration. This was also an opportunity to dig into dask, and try applying it to a sample task, specifically looking at ecosystem variables in the CESM-LE dataset, using notebooks included in Matt Long’s krill-cesm-le repository, modified by Kristen Krumhardt.

Here, we spin up our dask cluster. At first, running this notebook resulted in a killed worker error. After further expection, we noticed that additional resources would be needed to read in the notebook since the data are so large (on the order of ~1-2 TB). Increasing the individual worker to a higher amount (ex. 256 GB) solved the issue. Scale up to as many workers as you think are neccessary for the calculation (this may take some trial and error).

Read more ...


Using Dask on the New Casper PBS Scheduler

Casper will complete a transition from Slurm to the PBS Pro workload manager on April 7, 2021. This has implications for how to spin up a Dask cluster, including via the NCAR Jupyterhub.

Below is an example script suitable for the new configuration using the PBSCluster function from dask_jobqueue. Note that the ncar_jobqueue package requires updating to work with the new configuration.

Read more ...


Writing multiple netCDF files in parallel with xarray and dask

A typical computation workflow with xarray consists of:

reading one or more netCDF files into an xarray dataset backed by dask using xr.open_mfdataset() or xr.open_dataset(chunks=...),

../../../_images/68a7c16c9445f3a38f0c52d82c39b90ee5c24aa7b18fdc82136c224079f446da.png

Read more ...