Benchmarking Performance of History vs. Timeseries Files with ecgtools, Intake-ESM, and Dask#

In this example, we will look at how long reading data from the Community Earth System Model (CESM), applying calculations, and visualizing the output takes using the following packages:

We are going to investigate whether it is faster to do these operations on the history files output by the model or on time series files that have been generated from the history files. Our hypothesis is that performance should be substantially better when reading from timeseries files, but let’s take a look…

We use CESM data on the GLADE filesystem, from a case which includes both history and timeseries files on disk.

Imports#

Installing packages via conda-forge#

As of this week, ecgtools is available via conda-forge, which is very exciting! You can install the packages used here using the following:

conda install -c conda-forge ecgtools ncar-jobqueue distributed intake-esm pandas

We will also install hvPlot to help with visualization, installing from the pyviz channel!

conda install -c pyviz hvplot
import ast
import time
import warnings

warnings.filterwarnings("ignore")

import holoviews as hv
import hvplot
import hvplot.pandas
import intake
import pandas as pd
from dask.distributed import performance_report
from distributed import Client
from ecgtools import Builder
from ecgtools.parsers.cesm import parse_cesm_history, parse_cesm_timeseries
from IPython.core.display import HTML
from ncar_jobqueue import NCARCluster

hv.extension('bokeh')