Introduction - OSDF usage examples

Welcome to the OSDF Examples repository! This repository provides example notebooks and scripts that demonstrate how to access data via the Open Science Data Federation (OSDF) using PelicanFS. All the notebooks show how to stream geoscience data into your workflows and perform an interesting calculation or visualization.

A short primer on OSDF and PelicanFS¶

If accessing scientific data still feels like “download a giant archive, then analyze it locally,” OSDF is the alternative. The Open Science Data Federation is an NSF-funded content-distribution layer for science: it sits in front of existing repositories and streams data over HTTPS to wherever your code is running.

Two pieces of jargon worth knowing:

Origin — a server that connects an existing data repository to the federation. For example, NCAR runs on-prem OSDF origins that expose datasets from GDEX (NCAR’s Geoscience Data Exchange — 17 PB across 1600+ datasets on POSIX storage). The origin is a separate piece of hardware that talks to GDEX’s storage; the two are co-located but conceptually distinct.
Cache — a server that holds temporary copies of frequently-requested objects close to where computation happens. For instance, NCAR also runs an on-prem cache so Casper/Derecho users get fast access to data from any OSDF origin (not just NCAR’s).

You don’t have to think about origins and caches when you read data — the Pelican packages handle this transparently. In this repository, we use the Pelican Python client PelicanFS, an FSSpec implementation, which plugs into anything that already speaks FSSpec: xarray, intake, intake-esm, pandas. The two URL schemes you’ll see throughout this book:

Scheme	Format	Used for
`osdf`	`osdf:///<namespace-path>`	OSDF data — note the three slashes
`pelican`	`pelican://<federation-host>/<namespace-path>`	Other Pelican federations

Common namespaces in this book:

osdf:///ncar/gdex/<dataset_id> — NCAR/GDEX datasets via the NCAR origin.
osdf:///aws-opendata/us-west-2/... and .../us-west-1/... — AWS Open Data (CMIP6, CESM2 LENS, HRRR, etc.) via the AWS origin.

A typical xarray + zarr call looks like:

import xarray as xr
ds = xr.open_zarr("osdf:///aws-opendata/us-west-2/cmip6-pds/.../...")

For a deeper introduction with executable examples, see Project Pythia’s OSDF Cookbook — its first chapters cover the OSDF concept and PelicanFS usage in detail. To learn how NCAR integrated OSDF with its data infrastructure, see Integration of OSDF with NCAR’s data infrastructure: Interim Project Report (Oct 2025).

Find a notebook¶

The collection is organized by data origin rather than a fixed list of notebooks, so it scales as new examples are added. Use whichever entry point matches what you have:

You have NCAR HPC access (Casper/Derecho). Browse the GDEX / NCAR Data Origin section — those notebooks stream GDEX data through NCAR’s on-prem OSDF origin and run on Casper.
You want to run on a laptop or in the cloud. Look for notebooks tagged platform:laptop or platform:jetstream2. The NDC Pathfinder workflows are a good starting point.
You want to compare OSDF performance. See the benchmark notebooks under the NDC section.
Brand new and just want to see something work. Open simple_aws_example.ipynb.

The full tagged index lives in the Notebook Gallery.

How notebooks are tagged¶

Every notebook carries a faceted set of tags in its frontmatter so users can filter by axis (compute platform, data origin, dataset, task, level). The facets are:

Facet	Examples
`origin:`	`aws`, `ncar-posix`, `ncar-object-store`
`platform:`	`casper`, `stampede3`, `jetstream2`, `ospool`, `laptop`
`dataset:`	`cesm`, `cmip6`, `era5`, `conus404`, `na-cordex`, `hrrr`, `dart`, `jra3q`, `hadisst`
`task:`	`bias-correction`, `climatology`, `ml`, `benchmark`, `visualization`, `ecs`
`level:`	`beginner`, `intermediate`, `advanced`

NCAR has two OSDF origins: ncar-posix (POSIX storage; namespace osdf:///ncar/gdex/... — older notebooks may use osdf:///ncar/rda/..., which is the same origin under its previous name) and ncar-object-store (NCAR’s object storage, currently called Boreas; namespace osdf:///ncar-gdex/...).

Searching by tag¶

Tags are full-text indexed by the book’s search bar (the magnifying-glass icon at the top of every page, or press / on your keyboard). Type a tag value to find every notebook that carries it. For example:

Type platform:casper to list every notebook tested on NCAR Casper.
Type dataset:cmip6 to find all CMIP6 examples.
Type task:bias-correction to find every bias-correction workflow.
Combine with a free-text term — era5 precip narrows to ERA5 precipitation notebooks.

Each visible tag pill on a notebook page (or in the gallery) is also a clickable link into the Tag Index, where you can see every notebook that shares that tag in one place.

A note on platform: tags. Most notebooks are designed to run on a user’s own machine via a Dask LocalCluster, and only opt into PBS/Slurm when a flag is set. The platform: tag therefore documents where the notebook has been verified to run, not the only place it can run. A notebook tagged platform:casper was tested on Casper using PBS; flip the cluster switch in the notebook (e.g. USE_PBS_SCHEDULER = False) and the same notebook runs locally.

For the full taxonomy and conventions, see CONTRIBUTING.md.

How is the repository organized?¶

This repository is organized into sections based mostly on the data origins from which the data is accessed and the computational platforms used to execute the notebooks.

NCAR HPC workflows (Casper) — notebooks executed on NCAR’s HPC system.
- GDEX / NCAR Data Origin — GDEX data streamed via NCAR’s on-prem OSDF origin.
- Other Data Origins — data streamed from origins like AWS Open Data.
- ML Workflows — machine-learning workflows.
Other Computational Platforms — workflows executed on other HPC and cloud computing platforms.
NDC Workflows — workflows developed as part of the National Discovery Cloud (NDC) Pathfinder initiative.
Scripts — Python scripts and any content that is not a Jupyter notebook.

Access methods¶

Some notebooks use intake/intake-ESM catalogs in conjunction with PelicanFS to stream data. Others use PelicanFS directly to load data into xarray.

Repository structure¶

docs/ — introductory markdown files for each section and the notebook gallery.
notebooks/ — all computational workflows archived as Jupyter notebooks.
scripts/ — Python scripts and any content that is not a Jupyter notebook.