Skip to article frontmatterSkip to article content

Access NOAA’s SONAR data hosted on AWS via OSDF protocol

Overview

Import package, define parameters and functions

# Display output of plots directly in Notebook
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

import intake
import numpy as np
import pandas as pd
import xarray as xr
import s3fs
import seaborn as sns
import re
import nest_asyncio
nest_asyncio.apply()
from pelicanfs.core import PelicanFileSystem, PelicanMap,OSDFFileSystem 
import fsspec.implementations.http as fshttp
import dask 
from dask_jobqueue import PBSCluster
from dask.distributed import Client
from dask.distributed import performance_report

Use OSDF protocol to access data

Set up osdf url to use with PelicanFS

  • We should one of the two pelicanFS FSSpec protocols (‘osdf’ or ‘pelican’) instead of the https protocol.
  • So, the urls will look like: osdf_discovery_url + namespace prefix/aws_region/bucket_name/path to file or object
  • osdf_discovery_url = osdf:///, namespace_prefix = aws-opendata
  • aws_region = us-east-1, bucket_name = ‘noaa-wcsd-zarr-pds’
  • zarr_store = ‘level_2/Bell_M._Shimada/SH1507/EK60/SH1507.zarr’
  • An example url:
  • osdf:///us-east-1/aws-opendata/noaa-wcsd-zarr-pds/evel_2/Bell_M._Shimada/SH1507/EK60/SH1507.zarr
%%time
osdf_fs = OSDFFileSystem() # OSDFFileSystem is already aware of the osdf discovery url
CPU times: user 1.45 ms, sys: 0 ns, total: 1.45 ms
Wall time: 1.48 ms
%%time
# Compose url using s3 protocol
bucket_name = 'noaa-wcsd-zarr-pds'

ship_name   = 'Bell_M._Shimada'
cruise_name = 'SH1507'
sensor_name = 'EK60'

zarr_store = f'{cruise_name}.zarr'

# How an s3 path would look 
s3_zarr_store_path = f"{bucket_name}/level_2/{ship_name}/{cruise_name}/{sensor_name}/{zarr_store}"
print(s3_zarr_store_path)
noaa-wcsd-zarr-pds/level_2/Bell_M._Shimada/SH1507/EK60/SH1507.zarr
CPU times: user 54 μs, sys: 0 ns, total: 54 μs
Wall time: 56 μs
# Compose url using OSDF protocol
namespace_prefix = '/aws-opendata' #In this current version of pelfs, this slash is necessary
aws_region       = 'us-east-1'
osdf_path        = f"{namespace_prefix}/{aws_region}/{s3_zarr_store_path}"
print(osdf_path)

# This just prepends the osdf_disovery_url to the osdf_path
osdf_zarr_store_path = PelicanMap(osdf_path, osdf_fs)
/aws-opendata/us-east-1/noaa-wcsd-zarr-pds/level_2/Bell_M._Shimada/SH1507/EK60/SH1507.zarr

Opening a Zarr store with Xarray

We will open the S3 Zarr store using Xarray (the “consolidated” parameter just defines whether we are interested in reading the metadata in a consolidated manner, “None” is the desired value).

# %%time
# # First use s3
# s3_file_system = s3fs.S3FileSystem(anon=True)
# store          = s3fs.S3Map(root=s3_zarr_store_path, s3=s3_file_system, check=False)
# cruise         = xr.open_zarr(store=store, consolidated=None)
%%time
# Now use OSDF protocol: 
cruise_osdf = xr.open_zarr(osdf_zarr_store_path,consolidated = None)
cruise_osdf
Loading...

Load data and plot echogram

Subsetting on one hour worth of data near an area of interest. We’ll define select_times from 19:00 to 20:00 on July 19th 2015.

# roughly covering the raw files from SaKe2015-D20150719-T193412.raw to SaKe2015-D20150719-T195842.raw
start_time = np.datetime64('2015-07-19T19:00:00')
end_time = np.datetime64('2015-07-19T20:00:00')

select_times = (cruise_osdf.time > start_time) & (cruise_osdf.time < end_time)

We can then use this slice of select_times to select the Sv values for the specific subset of time. Additional subsetting can also be added to the query, for example we can additionally subset by defining a selected frequency. The example below selects 18 kHz for the time interval of interest and store this in Sv_18 (drop=True means that the coordinate labels that do not meet the condition are not just masked but also dropped from the result)

Sv_38 = cruise_osdf.sel(frequency=38000).where((select_times), drop=True).Sv
print(Sv_38)
<xarray.DataArray 'Sv' (depth: 2800, time: 2959)> Size: 33MB
dask.array<where, shape=(2800, 2959), dtype=float32, chunksize=(512, 512), chunktype=numpy.ndarray>
Coordinates:
  * depth      (depth) float64 22kB nan 0.25 0.5 0.75 ... 699.5 699.8 700.0
    frequency  float32 4B 3.8e+04
  * time       (time) datetime64[ns] 24kB 2015-07-19T19:08:39.063712768 ... 2...
Attributes:
    long_name:  Volume backscattering strength (Sv re 1 m-1)
    units:      dB
%%time
Sv_38.dropna(dim='depth').plot(yincrease=False, cmap='jet', vmin=-80, vmax=-30)

plt.title(f"Sv 38 kHz")
CPU times: user 838 ms, sys: 146 ms, total: 984 ms
Wall time: 1.4 s
<Figure size 640x480 with 2 Axes>

The data in the plot above shows a detected bottom as a ship crosses above a smooth ridge. Beneath the seabed are two harmonics we would like to omit from the plot. We can begin to do that by just plotting data between 0 and 500 meters of depth.

start_depth = 0
end_depth = 500

select_depths = (cruise_osdf.depth > start_depth) & (cruise_osdf.depth < end_depth)
Sv_38_subset = Sv_38.where((select_depths), drop=True)

Now when we plot the data it is subsetted to less than 500 meters of depth.

%%time
Sv_38_subset.plot(yincrease=False, cmap='jet', vmin=-80, vmax=-30)
plt.title(f"Sv 38 kHz")
CPU times: user 607 ms, sys: 48.9 ms, total: 656 ms
Wall time: 816 ms
<Figure size 640x480 with 2 Axes>

We can also select the corresponding data for 120 and 200 kHz and store them in seperate variables

Sv_18 = cruise_osdf.sel(frequency=18000).where((select_times) & (select_depths), drop=True).Sv
Sv_120 = cruise_osdf.sel(frequency=120000).where((select_times) & (select_depths), drop=True).Sv
Sv_200 = cruise_osdf.sel(frequency=200000).where((select_times) & (select_depths), drop=True).Sv

For comparison, let’s see the 120 kHz echogram with the same depth bounds.

Sv_120_subset = Sv_120.where((select_depths), drop=True)
%%time
Sv_120_subset.plot(yincrease=False, cmap='jet', vmin=-80, vmax=-30)
plt.title(f"Sv 120 kHz")
CPU times: user 587 ms, sys: 68.2 ms, total: 656 ms
Wall time: 775 ms
<Figure size 640x480 with 2 Axes>