VAPOR Data Collection

VDC File Structure

Before attempting to convert your data to a VDC a basic understanding of the VDC file structure is helpful. The components of a VDC are distributed across different disk files. All metadata for a VDC are stored in a single .vdf file. A .vdf file is encoded in XML and may easily be browsed, or even edited, with widely found XML editors and browsers (most web browsers will read and display XML).

Field data are stored in netCDF data files as coefficients output from a user-directed wavelet transformation process. Two types of VDC exist: VDC Type 1 and Type 2.

VDC Type I

In a type I data collection each netCDF file contains the wavelet coefficients associated with a single wavelet transformation pass applied to a single field variable, at a single time step. For example, applying two transformation passes to the variable, vx, from the first time step in a data collection would result in the generation of three netCDF files with the extensions .nc0, .nc1, and .nc2. The first file, .nc0, contains the wavelet coefficients necessary to reconstruct vx at its coarsest approximation level (1/4 the original grid resolution along each coordinate axes). The .nc1 file provides the wavelet coefficients necessary to reconstruct the vx variable at ½ the original resolution, etc.

VDC Type II

A type II data collection is more complicated than a type I, but offers higher quality compression for a given storage budget. In a type II collection field data again undergo a wavelet transformation. The resulting wavelet coefficients are than sorted into a small number of ordered groups. The original data can be exactly reconstructed (up to floating point round-off) from the coefficients in all the groups, or an approximation of the data can be generated from a subset of the groups. As with type I collections the lowest order, but most information containing group, is stored in a netCDF file with the extention .nc0. The next most information containing group is stored in a netCDF file with the extension .nc1, and so on.

VDC Type I or II?

If there are missing values in your dataset and you want to use more than one refinement level, you should use VDC type II, because VAPOR does not support missing values at higher refinement levels.  This is important with all of the ocean models supported by VAPOR (ROMS, POP, MOM).

There are tradeoffs when deciding whether to use a type I or type II VDC encoding. Each available approximation in a type I VDC corresponds to a level in a hierarchical represenation of the sampling grid. Sample values at successively coarser grids are constructed by averaging neighboring grid points from parent grids. By virtue of containing fewer grid points a coarsened grid requires less space (both on disk and in when stored in memory), less IO bandwidth, and less computation when any data processing operators are applied. A type II collection examines each wavelet coefficient and groups it based on its information content in terms of the L2 error norm. Thus for a given number of wavelet coefficients the type II representation is guaranteed to provide the highest quality reconstruction based on the L2 error. However, type II collections require storage overhead to address the wavelet coefficients. Hence, a type II VDC will be larger than a type I VDC if all coefficients are kept.

A VDC, type I or II, is considered valid even if finer approximation levels are missing. In the above examples, the user may choose to store the .nc2 coefficients off line in order to save space. Furthermore, a VDC is valid even if entire time steps or variables are not present on disk. The goal of supporting incomplete VDCs is to provide the user the flexibility needed to manage very large data sets.  Hence a minimal valid VDC consists only of a metadata .vdf file, and no field data.

Generating a VDC

The process of creating a VDC is straightforward and will be explained in detail in subsequent chapters:

  1. Generate a .vdf file defining the number and name of variables, number of time steps, and resolution of each volume in a data set, as well as the number of wavelet transformations to apply.
  2. Translate raw data volumes into wavelet-transformed coefficients.

The first step is performed once for a VDC. The number of variables, time steps and the resolution are all determined by the data itself. The number of wavelet transforms is a user option that determines how many, and what resolution, field data approximations will be available for subsequently transformed data. Specifying a value of zero implies no transformations and the data will only be available at full resolution. A value of one implies a single transformation; the data will be available at full and half resolution. And so on.  Step two may be repeatedly performed as needed, and when needed, until the VDC is fully populated as defined by the associated .vdf file.

VDC Metadata (.vdf file)

Associated with each VDC is a single metafile describing the contents of the VDC. A listing of the essential elements of a .vdf file appers below:

Dimensions: This element defines the dimensions of the 3D rectilinear grid for all sampled data in the VDC. All variables across all time steps must be sampled on the same grid.

Extents: This element defines the coordinates in a user defined coordinate system of the smallest bounding box enclosing the sampled grid. For rectilinear data, the most common case, the extents are the coordinates of the first and last grid point in the data set. Accurate setting of user extents is essential for proper operation of some analysis tasks, such as flow visualization.

VariableNames: The element provides a textual name for all of the variables stored in the VDC.

UserTime: The user time specifies the time, in user coordinates, to be associated with each time step. Accurate user times are essential for some VAPOR analysis operations such as particle advection.

Num Transforms (VDC Type 1): This element specifies the depth of the multi-resolution hierarchy for VDC Type 1 data.

Compression Ratio (VDC Type 2): This element specifies the compression ratios available for VDC Type 2 data.

The sections that follow describe the various tools available for importing and exporting raw data to/from a VDC.