The PyReshaper User’s Manual¶
What is it?¶
The PyReshaper is a tool for converting time-slice (or history-file or synoptically) formatted NetCDF files into time-series (or single-field) format.
Requirements¶
The PyReshaper explicitly depends upon the following Python packages:
PyNIO (v1.5+) or netCDF4-python (v1.2+)
ASAPPyTools (v0.4+)
These packages imply a dependency on the NumPy (v1.4+) and mpi4py (v1.3+) packages, and the libraries NetCDF and MPI/MPI-2.
No thorough testing has been done to show whether earlier versions of these dependencies will work with the PyReshaper. The versions listed have been shown to work, and it is assumed that later versions will continue to work.
How can I install it?¶
The easiest way of obtaining the PyReshaper is from the Python Package
Index (PyPI), using pip
:
$ pip install [--user] PyReshaper
Alternatively, you can download the source from GitHub and install with
setuptools
:
$ git clone https://github.com/NCAR/PyReshaper
$ cd PyReshaper
$ python setup.py install [--user]
This will download and install the most recent stable version of the source code. If the most recent version of the non-stable source is desired, you may switch to the development branch before installing.
$ git checkout devel
When installing, the --user
option to either pip
or setup.py
will install the PyReshaper in the user’s private workspace, as defined
by the system on which the user is installing. This is useful if you don’t
have permissions to install system-wide software.
Before Using the PyReshaper¶
If you installed the PyReshaper using pip
, then you should be ready to
go. However, if you using the --user
option, the local install directories
using by pip
may not be in your paths.
First, you must add the installation site-packages directory to your
PYTHONPATH
. If you installed with the --user
option, this means
adding the $HOME/.local/lib/python2.X/site-packages
(on Linux) directory
to your PYTHONPATH
. If you specified a different --prefix
option,
then you must point to that prefix directory. For bash users, this is
done with the following command.
$ export PYTHONPATH=$PYTHONPATH:$PREFIX/lib/python2.X/site-packages
where the $PREFIX
is the root installation directory used when
installing the PyReshaper package ($HOME/.local/
if using the
--user
option on Linux), and the value of X
will correspond to the
version of Python used to install the PyReshaper package.
If you want to use the command-line interface to the PyReshaper, you
must also add the PyReshaper executables directory to your PATH
.
Like for the PYTHONPATH
, this can be done with the following
command.
$ export PATH=$PATH:$PREFIX/bin
How do I use it?¶
Some General Concepts¶
Before we describe the various ways you can use the PyReshaper, we must describe more about what, precisely, the PyReshaper is designed to do.
As we’ve already mentioned, the PyReshaper is designed to convert a set of NetCDF files from time-slice (i.e., synoptic or history-file) format to time-series (or single-field) format, either in serial or parallel. Time-slice files contain all of the model variables in one file, but typically only span 1 or a few time-steps per file. Time-series files nominally contain only 1 single time-dependent variable spanning many time-steps, but they can additionally contain metadata used to describe the single-field variable contained by the file.
In serial, the PyReshaper will write each time-series variable to its own file in sequence. In parallel, time-series variables will be written simultaneously across the MPI processes allocated for the job.
There are a number of assumptions that the PyReshaper makes regarding the time-slice (input) data, which we list below.
Each time-slice NetCDF file has multiple time-dependent variables inside it, but can have many time-independent variables inside it, as well.
Each time-slice NetCDF file contains data for times that do not overlap with each other. (That is, each time-slice NetCDF file can contain data spanning a number of simulation time steps. However, the span of time contained in one time slice cannot overlap the span of time in another time-slice.) If the time-slices overlap, an error will be given and execution will stop.
Every time-slice NetCDF file contains the same time-dependent variables, just at differing times.
Similarly, there are a number of assumptions made about the time-series (output) data produced by the PyReshaper conversion process. The variables written to the output data can be time-series variables or metadata variables. Time-series variables are written to one output file only. Metadata variables are written to all output files.
By default, every time-dependent variable will be assumed to be a time-series variable (i.e., written to its own time-series NetCDF file).
Every time-independent variable that appears in the time-slice files will be assumed to be a metadata variable (i.e., written to every time-series file).
Users can explicitly specify any number of time-dependent variables as metadata variables (e.g., such as
time
itself).Every time-series file written by the PyReshaper will span the total range of time spanned by all time-slice files specified.
Every time-series file will be named with the same prefix and suffix, according to the rule:
time_series_filename = prefix + variable_name + suffix
where the variable_name is the name of the time-series variable associated with that time-series file.
It is important to understand the implications of the last assumption on the list above. Namely, it is important to note what this assumption means in terms of NetCDF file-naming conventions. It is common for the file-name to contain information that pertains to the time-sampling frequency of the data in the file, or the range of time spanned by the time-series file, or any number of other things. To conform to such naming conventions, it may be required that the total set of time-slice files that the user wishes to convert to time-series be given to the PyReshaper in multiple subsets, running the PyReshaper independently on each subset of time-slice files. Throughout this manual, we will refer to such “subsets” as streams. As such, every single PyReshaper operation is designed to act on a single stream.
Using the PyReshaper from the Unix Command-Line¶
While the most flexible way of using the PyReshaper is from within
Python, as described above, the easiest way to use the PyReshaper is usually
to run the PyReshaper command-line utilities. In this section, we describe
how to use the command-line utilities s2smake
and s2srun
, which
provide command-line interfaces (CLI) to the PyReshaper. (These scripts
will be installed in the $PREFIX/bin
directory, where PREFIX
is the
installation root directory. If you installed PyReshaper with the --user
flag, you may need to add this directpry to your path.)
The s2smake
utility is designed to generate a Specifier object file
(specfile) that contains a specification of the PyReshaper job.
The s2srun
utility is then used to run the PyReshaper with the newly
generated specfile.
Below is an example of how to use the PyReshaper’s s2smake
utility,
with all options and parameters specified on the command line.
$ s2smake \
--netcdf_format="netcdf4" \
--compression_level=1 \
--output_prefix="/path/to/outfile_prefix." \
--output_suffix=".000101-001012.nc" \
-m "time" -m "time_bounds" \
--specfile=example.s2s \
/path/to/infiles/*.nc
In this example, you will note that we have specified each
time-dependent metadata variable name with its own -m
option. (In
this case, there are only 2, time
and time_bounds
.) We have also
specified the list of input (time-slice) files using a wildcard, which
the Unix shell fills in with a list of all filenames that match this glob
pattern. In this case, we are specifying all files with the .nc
file
extension in the directory /path/to/infiles
. These command-line options
and arguments specify all of the same input needed to run the PyReshaper.
Running this command will save this PyReshaper specfile in a file called
example.s2s
.
When using glob patterns, it is important to understand that the shell
expands these glob patterns out into the full list of matching filenames
before running the s2smake
command. On many systems, the length of
a shell command is limited to a fixed number of characters, and it is possible
for the glob pattern to expand to a length that makes the command too long
for the shell to execute! If this is the case, you may contain your glob
pattern in quotation marks (i.e., "/path/to/infiles/*.nc"
instead of
/path/to/infiles/*.nc
). The s2smake
command will then expand the
glob pattern internally, allowing you to avoid the command-line character
limit of the system.
With the specfile created and saved using the s2smake
utility,
we can run the PyReshaper with this specfile using the s2srun
utility,
with all options and parameters specified on the command line.
$ s2srun --serial --verbosity=2 example.s2s
The example above shows the execution, in serial, of the PyReshaper job
specified by the example.s2s
specfile with a verbosity
level of 2.
For parallel operation, one must launch the s2srun
script from
the appropriate MPI launcher. On the NCAR Yellowstone system
(yellowstone.ucar.edu
), for example, this is done with the following
command.
$ mpirun.lsf s2srun --verbosity=3 example.s2s
In the above example, this will launch the s2srun
script into
the MPI environment already created by either a request for an
interactive session or from an LSF submission script.
Arguments to the s2smake
Script¶
The arguments to the s2smake
utility are as follows.
--backend BACKEND
(-b BACKEND
): I/O backend to be used when reading or writing from NetCDF files. The parameterBACKEND
can be one of'Nio'
or'netCDF4'
, indicating PyNIO or netCDF4-python, respectively. The default value is'netCDF4'
.--compression_level C
(-c C
): NetCDF compression level, when using the netcdf4 file format, whereC
is an integer between 0 and 9, with 0 indicating no compression at all and 9 indicating the highest level of compression. The default compression level is 1.--netcdf_format NCFORMAT
(-f NCFORMAT
): NetCDF file format to be used for all output files, whereNCFORMAT
can be'netcdf'
,'netcdf4'
, or'netcdf4c'
, indicating NetCDF3 Classic format, NetCDF4 Classic format, or NetCDF4 Classic format with forced compression level 1. The default file format is'netcdf4'
.--metadata VNAME
(-m VNAME
): Indicate that the variableVNAME
should be treated as metadata, and written to all output files. There may be more than one--metadata
(or-m
) options given, each one being added to a list.--meta1d
(-1
): This flag forces all 1D time-variant variables to be treated as metadata. These variables need not be added explicitly to the list of metadata variables (i.e., with the--metadata
or-m
argument). These variables will be added to the list when the PyReshaper runs.--specfile SPECFILE
(-o SPECFILE
): The name of the specfile to write, containing the specification of the PyReshaper job. The default specfile name is'input.s2s'
.--output_prefix PREFIX
(-p PREFIX
): A string specifying the prefix to be given to all output filenames. The output file will be named according to the rule:output_prefix + variable_name + output_suffix
The default output filename prefix is
'tseries.'
.--output_suffix SUFFIX
(-s SUFFIX
): A string specifying the suffix to be given to all output filenames. The output file will be named according to the rule:output_prefix + variable_name + output_suffix
The default output filename suffix is
'.nc'
.--time_series VNAME
: Indicate that only the namedVNAME
variables should be treated as time-series variables and extracted into their own time-series files. This option works like the--metadata
option, in that multiple occurrences of this option can be used to extract out only the time-series variables given. If any variable names are given to both the--metadata
and--time_series
options, then the variable will be treated as metadata. If the--time_series
option is not used, then all time-dependent variables that are not specified to be metadata (i.e., with the--metadata
option) will be treated as time-series variables and given their own output file. NOTE: If you use this option, data can be left untransformed from time-slice to time-series output! DO NOT DELETE YOUR OLD TIME-SLICE FILES!
Each input file should be listed in sequence, space separated, on the command line to the utility, nominally after all other options have been specified.
Arguments to the s2srun
Script¶
While the basic options shown in the previous examples above are sufficient for most purposes, additional options are available.
--chunk NAME,SIZE
(-c NAME,SIZE
): This command-line option can be used to specify a maximum read/write chunk-sizeSIZE
along a given named dimensionNAME
. Multiple--chunk
options can be given to specify chunk-sizes along multiple dimensions. This option determines the size of the data “chunk” read from a single input file (and then written to an output file). If the chunk-size is greater than the given dimension size, then the entire dimension will be read at once. If the chunk-size is less than the given dimension size, then all variables that depend on that dimension will be read in multiple parts, each “chunk” being written before the next is read. This can be important to control memory use. By default, chunking is done over the unlimited dimension with a chunk-size of 1.--limit L
(-l L
): This command-line option can be used to set theoutput_limit
argument of the PyReshaperconvert()
function, described below. This can be used when testing to only output the firstL
files. The default value is 0, which indicates no limit (normal operation).--write_mode M
(-m M
): This command-line option can be used to set thewmode
output file write-mode parameter of thecreate_reshaper()
function, described below. The default write mode is'w'
, which indicates normal writing, which will error if the output files already exists (i.e., no overwriting). Other options are'o'
to overwrite existing output files,'s'
to skip existing output files,'a'
to append to existing output files.--serial
(-s
): If this flag is used, it will run the PyReshaper in serial mode. By default, it will run PyReshaper in parallel mode.--verbosity V
(-v V
): Sets the verbosity level for standard output from the PyReshaper. A level of 0 means no output, and a value of 1 or more means increasingly more output. The default verbosity level is 1.
Nominally, the last argument given to the s2srun
utility should be the name
of the specfile to run.
Using the PyReshaper from within Python¶
Obviously, one of the advantages of writing the PyReshaper in Python is that it is easy to import features (modules) of the PyReshaper into your own Python code, as you might link your own software tools to an external third-party library. The library API for the PyReshaper is designed to be simple and light-weight, making it easy to use in your own Python tools or scripts.
Below, we show an example of how to use the PyReshaper from within Python to convert a stream from time-slice format to time-series format.
from pyreshaper import specification, reshaper
# Create a Specifier object
specifier = specification.create_specifier()
# Specify the input needed to perform the PyReshaper conversion
specifier.input_file_list = [ "/path/to/infile1.nc", "/path/to/infile2.nc", ...]
specifier.netcdf_format = "netcdf4"
specifier.compression_level = 1
specifier.output_file_prefix = "/path/to/outfile_prefix."
specifier.output_file_suffix = ".000101-001012.nc"
specifier.time_variant_metadata = ["time", "time_bounds"]
specifier.exclude_list = ['HKSAT','ZLAKE']
# Create the PyReshaper object
rshpr = reshaper.create_reshaper(specifier,
serial=False,
verbosity=1,
wmode='s')
# Run the conversion (slice-to-series) process
rshpr.convert()
# Print timing diagnostics
rshpr.print_diagnostics()
In the above example, it is important to understand the input given to the PyReshaper. Namely, all of the input for this single stream is contained by a single instantiation of a Specifier object (the code for which is defined in the specification module). We will describe each attribute of the Specifier object below.
Specifier Object Attributes¶
input_file_list
: This specifies a list of input (time-slice) file paths that all conform to the input file assumptions (described above). The list of input files need not be time-ordered, as the PyReshaper will order them appropriately. (This means that this list can easily be generated by using filename globs.)
In the example above, each file path is full and absolute, for safety’s sake.
netcdf_format
: This is a string specifying what NetCDF format will be used to write the output (time-series) files. Acceptable options fornetcdf_format
are:"netcdf"
for NetCDF3 format,"netcdf4"
for NetCDF4 Classic format, and"netcdf4c"
for NetCDF4 Classic with level-1 compression.compression_level
: This is an integer specifying the level of compression to use when writing the output files. This can be a number from 0 to 9, where 0 means no compression (default) and 9 mean the highest level of compression. This is overridden when the"netcdf4c"
format is used, where it is forced to be 1.
In the above example, NetCDF4 Classic format is used for the output files,
with level-1 compression. The "netcdf4c"
option can be used as a
short-hand notation for this combination of netcdf_format
and
compression_level
options.
output_file_prefix
: This is a string specifying the common output (time-series) filename prefix. It is assumed that each time-series file will be named according to the rule:filename = output_file_prefix + variable_name + output_file_suffix
output_file_suffix
: This is a string specifying the common output (time-series) filename suffix. It is assumed that each time-series file will be named according to the above rule.
It is important to understand, as in the example above, that the prefix can include the full, absolute path information for the output (time-series) files.
time_variant_metadata
: This specifies a list of variable names corresponding to variables that should be written to every output (time-series) NetCDF file. Nominally, this should specify only the time-variant (time-dependent) variables that should not be treated as time-series variables (i.e., treated as metadata), since all time-invariant (time-independent) variables will be treat as metadata automatically.assume_1d_time_variant_metadata
: If set toTrue
, this indicates that all 1D time-variant variables (i.e., variables that only depend upontime
) should be added to the list oftime_variant_metadata
when the Reshaper runs. The default for this option isFalse
.time_series
: If set to a list of string variable names, only these variable names will be transformed into time-series format. This is equivalent to the--time_series
option to thes2smake
utility. NOTE: Setting this attribute can leave data untransformed from time-slice to time-series format! DO NOT DELETE YOUR OLD TIME-SLICE FILES!backend
: This specifies which I/O backend to use for reading and writing NetCDF files. The default backend is'netCDF4'
, but the user can alternatively specify'Nio'
to use PyNIO.exclude_list
: If set to a list of string time invariant variable names, these variables will not be included in each of the timeseries files. NOTE: Setting this attribute can leave data untransformed from time-slice to time-series format! DO NOT DELETE YOUR OLD TIME-SLICE FILES!
Specifier Object Methods¶
In addition to the attributes above, the Specifier objects have some useful methods that can be called.
validate()
: Calling this function validates the attributes of the Specifier, making sure their types and values appear correct.write(filename)
: Calling this function with the argumentfilename
will write the specfile matching the Specifier.
Specfiles¶
Specfiles are simply pickled Specifier objects written to a file. To
create a specfile, one can simply call the Specifier’s write()
method,
described above, or one can explicitly pickle the Specifier directly, as
shown below.
import pickle
# Assume "spec" is an existing Specifier instance
pickle.dump(spec, open("specfile.s2s", "wb"))
This is equivalent to the call spec.write('specfile.s2s')
.
A specfile can be read with the following Python code.
import pickle
spec = pickle.load( open("specfile.s2s", "rb") )
Arguments to the create_reshaper()
Function¶
In the example above, the PyReshaper object (rshpr) is created by
passing the single Specifier instance to the factory function
create_reshaper()
. This function returns a PyReshaper object that has
the functions convert()
and print_diagnostics()
that perform the
time-slice to time-series conversion step and print useful timing
diagnostics, respectively.
In addition to the Specifier instance, the create_reshaper()
function
takes the following parameters.
serial
: This is a boolean flag, which can beTrue
orFalse
, indicating whether the PyReshaperconvert()
step should be done in serial (True
) or parallel (False
). By default, parallel operation is assumed if this parameter is not specified.verbosity
: This is an integer parameter that specifies what level of output to produce (tostdout
) during theconvert()
step. A verbosity level of0
means that no output will be produced, while an increasing vebosity level producing more and more output. Currently, a level of2
produces the most output possible.verbosity = 0
: This means that no output will be produced unless specifically requested (i.e., by calling theprint_diagnostics()
function).verbosity = 1
: This means that only output that would be produced by the head rank of a parallel process will be generated.verbosity = 2
: This means that all output from all processors will be generated, but any output that is the same on all processors will only be generated once.
wmode
: This is a single-character string that can be used to set the write mode of the PyReshaper. By default, the PyReshaper will not overwrite existing output files, if they exist. In normal operation, this means the PyReshaper will error (and stop execution) if output files are already present. This behavior can be controlled with thewmode
parameter. Thewmode
parameter can be set to any of the following.wmode = 'w'
: This indicates that normal write operation is to be performed. That is, the PyReshaper will error and stop execution if it finds output files that already exist. This is the default setting.wmode = 's'
: This indicates that the PyReshaper should skip generating time-series files for output files that already exist. No check is done to see if the output files are correct.wmode = 'o'
: This indicates that the PyReshaper should overwrite existing output files, if present. In this mode, the existing output files will be deleted before running the PyReshaper operation.wmode = 'a'
: This indicates that the PyReshaper should append to existing output files, if present. In this mode, it is assumed that the existing output files have the correct format before appending new data to them.
simplecomm
: This option allows the user to pass anASAPPyTools
SimpleComm
instance to the PyReshaper, instead of having the PyReshaper create its own internally. TheSimpleComm
object is the simple MPI communication object used by the PyReshaper to handle its MPI communication. By default, the PyReshaper will create its own SimpleComm that uses the MPICOMM_WORLD
communicator for communication. However, the user may create their ownSimpleComm
object and force the PyReshaper to use it by setting this option equal to the user-createdSimpleComm
instance.
Arguments to the convert()
Function¶
While not shown in the above examples, there are named arguments that can
be passed to the convert()
function of the Reshaper object.
output_limit
: This argument sets an integer limit on the number of time-series files generated during theconvert()
operation (per MPI process). This can be useful for debugging purposes, as it can greatly reduce the length of time consumed in theconvert()
function. A value of0
indicates no limit, or all output files will be generated.chunks
: This argument sets a dictionary of dimension names to chunk-sizes. This is equivalent to the--chunk
command-line option tos2srun
. This option determines the size of the data “chunk” read from a single input file (and then written to an output file) along each given dimension. If a chunk-size is greater than the given dimension size, then the entire dimension will be read at once. If a chunk-size is less than the given dimension size, then all variables that depend on that dimension will be read in multiple parts, each “chunk” being written before the next is read. This can be important to control memory use. By default, thechunks
parameter is equal toNone
, which means chunking is done over the unlimited dimension with a chunk-size of 1.
Obtaining Best Performance with the PyReshaper¶
While the PyReshaper can be run in either serial or parallel, best performance is almost always achieved by running in parallel. Understanding how the PyReshaper operates, however, is important to knowing how to get the best performance.
Of critical importance to understanding this, one must appreciate the fact that the PyReshaper only parallelizes over time-series (output) variables. Or, in other words, it parallelizes over output files, since each time-series variable is written to its own file. Thus, the maximum amount of parallelism in the PyReshaper equal to the number of time-series variables in the input dataset. If 10 time-series variables exist in the input dataset, then the maximum performance will be achieved by running the job with 10 MPI processes.
Unfortunately, that is not all that needs to be appreciated, because there are many factors that can impact performance.
I/O Nodes¶
Similar limitations usually apply to I/O (reading/writing data) operations, of which the PyReshaper is one. The PyReshaper does very little computation on the CPU, and almost all of its operation time is dominated by I/O. Unfortunately, most systems have serial I/O from all MPI processes on the same CPU (or node). Hence, while a multicore CPU can efficiently execute as many MPI processes as cores on the CPU for computation, this may not be true for I/O. To prevent overloading the node’s I/O capabilities, it may be necessary to run fewer PyReshaper processes per node than there are available cores.
This is a parameter that is hard to get a feel for, so it is best to see how performance varies on the system you are using. In general, though, using the maximum number of processes per node will saturate the I/O capabilities of the node, so using fewer processes per node may improve conversion speeds.