PIO  2.5.4
testpio: a regression and benchmarking code

The testpio directory, included with the release package, tests both the accuracy and performance of reading and writing data using the pio library.

The testpio directory contains 3 perl scripts that you can use to build and run the testpio.F90 code.

  • testpio_build.pl - builds the pio, timing and testpio libraries and executables
  • testpio_bench.pl - setups, builds and runs a user specified set of test suites using testpio and generates log files with benchmarking information.
  • testpio_run.pl - setups, builds and runs a user specified set of test suites using testpio.

Additional C shell scripts wrappers are packaged with the testpio suite to allow for environment customization of the 3 perl scripts listed above. The following help information describes in more detail how the testpio code works.

The tests are controlled via a namelist. Sample namelist files are located in the testpio/namelists directory. It contains a set of general namelists and specific namelists to setup a computational decomposition and an IO decomposition. The computational decomposition should be setup to duplicate a realistic model data decomposition. The IO decomposition is generally not used, but in some cases, can be used and impacts IO performance. The IO decomposition is an intermediate decomposition that provides compatability between a relative arbitrary computational decomposition and the MPI-IO, netcdf, pnetcdf, or other IO layers. Depending on the IO methods used, only certain IO decompositions are valid. In general, the IO decomposition is not used and is set internally.

The namelist input file is called "testpio_in". The first namelist block, io_nml, contains some general settings:

namelist io_nml
casename string, user defined test case name
nx_global integer, global size of "x" dimension
ny_global integer, global size of "y" dimension
nz_global integer, glboal size of "z" dimension
ioFMT string, type and i/o method of data file ("bin","pnc","snc"), binary, pnetcdf, or serial netcdf
rearr string, type of rearranging to be done ("none","mct","box","boxauto")
nprocsIO integer, number of IO processors used only when rearr is not "none", if rearr is "none", then the IO decomposition will be the computational decomposition
base integer, base pe associated with nprocIO striding
stride integer, the stride of io pes across the global pe set. A stride=-1 directs PIO to calculate the stride automatically.
num_aggregator integer, mpi-io number of aggregators, only used if no pio rearranging is done
dir string, directory to write output data, this must exist before the model starts up
num_iodofs tests either 1dof or 2dof init decomp interfaces (1,2)
maxiter integer, the number of trials for the test
DebugLevel integer, sets the debug level (0,1,2,3)
compdof_input string, setting of the compDOF ('namelist' or a filename)
compdof_output string, whether the compDOF is saved to disk ('none' or a filename)

Notes:

  • the "mct" rearr option is not currently available
  • if rearr is set to "none", then the computational decomposition is also going to be used as the IO decomposition. The computation decomposition must therefore be suited to the underlying I/O methods.
  • if rearr is set to "box", then pio is going to generate an internal IO decomposition automatically and pio will rearrange to that decomp.
  • num_aggregator is used with mpi-io and no pio rearranging. mpi-io is only used with binary data.
  • nprocsIO, base, and stride implementation has some special options
    • if nprocsIO > 0 and stride > 0, then use input values
    • if nprocsIO > 0 and stride <= 0, then stride=(npes-base)/nprocsIO
    • if nprocsIO <= 0 and stride > 0, then nprocsIO=(npes-base)/stride
    • if nprocsIO <= 0 and stride <= 0, then nprocsIO=npes, base=0, stride=1

Two other namelist blocks exist to described the computational and IO decompositions, compdof_nml and iodof_nml. These namelist blocks are identical in use.

namelist compdof_nml - or - iodof_nml
nblksppe integer, sets the number of blocks desired per pe, the default is one per pe for automatic decomposition. increasing this increases the flexibility of decompositions.
grdorder string, sets the gridcell ordering within the block ("xyz","xzy","yxz","yzx","zxy","zyx")
grddecomp string, sets up the block size with gdx, gdy, and gdz, see below, ("x","y","z","xy","xye","xz","xze","yz","yze", "xyz","xyze","setblk")
gdx integer, "x" size of block
gdy integer, "y" size of block
gdz integer, "z" size of block
blkorder string, sets the block ordering within the domain ("xyz","xzy","yxz","yzx","zxy","zyx")
blkdecomp1 string, sets up the block / processor layout within the domain with bdx, bdy, and bdz, see below. ("x","y","z","xy","xye","xz","xze","yz","yze","xyz","xyze", "setblk","cont1d","cont1dm")
blkdecomp2 string, provides an additional option to the block decomp after blkdecomp1 is computes ("","ysym2","ysym4")
bdx integer, "x" numbers of contiguous blocks
bdy integer, "y" numbers of contiguous blocks
bdz integer, "z" numbers of contiguous blocks

A description of the decomposition implementation and some examples are provided below.

Testpio writes out several files including summary information to stdout, data files to the namelists directory, and a netcdf file summarizing the decompositions. The key output information is written to stdout and contains the timing information. In addition, a netcdf file called gdecomp.nc is written that provides both the block and task ids for each gridcell as computed by the decompositions. Finally, foo.* files are written by testpio using the methods specified.

Currently, the timing information is limited to the high level pio read/write calls which generally will also include copy and rearrange overhead as well as actual I/O time. Addition timers will be added in the future.

The test script is called testpio_run.pl, it uses the hostname function to determine the platform. New platforms can be added by editing the files build_defaults.xml and Utils.pm. If more than one configuration should be tested on a single platform you can provide two hostnames in this file and specify the name to test in a –host option to testpio_run.pl

There are several testpio_in files for the pio test suite. The ones that come with pio test specific things. In general, these tests include:

  • sn = serial netcdf and no rearrangement
  • sb = serial netcdf and box rearrangement
  • pn = parallel netcdf and no rearrangement
  • pb = parallel netcdf and box rearrangement
  • bn = binary I/O and no rearrangement
  • bb = binary I/O and box rearrangment and the test number (01, etc) is consistent across I/O methods with
  • 01 = all data on root pe, only root pe active in I/O
  • 02 = simple 2d xy decomp across all pes with all pes active in I/O
  • 03 = all data on root pe, all pes active in I/O
  • 04 = simple 2d xy decomp with yxz ordering and stride=4 pes active in I/O
  • 05 = 2d xy decomp with 4 blocks/pe, yxz ordering, xy block decomp, and stride=4 pes active in I/O
  • 06 = 3d xy decomp with 4 blocks/pe, yxz ordering, xy block decomp, and stride=4 pes active in I/O
  • 07 = 3d xyz decomp with 16 blocks/pe, yxz ordering, xyz block decomp with block yzx ordering and stride=4 pes active in I/O
  • 08 = 2d xy decomp with 4 blocks/pe and yxz grid ordering, yxz block ordering and cont1d block decomp the rd01 and wr01 tests are distinct and test writing, reading and use of DOF data via pio methods.

PIO can use several backend libraries including netcdf, pnetcdf and mpi-io. For each library used, a compile time cpp flag is defined (eg _USEMPIIO). The test suite builds and tests the model for several combinations of these cpp flags.

  • snet = serial netcdf only
  • pnet = parallel netcdf only
  • mpiio = mpiio only
  • all = everything on
  • ant = everything on but timing disabled

Decomposition

The decomposition implementation supports the decomposition of a general 3 dimensional "nx * ny * nz" grid into multiple blocks of gridcells which are then ordered and assigned to processors.
In general, blocks in the decomposition are rectangular, "gdx * gdy * gdz" and the same size, although some blocks around the edges of the domain may be smaller if the decomposition is uneven.
Both gridcells within the block and blocks within the domain can be ordered in any of the possible dimension hierarchies, such as "xyz" where the first dimension is the fastest.

The gdx, gdy, and gdz inputs allow the user to specify the size in any dimension and the grddecomp input specifies which dimensions are to be further optimized. In general, automatic decomposition generation of 3 dimensional grids can be done in any of possible combination of dimensions, (x, y, z, xy, xz, yz, or xyz), with the other dimensions having a fixed block size. The automatic generation of the decomposition is based upon an internal algorithm that tries to determine the most "square" blocks with an additional constraint on minimizing the maximum number of gridcells across processors. If evenly divided grids are desired, use of the "e" addition to grddecomp specifies that the grid decomposition must be evenly divided. The setblk option uses the prescibed gdx, gdy, and gdz inputs without further automation.

The blkdecomp1 input works fundamentally the same way as the grddecomp in mapping blocks to processors, but has a few additional options. "cont1d" (contiguous 1d) basically unwraps the blocks in the order specified by the blkorder input and then decomposes that "1d" list of blocks onto processors by contiguously grouping blocks together and allocating them to a processor. The number of contiguous blocks that are allocated to a processor is the maximum of the values of bdx, bdy, and bdz inputs. Contiguous blocks are allocated to each processor in turn in a round robin fashion until all blocks are allocated. The "cont1dm" does basically the same thing except the number of contiguous blocks are set automatically such that each processor recieves only 1 set of contiguous blocks. The ysym2 and ysym4 blkdecomp2 options modify the original block layout such that the tasks assigned to the blocks are 2-way or 4-way symetric in the y axis.

The decomposition tool is extremely flexible, but arbitrary inputs will not always yield valid decompositions. If a valid decomposition cannot be computed based on the global grid size, number of pes, number of blocks desired, and decomposition options, the model will stop.

As indicated above, the IO decomposition must be suited to the IO methods, so decompositions are even further limited by those constraints. The testpio tool provides limited checking about whether the IO decomposition is valid for the IO method used. Since the IO output is written in "xyz" order, it's likely the best IO performance will be achieved with both grdorder and blkorder set to "xyz" for the IO decomposition.

Also note that in all cases, regardless of the decomposition, the global gridcell numbering and ordering in the output file is assumed to be "xyz" and defined as a single block. The number scheme in the examples below demonstrates how the namelist input relates back to the grid numbering on the local computational decomposition.

Some decomposition examples:

  • "B" is the block number
  • "P" is the processor (1:npes) the block is associated with numbers are the local gridcell numbering within the block if the local dimensions are unrolled.

Standard xyz ordering, 2d decomp: note: blkdecomp plays no role since there is 1 block per pe

 nx_global  6       
 ny_global  4
 nz_global  1           ______________________________
 npes       4          |B3  P3        |B4  P4         |
 nblksppe   1          |              |               |
 grdorder   "xyz"      |              |               |
 grddecomp  "xy"       |              |               |
 gdx        0          |              |               |
 gdy        0          |--------------+---------------|
 gdz        0          |B1  P1        |B2  P2         |
 blkorder   "xyz"      |  4    5   6  |  4   5   6    |
 blkdecomp1 "xy"       |              |               |
 blkdecomp2 ""         |              |               |
 bdx        0          |  1    2   3  |  1   2   3    |
 bdy        0          |______________|_______________|
 bdz        0

Same as above but yxz ordering, 2d decomp note: blkdecomp plays no role since there is 1 block per pe

 nx_global  6       
 ny_global  4
 nz_global  1           _____________________________
 npes       4          |B2  P2        |B4  P4        |
 nblksppe   1          |              |              |
 grdorder   "yxz"      |              |              |
 grddecomp  "xy"       |              |              |
 gdx        0          |              |              |
 gdy        0          |--------------+--------------|
 gdz        0          |B1  P1        |B3  P3        |
 blkorder   "yxz"      |  2    4   6  |  2   4   6   |
 blkdecomp1 "xy"       |              |              |
 blkdecomp2 ""         |              |              |
 bdx        0          |  1    3   5  |  1   3   5   |
 bdy        0          |______________|______________|
 bdz        0

xyz grid ordering, 1d x decomp note: blkdecomp plays no role since there is 1 block per pe note: blkorder plays no role since it's a 1d decomp

 nx_global  8       
 ny_global  4
 nz_global  1           _____________________________________
 npes       4          |B1  P1  |B2  P2   |B3  P3  |B4  P4   |
 nblksppe   1          | 7   8  |  7   8  |        |         |
 grdorder   "xyz"      |        |         |        |         |
 grddecomp  "x"        |        |         |        |         |
 gdx        0          | 5   6  |  5   6  |        |         |
 gdy        0          |        |         |        |         |
 gdz        0          |        |         |        |         |
 blkorder   "yxz"      | 3   4  |  3   4  |        |         |
 blkdecomp1 "xy"       |        |         |        |         |
 blkdecomp2 ""         |        |         |        |         |
 bdx        0          | 1   2  |  1   2  |        |         |
 bdy        0          |________|_________|________|_________|
 bdz        0

yxz block ordering, 2d grid decomp, 2d block decomp, 4 block per pe

 nx_global  8       
 ny_global  4
 nz_global  1           _____________________________________
 npes       4          |B4  P2  |B8  P2   |B12  P4 |B16  P4  |
 nblksppe   4          |        |         |        |         |
 grdorder   "xyz"      |--------+---------+--------+---------|
 grddecomp  "xy"       |B3  P2  |B7  P2   |B11  P4 |B15  P4  |
 gdx        0          |        |         |        |         |
 gdy        0          |--------+---------+--------+---------|
 gdz        0          |B2  P1  |B6  P1   |B10  P3 |B14  P3  |
 blkorder   "yxz"      |        |         |        |         |
 blkdecomp1 "xy"       |--------+---------+--------+---------|
 blkdecomp2 ""         |B1  P1  |B5  P1   |B9   P3 |B13  P3  |
 bdx        0          | 1   2  | 1   2   |        |         |
 bdy        0          |________|_________|________|_________|
 bdz        0