module: stats#

stats.apply_to_phases_in_place(func)#

Decorator to apply a function to both ‘prior’ and ‘posterior’ phases and modify the DataFrame in place.

The decorated function should accept ‘phase’ as its first argument.

stats.apply_to_phases_by_type_return_df(func)#

Decorator to apply a function to both ‘prior’ and ‘posterior’ phases and return a new DataFrame.

The decorated function should accept ‘phase’ as its first argument and return a DataFrame.

stats.apply_to_phases_by_obs(func)#

Decorator to apply a function to both ‘prior’ and ‘posterior’ phases and return a new DataFrame.

The decorated function should accept ‘phase’ as its first argument and return a DataFrame.

stats.calculate_rank(df, phase)#

Calculate the rank of observations within an ensemble.

Note

This function is decorated with @apply_to_phases_by_obs, which modifies its usage. You should call it as calculate_rank(df), and the decorator will automatically apply the function to all relevant phases (‘prior’ and ‘posterior’).

This function takes a DataFrame containing ensemble predictions and observed values, adds sampling noise to the ensemble predictions, and calculates the rank of the observed value within the perturbed ensemble for each observation. The rank indicates the position of the observed value within the sorted ensemble values, with 1 being the lowest. If the observed value is larger than the largest ensemble member, its rank is set to the ensemble size plus one.

Parameters:

df (pd.DataFrame) – A DataFrame with columns for rank, and observation type.

Returns:

DataFrame containing columns for ‘rank’ and observation ‘type’.

stats.mean_then_sqrt(x)#

Calculates the mean of an array-like object and then takes the square root of the result.

Parameters:

arr (array-like) – An array-like object (such as a list or a pandas Series). The elements should be numeric.

Returns:

The square root of the mean of the input array.

Return type:

float

Raises:

TypeError – If the input is not an array-like object containing numeric values. ValueError: If the input array is empty.

stats.diag_stats(df, phase)#

Calculate diagnostic statistics for a given phase and add them to the DataFrame.

Note

This function is decorated with @apply_to_phases_in_place, which modifies its usage. You should call it as diag_stats(df), and the decorator will automatically apply the function to all relevant phases (‘prior’ and ‘posterior’) modifying the DataFrame in place.

Parameters:

df (pandas.DataFrame) –

The input DataFrame containing observation data and ensemble statistics. The DataFrame must include the following columns:

  • ’observation’: The actual observation values.

  • ’obs_err_var’: The variance of the observation error.

  • ’prior_ensemble_mean’ and/or ‘posterior_ensemble_mean’: The mean of the ensemble.

  • ’prior_ensemble_spread’ and/or ‘posterior_ensemble_spread’: The spread of the ensemble.

Returns:

The function modifies the DataFrame in place by adding the following columns:
  • ’prior_sq_err’ and/or ‘posterior_sq_err’: The square error for the ‘prior’ and ‘posterior’ phases.

  • ’prior_bias’ and/or ‘posterior_bias’: The bias for the ‘prior’ and ‘posterior’ phases.

  • ’prior_totalvar’ and/or ‘posterior_totalvar’: The total variance for the ‘prior’ and ‘posterior’ phases.

Return type:

None

Notes

  • Spread is the standard deviation of the ensemble.

  • The function modifies the input DataFrame by adding new columns for the calculated statistics.

stats.bin_by_layer(df, levels, verticalUnit='pressure (Pa)')#

Bin observations by vertical layers and add ‘vlevels’ and ‘midpoint’ columns to the DataFrame.

This function bins the observations in the DataFrame based on the specified vertical levels and adds two new columns: ‘vlevels’, which represents the categorized vertical levels, and ‘midpoint’, which represents the midpoint of each vertical level bin. Only observations (row) with the specified vertical unit are binned.

Parameters:
  • df (pandas.DataFrame) –

    The input DataFrame containing observation data. The DataFrame must include the following columns:

    • ’vertical’: The vertical coordinate values of the observations.

    • ’vert_unit’: The unit of the vertical coordinate values.

  • levels (list) – A list of bin edges for the vertical levels.

  • verticalUnit (str, optional) – The unit of the vertical axis (e.g., ‘pressure (Pa)’). Default is ‘pressure (Pa)’.

Returns:

The input DataFrame with additional columns for the binned vertical levels and their midpoints:
  • ’vlevels’: The categorized vertical levels.

  • ’midpoint’: The midpoint of each vertical level bin.

Return type:

pandas.DataFrame

Notes

  • The function modifies the input DataFrame by adding ‘vlevels’ and ‘midpoint’ columns.

  • The ‘midpoint’ values are calculated as half the midpoint of each vertical level bin.

stats.bin_by_time(df, time_value)#

Bin observations by time and add ‘time_bin’ and ‘time_bin_midpoint’ columns to the DataFrame. The first bin starts 1 second before the minimum time value, so the minimum time is included in the first bin. The last bin is inclusive of the maximum time value.

Parameters:
  • df (pd.DataFrame) – The input DataFrame containing a ‘time’ column.

  • time_value (str) – The width of each time bin (e.g., ‘3600S’ for 1 hour).

Returns:

The function modifies the DataFrame in place by adding ‘time_bin’ and ‘time_bin_midpoint’ columns.

Return type:

None

stats.grand_statistics(df, phase)#

Calculate grand statistics (RMSE, bias, total spread) for each observation type and phase.

This function assumes that diagnostic statistics (such as squared error, bias, and total variance) have already been computed by diag_stats() and are present in the DataFrame. It groups the data by observation type and computes the root mean square error (RMSE), mean bias, and total spread for the specified phase.

Note

This function is decorated with @apply_to_phases_by_type_return_df, which modifies its usage You should call it as grand_statistics(df), and the decorator will automatically apply the function to all relevant phases (‘prior’ and ‘posterior’) and return a merged DataFrame.

Parameters:

df (pandas.DataFrame) – The input DataFrame containing diagnostic statistics for observations.

Returns:

A DataFrame with columns:
  • ’type’: The observation type.

  • ’{phase}_rmse’: The root mean square error for the phase.

  • ’{phase}_bias’: The mean bias for the phase.

  • ’{phase}_totalspread’: The total spread for the phase.

Return type:

pandas.DataFrame

stats.layer_statistics(df, phase)#

Calculate statistics (RMSE, bias, total spread) for each observation type and vertical layer.

This function assumes that diagnostic statistics (such as squared error, bias, and total variance) have already been computed with diag_stats() and are present in the DataFrame. It groups the data by vertical layer midpoint and observation type, and computes the root mean square error (RMSE), mean bias, and total spread for the specified phase for each vertical layer.

Note

This function is decorated with @apply_to_phases_by_type_return_df, which modifies its usage You should call it as layer_statistics(df), and the decorator will automatically apply the function to all relevant phases (‘prior’ and ‘posterior’) and return a merged DataFrame.

Parameters:
  • df (pandas.DataFrame) – The input DataFrame containing diagnostic statistics for observations.

  • phase (str) – The phase for which to calculate the statistics (‘prior’ or ‘posterior’).

Returns:

A DataFrame with columns:
  • ’midpoint’: The midpoint of the vertical layer.

  • ’type’: The observation type.

  • ’{phase}_rmse’: The root mean square error for the phase.

  • ’{phase}_bias’: The mean bias for the phase.

  • ’{phase}_totalspread’: The total spread for the phase.

  • ’vert_unit’: The vertical unit.

  • ’vlevels’: The categorized vertical level.

Return type:

pandas.DataFrame

stats.time_statistics(df, phase)#

Calculate time-based statistics (RMSE, bias, total spread) for each observation type and time bin.

This function assumes that diagnostic statistics (such as squared error, bias, and total variance) have already been computed by diag_stats() and are present in the DataFrame. It groups the data by time bin midpoint and observation type, and computes the root mean square error (RMSE), mean bias, and total spread for the specified phase for each time bin.

Note

This function is decorated with @apply_to_phases_by_type_return_df. You should call it as time_statistics(df), and the decorator will automatically apply the function to all relevant phases (‘prior’ and ‘posterior’) and return a merged DataFrame.

Parameters:
  • df (pandas.DataFrame) – The input DataFrame containing diagnostic statistics for observations.

  • phase (str) – The phase for which to calculate the statistics (‘prior’ or ‘posterior’).

Returns:

A DataFrame with columns:
  • ’time_bin_midpoint’: The midpoint of the time bin.

  • ’type’: The observation type.

  • ’{phase}_rmse’: The root mean square error for the phase.

  • ’{phase}_bias’: The mean bias for the phase.

  • ’{phase}_totalspread’: The total spread for the phase.

  • ’time_bin’: The time bin interval.

  • ’time’: The first time value in the bin.

Return type:

pandas.DataFrame

stats.possible_vs_used(df)#

Calculates the count of possible vs. used observations by type.

This function takes a DataFrame containing observation data, including a ‘type’ column for the observation type and an ‘observation’ column. The number of used observations (‘used’), is the total number of assimilated observations (as determined by the select_used_qcs function). The result is a DataFrame with each observation type, the count of possible observations, and the count of used observations.

Returns:

A DataFrame with three columns: ‘type’, ‘possible’, and ‘used’. ‘type’ is the observation type, ‘possible’ is the count of all observations of that type, and ‘used’ is the count of observations of that type that passed quality control checks.

Return type:

pd.DataFrame

stats.possible_vs_used_by_layer(df)#

Calculates the count of possible vs. used observations by type and vertical level.

stats.select_used_qcs(df)#

Select rows from the DataFrame where the observation was used. Includes observations for which the posterior forward observation operators failed.

Returns:

A DataFrame containing only the rows with a DART quality control flag 0 or 2.

Return type:

pandas.DataFrame

stats.possible_vs_used_by_time(df)#

Calculates the count of possible vs. used observations by type and time bin.

Parameters:

df (pd.DataFrame) –

The input DataFrame containing observation data. The DataFrame must include:

  • ’type’: The observation type.

  • ’time_bin_midpoint’: The midpoint of the time bin.

  • ’observation’: The observation values.

  • ’DART_quality_control’: The quality control flag.

Returns:

A DataFrame with the following columns:
  • ’time_bin_midpoint’: The midpoint of the time bin.

  • ’type’: The observation type.

  • ’possible’: The count of all observations in the time bin.

  • ’used’: The count of observations in the time bin that passed quality control checks.

Return type:

pd.DataFrame