module: stats#
- stats.apply_to_phases_in_place(func)#
Decorator to apply a function to both ‘prior’ and ‘posterior’ phases and modify the DataFrame in place.
The decorated function should accept ‘phase’ as its first argument.
- stats.apply_to_phases_by_type_return_df(func)#
Decorator to apply a function to both ‘prior’ and ‘posterior’ phases and return a new DataFrame.
The decorated function should accept ‘phase’ as its first argument and return a DataFrame.
- stats.apply_to_phases_by_obs(func)#
Decorator to apply a function to both ‘prior’ and ‘posterior’ phases and return a new DataFrame.
The decorated function should accept ‘phase’ as its first argument and return a DataFrame.
- stats.calculate_rank(df, phase)#
Calculate the rank of observations within an ensemble.
This function takes a DataFrame containing ensemble predictions and observed values, adds sampling noise to the ensemble predictions, and calculates the rank of the observed value within the perturbed ensemble for each observation. The rank indicates the position of the observed value within the sorted ensemble values, with 1 being the lowest. If the observed value is larger than the largest ensemble member, its rank is set to the ensemble size plus one.
- Parameters:
df (pd.DataFrame) – A DataFrame with columns for rank, and observation type.
phase (str) – The phase for which to calculate the statistics (‘prior’ or ‘posterior’)
- Returns:
DataFrame containing columns for ‘rank’ and observation ‘type’.
- stats.mean_then_sqrt(x)#
Calculates the mean of an array-like object and then takes the square root of the result.
- Parameters:
arr (array-like) – An array-like object (such as a list or a pandas Series). The elements should be numeric.
- Returns:
The square root of the mean of the input array.
- Return type:
float
- Raises:
TypeError – If the input is not an array-like object containing numeric values. ValueError: If the input array is empty.
- stats.diag_stats(df, phase)#
Calculate diagnostic statistics for a given phase and add them to the DataFrame.
- Parameters:
df (pandas.DataFrame) – The input DataFrame containing observation data and ensemble statistics. The DataFrame must include the following columns: - ‘observation’: The actual observation values. - ‘obs_err_var’: The variance of the observation error. - ‘prior_ensemble_mean’ and/or ‘posterior_ensemble_mean’: The mean of the ensemble. - ‘prior_ensemble_spread’ and/or ‘posterior_ensemble_spread’: The spread of the ensemble.
phase (str) – The phase for which to calculate the statistics (‘prior’ or ‘posterior’)
- Returns:
- The function modifies the DataFrame in place by adding the following columns:
’prior_sq_err’ and/or ‘posterior_sq_err’: The square error for the ‘prior’ and ‘posterior’ phases.
’prior_bias’ and/or ‘posterior_bias’: The bias for the ‘prior’ and ‘posterior’ phases.
’prior_totalvar’ and/or ‘posterior_totalvar’: The total variance for the ‘prior’ and ‘posterior’ phases.
- Return type:
None
Notes
Spread is the standard deviation of the ensemble.
The function modifies the input DataFrame by adding new columns for the calculated statistics.
- stats.bin_by_layer(df, levels, verticalUnit='pressure (Pa)')#
Bin observations by vertical layers and add ‘vlevels’ and ‘midpoint’ columns to the DataFrame.
This function bins the observations in the DataFrame based on the specified vertical levels and adds two new columns: ‘vlevels’, which represents the categorized vertical levels, and ‘midpoint’, which represents the midpoint of each vertical level bin. Only observations (row) with the specified vertical unit are binned.
- Parameters:
df (pandas.DataFrame) – The input DataFrame containing observation data. The DataFrame must include the following columns: - ‘vertical’: The vertical coordinate values of the observations. - ‘vert_unit’: The unit of the vertical coordinate values.
levels (list) – A list of bin edges for the vertical levels.
verticalUnit (str, optional) – The unit of the vertical axis (e.g., ‘pressure (Pa)’). Default is ‘pressure (Pa)’.
- Returns:
- The input DataFrame with additional columns for the binned vertical levels and their midpoints:
’vlevels’: The categorized vertical levels.
’midpoint’: The midpoint of each vertical level bin.
- Return type:
pandas.DataFrame
Notes
The function modifies the input DataFrame by adding ‘vlevels’ and ‘midpoint’ columns.
The ‘midpoint’ values are calculated as half the midpoint of each vertical level bin.
- stats.bin_by_time(df, time_value)#
Bin observations by time and add ‘time_bin’ and ‘time_bin_midpoint’ columns to the DataFrame. The first bin starts 1 second before the minimum time value, so the minimum time is included in the first bin. The last bin is inclusive of the maximum time value.
- Parameters:
df (pd.DataFrame) – The input DataFrame containing a ‘time’ column.
time_value (str) – The width of each time bin (e.g., ‘3600S’ for 1 hour).
- Returns:
The function modifies the DataFrame in place by adding ‘time_bin’ and ‘time_bin_midpoint’ columns.
- Return type:
None
- stats.time_statistics(df, phase)#
Calculate time-based statistics for a given phase and return a new DataFrame.
- Parameters:
df (pandas.DataFrame) – The input DataFrame containing observation data and ensemble statistics.
phase (str) – The phase for which to calculate the statistics (‘prior’ or ‘posterior’).
- Returns:
A DataFrame containing time-based statistics for the specified phase.
- Return type:
pandas.DataFrame
- stats.possible_vs_used(df)#
Calculates the count of possible vs. used observations by type.
This function takes a DataFrame containing observation data, including a ‘type’ column for the observation type and an ‘observation’ column. The number of used observations (‘used’), is the total number of assimilated observations (as determined by the select_used_qcs function). The result is a DataFrame with each observation type, the count of possible observations, and the count of used observations.
- Returns:
A DataFrame with three columns: ‘type’, ‘possible’, and ‘used’. ‘type’ is the observation type, ‘possible’ is the count of all observations of that type, and ‘used’ is the count of observations of that type that passed quality control checks.
- Return type:
pd.DataFrame
- stats.possible_vs_used_by_layer(df)#
Calculates the count of possible vs. used observations by type and vertical level.
- stats.select_used_qcs(df)#
Select rows from the DataFrame where the observation was used. Includes observations for which the posterior forward observation operators failed.
- Returns:
A DataFrame containing only the rows with a DART quality control flag 0 or 2.
- Return type:
pandas.DataFrame
- stats.possible_vs_used_by_time(df)#
Calculates the count of possible vs. used observations by type and time bin.
- Parameters:
df (pd.DataFrame) – The input DataFrame containing observation data. The DataFrame must include: - ‘type’: The observation type. - ‘time_bin_midpoint’: The midpoint of the time bin. - ‘observation’: The observation values. - ‘DART_quality_control’: The quality control flag.
- Returns:
- A DataFrame with the following columns:
’time_bin_midpoint’: The midpoint of the time bin.
’type’: The observation type.
’possible’: The count of all observations in the time bin.
’used’: The count of observations in the time bin that passed quality control checks.
- Return type:
pd.DataFrame