Note
Go to the end to download the full example code.
Grand Statistics#
This example demonstrates how to compute grand statistics for an observation sequence. For an explanation of the statistics calculations see the Statistics guide.
Import the obs_sequence module and the statistics module.
import pydartdiags.obs_sequence.obs_sequence as obsq
from pydartdiags.stats import stats
Chose an obs_seq file to read.
This is a small obs_seq file “obs_seq.final.ascii.medium”
that comes with the pyDARTdiags package
in the data directory, so we import os
to get the path to the file
import os
data_dir = os.path.join(os.getcwd(), "../..", "data")
data_file = os.path.join(data_dir, "obs_seq.final.ascii.medium")
Read the obs_seq file into an obs_seq object.
obs_seq = obsq.ObsSequence(data_file)
# Select observations that were used in the assimilation.
used_obs = obs_seq.select_used_qcs()
used_obs is a dataframe with only the observations with QC value of 0.
The columns of the dataframe are the same as the original obs_seq dataframe.
used_obs.columns
Index(['obs_num', 'observation', 'prior_ensemble_mean',
'prior_ensemble_spread', 'Data_QC', 'DART_quality_control',
'linked_list', 'longitude', 'latitude', 'vertical', 'vert_unit', 'type',
'metadata', 'external_FO', 'seconds', 'days', 'time', 'obs_err_var'],
dtype='object')
Now we calculate the statistics required for DART diagnostics.
stats.diag_stats(used_obs)
The statistics are calculated for each row in the dataframe, and the results are stored in new columns.
used_obs.columns
Index(['obs_num', 'observation', 'prior_ensemble_mean',
'prior_ensemble_spread', 'Data_QC', 'DART_quality_control',
'linked_list', 'longitude', 'latitude', 'vertical', 'vert_unit', 'type',
'metadata', 'external_FO', 'seconds', 'days', 'time', 'obs_err_var',
'prior_sq_err', 'prior_bias', 'prior_totalvar'],
dtype='object')
The help function can be used to find out more about the diag_stats function including what statistics are calculated.
help(stats.diag_stats)
Help on function diag_stats in module pydartdiags.stats.stats:
diag_stats(df, phase)
Calculate diagnostic statistics for a given phase and add them to the DataFrame.
Note:
This function is decorated with @apply_to_phases_in_place, which modifies its usage.
You should call it as diag_stats(df), and the decorator will automatically apply the
function to all relevant phases (‘prior’ and ‘posterior’) modifying the DataFrame
in place.
Args:
df (pandas.DataFrame): The input DataFrame containing observation data and ensemble statistics.
The DataFrame must include the following columns:
- 'observation': The actual observation values.
- 'obs_err_var': The variance of the observation error.
- 'prior_ensemble_mean' and/or 'posterior_ensemble_mean': The mean of the ensemble.
- 'prior_ensemble_spread' and/or 'posterior_ensemble_spread': The spread of the ensemble.
Returns:
None: The function modifies the DataFrame in place by adding the following columns:
- 'prior_sq_err' and/or 'posterior_sq_err': The square error for the 'prior' and 'posterior' phases.
- 'prior_bias' and/or 'posterior_bias': The bias for the 'prior' and 'posterior' phases.
- 'prior_totalvar' and/or 'posterior_totalvar': The total variance for the 'prior' and 'posterior' phases.
Notes:
- Spread is the standard deviation of the ensemble.
- The function modifies the input DataFrame by adding new columns for the calculated statistics.
Summarize the grand statistics, which are the statistics aggregated over all the observations for each type of observation.
stats.grand_statistics(used_obs)
Total running time of the script: (0 minutes 0.076 seconds)