Note
Go to the end to download the full example code.
Finding Duplicates in a Observation Sequence#
This example shows how to find duplicates in an observation sequence, and how to remove them.
Import the obs_sequence module
import os
import pydartdiags.obs_sequence.obs_sequence as obsq
Read in the observation sequence file. In this example we’ll use a real obs_seq file, the NCEP+ACARS.201303_6H.obs_seq2013030306 file that comes with the pyDARTdiags package. This is 6 hours of observations from March 3, 2013.
data_dir = os.path.join(os.getcwd(), "../..", "data")
data_file = os.path.join(data_dir, "NCEP+ACARS.201303_6H.obs_seq2013030306")
obs_seq = obsq.ObsSequence(data_file)
How many observations are in the sequence?
num_obs = len(obs_seq.df)
print(f"Number of observations: {num_obs}")
Number of observations: 317038
How many of each type of observation are in the sequence?
obs_seq.df.groupby('type')['type'].count()
type
ACARS_TEMPERATURE 21276
ACARS_U_WIND_COMPONENT 21672
ACARS_V_WIND_COMPONENT 21672
AIRCRAFT_TEMPERATURE 21576
AIRCRAFT_U_WIND_COMPONENT 21833
AIRCRAFT_V_WIND_COMPONENT 21833
LAND_SFC_ALTIMETER 19781
MARINE_SFC_ALTIMETER 8234
MARINE_SFC_SPECIFIC_HUMIDITY 2337
MARINE_SFC_TEMPERATURE 5838
MARINE_SFC_U_WIND_COMPONENT 4985
MARINE_SFC_V_WIND_COMPONENT 4985
RADIOSONDE_SPECIFIC_HUMIDITY 361
RADIOSONDE_SURFACE_ALTIMETER 16
RADIOSONDE_TEMPERATURE 655
RADIOSONDE_U_WIND_COMPONENT 2194
RADIOSONDE_V_WIND_COMPONENT 2194
SAT_U_WIND_COMPONENT 67798
SAT_V_WIND_COMPONENT 67798
Name: type, dtype: int64
How many duplicates are there in the sequence? Lets pick the columns that we want to compare to determine if an observation is a duplicate. In this case we’ll use latitude, longitude, vertical, time, observation, and type. We’ll use the ‘duplicated’ method to find the duplicates.
columns_to_compare = ['latitude', 'longitude', 'vertical', 'time', 'observation', 'type']
num_dups = obs_seq.df.duplicated(subset=columns_to_compare).sum()
print(f"Number of duplicates: {num_dups}")
Number of duplicates: 1933
Lets see how many duplicates there are for each type of observation. We’ll use the ‘duplicated’ method to find the duplicates.
for obs_type in sorted(obs_seq.types.values()):
selected_rows = obs_seq.df[obs_seq.df['type'] == obs_type]
print(f"duplicates in {obs_type}: ", selected_rows[columns_to_compare].duplicated().sum())
duplicates in ACARS_TEMPERATURE: 15
duplicates in ACARS_U_WIND_COMPONENT: 16
duplicates in ACARS_V_WIND_COMPONENT: 16
duplicates in AIRCRAFT_TEMPERATURE: 68
duplicates in AIRCRAFT_U_WIND_COMPONENT: 69
duplicates in AIRCRAFT_V_WIND_COMPONENT: 69
duplicates in LAND_SFC_ALTIMETER: 616
duplicates in MARINE_SFC_ALTIMETER: 217
duplicates in MARINE_SFC_SPECIFIC_HUMIDITY: 123
duplicates in MARINE_SFC_TEMPERATURE: 200
duplicates in MARINE_SFC_U_WIND_COMPONENT: 187
duplicates in MARINE_SFC_V_WIND_COMPONENT: 186
duplicates in RADIOSONDE_SPECIFIC_HUMIDITY: 0
duplicates in RADIOSONDE_SURFACE_ALTIMETER: 0
duplicates in RADIOSONDE_TEMPERATURE: 0
duplicates in RADIOSONDE_U_WIND_COMPONENT: 0
duplicates in RADIOSONDE_V_WIND_COMPONENT: 0
duplicates in SAT_U_WIND_COMPONENT: 69
duplicates in SAT_V_WIND_COMPONENT: 82
Let’s look at the duplicates in the ‘LAND_SFC_ALTIMETER’ type. We’re sorting by latitude to make it easier to see the duplicates.
selected_rows = obs_seq.df[obs_seq.df['type'] == 'LAND_SFC_ALTIMETER']
duplicate_mask = selected_rows[columns_to_compare].duplicated(keep=False)
duplicate_rows = selected_rows[duplicate_mask]
duplicate_rows.sort_values(by='latitude')
Lets remove all the duplicates from the dataFrame
obs_seq.df.drop_duplicates(subset=columns_to_compare, inplace=True)
The number of obs has been reduced by the number of duplicates
print(f"Original number of observations: {num_obs}")
print(f"Number of duplicate observations: {num_dups}")
print(f"Number of observations after removing duplicates: {len(obs_seq.df)}")
Original number of observations: 317038
Number of duplicate observations: 1933
Number of observations after removing duplicates: 315105
Total running time of the script: (0 minutes 4.104 seconds)