Finding Duplicates in a Observation Sequence#

This example shows how to find duplicates in an observation sequence, and how to remove them.

Import the obs_sequence module

import os
import pydartdiags.obs_sequence.obs_sequence as obsq

Read in the observation sequence file. In this example we’ll use a real obs_seq file, the NCEP+ACARS.201303_6H.obs_seq2013030306 file that comes with the pyDARTdiags package. This is 6 hours of observations from March 3, 2013.

data_dir = os.path.join(os.getcwd(), "../..", "data")
data_file = os.path.join(data_dir, "NCEP+ACARS.201303_6H.obs_seq2013030306")

obs_seq = obsq.ObsSequence(data_file)

How many observations are in the sequence?

num_obs = len(obs_seq.df)
print(f"Number of observations: {num_obs}")
Number of observations: 317038

How many of each type of observation are in the sequence?

obs_seq.df.groupby('type')['type'].count()
type
ACARS_TEMPERATURE               21276
ACARS_U_WIND_COMPONENT          21672
ACARS_V_WIND_COMPONENT          21672
AIRCRAFT_TEMPERATURE            21576
AIRCRAFT_U_WIND_COMPONENT       21833
AIRCRAFT_V_WIND_COMPONENT       21833
LAND_SFC_ALTIMETER              19781
MARINE_SFC_ALTIMETER             8234
MARINE_SFC_SPECIFIC_HUMIDITY     2337
MARINE_SFC_TEMPERATURE           5838
MARINE_SFC_U_WIND_COMPONENT      4985
MARINE_SFC_V_WIND_COMPONENT      4985
RADIOSONDE_SPECIFIC_HUMIDITY      361
RADIOSONDE_SURFACE_ALTIMETER       16
RADIOSONDE_TEMPERATURE            655
RADIOSONDE_U_WIND_COMPONENT      2194
RADIOSONDE_V_WIND_COMPONENT      2194
SAT_U_WIND_COMPONENT            67798
SAT_V_WIND_COMPONENT            67798
Name: type, dtype: int64

How many duplicates are there in the sequence? Lets pick the columns that we want to compare to determine if an observation is a duplicate. In this case we’ll use latitude, longitude, vertical, time, observation, and type. We’ll use the ‘duplicated’ method to find the duplicates.

columns_to_compare = ['latitude', 'longitude', 'vertical', 'time', 'observation', 'type']
num_dups = obs_seq.df.duplicated(subset=columns_to_compare).sum()
print(f"Number of duplicates: {num_dups}")
Number of duplicates: 1933

Lets see how many duplicates there are for each type of observation. We’ll use the ‘duplicated’ method to find the duplicates.

for obs_type in sorted(obs_seq.types.values()):
    selected_rows = obs_seq.df[obs_seq.df['type'] == obs_type]
    print(f"duplicates in {obs_type}: ", selected_rows[columns_to_compare].duplicated().sum())
duplicates in ACARS_TEMPERATURE:  15
duplicates in ACARS_U_WIND_COMPONENT:  16
duplicates in ACARS_V_WIND_COMPONENT:  16
duplicates in AIRCRAFT_TEMPERATURE:  68
duplicates in AIRCRAFT_U_WIND_COMPONENT:  69
duplicates in AIRCRAFT_V_WIND_COMPONENT:  69
duplicates in LAND_SFC_ALTIMETER:  616
duplicates in MARINE_SFC_ALTIMETER:  217
duplicates in MARINE_SFC_SPECIFIC_HUMIDITY:  123
duplicates in MARINE_SFC_TEMPERATURE:  200
duplicates in MARINE_SFC_U_WIND_COMPONENT:  187
duplicates in MARINE_SFC_V_WIND_COMPONENT:  186
duplicates in RADIOSONDE_SPECIFIC_HUMIDITY:  0
duplicates in RADIOSONDE_SURFACE_ALTIMETER:  0
duplicates in RADIOSONDE_TEMPERATURE:  0
duplicates in RADIOSONDE_U_WIND_COMPONENT:  0
duplicates in RADIOSONDE_V_WIND_COMPONENT:  0
duplicates in SAT_U_WIND_COMPONENT:  69
duplicates in SAT_V_WIND_COMPONENT:  82

Let’s look at the duplicates in the ‘LAND_SFC_ALTIMETER’ type. We’re sorting by latitude to make it easier to see the duplicates.

selected_rows = obs_seq.df[obs_seq.df['type'] == 'LAND_SFC_ALTIMETER']
duplicate_mask = selected_rows[columns_to_compare].duplicated(keep=False)
duplicate_rows = selected_rows[duplicate_mask]

duplicate_rows.sort_values(by='latitude')
obs_num observation NCEP_QC_index linked_list longitude latitude vertical vert_unit type metadata external_FO seconds days time obs_err_var
313408 313409 1016.865887 15.0 313408 313410 -1 40.33 -22.32 13.0 surface (m) LAND_SFC_ALTIMETER [] [] 32400 150541 2013-03-03 09:00:00 1.0
313866 313867 1016.865887 15.0 313866 313868 -1 40.33 -22.32 13.0 surface (m) LAND_SFC_ALTIMETER [] [] 32400 150541 2013-03-03 09:00:00 1.0
313865 313866 1014.502434 15.0 313865 313867 -1 42.70 -17.05 10.0 surface (m) LAND_SFC_ALTIMETER [] [] 32400 150541 2013-03-03 09:00:00 1.0
313407 313408 1014.502434 15.0 313407 313409 -1 42.70 -17.05 10.0 surface (m) LAND_SFC_ALTIMETER [] [] 32400 150541 2013-03-03 09:00:00 1.0
310087 310088 1013.941448 15.0 310087 310089 -1 45.28 -12.80 7.0 surface (m) LAND_SFC_ALTIMETER [] [] 32400 150541 2013-03-03 09:00:00 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
270665 270666 1009.972271 15.0 270665 270667 -1 26.65 67.37 179.0 surface (m) LAND_SFC_ALTIMETER [] [] 31140 150541 2013-03-03 08:39:00 1.0
311873 311874 1010.826543 15.0 311873 311875 -1 27.42 68.62 148.0 surface (m) LAND_SFC_ALTIMETER [] [] 32400 150541 2013-03-03 09:00:00 1.0
309527 309528 1010.826543 15.0 309527 309529 -1 27.42 68.62 148.0 surface (m) LAND_SFC_ALTIMETER [] [] 32400 150541 2013-03-03 09:00:00 1.0
274040 274041 1010.755425 15.0 274039 274042 -1 27.03 69.75 101.0 surface (m) LAND_SFC_ALTIMETER [] [] 31140 150541 2013-03-03 08:39:00 1.0
270664 270665 1010.755425 15.0 270664 270666 -1 27.03 69.75 101.0 surface (m) LAND_SFC_ALTIMETER [] [] 31140 150541 2013-03-03 08:39:00 1.0

1232 rows × 15 columns



Lets remove all the duplicates from the dataFrame

obs_seq.df.drop_duplicates(subset=columns_to_compare, inplace=True)

The number of obs has been reduced by the number of duplicates

print(f"Original number of observations: {num_obs}")
print(f"Number of duplicate observations: {num_dups}")
print(f"Number of observations after removing duplicates: {len(obs_seq.df)}")
Original number of observations: 317038
Number of duplicate observations: 1933
Number of observations after removing duplicates: 315105

Total running time of the script: (0 minutes 4.104 seconds)

Gallery generated by Sphinx-Gallery