Finding Duplicates in a Observation Sequence#

This example shows how to find duplicates in an observation sequence, and how to remove them.

Import the obs_sequence module

import os
import pydartdiags.obs_sequence.obs_sequence as obsq

Read in the observation sequence file. In this example we’ll use a real obs_seq file, the NCEP+ACARS.201303_6H.obs_seq2013030306 file that comes with the pyDARTdiags package. This is 6 hours of observations from March 3, 2013.

data_dir = os.path.join(os.getcwd(), "../..", "data")
data_file = os.path.join(data_dir, "NCEP+ACARS.201303_6H.obs_seq2013030306")

obs_seq = obsq.ObsSequence(data_file)

How many observations are in the sequence?

num_obs = len(obs_seq.df)
print(f"Number of observations: {num_obs}")

Number of observations: 317038

How many of each type of observation are in the sequence?

obs_seq.df.groupby('type')['type'].count()

type
ACARS_TEMPERATURE               21276
ACARS_U_WIND_COMPONENT          21672
ACARS_V_WIND_COMPONENT          21672
AIRCRAFT_TEMPERATURE            21576
AIRCRAFT_U_WIND_COMPONENT       21833
AIRCRAFT_V_WIND_COMPONENT       21833
LAND_SFC_ALTIMETER              19781
MARINE_SFC_ALTIMETER             8234
MARINE_SFC_SPECIFIC_HUMIDITY     2337
MARINE_SFC_TEMPERATURE           5838
MARINE_SFC_U_WIND_COMPONENT      4985
MARINE_SFC_V_WIND_COMPONENT      4985
RADIOSONDE_SPECIFIC_HUMIDITY      361
RADIOSONDE_SURFACE_ALTIMETER       16
RADIOSONDE_TEMPERATURE            655
RADIOSONDE_U_WIND_COMPONENT      2194
RADIOSONDE_V_WIND_COMPONENT      2194
SAT_U_WIND_COMPONENT            67798
SAT_V_WIND_COMPONENT            67798
Name: type, dtype: int64

How many duplicates are there in the sequence? Lets pick the columns that we want to compare to determine if an observation is a duplicate. In this case we’ll use latitude, longitude, vertical, time, observation, and type. We’ll use the ‘duplicated’ method to find the duplicates.

columns_to_compare = ['latitude', 'longitude', 'vertical', 'time', 'observation', 'type']
num_dups = obs_seq.df.duplicated(subset=columns_to_compare).sum()
print(f"Number of duplicates: {num_dups}")

Number of duplicates: 1933

Lets see how many duplicates there are for each type of observation. We’ll use the ‘duplicated’ method to find the duplicates.

for obs_type in sorted(obs_seq.types.values()):
    selected_rows = obs_seq.df[obs_seq.df['type'] == obs_type]
    print(f"duplicates in {obs_type}: ", selected_rows[columns_to_compare].duplicated().sum())

duplicates in ACARS_TEMPERATURE:  15
duplicates in ACARS_U_WIND_COMPONENT:  16
duplicates in ACARS_V_WIND_COMPONENT:  16
duplicates in AIRCRAFT_TEMPERATURE:  68
duplicates in AIRCRAFT_U_WIND_COMPONENT:  69
duplicates in AIRCRAFT_V_WIND_COMPONENT:  69
duplicates in LAND_SFC_ALTIMETER:  616
duplicates in MARINE_SFC_ALTIMETER:  217
duplicates in MARINE_SFC_SPECIFIC_HUMIDITY:  123
duplicates in MARINE_SFC_TEMPERATURE:  200
duplicates in MARINE_SFC_U_WIND_COMPONENT:  187
duplicates in MARINE_SFC_V_WIND_COMPONENT:  186
duplicates in RADIOSONDE_SPECIFIC_HUMIDITY:  0
duplicates in RADIOSONDE_SURFACE_ALTIMETER:  0
duplicates in RADIOSONDE_TEMPERATURE:  0
duplicates in RADIOSONDE_U_WIND_COMPONENT:  0
duplicates in RADIOSONDE_V_WIND_COMPONENT:  0
duplicates in SAT_U_WIND_COMPONENT:  69
duplicates in SAT_V_WIND_COMPONENT:  82

Let’s look at the duplicates in the ‘LAND_SFC_ALTIMETER’ type. We’re sorting by latitude to make it easier to see the duplicates.

selected_rows = obs_seq.df[obs_seq.df['type'] == 'LAND_SFC_ALTIMETER']
duplicate_mask = selected_rows[columns_to_compare].duplicated(keep=False)
duplicate_rows = selected_rows[duplicate_mask]

duplicate_rows.sort_values(by='latitude')

	obs_num	observation	NCEP_QC_index	linked_list	longitude	latitude	vertical	vert_unit	type	metadata	external_FO	seconds	days	time	obs_err_var
313408	313409	1016.865887	15.0	313408 313410 -1	40.33	-22.32	13.0	surface (m)	LAND_SFC_ALTIMETER	[]	[]	32400	150541	2013-03-03 09:00:00	1.0
313866	313867	1016.865887	15.0	313866 313868 -1	40.33	-22.32	13.0	surface (m)	LAND_SFC_ALTIMETER	[]	[]	32400	150541	2013-03-03 09:00:00	1.0
313865	313866	1014.502434	15.0	313865 313867 -1	42.70	-17.05	10.0	surface (m)	LAND_SFC_ALTIMETER	[]	[]	32400	150541	2013-03-03 09:00:00	1.0
313407	313408	1014.502434	15.0	313407 313409 -1	42.70	-17.05	10.0	surface (m)	LAND_SFC_ALTIMETER	[]	[]	32400	150541	2013-03-03 09:00:00	1.0
310087	310088	1013.941448	15.0	310087 310089 -1	45.28	-12.80	7.0	surface (m)	LAND_SFC_ALTIMETER	[]	[]	32400	150541	2013-03-03 09:00:00	1.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
270665	270666	1009.972271	15.0	270665 270667 -1	26.65	67.37	179.0	surface (m)	LAND_SFC_ALTIMETER	[]	[]	31140	150541	2013-03-03 08:39:00	1.0
311873	311874	1010.826543	15.0	311873 311875 -1	27.42	68.62	148.0	surface (m)	LAND_SFC_ALTIMETER	[]	[]	32400	150541	2013-03-03 09:00:00	1.0
309527	309528	1010.826543	15.0	309527 309529 -1	27.42	68.62	148.0	surface (m)	LAND_SFC_ALTIMETER	[]	[]	32400	150541	2013-03-03 09:00:00	1.0
274040	274041	1010.755425	15.0	274039 274042 -1	27.03	69.75	101.0	surface (m)	LAND_SFC_ALTIMETER	[]	[]	31140	150541	2013-03-03 08:39:00	1.0
270664	270665	1010.755425	15.0	270664 270666 -1	27.03	69.75	101.0	surface (m)	LAND_SFC_ALTIMETER	[]	[]	31140	150541	2013-03-03 08:39:00	1.0

1232 rows × 15 columns

Lets remove all the duplicates from the dataFrame

obs_seq.df.drop_duplicates(subset=columns_to_compare, inplace=True)

The number of obs has been reduced by the number of duplicates

print(f"Original number of observations: {num_obs}")
print(f"Number of duplicate observations: {num_dups}")
print(f"Number of observations after removing duplicates: {len(obs_seq.df)}")

Original number of observations: 317038
Number of duplicate observations: 1933
Number of observations after removing duplicates: 315105

Total running time of the script: (0 minutes 4.181 seconds)

Gallery generated by Sphinx-Gallery