Removing Duplicate rows from Pandas DataFrame

The illustrious Pandas library, renowned for its data manipulation prowess, unveils yet another indispensable feature through the drop_duplicates() method. This powerful method stands as a reliable ally in the quest to extract and retain only the distinct and non-redundant elements from a DataFrame, elevating the precision and integrity of data analysis endeavors.

drop_duplicates(subset=None, keep="first", inplace=False)
  1. subset: Subset takes a column or list of column label.
  2. keep : {'first', 'last', False}, default 'first'
Parameter Description
first Drop duplicates except for the first occurrence.
last Drop duplicates except for the last occurrence.
False Drop all duplicates.

drop_duplicates() method

The drop_duplicates() method embodies a potent tool for data deduplication, seamlessly sifting through vast datasets to isolate singular occurrences, eliminating any duplications that may skew analytical insights. This careful process ensures that each retained element is wholly unique, allowing data professionals to glean accurate and unadulterated information from their datasets.

Lets create a DataFrame..

df = pd.DataFrame() df['A'] = [1, 1, 2,2,3,4,4,4,5] df['B'] = [10, 20, 30,40,10,30,10,40,20] df
A B 0 1 10 1 1 20 2 2 30 3 2 40 4 3 10 5 4 30 6 4 10 7 4 40 8 5 20

Drop all duplicate values from column "A"

df.drop_duplicates(subset ="A",keep = False)
A B 4 3 10 8 5 20

The same result you can achieved with DataFrame.groupby()

df.groupby(["A"]).filter(lambda df:df.shape[0] == 1)
A B 4 3 10 8 5 20

The drop_duplicates() method extends a welcoming embrace to customization, presenting users with the option to specify certain columns as the basis for uniqueness assessment. This refined control bestows analytical agility, empowering users to fine-tune the deduplication process to match the specific dimensions and characteristics of their data, ultimately amplifying the relevance and accuracy of the resulting output.

Drop duplicates except for the first occurrence

df.drop_duplicates(subset ="A",keep = 'first')
0 1 10 2 2 30 4 3 10 5 4 30 8 5 20

Drop duplicates except for the last occurrence

df.drop_duplicates(subset ="A",keep = 'last')
A B 1 1 20 3 2 40 4 3 10 7 4 40 8 5 20

Drop duplicates based on multiple columns

df.drop_duplicates(subset=['A','B'], keep=False)

Keeping the row with the highest value

Remove duplicates by columns A and keeping the row with the highest value in column B

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()
A B 1 1 20 3 2 40 4 3 10 7 4 40 8 5 20

The same result you can achieved with DataFrame.groupby()

df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
A B 1 1 20 3 2 40 4 3 10 7 4 40 8 5 20

With a penchant for efficiency, the drop_duplicates() method orchestrates its data-sifting operations with remarkable speed and precision. By utilizing the full potential of optimized algorithms and vectorized computations, this method facilitates swift and seamless data deduplication, ensuring that data professionals can allocate their time and resources toward higher-level analytical pursuits.

Find duplicate rows on a specific column?

df.A.duplicated()

Count duplicate rows on a specific column

df.A.duplicated().sum()

Count duplicate rows in a DataFrame

df.duplicated().sum()

Count duplicate rows on certain column(s)

df.duplicated(subset=['A', 'B']).sum()

Conclusion

The Pandas drop_duplicates() method shines as an indispensable asset in the data analysis toolkit, epitomizing the essence of data hygiene and accuracy. Its ability to discern and preserve unique values, guided by customizable criteria, signifies a commitment to delivering precise and insightful analytical outcomes. Through this method, data professionals can traverse the data landscape with confidence, honing their analytical prowess and steering their data-driven solutions towards the zenith of excellence.