Removing Duplicate rows from Pandas DataFrame

Pandas drop_duplicates() returns only the dataframe's unique values, optionally only considering certain columns.
drop_duplicates(subset=None, keep="first", inplace=False)
  1. subset: Subset takes a column or list of column label.
  2. keep : {'first', 'last', False}, default 'first'
Parameter Description
first Drop duplicates except for the first occurrence.
last Drop duplicates except for the last occurrence.
False Drop all duplicates.

Lets create a DataFrame..

df = pd.DataFrame() df['A'] = [1, 1, 2,2,3,4,4,4,5] df['B'] = [10, 20, 30,40,10,30,10,40,20] df
A B 0 1 10 1 1 20 2 2 30 3 2 40 4 3 10 5 4 30 6 4 10 7 4 40 8 5 20

Drop all duplicate values from column "A"

df.drop_duplicates(subset ="A",keep = False)
A B 4 3 10 8 5 20
The same result you can achieved with DataFrame.groupby()
df.groupby(["A"]).filter(lambda df:df.shape[0] == 1)
A B 4 3 10 8 5 20

Drop duplicates except for the first occurrence

df.drop_duplicates(subset ="A",keep = 'first')
0 1 10 2 2 30 4 3 10 5 4 30 8 5 20

Drop duplicates except for the last occurrence

df.drop_duplicates(subset ="A",keep = 'last')
A B 1 1 20 3 2 40 4 3 10 7 4 40 8 5 20

Drop duplicates based on multiple columns

df.drop_duplicates(subset=['A','B'], keep=False)

Keeping the row with the highest value

Remove duplicates by columns A and keeping the row with the highest value in column B

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()
A B 1 1 20 3 2 40 4 3 10 7 4 40 8 5 20
The same result you can achieved with DataFrame.groupby()
df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
A B 1 1 20 3 2 40 4 3 10 7 4 40 8 5 20

Find duplicate rows on a specific column?

df.A.duplicated()

Count duplicate rows on a specific column

df.A.duplicated().sum()

Count duplicate rows in a DataFrame

df.duplicated().sum()

Count duplicate rows on certain column(s)

df.duplicated(subset=['A', 'B']).sum()