Removing Duplicate rows from Pandas DataFrame
Pandas drop_duplicates() returns only the dataframe's unique values, optionally only considering certain columns.
drop_duplicates(subset=None, keep="first", inplace=False)
- subset: Subset takes a column or list of column label.
- keep : {'first', 'last', False}, default 'first'
Parameter | Description |
---|---|
first | Drop duplicates except for the first occurrence. |
last | Drop duplicates except for the last occurrence. |
False | Drop all duplicates. |
Lets create a DataFrame..
df = pd.DataFrame()
df['A'] = [1, 1, 2,2,3,4,4,4,5]
df['B'] = [10, 20, 30,40,10,30,10,40,20]
df
A B
0 1 10
1 1 20
2 2 30
3 2 40
4 3 10
5 4 30
6 4 10
7 4 40
8 5 20
Drop all duplicate values from column "A"
df.drop_duplicates(subset ="A",keep = False)
A B
4 3 10
8 5 20
The same result you can achieved with DataFrame.groupby()
df.groupby(["A"]).filter(lambda df:df.shape[0] == 1)
A B
4 3 10
8 5 20
Drop duplicates except for the first occurrence
df.drop_duplicates(subset ="A",keep = 'first')
0 1 10
2 2 30
4 3 10
5 4 30
8 5 20
Drop duplicates except for the last occurrence
df.drop_duplicates(subset ="A",keep = 'last')
A B
1 1 20
3 2 40
4 3 10
7 4 40
8 5 20
Drop duplicates based on multiple columns
df.drop_duplicates(subset=['A','B'], keep=False)
Keeping the row with the highest value
Remove duplicates by columns A and keeping the row with the highest value in column B
df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()
A B
1 1 20
3 2 40
4 3 10
7 4 40
8 5 20
The same result you can achieved with DataFrame.groupby()
df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
A B
1 1 20
3 2 40
4 3 10
7 4 40
8 5 20
Find duplicate rows on a specific column?
df.A.duplicated()
Count duplicate rows on a specific column
df.A.duplicated().sum()
Count duplicate rows in a DataFrame
df.duplicated().sum()
Count duplicate rows on certain column(s)
df.duplicated(subset=['A', 'B']).sum()
Related Topics
- Pandas DataFrame: GroupBy Examples
- Pandas DataFrame Aggregation and Grouping
- How to Sort Pandas DataFrame
- Pandas DataFrame: query() function
- How to Replace NaN Values With Zeros in Pandas DataFrame
- How to read CSV File using Pandas DataFrame.read_csv()
- How to Convert Pandas DataFrame to NumPy Array
- How to shuffle a DataFrame rows
- Import multiple csv files into one pandas DataFrame
- Create new column in DataFrame based on the existing columns
- New Pandas dataframe column based on if-else condition
- How to Convert a Dictionary to Pandas DataFrame
- Rename Pandas columns/index names (labels)