Pandas dataframe.groupby()

The 'groupby' statement in a Pandas DataFrame is a powerful operation that orchestrates the grouping of rows sharing similar values into concise summary rows, thereby enabling meaningful data aggregation and analysis. This operation is akin to posing inquiries such as "determine the count of Apples that Steve possesses" and distilling the results into a concise and informative format.

By employing the 'groupby' statement, data analysts and scientists can effectively segment and organize the dataset based on specific criteria, transforming an unwieldy dataset into a structured and intelligible representation. This functionality proves indispensable when dealing with large datasets and seeking to gain insights into specific subsets of the data.

Lets' create DataFrame with values.

import pandas as pd import numpy as np df = pd.DataFrame() df['Name'] = ['Doe', 'Doe', 'Mike','Steve','Doe','Doe','Mike','Doe','Mike'] df['Fruit'] = ['Apple', 'Apple', 'Apple','Orange','Orange','Orange','Orange','Grapes','Grapes'] df['Count'] = [20, 10, 20,30,10,40,50,10,30] df
Name Fruit Count 0 Doe Apple 20 1 Doe Apple 10 2 Mike Apple 20 3 Steve Orange 30 4 Doe Orange 10 5 Doe Orange 40 6 Mike Orange 50 7 Doe Grapes 10 8 Mike Grapes 30

Here you can see 3 names (Doe, Mike and Steve) have different kind of fruits (Apple, Orange and Grapes). So, you can have some operations on these tables using DataFrame groupby statement.


Pandas dataframe.groupby() examples

In the above image you can see some results from the above DataFrame. So, lets try to get the above result using DataFrame group by operation.

df.groupby(['Name','Fruit']).sum()
Count Name Fruit Doe Apple 30 Grapes 10 Orange 50 Mike Apple 20 Grapes 30 Orange 50 Steve Orange 30

The 'groupby' statement synergizes seamlessly with various aggregate functions, ranging from fundamental statistical computations (e.g., sum, count, mean, min, max) to more complex custom operations. When combined, these aggregate functions can be applied to the grouped data, providing comprehensive summaries and revealing meaningful patterns that might otherwise remain concealed.

Apply reset_index()

df.groupby(['Name','Fruit'])['Count'].sum().reset_index()
Name Fruit Count 0 Doe Apple 30 1 Doe Grapes 10 2 Doe Orange 50 3 Mike Apple 20 4 Mike Grapes 30 5 Mike Orange 50 6 Steve Orange 30

Also, you get another result to change the groupby order:

df.groupby(['Fruit','Name']).sum()
Count Fruit Name Apple Doe 30 Mike 20 Grapes Doe 10 Mike 30 Orange Doe 50 Mike 50 Steve 30

The ability to group the DataFrame by one or more columns opens up a world of possibilities for dissecting and analyzing intricate relationships within the data. It empowers data analysts to explore data from multiple dimensions, facilitating comprehensive exploratory data analysis (EDA) and uncovering intricate interdependencies that underpin the dataset.

Pivot Table

You can use the pivot functionality to arrange the data in a better grid.

df.groupby(['Name','Fruit'],as_index = False).sum().pivot('Name','Fruit').fillna(0)
Fruit Apple Grapes Orange Name Doe 30.0 10.0 50.0 Mike 20.0 30.0 50.0 Steve 0.0 0.0 30.0

Find the total count of fruits by person

df.groupby(['Name']).sum()
Count Name Doe 90 Mike 100 Steve 30

Find the total count of fruits

df.groupby(['Fruit']).sum()
Count Fruit Apple 50 Grapes 40 Orange 130

How many row entries for fruits in the table?

df.groupby(['Fruit']).size().reset_index()
Fruit 0 0 Apple 3 1 Grapes 2 2 Orange 4

Conclusion

The 'groupby' statement in the Pandas DataFrame represents an indispensable tool for data manipulation and exploration, enabling professionals to glean valuable insights, make data-driven decisions, and communicate complex findings in a clear and succinct manner. By utilizing the potential of this powerful operation, analysts can transform raw data into actionable knowledge, propelling their data analysis endeavors to new heights of sophistication and precision.