pandas - Python Data Analysis Library

By: Rajesh P.S.

Pandas is an open-source library that provides powerful tools for data analysis, cleaning, exploration, and manipulation in Python. Its main data structures are Series and DataFrame, akin to R's data frame, with Series representing one-dimensional labeled arrays and DataFrame offering a two-dimensional tabular style data format with column and row indexes. Developers can import data from various file formats and perform a range of data manipulation operations, including cleaning, reshaping, summarizing, grouping, and merging datasets.

Importing Pandas DataFrame module

>>> import pandas

Continue Reading...

When dealing with an extensive number of function calls to pandas, repeatedly writing pandas.x() can be cumbersome and reduce code readability. As a best practice, it is more convenient to import the library under the abbreviated alias pd. This enables the usage of pandas functions with a more concise syntax, such as pd.x(), making the code more manageable and enhancing clarity.

>>> import pandas as pd

Get your data into a DataFrame

There are several ways you can use to take a standard python datastructure and create a panda's DataFrame.

Pandas DataFrame from Python List

>>> import pandas as pd >>> lstColors = ['red','blue','green'] >>> df=pd.DataFrame(lstColors) >>> print(df) 0 0 red 1 blue 2 green

Pandas DataFrame from Python Dictionary

>>> import pandas as pd >>> >>> data = { ... "Name": ['John', 'Doe', 'Gates'], ... "Age": [34, 52, 25], ... "Grade": ['B','A','B'] ... } >>> >>> #load data into a DataFrame object: >>> >>> df=pd.DataFrame(data) >>> print(df) Name Age Grade 0 John 34 B 1 Doe 52 A 2 Gates 25 B

Working with DataFrame Columns and Rows

Select Columns from DataFrame

From daraframe select only Name and Grade Columns

>>> print(df[['Name', 'Grade']]) Name Grade 0 John B 1 Doe A 2 Gates B

Select Rows from DataFrame

Pandas daraframe uses the loc() method to return one or more specified row(s).

>>> print(df.loc[1]) Name Doe Age 52 Grade A Name: 1, dtype: object

Select Multiple rows from DataFrame

>>> print(df.loc[[0,2]]) Name Age Grade 0 John 34 B 2 Gates 25 B

Adding Named Indexes

In dataframe you can name your own indexes by using index argument .

>>> import pandas as pd >>> >>> data = { ... "Name": ['John', 'Doe', 'Gates'], ... "Age": [34, 52, 25], ... "Grade": ['B','A','B'] ... } >>> >>> df=pd.DataFrame(data,index=['Student-1','Student-2','Student-3']) >>> print(df) Name Age Grade Student-1 John 34 B Student-2 Doe 52 A Student-3 Gates 25 B

Retrieve data using Named Index

>>> print(df.loc["Student-2"]) Name Doe Age 52 Grade A Name: Student-2, dtype: object

Dataframe from numpy ndarray

>>> import numpy as np >>> import pandas as pd >>> df=pd.DataFrame(np.random.randint(low=100,high=999,size=(10,4))) >>> df 0 1 2 3 0 935 842 850 327 1 232 149 306 615 2 602 943 729 686 3 894 460 563 221 4 223 529 905 486 5 386 961 100 451 6 801 852 692 887 7 922 491 325 186 8 678 942 386 152 9 286 764 359 708

View the first or last N rows

DataFrame head() method return first 5 rows
DataFrame tail() method return last 5 rows

>>> df.head() 0 1 2 3 0 935 842 850 327 1 232 149 306 615 2 602 943 729 686 3 894 460 563 221 4 223 529 905 486

You can pass number of rows as argument

>>> df.tail(2) 0 1 2 3 8 678 942 386 152 9 286 764 359 708

Loading Data from files

Pandas provides various functions, such as read_csv for comma-separated values, read_excel for Microsoft Excel spreadsheets, and read_fwf for fixed-width formatted text, to efficiently read data from external files. These functions facilitate the process of importing data into pandas DataFrames, enabling seamless data analysis and manipulation within the Python environment.

import pandas as pd df = pd.read_csv('your-data.csv')

example

import pandas as pd df = pd.read_csv('https://static.lib.virginia.edu/statlab/materials/data/VDH-COVID-19-PublicUseDataset-EventDate.csv') df.head()

Saving a DataFrame

Read data and saving a DataFrame to a CSV file.

import pandas as pd df = pd.read_csv('https://static.lib.virginia.edu/statlab/materials/data/VDH-COVID-19-PublicUseDataset-EventDate.csv') df.to_csv('d:/data.csv', encoding='utf-8') print('done')

Find columns data types

>>> import pandas as pd >>> df = pd.read_csv('data.csv') >>> print(df.dtypes) Unnamed: 0 int64 Event Date object Health Planning Region object Case Status object Number of Cases int64 Number of Hospitalizations int64 Number of Deaths int64 dtype: object

Statistical Summary of Data

Pandas describe() method output a a brief statistical summary of the numeric columns in the data, including descriptive statistics of the central tendency and dispersion.

Copy DataFrame to another DataFrame

import pandas as pd df = pd.read_csv('data.csv') dfc = df.copy() dfc.head()

Count rows in a DataFrame

>>> import pandas as pd >>> df = pd.read_csv('data.csv') >>> df.count() Unnamed: 0 2338 Event Date 2338 Health Planning Region 2338 Case Status 2338 Number of Cases 2338 Number of Hospitalizations 2338 Number of Deaths 2338 dtype: int64

Conclusion

Pandas is a Python Data Analysis Library that offers robust tools for data manipulation, exploration, and analysis. It provides functions like read_csv, read_excel, and read_fwf to efficiently read data from various external file formats, allowing users to work with data in a convenient and intuitive manner.

Next > SciPy : Scientific Python

Related Topics

More Related Topics.....

Scikit-Learn : Python Machine Learning