pandas - Python Data Analysis Library

Pandas is an open-source software library for analysing, cleaning, exploring, and manipulating data, built on top of the Python programming language. The main data structures in Pandas are the Series and the DataFrame (similar to R's data frame). A Pandas Series one-dimensional labelled array of data and an index. All the data in a dataFrame Series is of the same data type. The pandas DataFrame is a two-dimensional tabular style data with column and row indexes. The columns in DataFrame are made up of Series objects . The pandas module allows developers to import data from various file formats (csv, json, sql, xls, etc.) and perform data manipulation operations, including cleaning and reshaping the data, summarizing observations , grouping data, and merging multiple datasets.
panda library python

Importing Pandas DataFrame module

>>> import pandas
If you have large amounts of function calls to pandas , it can become hard to write pandas.x() over and over again. Instead, it is better to import under the brief name pd.
>>> import pandas as pd

Get your data into a DataFrame

There are several ways you can use to take a standard python datastructure and create a panda's DataFrame.

Pandas DataFrame from Python List

>>> import pandas as pd >>> lstColors = ['red','blue','green'] >>> df=pd.DataFrame(lstColors) >>> print(df) 0 0 red 1 blue 2 green

Pandas DataFrame from Python Dictionary

>>> import pandas as pd >>> >>> data = { ... "Name": ['John', 'Doe', 'Gates'], ... "Age": [34, 52, 25], ... "Grade": ['B','A','B'] ... } >>> >>> #load data into a DataFrame object: >>> >>> df=pd.DataFrame(data) >>> print(df) Name Age Grade 0 John 34 B 1 Doe 52 A 2 Gates 25 B

Working with DataFrame Columns and Rows

Select Columns from DataFrame

From daraframe select only Name and Grade Columns
>>> print(df[['Name', 'Grade']]) Name Grade 0 John B 1 Doe A 2 Gates B

Select Rows from DataFrame

Pandas daraframe uses the loc() method to return one or more specified row(s).
>>> print(df.loc[1]) Name Doe Age 52 Grade A Name: 1, dtype: object

Select Multiple rows from DataFrame

>>> print(df.loc[[0,2]]) Name Age Grade 0 John 34 B 2 Gates 25 B

Adding Named Indexes

In dataframe you can name your own indexes by using index argument .
>>> import pandas as pd >>> >>> data = { ... "Name": ['John', 'Doe', 'Gates'], ... "Age": [34, 52, 25], ... "Grade": ['B','A','B'] ... } >>> >>> df=pd.DataFrame(data,index=['Student-1','Student-2','Student-3']) >>> print(df) Name Age Grade Student-1 John 34 B Student-2 Doe 52 A Student-3 Gates 25 B

Retrieve data using Named Index

>>> print(df.loc["Student-2"]) Name Doe Age 52 Grade A Name: Student-2, dtype: object

Dataframe from numpy ndarray

>>> import numpy as np >>> import pandas as pd >>> df=pd.DataFrame(np.random.randint(low=100,high=999,size=(10,4))) >>> df 0 1 2 3 0 935 842 850 327 1 232 149 306 615 2 602 943 729 686 3 894 460 563 221 4 223 529 905 486 5 386 961 100 451 6 801 852 692 887 7 922 491 325 186 8 678 942 386 152 9 286 764 359 708

View the first or last N rows

  1. DataFrame head() method return first 5 rows
  2. DataFrame tail() method return last 5 rows
>>> df.head() 0 1 2 3 0 935 842 850 327 1 232 149 306 615 2 602 943 729 686 3 894 460 563 221 4 223 529 905 486

You can pass number of rows as argument

>>> df.tail(2) 0 1 2 3 8 678 942 386 152 9 286 764 359 708

Loading Data from files

The function read_csv (for comma separated values), read_excel (for Microsoft Excel spreadsheets), read_fwf (fixed width formatted text) etc. are using read data from external files.
import pandas as pd df = pd.read_csv('your-data.csv')
example
import pandas as pd df = pd.read_csv('https://static.lib.virginia.edu/statlab/materials/data/VDH-COVID-19-PublicUseDataset-EventDate.csv') df.head()

Panda dataframe reading csv file

Saving a DataFrame

Read data and saving a DataFrame to a CSV file.
import pandas as pd df = pd.read_csv('https://static.lib.virginia.edu/statlab/materials/data/VDH-COVID-19-PublicUseDataset-EventDate.csv') df.to_csv('d:/data.csv', encoding='utf-8') print('done')

Find columns data types

>>> import pandas as pd >>> df = pd.read_csv('data.csv') >>> print(df.dtypes) Unnamed: 0 int64 Event Date object Health Planning Region object Case Status object Number of Cases int64 Number of Hospitalizations int64 Number of Deaths int64 dtype: object

Statistical Summary of Data

Pandas describe() method output a a brief statistical summary of the numeric columns in the data, including descriptive statistics of the central tendency and dispersion.
panda dataframe describe

Copy DataFrame to another DataFrame

import pandas as pd df = pd.read_csv('data.csv') dfc = df.copy() dfc.head()

Count rows in a DataFrame

>>> import pandas as pd >>> df = pd.read_csv('data.csv') >>> df.count() Unnamed: 0 2338 Event Date 2338 Health Planning Region 2338 Case Status 2338 Number of Cases 2338 Number of Hospitalizations 2338 Number of Deaths 2338 dtype: int64