Importing Data with DataFrame.read_csv()

By: Rajesh P.S.

Pandas' read_csv() function is a powerful and widely-used method for reading data from CSV (Comma Separated Values) files and creating a DataFrame, a two-dimensional tabular data structure in Python. This method simplifies the process of importing data from CSV files, which are a common format for storing structured data. Let's explore how to use read_csv() in detail with examples:

Basic Usage

Suppose we have a CSV file named "data.csv" with the following content:

Name,Age,Salary Alfred,25,50000 William,30,45000 Nick,22,60000

We can use the read_csv() function to read this file and create a DataFrame:

import pandas as pd # Reading the CSV file df = pd.read_csv('data.csv') print(df)

Continue Reading...

Output: Name Age Salary 0 Alfred 25 50000 1 William 30 45000 2 Nick 22 60000

In this example, the read_csv() function reads the CSV file and creates a DataFrame with three columns: 'Name', 'Age', and 'Salary'.

Customizing Parameters

The read_csv() function provides various parameters to customize the import process. For instance, we can specify a custom separator (e.g., tab-delimited) and select specific columns to read:

Suppose we have a tab-delimited file named "data.txt" with the following content:

Name Age Salary Alfred 25 50000 William 30 45000 Nick 22 60000

We can use the read_csv() function with custom parameters:

import pandas as pd # Reading the tab-delimited file, selecting only 'Name' and 'Salary' columns df = pd.read_csv('data.txt', delimiter='\t', usecols=['Name', 'Salary']) print(df)

Output: Name Salary 0 Alfred 50000 1 William 45000 2 Nick 60000

In this example, we use the delimiter parameter to specify that the file is tab-delimited, and the usecols parameter to read only the 'Name' and 'Salary' columns, resulting in a DataFrame with only those columns.

Handling Missing Values

CSV files may contain missing or NaN values. The read_csv() function can handle these missing values using various parameters, such as na_values, keep_default_na, and na_filter.

Suppose we have a CSV file named "data_with_nan.csv" with the following content:

Name,Age,Salary Alfred,25,50000 William,,45000 Nick,22,

We can use the read_csv() function to handle missing values:

import pandas as pd # Reading the CSV file, handling missing values df = pd.read_csv('data_with_nan.csv', na_values=['']) print(df)

Output: Name Age Salary 0 Alfred 25.0 50000.0 1 William NaN 45000.0 2 Nick 22.0 NaN

In this example, we use the na_values parameter to specify that empty strings should be treated as NaN values. As a result, the DataFrame contains NaN values for the missing entries in the 'Age' and 'Salary' columns.

read_csv()

The read_csv() function offers many more parameters, such as header, index_col, dtype, parse_dates, and skiprows, to further customize the import process according to the structure and requirements of the CSV file.

Specifying Delimiter

pd.read_csv ('data.csv',sep='\t')

Reading specific Columns only

pd.read_csv ('data.csv',usecols=['Name','Age'])

Read CSV without headers

pd.read_csv ('data.csv',header=None)

Argument header=None , skip the first row and use the 2nd row as headers

Skiprows

skiprows allows you to specify the number of lines to skip at the start of the file.

df = pd.read_csv ('data.csv', skiprows = 3)

Use a specific encoding (e.g. 'utf-8' )

pd.read_csv('data.csv', encoding='utf-8')

Parsing date columns

pd.read_csv('data.csv', parse_dates=['date'])

Specify dType

df = pd.read_csv ('data.csv', usecols=['Height'],dtype=np.float32)

Multi-character separator

By default, Pandas read_csv() uses a C parser engine for high performance. The C parser engine can only handle single character separators. If you need your CSV has a multi-character separator , you will need to modify your code to use the 'python' engine.

pd.read_csv ('data.csv', sep=r'\s*\\s*', engine='python')

UnicodeDecodeError while read_csv()

UnicodeDecodeError occurs when the data was stored in one encoding format but read in a different, incompatible one. The easiest solution for this error is:

pd.read_csv('data.csv', engine='python')

"Unnamed: 0" while read_csv()

"Unnamed: 0" occurs when a DataFrame with an un-named index is saved to CSV and then re-read after. To solve this error, what you have to do is to specify an index_col=[0] argument to read_csv() function, then it reads in the first column as the index.

pd.read_csv('data.csv', index_col=[0])

Instead of having to fix this issue while reading, you can also fix this issue when writing by using:

df.to_csv('data.csv', index=False)

Error tokenizing data while read_csv()

In most cases, it might be an issue with (1) the delimiters in your data (2) confused by the headers/column of the file. Solution:

pandas.read_csv('data.csv', sep='you_delimiter', header=None)

Above code tells pandas that your source data has no row for headers/column titles.

pd.read_csv('data.csv', error_bad_lines=False)

Above code will cause the offending lines to be skipped.

In order to get information about error causing rows try to use combination of error_bad_lines=False and warn_bad_lines=True:

pd.read_csv('data.csv', error_bad_lines=False,warn_bad_lines=True)

FileNotFoundError

In most cases :

just put r'' before your path to file. Because \ escapes character.

pd.read_csv(r'D:\Users\Desktop\data.csv')

Here r is a special character and means raw string.

Another way is to use \\ in your string to escape that \.

pd.read_csv('C:\\Users\\mylab\\Desktop\\data.csv')

MemoryError

Memory errors happens a lot with python when using the 32bit Windows version . This is because 32bit processes only gets 2GB of memory to play with by default.

The solution for this error is that pandas.read_csv() function takes an option called dtype. This lets pandas know what types exist inside your csv data.

For example: by specifying dtype={'age':int} as an option to the .read_csv() will let pandas know that age should be interpreted as a number. This saves you lots of memory.

pd.read_csv('data.csv',dtype={'age':int})

Or try the solution below:

pd.read_csv('data.csv',sep='\t',low_memory=False)

Conclusion

The read_csv() function in Pandas simplifies the process of reading data from CSV files and creating DataFrames. Its flexibility and numerous parameters allow data professionals to handle various CSV file formats, customize the import process, and address missing values efficiently, paving the way for seamless data analysis and manipulation.

Next > How to Convert Pandas DataFrame to NumPy Array

Related Topics

More Related Topics.....