Importing Data with DataFrame.read_csv()
Pandas' read_csv() function is a powerful and widely-used method for reading data from CSV (Comma Separated Values) files and creating a DataFrame, a two-dimensional tabular data structure in Python. This method simplifies the process of importing data from CSV files, which are a common format for storing structured data. Let's explore how to use read_csv() in detail with examples:
Basic Usage
Suppose we have a CSV file named "data.csv" with the following content:
We can use the read_csv() function to read this file and create a DataFrame:
In this example, the read_csv() function reads the CSV file and creates a DataFrame with three columns: 'Name', 'Age', and 'Salary'.
Customizing Parameters
The read_csv() function provides various parameters to customize the import process. For instance, we can specify a custom separator (e.g., tab-delimited) and select specific columns to read:
Suppose we have a tab-delimited file named "data.txt" with the following content:
We can use the read_csv() function with custom parameters:
In this example, we use the delimiter parameter to specify that the file is tab-delimited, and the usecols parameter to read only the 'Name' and 'Salary' columns, resulting in a DataFrame with only those columns.
Handling Missing Values
CSV files may contain missing or NaN values. The read_csv() function can handle these missing values using various parameters, such as na_values, keep_default_na, and na_filter.
Suppose we have a CSV file named "data_with_nan.csv" with the following content:
We can use the read_csv() function to handle missing values:
In this example, we use the na_values parameter to specify that empty strings should be treated as NaN values. As a result, the DataFrame contains NaN values for the missing entries in the 'Age' and 'Salary' columns.
read_csv()
The read_csv() function offers many more parameters, such as header, index_col, dtype, parse_dates, and skiprows, to further customize the import process according to the structure and requirements of the CSV file.
Specifying Delimiter
Reading specific Columns only
Read CSV without headers
Argument header=None , skip the first row and use the 2nd row as headers
Skiprows
skiprows allows you to specify the number of lines to skip at the start of the file.
Use a specific encoding (e.g. 'utf-8' )
Parsing date columns
Specify dType
Multi-character separator
By default, Pandas read_csv() uses a C parser engine for high performance. The C parser engine can only handle single character separators. If you need your CSV has a multi-character separator , you will need to modify your code to use the 'python' engine.
UnicodeDecodeError while read_csv()
UnicodeDecodeError occurs when the data was stored in one encoding format but read in a different, incompatible one. The easiest solution for this error is:
"Unnamed: 0" while read_csv()
"Unnamed: 0" occurs when a DataFrame with an un-named index is saved to CSV and then re-read after. To solve this error, what you have to do is to specify an index_col=[0] argument to read_csv() function, then it reads in the first column as the index.
Instead of having to fix this issue while reading, you can also fix this issue when writing by using:
Error tokenizing data while read_csv()
In most cases, it might be an issue with (1) the delimiters in your data (2) confused by the headers/column of the file. Solution:
Above code tells pandas that your source data has no row for headers/column titles.
OrAbove code will cause the offending lines to be skipped.
In order to get information about error causing rows try to use combination of error_bad_lines=False and warn_bad_lines=True:
FileNotFoundError
In most cases :just put r'' before your path to file. Because \ escapes character.
Here r is a special character and means raw string.
Another way is to use \\ in your string to escape that \.
MemoryError
Memory errors happens a lot with python when using the 32bit Windows version . This is because 32bit processes only gets 2GB of memory to play with by default.
The solution for this error is that pandas.read_csv() function takes an option called dtype. This lets pandas know what types exist inside your csv data.
For example: by specifying dtype={'age':int} as an option to the .read_csv() will let pandas know that age should be interpreted as a number. This saves you lots of memory.
Or try the solution below:
Conclusion
The read_csv() function in Pandas simplifies the process of reading data from CSV files and creating DataFrames. Its flexibility and numerous parameters allow data professionals to handle various CSV file formats, customize the import process, and address missing values efficiently, paving the way for seamless data analysis and manipulation.
- Pandas DataFrame: GroupBy Examples
- Pandas DataFrame Aggregation and Grouping
- How to Sort Pandas DataFrame
- Pandas DataFrame: query() function
- Finding and removing duplicate rows in Pandas DataFrame
- How to Replace NaN Values With Zeros in Pandas DataFrame
- How to Convert Pandas DataFrame to NumPy Array
- How to shuffle a DataFrame rows
- Import multiple csv files into one pandas DataFrame
- Create new column in DataFrame based on the existing columns
- New Pandas dataframe column based on if-else condition
- How to Convert a Dictionary to Pandas DataFrame
- Rename Pandas columns/index names (labels)
- Check for NaN Values : Pandas DataFrame