How to Load Different Data File Formats in Python Using Pandas

4 min readSep 5, 2024

Loading and working with data is a critical task in data science and machine learning. Python, with its powerful pandas library, makes it easier to handle various data formats such as CSV, Excel, JSON, Parquet, Avro, and more. In this article, we will walk through different ways to load data from different file formats into a pandas DataFrame.

1. Loading a CSV File

Problem:

You need to import a comma-separated values (CSV) file into a pandas DataFrame.

Solution:

The pandas library provides the read_csv function to load local or hosted CSV files into a DataFrame.

# Load library
import pandas as pd

# Create URL
url = 'https://raw.githubusercontent.com/chrisalbon/sim_data/master/data.csv'

# Load dataset
dataframe = pd.read_csv(url)

# View the first two rows
print(dataframe.head(2))

Discussion:

There are two things to keep in mind when loading CSV files:

Quick File Inspection: It can be helpful to take a quick look at the contents of the file before loading. This allows you to understand the structure and see if there are any special parameters to use when loading the file.
Understanding Parameters: The read_csv function has over 30 parameters to handle different formats and variations of CSV files. One key parameter is sep, which defines the delimiter. CSVs typically use commas, but some files might use tabs (referred to as TSV files). For instance, if the first line defines column headers, you can use the header parameter to indicate this. If there are no headers, set header=None.

Example:

# Load dataset with no headers
dataframe = pd.read_csv(url, header=None)

2. Loading an Excel File

Problem:

You need to import data from an Excel spreadsheet.

Solution:

Use the read_excel function from pandas to load the Excel file.

# Load library
import pandas as pd

# Create URL
url = 'https://raw.githubusercontent.com/chrisalbon/sim_data/master/data.xlsx'

# Load data from the first sheet
dataframe = pd.read_excel(url, sheet_name=0)

# View the first two rows
print(dataframe.head(2))

Discussion:

The solution is similar to loading a CSV file, with one additional parameter, sheet_name, that specifies which sheet in the Excel file to load. It accepts strings for the sheet name or integers for the sheet index (0-based). If you need to load multiple sheets, you can pass a list of sheet names or indices.

Example:

# Load multiple sheets as a dictionary of DataFrames
data = pd.read_excel(url, sheet_name=[0, 1, "Monthly Sales"])

3. Loading a JSON File

Problem:

You need to load a JSON file for data preprocessing.

Solution:

Use read_json from pandas to convert a JSON file into a DataFrame.

# Load library
import pandas as pd

# Create URL
url = 'https://raw.githubusercontent.com/chrisalbon/sim_data/master/data.json'

# Load data
dataframe = pd.read_json(url, orient='columns')

# View the first two rows
print(dataframe.head(2)

Discussion:

The orient parameter specifies how the JSON data is structured. Depending on the file, you might need to experiment with different arguments (split, records, index, etc.) to determine which one works best. Additionally, pandas provides json_normalize to handle nested or semi-structured JSON data.

Example:

from pandas import json_normalize

# Assuming a nested JSON structure
data = {
    "employees": [
        {"name": "John", "age": 30, "city": "New York"},
        {"name": "Anna", "age": 22, "city": "London"},
    ]
}

df = json_normalize(data['employees'])
print(df)

4. Loading a Parquet File

Problem:

You need to load a Parquet file.

Solution:

Use the read_parquet function from pandas.

# Load library
import pandas as pd

# Create URL
url = 'https://machine-learning-python-cookbook.s3.amazonaws.com/data.parquet'

# Load data
dataframe = pd.read_parquet(url)

# View the first two rows
print(dataframe.head(2))

Discussion:

Parquet is a popular format for storing large-scale data efficiently. It is often used in big data tools such as Hadoop and Spark. While working with big data, you may frequently encounter this format.

5. Loading an Avro File

Problem:

You need to load an Avro file into a pandas DataFrame.

Solution:

Use the pandavro library to read Avro files.

# Load libraries
import requests
import pandavro as pdx

# Create URL
url = 'https://machine-learning-python-cookbook.s3.amazonaws.com/data.avro'

# Download file
r = requests.get(url)
with open('data.avro', 'wb') as f:
    f.write(r.content)

# Load data
dataframe = pdx.read_avro('data.avro')

# View the first two rows
print(dataframe.head(2))

Discussion:

Avro is another binary data format, often used with large data systems. Although it is less common than Parquet, it is worth knowing how to handle Avro, especially if you work with large data infrastructures.

6. Querying Data from a SQLite Database

Problem:

You need to load data from a database using SQL.

Solution:

Use pandas’ read_sql_query to query data from a database.

# Load libraries
import pandas as pd
from sqlalchemy import create_engine

# Create a connection to the SQLite database
database_connection = create_engine('sqlite:///sample.db')

# Execute SQL query and load data
dataframe = pd.read_sql_query('SELECT * FROM data', database_connection)

# View the first two rows
print(dataframe.head(2))

Discussion:

SQL is essential for querying data from databases. In this example, we use create_engine from SQLAlchemy to establish a connection to an SQLite database, and then use pandas to execute an SQL query. The read_sql_query function allows you to load the result of an SQL query directly into a DataFrame.

Conclusion:

In this guide, we explored how to load data from different file formats, including CSV, Excel, JSON, Parquet, Avro, and databases using SQL, into a pandas DataFrame. Each format has its specific parameters and considerations, but with pandas, handling different data formats becomes relatively straightforward. Whether you're working with local files or pulling data from hosted URLs, pandas provides a powerful interface to load, manipulate, and analyze data in Python.

Make sure to consult the documentation for additional options and customization when handling larger datasets or more complex formats.

How to Load Different Data File Formats in Python Using Pandas

1. Loading a CSV File

Problem:

Solution:

Discussion:

Example:

2. Loading an Excel File

Problem:

Solution:

Discussion:

Example:

3. Loading a JSON File

Problem:

Solution:

Discussion:

Example:

4. Loading a Parquet File

Problem:

Solution:

Discussion:

5. Loading an Avro File

Problem:

Solution:

Discussion:

6. Querying Data from a SQLite Database

Problem:

Solution:

Discussion:

Conclusion:

Written by Scaibu

No responses yet