Useful Data Tips

How to Read CSV Files in Python: Complete Tutorial

⏱️ 32 sec read 🐍 Python

Reading CSV files is one of the most common tasks in data analysis. Here's how to do it efficiently with pandas and the built-in csv module:

1. Basic CSV Reading with Pandas

import pandas as pd

# Read CSV file ✅
df = pd.read_csv('data.csv')

# Display first rows
print(df.head())

# Check shape
print(df.shape)  # (rows, columns)

2. Handle Different Delimiters

# Tab-separated values
df = pd.read_csv('data.tsv', sep='\t')

# Semicolon-separated
df = pd.read_csv('data.csv', sep=';')

# Custom delimiter
df = pd.read_csv('data.txt', sep='|')

# Auto-detect delimiter (slower)
df = pd.read_csv('data.csv', sep=None, engine='python')

3. Specify Column Data Types

# Define data types for better performance
dtypes = {
    'user_id': 'int64',
    'name': 'string',
    'age': 'int64',
    'salary': 'float64',
    'active': 'bool'
}

df = pd.read_csv('data.csv', dtype=dtypes)

# Parse dates during reading (much faster!)
df = pd.read_csv('data.csv',
                 parse_dates=['created_at', 'updated_at'])

4. Handle Missing Values

# Specify missing value indicators
df = pd.read_csv('data.csv',
                 na_values=['NA', 'N/A', 'null', 'NULL', ''])

# Keep default NaN values and add custom ones
df = pd.read_csv('data.csv',
                 keep_default_na=True,
                 na_values=['missing', 'unknown'])

# Don't interpret any values as NaN
df = pd.read_csv('data.csv', keep_default_na=False)

5. Read Large Files Efficiently

# Read in chunks for large files
chunk_size = 10000
chunks = []

for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
    # Process each chunk
    processed = chunk[chunk['age'] > 18]
    chunks.append(processed)

df = pd.concat(chunks, ignore_index=True)

# Read only specific columns (faster)
df = pd.read_csv('data.csv', usecols=['name', 'age', 'city'])

# Skip rows
df = pd.read_csv('data.csv', skiprows=5)  # Skip first 5 rows
df = pd.read_csv('data.csv', nrows=1000)  # Read first 1000 rows only

6. Handle Headers and Column Names

# No header in file
df = pd.read_csv('data.csv', header=None)

# Custom column names
df = pd.read_csv('data.csv',
                 names=['col1', 'col2', 'col3'])

# Header on different row
df = pd.read_csv('data.csv', header=2)  # 3rd row is header

# Use first column as index
df = pd.read_csv('data.csv', index_col=0)

7. Using Built-in csv Module

import csv

# Read CSV with csv.reader
with open('data.csv', 'r') as file:
    csv_reader = csv.reader(file)
    header = next(csv_reader)  # Get header

    for row in csv_reader:
        print(row)  # Each row is a list

# Read as dictionaries (easier to work with)
with open('data.csv', 'r') as file:
    csv_reader = csv.DictReader(file)

    for row in csv_reader:
        print(row['name'], row['age'])  # Access by column name

8. Read from URLs

# Read directly from URL
url = 'https://example.com/data.csv'
df = pd.read_csv(url)

# Read compressed files
df = pd.read_csv('data.csv.gz', compression='gzip')
df = pd.read_csv('data.csv.zip', compression='zip')

# Auto-detect compression
df = pd.read_csv('data.csv.gz', compression='infer')

9. Handle Encoding Issues

# Specify encoding (common for international data)
df = pd.read_csv('data.csv', encoding='utf-8')
df = pd.read_csv('data.csv', encoding='latin-1')
df = pd.read_csv('data.csv', encoding='ISO-8859-1')

# Ignore encoding errors
df = pd.read_csv('data.csv', encoding='utf-8', encoding_errors='ignore')

Best Practices

Pro Tip: When reading large CSV files, use dtype and usecols parameters together to dramatically reduce memory usage. Reading only the columns you need with proper data types can reduce memory by 70% or more!

← Back to Python Tips