How to Read CSV Files in Python: Complete Tutorial
Reading CSV files is one of the most common tasks in data analysis. Here's how to do it efficiently with pandas and the built-in csv module:
1. Basic CSV Reading with Pandas
import pandas as pd
# Read CSV file ✅
df = pd.read_csv('data.csv')
# Display first rows
print(df.head())
# Check shape
print(df.shape) # (rows, columns)
2. Handle Different Delimiters
# Tab-separated values
df = pd.read_csv('data.tsv', sep='\t')
# Semicolon-separated
df = pd.read_csv('data.csv', sep=';')
# Custom delimiter
df = pd.read_csv('data.txt', sep='|')
# Auto-detect delimiter (slower)
df = pd.read_csv('data.csv', sep=None, engine='python')
3. Specify Column Data Types
# Define data types for better performance
dtypes = {
'user_id': 'int64',
'name': 'string',
'age': 'int64',
'salary': 'float64',
'active': 'bool'
}
df = pd.read_csv('data.csv', dtype=dtypes)
# Parse dates during reading (much faster!)
df = pd.read_csv('data.csv',
parse_dates=['created_at', 'updated_at'])
4. Handle Missing Values
# Specify missing value indicators
df = pd.read_csv('data.csv',
na_values=['NA', 'N/A', 'null', 'NULL', ''])
# Keep default NaN values and add custom ones
df = pd.read_csv('data.csv',
keep_default_na=True,
na_values=['missing', 'unknown'])
# Don't interpret any values as NaN
df = pd.read_csv('data.csv', keep_default_na=False)
5. Read Large Files Efficiently
# Read in chunks for large files
chunk_size = 10000
chunks = []
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
# Process each chunk
processed = chunk[chunk['age'] > 18]
chunks.append(processed)
df = pd.concat(chunks, ignore_index=True)
# Read only specific columns (faster)
df = pd.read_csv('data.csv', usecols=['name', 'age', 'city'])
# Skip rows
df = pd.read_csv('data.csv', skiprows=5) # Skip first 5 rows
df = pd.read_csv('data.csv', nrows=1000) # Read first 1000 rows only
6. Handle Headers and Column Names
# No header in file
df = pd.read_csv('data.csv', header=None)
# Custom column names
df = pd.read_csv('data.csv',
names=['col1', 'col2', 'col3'])
# Header on different row
df = pd.read_csv('data.csv', header=2) # 3rd row is header
# Use first column as index
df = pd.read_csv('data.csv', index_col=0)
7. Using Built-in csv Module
import csv
# Read CSV with csv.reader
with open('data.csv', 'r') as file:
csv_reader = csv.reader(file)
header = next(csv_reader) # Get header
for row in csv_reader:
print(row) # Each row is a list
# Read as dictionaries (easier to work with)
with open('data.csv', 'r') as file:
csv_reader = csv.DictReader(file)
for row in csv_reader:
print(row['name'], row['age']) # Access by column name
8. Read from URLs
# Read directly from URL
url = 'https://example.com/data.csv'
df = pd.read_csv(url)
# Read compressed files
df = pd.read_csv('data.csv.gz', compression='gzip')
df = pd.read_csv('data.csv.zip', compression='zip')
# Auto-detect compression
df = pd.read_csv('data.csv.gz', compression='infer')
9. Handle Encoding Issues
# Specify encoding (common for international data)
df = pd.read_csv('data.csv', encoding='utf-8')
df = pd.read_csv('data.csv', encoding='latin-1')
df = pd.read_csv('data.csv', encoding='ISO-8859-1')
# Ignore encoding errors
df = pd.read_csv('data.csv', encoding='utf-8', encoding_errors='ignore')
Best Practices
- ✅ Use
pandasfor data analysis tasks (faster, more features) - ✅ Use
csvmodule for simple reading/writing without dependencies - ✅ Specify
dtypeto reduce memory usage and improve performance - ✅ Use
parse_datesto handle dates during reading (10x faster) - ✅ Read large files in chunks to avoid memory issues
- ✅ Use
usecolsto read only needed columns - ❌ Don't read entire large files into memory if you only need part of it
- ⚠️ Always check encoding if you see strange characters
Pro Tip: When reading large CSV files, use dtype and usecols parameters together to dramatically reduce memory usage. Reading only the columns you need with proper data types can reduce memory by 70% or more!