How to Reduce Pandas Memory Usage by 90%
Large DataFrames can consume gigabytes of RAM. Here's how to shrink them dramatically:
1. Use Smaller Numeric Types
# Default: int64 uses 8 bytes per value
df['age'] = df['age'].astype('int64') # 0-100 values
# Optimized: int8 uses 1 byte (-128 to 127 range)
df['age'] = df['age'].astype('int8') # 87.5% memory saved!
# For larger ranges:
df['price'] = df['price'].astype('float32') # Instead of float64 (50% saved)
2. Use Categorical for Repeated Strings
# Before: Each country name stored separately
df['country'].memory_usage(deep=True) # 50 MB for 1M rows
# After: Categories stored once, integers used for values
df['country'] = df['country'].astype('category')
df['country'].memory_usage(deep=True) # 1 MB! (98% reduction)
When to use category: Columns with < 50% unique values (states, countries, categories, status codes)
3. Read Only Needed Columns
# Bad: Load entire 50-column CSV ❌
df = pd.read_csv('data.csv')
# Good: Load only 5 needed columns ✅
df = pd.read_csv('data.csv', usecols=['date', 'user_id', 'amount', 'status', 'country'])
4. Automatic Downcasting
def optimize_dtypes(df):
"""Automatically optimize DataFrame dtypes"""
# Optimize integers
int_cols = df.select_dtypes(include=['int']).columns
for col in int_cols:
df[col] = pd.to_numeric(df[col], downcast='integer')
# Optimize floats
float_cols = df.select_dtypes(include=['float']).columns
for col in float_cols:
df[col] = pd.to_numeric(df[col], downcast='float')
# Convert low-cardinality strings to category
obj_cols = df.select_dtypes(include=['object']).columns
for col in obj_cols:
if df[col].nunique() / len(df) < 0.5:
df[col] = df[col].astype('category')
return df
# Use it:
df = optimize_dtypes(df)
5. Check Memory Usage
# See memory by column
df.memory_usage(deep=True)
# Total memory in MB
df.memory_usage(deep=True).sum() / 1024**2
# Compare before/after optimization
print(f"Before: {before_mb:.1f} MB")
print(f"After: {after_mb:.1f} MB")
print(f"Saved: {(1 - after_mb/before_mb)*100:.1f}%")
Quick Reference Table
| Data Type | Range | Memory |
|---|---|---|
| int8 | -128 to 127 | 1 byte |
| int16 | -32K to 32K | 2 bytes |
| int32 | -2B to 2B | 4 bytes |
| int64 (default) | Very large | 8 bytes |
| category | Any repeated values | 1-4 bytes + overhead |
Pro Tip: Run the optimization function right after loading data. On a typical 1M row dataset with mixed types, you can expect 50-90% memory reduction.
← Back to Python Tips