How to Reduce Pandas Memory Usage by 90%

⏱️ 35 sec read 🐍 Python

Large DataFrames can consume gigabytes of RAM. Here's how to shrink them dramatically:

1. Use Smaller Numeric Types

# Default: int64 uses 8 bytes per value
df['age'] = df['age'].astype('int64')  # 0-100 values

# Optimized: int8 uses 1 byte (-128 to 127 range)
df['age'] = df['age'].astype('int8')   # 87.5% memory saved!

# For larger ranges:
df['price'] = df['price'].astype('float32')  # Instead of float64 (50% saved)

2. Use Categorical for Repeated Strings

# Before: Each country name stored separately
df['country'].memory_usage(deep=True)  # 50 MB for 1M rows

# After: Categories stored once, integers used for values
df['country'] = df['country'].astype('category')
df['country'].memory_usage(deep=True)  # 1 MB! (98% reduction)

When to use category: Columns with < 50% unique values (states, countries, categories, status codes)

3. Read Only Needed Columns

# Bad: Load entire 50-column CSV ❌
df = pd.read_csv('data.csv')

# Good: Load only 5 needed columns ✅
df = pd.read_csv('data.csv', usecols=['date', 'user_id', 'amount', 'status', 'country'])

4. Automatic Downcasting

def optimize_dtypes(df):
    """Automatically optimize DataFrame dtypes"""

    # Optimize integers
    int_cols = df.select_dtypes(include=['int']).columns
    for col in int_cols:
        df[col] = pd.to_numeric(df[col], downcast='integer')

    # Optimize floats
    float_cols = df.select_dtypes(include=['float']).columns
    for col in float_cols:
        df[col] = pd.to_numeric(df[col], downcast='float')

    # Convert low-cardinality strings to category
    obj_cols = df.select_dtypes(include=['object']).columns
    for col in obj_cols:
        if df[col].nunique() / len(df) < 0.5:
            df[col] = df[col].astype('category')

    return df

# Use it:
df = optimize_dtypes(df)

5. Check Memory Usage

# See memory by column
df.memory_usage(deep=True)

# Total memory in MB
df.memory_usage(deep=True).sum() / 1024**2

# Compare before/after optimization
print(f"Before: {before_mb:.1f} MB")
print(f"After: {after_mb:.1f} MB")
print(f"Saved: {(1 - after_mb/before_mb)*100:.1f}%")

Quick Reference Table

Data Type	Range	Memory
int8	-128 to 127	1 byte
int16	-32K to 32K	2 bytes
int32	-2B to 2B	4 bytes
int64 (default)	Very large	8 bytes
category	Any repeated values	1-4 bytes + overhead

Pro Tip: Run the optimization function right after loading data. On a typical 1M row dataset with mixed types, you can expect 50-90% memory reduction.

← Back to Python Tips