Useful Data Tips

Why Vectorization is Faster Than Loops in Pandas

⏱️ 35 sec read 🐍 Python

Loops in Python are slow. Vectorization can be 100x-1000x faster. Here's why:

The Performance Difference

# Slow: Python loop (12 seconds on 1M rows) ❌
total = 0
for i in range(len(df)):
    total += df.loc[i, 'price'] * df.loc[i, 'quantity']

# Fast: Vectorized (0.01 seconds) ✅
total = (df['price'] * df['quantity']).sum()

Why Vectorization is Faster

1. Compiled C Code

Pandas/NumPy operations run in compiled C code, not interpreted Python. C is 100x faster than Python.

2. No Python Overhead

Python loops have overhead for each iteration:

Vectorized operations skip all of this.

3. CPU Optimization (SIMD)

Modern CPUs can process multiple numbers simultaneously (Single Instruction, Multiple Data). Vectorized operations take advantage of this; loops don't.

4. Better Memory Access

Vectorized operations access memory sequentially, which is much faster than jumping around (cache-friendly).

Common Vectorization Patterns

Conditional Logic

# Slow ❌
for i in range(len(df)):
    if df.loc[i, 'age'] > 18:
        df.loc[i, 'category'] = 'adult'
    else:
        df.loc[i, 'category'] = 'minor'

# Fast ✅
df['category'] = np.where(df['age'] > 18, 'adult', 'minor')

String Operations

# Slow ❌
for i in range(len(df)):
    df.loc[i, 'name'] = df.loc[i, 'name'].upper()

# Fast ✅
df['name'] = df['name'].str.upper()

Math Operations

# Always vectorize math
df['total'] = df['price'] * df['quantity']
df['discount_price'] = df['price'] * 0.9
df['log_value'] = np.log(df['value'])

Golden Rule: If you're writing a for loop over DataFrame rows, there's almost always a vectorized way to do it. Ask yourself: "How can I express this operation on entire columns?"

← Back to Python Tips