Why Vectorization is Faster Than Loops in Pandas

⏱️ 35 sec read 🐍 Python

Loops in Python are slow. Vectorization can be 100x-1000x faster. Here's why:

The Performance Difference

# Slow: Python loop (12 seconds on 1M rows) ❌
total = 0
for i in range(len(df)):
    total += df.loc[i, 'price'] * df.loc[i, 'quantity']

# Fast: Vectorized (0.01 seconds) ✅
total = (df['price'] * df['quantity']).sum()

Why Vectorization is Faster

1. Compiled C Code

Pandas/NumPy operations run in compiled C code, not interpreted Python. C is 100x faster than Python.

2. No Python Overhead

Python loops have overhead for each iteration:

Type checking on every operation
Function call overhead
Memory allocation for temporary objects

Vectorized operations skip all of this.

3. CPU Optimization (SIMD)

Modern CPUs can process multiple numbers simultaneously (Single Instruction, Multiple Data). Vectorized operations take advantage of this; loops don't.

4. Better Memory Access

Vectorized operations access memory sequentially, which is much faster than jumping around (cache-friendly).

Common Vectorization Patterns

Conditional Logic

# Slow ❌
for i in range(len(df)):
    if df.loc[i, 'age'] > 18:
        df.loc[i, 'category'] = 'adult'
    else:
        df.loc[i, 'category'] = 'minor'

# Fast ✅
df['category'] = np.where(df['age'] > 18, 'adult', 'minor')

String Operations

# Slow ❌
for i in range(len(df)):
    df.loc[i, 'name'] = df.loc[i, 'name'].upper()

# Fast ✅
df['name'] = df['name'].str.upper()

Math Operations

# Always vectorize math
df['total'] = df['price'] * df['quantity']
df['discount_price'] = df['price'] * 0.9
df['log_value'] = np.log(df['value'])

Golden Rule: If you're writing a for loop over DataFrame rows, there's almost always a vectorized way to do it. Ask yourself: "How can I express this operation on entire columns?"

← Back to Python Tips