Why Vectorization is Faster Than Loops in Pandas
Loops in Python are slow. Vectorization can be 100x-1000x faster. Here's why:
The Performance Difference
# Slow: Python loop (12 seconds on 1M rows) ❌
total = 0
for i in range(len(df)):
total += df.loc[i, 'price'] * df.loc[i, 'quantity']
# Fast: Vectorized (0.01 seconds) ✅
total = (df['price'] * df['quantity']).sum()
Why Vectorization is Faster
1. Compiled C Code
Pandas/NumPy operations run in compiled C code, not interpreted Python. C is 100x faster than Python.
2. No Python Overhead
Python loops have overhead for each iteration:
- Type checking on every operation
- Function call overhead
- Memory allocation for temporary objects
Vectorized operations skip all of this.
3. CPU Optimization (SIMD)
Modern CPUs can process multiple numbers simultaneously (Single Instruction, Multiple Data). Vectorized operations take advantage of this; loops don't.
4. Better Memory Access
Vectorized operations access memory sequentially, which is much faster than jumping around (cache-friendly).
Common Vectorization Patterns
Conditional Logic
# Slow ❌
for i in range(len(df)):
if df.loc[i, 'age'] > 18:
df.loc[i, 'category'] = 'adult'
else:
df.loc[i, 'category'] = 'minor'
# Fast ✅
df['category'] = np.where(df['age'] > 18, 'adult', 'minor')
String Operations
# Slow ❌
for i in range(len(df)):
df.loc[i, 'name'] = df.loc[i, 'name'].upper()
# Fast ✅
df['name'] = df['name'].str.upper()
Math Operations
# Always vectorize math
df['total'] = df['price'] * df['quantity']
df['discount_price'] = df['price'] * 0.9
df['log_value'] = np.log(df['value'])
Golden Rule: If you're writing a for loop over DataFrame rows, there's almost always a vectorized way to do it. Ask yourself: "How can I express this operation on entire columns?"
← Back to Python Tips