PCA: Principal Component Analysis
PCA reduces high-dimensional data to fewer dimensions while preserving maximum variance. Use it for visualization, noise reduction, and speeding up machine learning models.
What PCA Does
# Transform 100 features → 2 dimensions
# While keeping as much information as possible
Original: 100 features (hard to visualize, slow training)
After PCA: 2 components (easy to visualize, fast training)
# Components are linear combinations of original features
# First component captures most variance
# Second component captures second-most, etc.
Basic PCA Example
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
# Load data (4 features)
iris = load_iris()
X = iris.data
y = iris.target
# Reduce to 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Visualize
plt.figure(figsize=(10, 6))
colors = ['red', 'blue', 'green']
for i, color in enumerate(colors):
plt.scatter(X_pca[y==i, 0], X_pca[y==i, 1],
label=iris.target_names[i], color=color, alpha=0.7)
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('PCA of Iris Dataset')
plt.legend()
plt.show()
# 4D data now visible in 2D!
Explained Variance
# How much information does each component capture?
print("Explained variance ratio:")
print(pca.explained_variance_ratio_)
# Output: [0.92, 0.05] = 92% + 5% = 97% total
print(f"Total variance explained: {sum(pca.explained_variance_ratio_):.2%}")
# Plot variance
plt.bar(range(1, 3), pca.explained_variance_ratio_)
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')
plt.title('Scree Plot')
plt.show()
Choosing Number of Components
Cumulative Variance Method
# Keep components that explain 95% of variance
pca = PCA(n_components=0.95) # Keep 95% variance
X_pca = pca.fit_transform(X)
print(f"Reduced from {X.shape[1]} to {X_pca.shape[1]} components")
print(f"Variance retained: {sum(pca.explained_variance_ratio_):.2%}")
Scree Plot Method
# Try different numbers
pca_full = PCA()
pca_full.fit(X)
# Plot cumulative variance
cumsum = np.cumsum(pca_full.explained_variance_ratio_)
plt.plot(range(1, len(cumsum)+1), cumsum, marker='o')
plt.axhline(y=0.95, color='r', linestyle='--', label='95% variance')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Variance Explained')
plt.legend()
plt.show()
# Choose where curve flattens (elbow)
PCA for Speeding Up ML Models
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import time
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Without PCA
start = time.time()
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
time_no_pca = time.time() - start
acc_no_pca = rf.score(X_test, y_test)
# With PCA
pca = PCA(n_components=0.95)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
start = time.time()
rf_pca = RandomForestClassifier()
rf_pca.fit(X_train_pca, y_train)
time_pca = time.time() - start
acc_pca = rf_pca.score(X_test_pca, y_test)
print(f"No PCA: {time_no_pca:.3f}s, Accuracy: {acc_no_pca:.3f}")
print(f"With PCA: {time_pca:.3f}s, Accuracy: {acc_pca:.3f}")
# Often: faster training, similar accuracy
Standardization is Critical
from sklearn.preprocessing import StandardScaler
# Bad: PCA on raw data (features with different scales)
pca_bad = PCA(n_components=2)
X_pca_bad = pca_bad.fit_transform(X)
# Good: Scale first
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# PCA is sensitive to feature scales!
# Always standardize first
When to Use PCA
- Visualization: Reduce to 2-3D for plotting
- Speed: Reduce features to speed up training
- Multicollinearity: Remove correlated features
- Noise reduction: Keep signal, drop noise
- Feature extraction: Create new features
Limitations
- Linear transformation only
- Components are hard to interpret
- Assumes variance = importance
- Sensitive to outliers
- May lose important information
Pro Tip: Always standardize features before PCA! Use explained variance ratio to decide how many components to keep—typically aim for 95% of variance. For non-linear relationships, try t-SNE or UMAP instead.
← Back to AI & ML Tips