Useful Data Tips

PCA: Principal Component Analysis

⏱️ 27 sec read 🤖 AI & ML

PCA reduces high-dimensional data to fewer dimensions while preserving maximum variance. Use it for visualization, noise reduction, and speeding up machine learning models.

What PCA Does

# Transform 100 features → 2 dimensions
# While keeping as much information as possible

Original: 100 features (hard to visualize, slow training)
After PCA: 2 components (easy to visualize, fast training)

# Components are linear combinations of original features
# First component captures most variance
# Second component captures second-most, etc.

Basic PCA Example

from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

# Load data (4 features)
iris = load_iris()
X = iris.data
y = iris.target

# Reduce to 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Visualize
plt.figure(figsize=(10, 6))
colors = ['red', 'blue', 'green']
for i, color in enumerate(colors):
    plt.scatter(X_pca[y==i, 0], X_pca[y==i, 1],
               label=iris.target_names[i], color=color, alpha=0.7)

plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('PCA of Iris Dataset')
plt.legend()
plt.show()

# 4D data now visible in 2D!

Explained Variance

# How much information does each component capture?
print("Explained variance ratio:")
print(pca.explained_variance_ratio_)
# Output: [0.92, 0.05] = 92% + 5% = 97% total

print(f"Total variance explained: {sum(pca.explained_variance_ratio_):.2%}")

# Plot variance
plt.bar(range(1, 3), pca.explained_variance_ratio_)
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')
plt.title('Scree Plot')
plt.show()

Choosing Number of Components

Cumulative Variance Method

# Keep components that explain 95% of variance
pca = PCA(n_components=0.95)  # Keep 95% variance
X_pca = pca.fit_transform(X)

print(f"Reduced from {X.shape[1]} to {X_pca.shape[1]} components")
print(f"Variance retained: {sum(pca.explained_variance_ratio_):.2%}")

Scree Plot Method

# Try different numbers
pca_full = PCA()
pca_full.fit(X)

# Plot cumulative variance
cumsum = np.cumsum(pca_full.explained_variance_ratio_)
plt.plot(range(1, len(cumsum)+1), cumsum, marker='o')
plt.axhline(y=0.95, color='r', linestyle='--', label='95% variance')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Variance Explained')
plt.legend()
plt.show()

# Choose where curve flattens (elbow)

PCA for Speeding Up ML Models

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import time

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Without PCA
start = time.time()
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
time_no_pca = time.time() - start
acc_no_pca = rf.score(X_test, y_test)

# With PCA
pca = PCA(n_components=0.95)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

start = time.time()
rf_pca = RandomForestClassifier()
rf_pca.fit(X_train_pca, y_train)
time_pca = time.time() - start
acc_pca = rf_pca.score(X_test_pca, y_test)

print(f"No PCA: {time_no_pca:.3f}s, Accuracy: {acc_no_pca:.3f}")
print(f"With PCA: {time_pca:.3f}s, Accuracy: {acc_pca:.3f}")
# Often: faster training, similar accuracy

Standardization is Critical

from sklearn.preprocessing import StandardScaler

# Bad: PCA on raw data (features with different scales)
pca_bad = PCA(n_components=2)
X_pca_bad = pca_bad.fit_transform(X)

# Good: Scale first
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# PCA is sensitive to feature scales!
# Always standardize first

When to Use PCA

Limitations

Pro Tip: Always standardize features before PCA! Use explained variance ratio to decide how many components to keep—typically aim for 95% of variance. For non-linear relationships, try t-SNE or UMAP instead.

← Back to AI & ML Tips