Useful Data Tips

Random Forests: Ensemble Learning

⏱️ 26 sec read 🤖 AI & ML

Random forests combine multiple decision trees to create a powerful ensemble model. By averaging predictions from many trees, they achieve better accuracy and robustness than single trees.

How Random Forests Work

1. Create many decision trees (e.g., 100 trees)
2. Each tree trained on random sample of data (bootstrap)
3. Each split uses random subset of features
4. For prediction:
   - Classification: Majority vote across trees
   - Regression: Average prediction across trees

Result: "Wisdom of crowds" beats individual trees

Creating a Random Forest

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Create dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train random forest
rf = RandomForestClassifier(
    n_estimators=100,  # Number of trees
    max_depth=10,      # Maximum tree depth
    random_state=42
)

rf.fit(X_train, y_train)

# Evaluate
accuracy = rf.score(X_test, y_test)
print(f"Accuracy: {accuracy:.3f}")

Key Parameters

n_estimators (Number of Trees)

# More trees = better performance, slower training
# Typical values: 100-500

rf = RandomForestClassifier(n_estimators=200)

# More isn't always better—diminishing returns after ~200

max_depth (Tree Depth)

# Limits how deep each tree grows
# Prevents overfitting

rf = RandomForestClassifier(max_depth=10)

# None = grow until pure (may overfit)
# 5-20 = good range for most problems

min_samples_split

# Minimum samples required to split a node
rf = RandomForestClassifier(min_samples_split=10)

# Higher value = more conservative, less overfitting

Feature Importance

import pandas as pd
import matplotlib.pyplot as plt

# Get feature importances
importances = rf.feature_importances_
feature_names = [f"Feature {i}" for i in range(X.shape[1])]

# Create dataframe
importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': importances
}).sort_values('importance', ascending=False)

# Plot top 10
importance_df.head(10).plot(x='feature', y='importance', kind='barh')
plt.xlabel('Importance')
plt.title('Top 10 Most Important Features')
plt.show()

Advantages of Random Forests

Disadvantages

Random Forest vs Decision Tree

from sklearn.tree import DecisionTreeClassifier

# Single decision tree
dt = DecisionTreeClassifier(max_depth=10)
dt.fit(X_train, y_train)
dt_accuracy = dt.score(X_test, y_test)

# Random forest
rf = RandomForestClassifier(n_estimators=100, max_depth=10)
rf.fit(X_train, y_train)
rf_accuracy = rf.score(X_test, y_test)

print(f"Decision Tree: {dt_accuracy:.3f}")
print(f"Random Forest: {rf_accuracy:.3f}")
# Random forest typically wins!

Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10]
}

# Grid search
grid_search = GridSearchCV(
    RandomForestClassifier(),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

print(f"Best params: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")

When to Use Random Forests

Pro Tip: Random forests work well out-of-the-box with minimal tuning. Start with 100-200 trees and default parameters. Use feature importances to identify which variables matter most for your predictions!

← Back to AI & ML Tips