Random Forests: Ensemble Learning
Random forests combine multiple decision trees to create a powerful ensemble model. By averaging predictions from many trees, they achieve better accuracy and robustness than single trees.
How Random Forests Work
1. Create many decision trees (e.g., 100 trees)
2. Each tree trained on random sample of data (bootstrap)
3. Each split uses random subset of features
4. For prediction:
- Classification: Majority vote across trees
- Regression: Average prediction across trees
Result: "Wisdom of crowds" beats individual trees
Creating a Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# Create dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train random forest
rf = RandomForestClassifier(
n_estimators=100, # Number of trees
max_depth=10, # Maximum tree depth
random_state=42
)
rf.fit(X_train, y_train)
# Evaluate
accuracy = rf.score(X_test, y_test)
print(f"Accuracy: {accuracy:.3f}")
Key Parameters
n_estimators (Number of Trees)
# More trees = better performance, slower training
# Typical values: 100-500
rf = RandomForestClassifier(n_estimators=200)
# More isn't always better—diminishing returns after ~200
max_depth (Tree Depth)
# Limits how deep each tree grows
# Prevents overfitting
rf = RandomForestClassifier(max_depth=10)
# None = grow until pure (may overfit)
# 5-20 = good range for most problems
min_samples_split
# Minimum samples required to split a node
rf = RandomForestClassifier(min_samples_split=10)
# Higher value = more conservative, less overfitting
Feature Importance
import pandas as pd
import matplotlib.pyplot as plt
# Get feature importances
importances = rf.feature_importances_
feature_names = [f"Feature {i}" for i in range(X.shape[1])]
# Create dataframe
importance_df = pd.DataFrame({
'feature': feature_names,
'importance': importances
}).sort_values('importance', ascending=False)
# Plot top 10
importance_df.head(10).plot(x='feature', y='importance', kind='barh')
plt.xlabel('Importance')
plt.title('Top 10 Most Important Features')
plt.show()
Advantages of Random Forests
- High accuracy: Often outperforms single models
- Robust: Less prone to overfitting than single trees
- Feature importance: Built-in feature ranking
- Handles missing values: No preprocessing needed
- Works with mixed data: Numeric and categorical
Disadvantages
- Slower than single tree
- Black box (harder to interpret)
- Large memory footprint
- Biased toward dominant classes in imbalanced data
Random Forest vs Decision Tree
from sklearn.tree import DecisionTreeClassifier
# Single decision tree
dt = DecisionTreeClassifier(max_depth=10)
dt.fit(X_train, y_train)
dt_accuracy = dt.score(X_test, y_test)
# Random forest
rf = RandomForestClassifier(n_estimators=100, max_depth=10)
rf.fit(X_train, y_train)
rf_accuracy = rf.score(X_test, y_test)
print(f"Decision Tree: {dt_accuracy:.3f}")
print(f"Random Forest: {rf_accuracy:.3f}")
# Random forest typically wins!
Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV
# Define parameter grid
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [10, 20, 30, None],
'min_samples_split': [2, 5, 10]
}
# Grid search
grid_search = GridSearchCV(
RandomForestClassifier(),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")
When to Use Random Forests
- Tabular data (structured data)
- Need high accuracy
- Want feature importance rankings
- Have mixed feature types
- Moderate dataset size (not billions of rows)
Pro Tip: Random forests work well out-of-the-box with minimal tuning. Start with 100-200 trees and default parameters. Use feature importances to identify which variables matter most for your predictions!
← Back to AI & ML Tips