3
Ensemble Learning and Random Forests

3.1 Introduction

Suppose you ask a complex question to millions of random people, then aggregate their answers. In many cases you will find that this aggregated answer is better than an expert’s answer.

This is called the wisdom of the crowd. ¹ ¹ also known as “wisdom of the majority”, which expresses the notion that the collective opinion of a diverse and independent group of individuals (rather than that of a single expert) produces the best judgement [10].

In a similar fashion, aggregating the predictions of a group of predictors, such as classifiers or regressors. We will often get better predictions than with the best individual predictor.

A group of predictors is called an ensemble and this technique, ensemble learning , and the learning method, ensemble method .

As an example of an ensemble method, we can train a group of decision tree classifiers , each on a different random subset of the training set. We can then obtain the predictions of all the individual trees, and the class that gets the most votes is the ensemble’s prediction. Such an ensemble of decision trees is called a RF , and despite its simplicity, this is one of the most powerful ML algorithms available today. ² ² A Historical Overview The general method of random decision forests was originally proposed by Salzberg and Heath in 1993 [11], with a method that used a randomized decision tree algorithm to create multiple trees and then combine them using majority voting. This idea was developed further by Ho in 1995 [12]. Ho established that forests of trees splitting with oblique hyperplanes can gain accuracy as they grow without suffering from overtraining, as long as the forests are randomly restricted to be sensitive to only selected feature dimensions. A subsequent work along the same lines [13] concluded that other splitting methods behave similarly, as long as they are randomly forced to be insensitive to some feature dimensions. This observation that a more complex classifier (a larger forest) gets more accurate nearly monotonically is in sharp contrast to the common belief that the complexity of a classifier can only grow to a certain level of accuracy before being hurt by overfitting. In this chapter we will examine the most popular ensemble methods, including:

voting classifiers,
bagging and pasting ensembles,
RF s,
boosting,
and stacking ensembles.

3.1.1 Voting Classifiers

Let’s assume that we have trained a few classifiers, with each one achieving about 80% accuracy.

This may have a logistic regression classifier , an SVM classifier , a RF classifier , a k-nearest neighbour classifier , and perhaps a few more.

A simple way to create an even better classifier is to combine the predictions of each classifier where the class which gets the most votes is the ensemble’s prediction.

This majority-vote classifier is called a hard voting classifier .

Interestingly, this voting classifier often achieves a higher accuracy than the best classifier in the ensemble. In fact, even if each classifier is a weak learner, meaning it does only slightly better than random guessing, the ensemble can still be a strong learner, achieving high accuracy, provided there are a sufficient number of weak learners in the ensemble and they are sufficiently diverse .

Let’s discuss how this all works with an example. For simplicity, we will have a slightly biased coin which has a 51% chance of coming up heads and 49% chance of coming up tails. ³ ³ A biased coin such as this, perhaps. If you toss it 1,000 times, we will generally get more or less 510 heads and 490 tails, and therefore will result in a majority of heads.

If you do the calculation, you will find that the probability of obtaining a majority of heads after 1,000 tosses is close to 75%. The more you toss the coin, the higher the probability (e.g., with 10,000 tosses, the probability climbs over 97%). This is due to the law of large numbers :

as we keep tossing the coin, the ratio of heads gets closer and closer to the probability of heads (51%).

Information : LLN

A mathematical law states that the average of the results obtained from a large number of independent random samples converges to the true value, if it exists. More formally, the LLN states that given a sample of independent and identically distributed values, the sample mean converges to the true mean. An example of this behaviour is shown in Fig. 3.1 .

Figure 3.1: An example of a biased coin, where as the number of coin tosses increases the value of the will reach to 51%.

If we extend this analogy to our previous case, Our case becomes the building of an ensemble containing 1,000 classifiers which are individually correct only 51% of the time. ⁴ ⁴ This can be barely considered to be better than just randomly guessing.

In this scenario, if we predict the majority voted class, we can hope for up to 75% accuracy.

However, this is only true if all classifiers are perfectly independent, making uncorrelated errors, which is clearly not the case because they are trained on the same data. They are likely to make the same types of errors, so there will be many majority votes for the wrong class, reducing the ensemble’s accuracy.

Ensemble methods work best when the predictors are as independent from one another as possible. One way to get diverse classifiers is to train them using very different algorithms. This increases the chance that they will make very different types of errors, improving the ensemble’s accuracy [14].

To implement this behaviour, sklearn provides a VotingClassifier class which is intuitive as it just gives it a list of name/predictor pairs, and use it like a normal classifier.

Let’s try it on the moons dataset we used in SVM . We will load and split the moons dataset into a training set and a test set, then we’ll create and train a voting classifier composed of three (3) diverse classifiers:

from sklearn.datasets import make_moons 
from sklearn.ensemble import RandomForestClassifier, VotingClassifier 
from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import train_test_split 
from sklearn.svm import SVC 
X, y = make_moons(n_samples=500, noise=0.30, random_state=42) 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42) 
voting_clf = VotingClassifier( 
   estimators=[ 
      ('lr', LogisticRegression(random_state=42)), 
      ('rf', RandomForestClassifier(random_state=42)), 
      ('svc', SVC(random_state=42)) 
   ] 
) 
voting_clf.fit(X_train, y_train)

When we fit a VotingClassifier, it clones every estimator and fits the clones. The original estimators are available via the estimators attribute, whereas the fitted clones are available via the estimators_ attribute.

If we prefer a dict rather than a list, you can use named_estimators or named_estimators_ instead.

To begin, let’s look at each fitted classifier’s individual accuracy on the test set:

for name, clf in voting_clf.named_estimators_.items(): 
   print(name, "=", clf.score(X_test, y_test))

lr = 0.864 
rf = 0.896 
svc = 0.896

When we call the voting classifier’s predict() method, it performs hard voting.

For example, the voting classifier predicts class 1 for the first instance of the test set, because two out of three classifiers predict that class:

print(voting_clf.predict(X_test[:1])) 
print([clf.predict(X_test[:1]) for clf in voting_clf.estimators_]) 
print(voting_clf.score(X_test, y_test))

[1] 
[array([1]), array([1]), array([0])] 
0.912

And we can look at the performance of the voting classifier on the test set which is 0.912, which shows, the voting classifier outperforms all the individual classifiers.

If all classifiers are able to estimate class probabilities, ⁵ ⁵ i.e., if they all have a predict_proba() method. then we can tell sklearn to predict the class with the highest class probability, averaged over all the individual classifiers.

This is called soft voting , which often achieves higher performance than hard voting as it gives more weight to highly confident votes. All we need to do is set the voting classifier’s voting hyperparameter to soft, and ensure that all classifiers can estimate class probabilities.

This is NOT the case for the SVC class by default, so you need to set its probability hyperparameter to True

This will make the SVC class use cross-validation to estimate class probabilities, slowing down training, and it will add a predict_proba() method.

Let’s try that:

voting_clf.voting = "soft" 
voting_clf.named_estimators["svc"].probability = True 
voting_clf.fit(X_train, y_train) 
print(voting_clf.score(X_test, y_test))

0.92

We reach 92% accuracy simply by using soft voting.

To give a brief summary of what we have learned till now:

Hard Voting

Takes a simple majority vote to decide the final prediction, based on the most frequent class predicted by individual models.

Soft Voting

Considers the probability scores of each class predicted by individual models and averages them to produce a more refined final prediction.

When dealing with imbalanced datasets, soft voting can help mitigate the bias towards the majority class by taking into account the probabilities of all classes.

3.2 Bagging and Pasting

Now we got a general idea of RF , let’s look at methods to improve it’s performance. A way to get a diverse set of classifiers is to use diverse training algorithms. An alternative approach is to use the same training algorithm for every predictor but train them on different random subsets of the training set.

When sampling is performed with replacement, this method is called bagging . ⁶ ⁶ This is short for bootstrap aggregating.
When sampling is performed without replacement, it is called pasting .

Both bagging and pasting allow training instances to be sampled several times across multiple predictors, but only bagging allows training instances to be sampled several times for the same predictor.

Once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating the predictions of all predictors. The aggregation function is typically the statistical mode for classification, ⁷ ⁷ i.e., the most frequent prediction, just like with a hard voting classifier. or the average for regression. Each individual predictor has a higher bias than if it were trained on the original training set, but aggregation reduces both bias and variance.

Generally, the net result is that the ensemble has a similar bias but a lower variance than a single predictor trained on the original training set.

Predictors can all be trained in parallel, via different Central Processing Unit (CPU) cores or even different servers. Similarly, predictions can be made in parallel. This is one of the reasons bagging and pasting are such popular methods as they scale very well.

3.2.1 Implementation

sklearn offers simple classes for both bagging and pasting:

the BaggingClassifier class, or BaggingRegressor for regression.

The code below trains an ensemble of 500 decision tree classifiers where each is trained on 100 training instances randomly sampled from the training set with replacement. ⁸ ⁸ This is an example of bagging, but if you want to use pasting instead, just set bootstrap=False. The n_jobs parameter tells sklearn the number of CPU cores to use for training and predictions, and -1 tells sklearn to use all available cores:

from sklearn.ensemble import BaggingClassifier 
from sklearn.tree import DecisionTreeClassifier 
 
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, 
                    max_samples=100, n_jobs=-1, random_state=42) 
bag_clf.fit(X_train, y_train)

A BaggingClassifier automatically performs soft voting instead of hard voting if the base classifier can estimate class probabilities, ⁹ ⁹ i.e., if it has a predict_proba() method which is the case with decision tree classifiers.

Please have a look at Fig. 3.2 , which compares the decision boundary of a single decision tree with the decision boundary of a bagging ensemble of 500 trees, both trained on the moons dataset.

Figure 3.2: A single decision tree (left) versus a bagging ensemble of 500 trees (right)

As can be seen, the ensemble’s predictions will likely generalize much better than the single decision tree’s predictions:

the ensemble has a comparable bias but a smaller variance.

It makes roughly the same number of errors on the training set, but the decision boundary is less irregular.

Bagging introduces a higher diversity in the subsets that each predictor is trained on, so bagging ends up with a slightly higher bias than pasting; but the extra diversity also means that the predictors end up being less correlated, so the ensemble’s variance is reduced.

Overall, bagging often results in better models, which explains why it’s generally preferred. However if we have spare time and CPU power, we can also use cross-validation to evaluate both bagging and pasting and select the one that works best.

3.2.2 Out-of-Bag Evaluation

With bagging, some training instances may be sampled several times for any given predictor, while others may not be sampled at all. By default a BaggingClassifier samples $m$ training instances with replacement (bootstrap=True), where $m$ is the size of the training set. With this process, it can be shown mathematically that only about 63% of the training instances are sampled on average for each predictor.

Information : Limits of Bagging

If we were to randomly draw one instance from a dataset of size $m$ , each instance in the dataset obviously has probability $\frac{1}{m}$ of getting picked, and therefore it has a probability $1 - \frac{1}{m}$ of NOT getting picked.

If you draw $m$ instances with replacement , all draws are independent and therefore each instance has a probability

{(1 - \frac{1}{m})}^{m}

of NOT getting picked. Now let’s use the fact that $\exp x$ is equal to the limit of:

\exp x = \lim_{x \to \infty} {(1 + \frac{x}{m})}^{m}

So if we to assume $m$ to be sufficiently large, the ratio of Out-of Bag (OOB) instances will be about $\exp - 1 ≃ 0.37$ . Therefore roughly 63% will be sampled $■$

The remaining 37% of the training instances that are NOT sampled are called OOB instances.

Note that they are NOT the same 37% for all predictors.

A bagging ensemble can be evaluated using OOB instances, without the need for a separate validation set as, if there are enough estimators, then each instance in the training set will likely be an OOB instance of several estimators, so these estimators can be used to make a fair ensemble prediction for that instance.

Once we have a prediction for each instance, we can determine the ensemble’s prediction accuracy ¹⁰ ¹⁰ Or any other metric of interest. In sklearn, we can set oob_score=True when creating a BaggingClassifier to request an automatic OOB evaluation after training.

The following code demonstrates this effect:

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, 
                    oob_score=True, n_jobs=-1, random_state=42) 
bag_clf.fit(X_train, y_train) 
print(bag_clf.oob_score_)

The resulting evaluation score is available in the oob_score_ attribute:

0.896

According to this OOB evaluation, this BaggingClassifier is likely to achieve about 89.6% accuracy on the test set. Let’s verify this:

from sklearn.metrics import accuracy_score 
 
y_pred = bag_clf.predict(X_test) 
print(accuracy_score(y_test, y_pred))

0.912

We get 92% accuracy on the test. The OOB evaluation was a bit too pessimistic, just over 2% too low. The OOB decision function for each training instance is also available as the oob_decision_function_ attribute. As the base estimator has a predict_proba() method, the decision function returns the class probabilities for each training instance.

For example, the OOB evaluation estimates that the first training instance has a 67.6% probability of belonging to the positive class and a 32.4% probability of belonging to the negative class:

print(bag_clf.oob_decision_function_[:3])  # probas for the first 3 instances

[[0.32352941 0.67647059] 
 [0.3375     0.6625    ] 
 [1.         0.        ]]

3.2.3 Random Patches and Random Subspaces

The BaggingClassifier class supports sampling the features as well. Sampling is controlled by two (2) hyper-parameters:

max_features,
bootstrap_features.

They work the same way as max_samples and bootstrap, but for feature sampling instead of instance sampling. Therefore, each predictor will be trained on a random subset of the input features .

This technique is particularly useful when you are dealing with high-dimensional inputs, ¹¹ ¹¹ such as images. as it can considerably speed up training.

Sampling both training instances and features is called the random patches method.

Keeping all training instances (by setting bootstrap=False and max_samples=1.0 ) but sampling features (by setting bootstrap_features to True and/or max_features to a value smaller than 1.0) is called the random subspaces method [13].

Sampling features results in even more predictor diversity, trading a bit more bias for a lower variance.

3.3 Random Forests

As we have discussed, a RF is an ensemble of decision trees , generally trained via the bagging method, ¹² ¹² or sometimes pasting. typically with max_samples set to the size of the training set. Instead of building a BaggingClassifier class and passing it a DecisionTreeClassifier, we can just use the RandomForestClassifier class, which is more convenient and optimised for decision trees. ¹³ The following code trains a RF classifier with 500 trees, each limited to maximum 16 leaf nodes, using all available CPU cores:

from sklearn.ensemble import RandomForestClassifier 
 
rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, 
                       n_jobs=-1, random_state=42) 
rnd_clf.fit(X_train, y_train) 
y_pred_rf = rnd_clf.predict(X_test)

With a few slight exceptions, a RandomForestClassifier has all the hyperparameters of the DecisionTreeClassifier class, which allows to control how trees are grown, in addition to all the hyperparameters of a BaggingClassifier to control the ensemble itself.

The RF algorithm introduces extra randomness when growing trees. Instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features.

By default, it samples $n$ features, where $n$ is the total number of features. The algorithm results in greater tree diversity, which trades a higher bias for a lower variance, generally giving an overall better model.

So, the following BaggingClassifier is equivalent to the previous RandomForestClassifier :

bag_clf = BaggingClassifier( 
   DecisionTreeClassifier(max_features="sqrt", max_leaf_nodes=16), 
   n_estimators=500, n_jobs=-1, random_state=42)

3.3.1 Extra-Trees

When we are growing a tree in a RF , at each node, only a random subset of the features is considered for splitting. ¹⁴ ¹⁴ as discussed previously. It is possible to make trees even more random by also using random thresholds for each feature rather than searching for the best possible thresholds. ¹⁵ For this, we can simply set splitter="random" when creating a DecisionTreeClassifier.

A forest of such extremely random trees is called an extremely randomised trees , or ExtraTrees . Here, values are chosen from a uniform distribution within the feature’s empirical range. ¹⁶ ¹⁶ in the tree’s training set. Then, of all the randomly chosen splits, the one which produces the highest score is used to split the node [15].

As with previous methods, this technique trades more bias for a lower variance.

It also makes extra-trees classifiers much faster to train than regular RF s, as finding the best possible threshold for each feature at every node is one of the most time-consuming tasks of growing a tree [15].

We can create an extra-trees classifier using sklearn ’s ExtraTreesClassifier class. Its Application-Programming Interface (API) is identical to the RandomForestClassifier class, except bootstrap defaults to False. Similarly, the ExtraTreesRegressor class has the same API as the RandomForestRegressor class, except bootstrap defaults to False.

It is hard to tell in advance whether a RandomForestClassifier will perform better or worse than an ExtraTreesClassifier. Generally, the only way to know is to try both and compare them using cross-validation.

3.3.2 Feature Importance

Another great quality of RF s is that they make it easy to measure the relative importance of each feature. sklearn measures a feature’s importance by looking at how much the tree nodes that use that feature reduce impurity on average, across all trees in the forest. More precisely, it is a weighted average, where each node’s weight is equal to the number of training samples that are associated with it.

sklearn computes this score automatically for each feature after training, then it scales the results so that the sum of all importances is equal to 1. You can access the result using the feature_importances_ variable. The following code trains a RandomForestClassifier on the iris dataset and outputs each feature’s importance.

from sklearn.datasets import load_iris 
 
iris = load_iris(as_frame=True) 
rnd_clf = RandomForestClassifier(n_estimators=500, random_state=42) 
rnd_clf.fit(iris.data, iris.target) 
for score, name in zip(rnd_clf.feature_importances_, iris.data.columns): 
   print(round(score, 2), name)

0.11 sepal length (cm) 
0.02 sepal width (cm) 
0.44 petal length (cm) 
0.42 petal width (cm)

Based on the results, it seems the most important features are the petal length (44%) and width (42%), while sepal length and width are rather unimportant in comparison which are 11% and 2%, respectively.

Similarly, if we were to train a RF classifier on the Modified National Institute of Standards and Technology (MNIST) dataset and plot each pixel’s importance, we get the image represented in Fig. 3.3 .

Figure 3.3: MNIST pixel importance (according to a RF classifier)

RF s are very handy to get a quick understanding of what features actually matter, in particular if you need to perform feature selection.

3.4 Boosting

Boosting refers to any ensemble method that can combine several weak learners into a strong learner . The general idea of most boosting methods is to train predictors sequentially , each trying to correct its predecessor. The general structure of it is as follows:

Initially, a model is built using the training data.
Subsequent models are then trained to address the mistakes of their predecessors.
Boosting assigns weights to the data points in the original dataset.

Higher weights

Instances which were misclassified by the previous model receive higher weights.

Lower weights

Instances which were correctly classified receive lower weights.
Training on weighted data: The subsequent model learns from the weighted dataset, focusing its attention on harder-to-learn examples (those with higher weights).
This iterative process continues until the entire training dataset is accurately predicted, or a predefined maximum number of models is reached.

There are many boosting methods available, but by far the most popular are AdaBoost and gradient boosting . Let’s start with AdaBoost. ¹⁷ ¹⁷ which is short for adaptive boosting.

3.4.1 AdaBoost

One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances which the predecessor underfit . This results in new predictors focusing more and more on the hard cases. This is the technique used by AdaBoost.

For example, when training an AdaBoost classifier, the algorithm first trains a base classifier ¹⁸ ¹⁸ i.e., a decision tree. and uses it to make predictions on the training set. The algorithm then increases the relative weight of misclassified training instances. Following this, it trains a second classifier, using the updated weights, and again makes predictions on the training set, updates the instance weights, and so on.

Figure 3.4: Decision boundaries of consecutive predictors.

Fig. 3.4 shows the decision boundaries of five (5) consecutive predictors on the moons dataset. ¹⁹ ¹⁹ In this example, each predictor is a highly regularized SVM classifier with an RBF kernel. The first classifier gets many instances wrong, so their weights get boosted.

The second classifier therefore does a better job on these instances, and so on. The plot on the right in Fig. 3.4 represents the same sequence of predictors, except that the learning rate is halved. ²⁰ ²⁰ i.e., the misclassified instance weights are boosted much less at every iteration. As can be seen, this sequential learning technique has some similarities with gradient descent, except instead of tweaking a single predictor’s parameters to minimise a cost function, AdaBoost adds predictors to the ensemble, gradually making it better.

Once all predictors are trained, the ensemble makes predictions very much like bagging or pasting, except that predictors have different weights depending on their overall accuracy on the weighted training set.

There is one important drawback to this sequential learning technique. Training cannot be parallelized as each predictor can only be trained after the previous predictor has been trained and evaluated. Therefore, it does NOT scale as well as bagging or pasting.

Let’s take a closer look at the AdaBoost algorithm.

Each instance weight $w_{}^{(i)}$ is initially set to $1 ∕ m$ . A first predictor is trained, and its weighted error rate $r_{1}^{}$ is computed on the training set.

r_{j}^{} = \frac{\sum_{\begin{array}{c} i = 1 \\ ŷ_{j}^{(i)} \neq y_{}^{(i)} \end{array}}^{m} w_{}^{(i)}}{\sum_{i = 1}^{m} w_{}^{(i)}}

(3.1)

where $ŷ_{j}^{(i)}$ is the $j^{th}$ predictor’s prediction for the $i^{th}$ instance. The predictor’s weight ( $α_{j}^{}$ ) is then determined using Eq. ( 3.2 ), where $η$ is the learning rate hyperparameter. ²¹ ²¹ which is 1 by default.

α_{j} = η \log \frac{1 - r_{j}^{}}{r_{j}^{}}

(3.2)

The more accurate the predictor is, the higher its weight will be. If it is just guessing randomly, then its weight will be close to zero. However, if it is most often wrong, ²² ²² i.e., less accurate than random guessing. then its weight will be negative.

Next, the AdaBoost algorithm updates the instance weights, using Eq. ( 3.3 ) , which boosts the weights of the misclassified instances.

w_{}^{(i)} = {\begin{matrix} w_{}^{(i)} & if ŷ_{j}^{(i)} = y_{}^{(i)}, \\ w_{}^{(i)} \exp α_{j} & if ŷ_{j}^{(i)} \neq y_{}^{(i)} . \end{matrix}

(3.3)

Then all the instance weights are normalised by dividing it with $\sum_{i = 1}^{m} w_{}^{(i)}$ .

Finally, a new predictor is trained using the updated weights, and the whole process is repeated:

the new predictor’s weight is computed, the instance weights are updated, then another predictor is trained, and so on.

The algorithm stops when the desired number of predictors is reached, or when a perfect predictor is found.

To make predictions, AdaBoost simply computes the predictions of all the predictors and weighs them using the predictor weights $α_{j}^{}$ . The predicted class is the one that receives the majority of weighted votes.

sklearn uses a multi-class version of AdaBoost called Stagewise Additive Modeling using a Multiclass Exponential loss function (SAMME) . When there are just two (2) classes, SAMME is equivalent to AdaBoost. If the predictors can estimate class probabilities, ²³ ²³ i.e., if they have a predict_proba() method. sklearn can use a variant of SAMME called SAMME.R (the R stands for “Real”), which relies on class probabilities rather than predictions and generally performs better.

The following code trains an AdaBoost classifier based on 30 decision stumps using sklearn ’s AdaBoostClassifier class. ²⁴ ²⁴ As you might expect, there is also an AdaBoostRegressor class. A decision stump is a decision tree with max_depth=1, which in other words, a tree composed of a single decision node plus two leaf nodes. This is the default base estimator for the AdaBoostClassifier class:

from sklearn.ensemble import AdaBoostClassifier 
 
ada_clf = AdaBoostClassifier( 
   DecisionTreeClassifier(max_depth=1), n_estimators=30, 
   learning_rate=0.5, random_state=42) 
ada_clf.fit(X_train, y_train)

Advantages	Disadvantages
Can effectively combine multiple weak classifiers to create a strong classifier with high accuracy	Can be sensitive to outliers and noisy data
Can handle complex datasets and capture intricate patters by iteratively adapting to difficult examples	Training process can be computationally expensive, especially dealing with large datasets.
By focusing on misclassified examples and adjusting sample weights, AdaBoost mitigates the risk of overfitting	Appropriate selection of weak classifiers and the number of hyperparameters are crucial for performance.
A versatile algorithm which can work with different types of base classifiers	Can struggle with imbalanced datasets.

Table 3.1: The advantages and disadvantages of AdaBoost.

If AdaBoost ensemble is overfitting the training set, we can try reducing the number of estimators or more strongly regularizing the base estimator.

3.4.2 Gradient Boosting

Another very popular boosting algorithm is gradient boosting. Just like AdaBoost, gradient boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor. However, instead of tweaking the instance weights at every iteration like AdaBoost does, this method tries to fit the new predictor to the residual errors made by the previous predictor.

Let’s go through a simple regression example, using decision trees as the base predictors. This is called gradient tree boosting, or gradient boosted regression trees (GBRT). First, let’s generate a noisy quadratic dataset and fit a DecisionTreeRegressor to it:

import numpy as np 
from sklearn.tree import DecisionTreeRegressor 
 
np.random.seed(42) 
X = np.random.rand(100, 1) - 0.5 
y = 3 * X[:, 0] ** 2 + 0.05 * np.random.randn(100)  # y = 3xš + Gaussian noise 
 
tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42) 
tree_reg1.fit(X, y)

Next, we’ll train a second DecisionTreeRegressor on the residual errors made by the first predictor:

y2 = y - tree_reg1.predict(X) 
tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=43) 
tree_reg2.fit(X, y2)

And then we’ll train a third regressor on the residual errors made by the second predictor:

y3 = y2 - tree_reg2.predict(X) 
tree_reg3 = DecisionTreeRegressor(max_depth=2, random_state=44) 
tree_reg3.fit(X, y3)

Now we have an ensemble containing three (3) trees. It can make predictions on a new instance simply by adding up the predictions of all the trees:

X_new = np.array([[-0.4], [0.], [0.5]]) 
print(sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3)))

[0.49484029 0.04021166 0.75026781]

Figure 3.5: In this depiction of gradient boosting, the first predictor (top left) is trained normally, then each consecutive predictor (middle left and lower left) is trained on the previous predictor’s residuals; the right column shows the resulting ensemble’s predictions

Fig. 3.5 represents the predictions of these three trees in the left column, and the ensemble’s predictions in the right column. In the first row, the ensemble has just one tree, so its predictions are exactly the same as the first tree’s predictions. In the second row, a new tree is trained on the residual errors of the first tree. On the right we can see that the ensemble’s predictions are equal to the sum of the predictions of the first two trees. Similarly, in the third row another tree is trained on the residual errors of the second tree. You can see that the ensemble’s predictions gradually get better as trees are added to the ensemble.

We can use sklearn ’s GradientBoostingRegressor class to train GBRT ensembles more easily. As a reminder, there’s also a GradientBoostingClassifier class for classification. Much like the RandomForestRegressor class, it has hyperparameters to control the growth of decision trees, such as max_depth, min_samples_leaf, as well as hyperparameters to control the ensemble training, such as the number of trees. ²⁵ ²⁵ i.e., n_estimators.

The following code creates the same ensemble as the previous one:

from sklearn.ensemble import GradientBoostingRegressor 
 
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, 
                       learning_rate=1.0, random_state=42) 
gbrt.fit(X, y)

The learning_rate hyperparameter scales the contribution of each tree. If you set it to a low value, such as 0.05, you will need more trees in the ensemble to fit the training set, but the predictions will usually generalize better. This is a regularization technique called shrinkage where Fig. 3.6 shows two GBRT ensembles trained with different hyperparameters: the one on the left does not have enough trees to fit the training set, while the one on the right has about the right amount. If we added more trees, the GBRT would start to overfit the training set.

Figure 3.6: GBRT ensembles with not enough predictors (left) and just enough (right).

To find the optimal number of trees, we could perform cross-validation using GridSearchCV or RandomizedSearchCV, as usual, but there’s a simpler way:

if we set the n_iter_no_change hyperparameter to an integer value, say 10, then the GradientBoostingRegressor will automatically stop adding more trees during training if it sees that the last 10 trees didn’t help.

This is simply early stopping, but with a little bit of patience. It tolerates having no progress for a few iterations before it stops. Let’s train the ensemble using early stopping:

gbrt_best = GradientBoostingRegressor( 
   max_depth=2, learning_rate=0.05, n_estimators=500, 
   n_iter_no_change=10, random_state=42) 
gbrt_best.fit(X, y)

If you set n_iter_no_change too low, training may stop too early and the model will underfit. But if you set it too high, it will overfit instead. We also set a fairly small learning rate and a high number of estimators, but the actual number of estimators in the trained ensemble is much lower, thanks to early stopping:

print(gbrt_best.n_estimators_)

When n_iter_no_change is set, the fit() method automatically splits the training set into a smaller training set and a validation set: this allows it to evaluate the model’s performance each time it adds a new tree. The size of the validation set is controlled by the validation_fraction hyperparameter, which is 10% by default. The tol hyperparameter determines the maximum performance improvement that still counts as negligible. It defaults to 0.0001. The GradientBoostingRegressor class also supports a subsample hyperparameter, which specifies the fraction of training instances to be used for training each tree.

For example, if subsample=0.25, then each tree is trained on 25% of the training instances, selected randomly. As you can probably guess by now, this technique trades a higher bias for a lower variance. It also speeds up training considerably. This is called stochastic gradient boosting.

3.4.3 Histogram-Based Gradient Boosting

sklearn provides another GBRT implementation, optimised for large datasets: histogram based gradient boosting (HGB). It works by binning the input features, replacing them with integers. The number of bins is controlled by the max_bins hyperparameter, which defaults to 255 and cannot be set any higher than this. Binning can greatly reduce the number of possible thresholds that the training algorithm needs to evaluate. Moreover, working with integers makes it possible to use faster and more memory efficient data structures. And the way the bins are built removes the need for sorting the features when training each tree.

As a result, this implementation has a computational complexity of $O {(\underline{\times} m)}^{}$ instead of $O {(n \times m \log m)}^{}$ , where b is the number of bins, m is the number of training instances, and n is the number of features. In practice, this means that HGB can train hundreds of times faster than regular GBRT on large datasets. However, binning causes a precision loss, which acts as a regularizer: depending on the dataset, this may help reduce overfitting, or it may cause underfitting.

sklearn provides two (2) classes for HGB:

1.: HistGradientBoostingRegressor
2.: HistGradientBoostingClassifier

They’re similar to GradientBoostingRegressor and GradientBoostingClassifier, with a few notable differences:

Early stopping is automatically activated if the number of instances is greater than 10,000. You can turn early stopping always on or always off by setting the early_stopping hyperparameter to True or False.
Subsampling is not supported.
n_estimators is renamed to max_iter.
The only decision tree hyperparameters that can be tweaked are:
- max_leaf_nodes,
- min_samples_leaf,
- and max_depth.

The HGB classes also have two (2) nice features:

They support both categorical features and missing values. This simplifies pre-processing quite a bit.

However, the categorical features must be represented as integers ranging from 0 to a number lower than max_bins. You can use an OrdinalEncoder for this.

For example, here’s how to build and train a complete pipeline for the California housing dataset:

from sklearn.pipeline import make_pipeline 
from sklearn.compose import make_column_transformer 
from sklearn.ensemble import HistGradientBoostingRegressor 
from sklearn.preprocessing import OrdinalEncoder 
hgb_reg = make_pipeline( 
make_column_transformer((OrdinalEncoder(), ["ocean_proximity"]), 
remainder="passthrough"), 
HistGradientBoostingRegressor(categorical_features=[0], random_state=42) 
) 
hgb_reg.fit(housing, housing_labels)

The whole pipeline is just as short as the imports! No need for an imputer, scaler, or a one-hot encoder, so it’s really convenient. Note that categorical_features must be set to the categorical column indices (or a Boolean array). Without any hyperparameter tuning, this model yields an RMSE of about 47,600, which is not too bad.

Information : Implementations

Several other optimised implementations of gradient boosting are available in the Python ML ecosystem: in particular, XGBoost, CatBoost, and LightGBM. These libraries have been around for several years. They are all specialized for gradient boosting, their APIs are very similar to sklearn ’s, and they provide many additional features, including GPU acceleration; you should definitely check them out! Moreover, the TensorFlow RF s library provides optimised implementations of a variety of RF algorithms, including plain RF s, extra-trees, GBRT, and several more.

3.5 Bagging v. Boosting

Similarities

Bagging and Boosting, both being the commonly used methods, have a universal similarity of being classified as ensemble methods. To summarise, let’s look at the similarities between them.

Both are ensemble methods to get N learners from 1 learner.
Both generate several training data sets by random sampling.
Both make the final decision by averaging the N learners (or taking the majority of them i.e Majority Voting).
Both are good at reducing variance and provide higher stability.

Differences

Bagging	Boosting
The simplest way of combining predictions that belong to the same type. classifiers to create a strong classifier with high accuracy	A way of combining predictions that belong to the different types.
Aim to decrease variance, not bias	Aim to decrease bias, not variance.
Each model receives equal weight.	Models are weighted according to their performance.
Each model is built independently.	New models are influenced by the performance of previously built models.
Different training data subsets are selected using row sampling with replacement and random sampling methods from the entire training dataset.	Iteratively train models, with each new model focusing on correcting the errors (misclassifications or high residuals) of the previous models
Bagging tries to solve the over-fitting problem.	Boosting tries to reduce bias.
In this base classifiers are trained in parallel.	In this base classifiers are trained sequentially.

Table 3.2: A comparison between Bagging and Boosting.

3.6 Stacking

The last ensemble method we will discuss, is called stacking. ²⁶ ²⁶ short for stacked generalization. It is based on a simple idea: instead of using trivial functions (such as hard voting) to aggregate the predictions of all predictors in an ensemble, why don’t we train a model to perform this aggregation?

To train the blender, you first need to build the blending training set. You can use cross_val_predict() on every predictor in the ensemble to get out-of-sample predictions for each instance in the original training set, and use these can be used as the input features to train the blender; and the targets can simply be copied from the original training set. Note that regardless of the number of features in the original training set (just one in this example), the blending training set will contain one input feature per predictor (three in this example). Once the blender is trained, the base predictors are retrained one last time on the full original training set.

It is actually possible to train several different blenders this way (e.g., one using linear regression, another using RF regression) to get a whole layer of blenders, and then add another blender on top of that to produce the final prediction. You may be able to squeeze out a few more drops of performance by doing this, but it will cost you in both training time and system complexity.

sklearn provides two classes for stacking ensembles: StackingClassifier and StackingRegressor. For example, we can replace the VotingClassifier we used at the beginning of this chapter on the moons dataset with a StackingClassifier :

from sklearn.ensemble import StackingClassifier 
stacking_clf = StackingClassifier( 
   estimators=[ 
      ('lr', LogisticRegression(random_state=42)), 
      ('rf', RandomForestClassifier(random_state=42)), 
      ('svc', SVC(probability=True, random_state=42)) 
   ], 
   final_estimator=RandomForestClassifier(random_state=43), 
 
cv=5 # number of cross-validation folds 
) 
stacking_clf.fit(X_train, y_train)

For each predictor, the stacking classifier will call predict_proba() if available; if not it will fall back to decision_function() or, as a last resort, call predict(). If you don’t provide a final estimator, StackingClassifier will use LogisticRegression and StackingRegressor will use RidgeCV. If you evaluate this stacking model on the test set, you will find 92.8% accuracy, which is a bit better than the voting classifier using soft voting, which got 92%.

In conclusion, ensemble methods are versatile, powerful, and fairly simple to use. RF s, AdaBoost, and GBRT are among the first models you should test for most ML tasks, and they particularly shine with heterogeneous tabular data. Moreover, as they require very little preprocessing, they’re great for getting a prototype up and running quickly. Lastly, ensemble methods like voting classifiers and stacking classifiers can help push your system’s performance to its limits.

3 Ensemble Learning and Random Forests