3
Ensemble Learning and Random Forests
3.1 Introduction
Suppose you ask a complex question to millions of random people, then aggregate their answers. In many cases you will find that this aggregated answer is better than an expert’s answer.
This is called the wisdom of the crowd.
In a similar fashion, aggregating the predictions of a group of predictors, such as classifiers or regressors. We will often get better predictions than with the best individual predictor.
A group of predictors is called an ensemble and this technique, ensemble learning , and the learning method, ensemble method .
As an example of an ensemble method, we can train a group of
-
voting classifiers,
-
bagging and pasting ensembles,
-
RF s,
-
boosting,
-
and stacking ensembles.
3.1.1 Voting Classifiers
Let’s assume that we have trained a few classifiers, with each one achieving about 80% accuracy.
This
may
have
a
A simple way to create an even better classifier is to combine the predictions of each classifier where the class which gets the most votes is the ensemble’s prediction.
This majority-vote classifier is called a hard voting classifier .
Interestingly, this voting classifier often achieves a higher accuracy than the best classifier in the ensemble. In fact, even if each
classifier is a weak learner, meaning it does only slightly better than random guessing, the ensemble can still be a strong learner,
achieving high accuracy, provided there are a sufficient number of weak learners in the ensemble and they are
Let’s
discuss
how
this
all
works
with
an
example.
For
simplicity,
we
will
have
a
slightly
biased
coin
which
has
a
51%
chance
of
coming
up
heads
and
49%
chance
of
coming
up
tails.
If you do the calculation, you will find that the probability of obtaining a majority of heads after 1,000 tosses is close to 75%. The more you toss the coin, the higher the probability (e.g., with 10,000 tosses, the probability climbs over 97%). This is due to the law of large numbers :
as we keep tossing the coin, the ratio of heads gets closer and closer to the probability of heads (51%).
Information : LLN
A mathematical law states that the average of the results obtained from a large number of independent random samples converges to the true value, if it exists. More formally, the LLN states that given a sample of independent and identically distributed values, the sample mean converges to the true mean. An example of this behaviour is shown in Fig. 3.1 .
If
we
extend
this
analogy
to
our
previous
case,
Our
case
becomes
the
building
of
an
ensemble
containing
1,000
classifiers
which
are
individually
correct
only
51%
of
the
time.
In this scenario, if we predict the majority voted class, we can hope for up to 75% accuracy.
However, this is only true if all classifiers are perfectly independent, making uncorrelated errors, which is clearly not the case because they are trained on the same data. They are likely to make the same types of errors, so there will be many majority votes for the wrong class, reducing the ensemble’s accuracy.
Ensemble methods work best when the predictors are as independent from one another as possible. One way to get diverse classifiers is to train them using very different algorithms. This increases the chance that they will make very different types of errors, improving the ensemble’s accuracy [14].
To implement this behaviour, sklearn
provides a VotingClassifier
class which is intuitive as it just gives it a list of
name/predictor pairs, and use it like a normal classifier.
Let’s try it on the moons dataset we used in
SVM
. We will load and split the moons dataset into a training set and a test set,
then we’ll create and train a voting classifier composed of three
from sklearn.datasets import make_moons from sklearn.ensemble import RandomForestClassifier, VotingClassifier from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.svm import SVC X, y = make_moons(n_samples=500, noise=0.30, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42) voting_clf = VotingClassifier( estimators=[ ('lr', LogisticRegression(random_state=42)), ('rf', RandomForestClassifier(random_state=42)), ('svc', SVC(random_state=42)) ] ) voting_clf.fit(X_train, y_train)
When we fit a VotingClassifier
, it clones every estimator and fits the clones. The original estimators are available via the estimators
attribute, whereas the fitted clones are available via the estimators_
attribute.
If we prefer a dict
rather than a list, you can use named_estimators
or named_estimators_
instead.
To begin, let’s look at each fitted classifier’s individual accuracy on the test set:
When we call the voting classifier’s predict()
method, it performs hard voting.
For example, the voting classifier predicts class 1 for the first instance of the test set, because two out of three classifiers predict that class:
And we can look at the performance of the voting classifier on the test set which is 0.912
, which shows, the voting classifier
outperforms all the individual classifiers.
If
all
classifiers
are
able
to
estimate
class
probabilities,
predict_proba()
method.
then
we
can
tell sklearn
to
predict
the
class
with
the
highest
class
probability,
averaged
over
all
the
individual
classifiers.
This
is
called
soft
voting
,
which
often
achieves
higher
performance
than
hard
voting
as
it
gives
more
weight
to
highly
confident
votes.
All
we
need
to
do
is
set
the
voting
classifier’s
voting
hyperparameter
to soft
,
and
ensure
that
all
classifiers
can
estimate
class
probabilities.
This
is
NOT
the
case
for
the
SVC
class
by
default,
so
you
need
to
set
its
probability
hyperparameter
to True
This will make the SVC class use cross-validation to estimate class probabilities, slowing down training, and it will add a predict_proba()
method.
Let’s try that:
We reach 92% accuracy simply by using soft voting.
To give a brief summary of what we have learned till now:
- Hard Voting
-
Takes a simple majority vote to decide the final prediction, based on the most frequent class predicted by individual models.
- Soft Voting
-
Considers the probability scores of each class predicted by individual models and averages them to produce a more refined final prediction.
When dealing with imbalanced datasets, soft voting can help mitigate the bias towards the majority class by taking into account the probabilities of all classes.
3.2 Bagging and Pasting
Now we got a general idea of RF , let’s look at methods to improve it’s performance. A way to get a diverse set of classifiers is to use diverse training algorithms. An alternative approach is to use the same training algorithm for every predictor but train them on different random subsets of the training set.
-
When sampling is performed with replacement, this method is called bagging .
6 6 This is short for bootstrap aggregating. -
When sampling is performed without replacement, it is called pasting .
Both bagging and pasting allow training instances to be sampled several times across multiple predictors, but only bagging allows training instances to be sampled several times for the same predictor.
Once
all
predictors
are
trained,
the
ensemble
can
make
a
prediction
for
a
new
instance
by
simply
aggregating
the
predictions
of
all
predictors.
The
aggregation
function
is
typically
the
statistical
mode
for
classification,
Generally, the net result is that the ensemble has a similar bias but a lower variance than a single predictor trained on the original training set.
Predictors can all be trained in parallel, via different Central Processing Unit (CPU) cores or even different servers. Similarly, predictions can be made in parallel. This is one of the reasons bagging and pasting are such popular methods as they scale very well.
3.2.1 Implementation
sklearn
offers simple classes for both bagging and pasting:
the BaggingClassifier
class,
or BaggingRegressor
for
regression.
The
code
below
trains
an
ensemble
of
500
decision
tree
classifiers
where
each
is
trained
on
100
training
instances
randomly
sampled
from
the
training
set
with
replacement.
bootstrap=False.
The n_jobs
parameter
tells sklearn
the
number
of
CPU
cores
to
use
for
training
and
predictions,
and -1
tells sklearn
to
use
all
available
cores:
A BaggingClassifier
automatically performs soft voting instead of hard voting if the base classifier can estimate class probabilities,
predict_proba()
method
which is the case with decision tree classifiers.
Please have a look at Fig. 3.2 , which compares the decision boundary of a single decision tree with the decision boundary of a bagging ensemble of 500 trees, both trained on the moons dataset.
As can be seen, the ensemble’s predictions will likely generalize much better than the single decision tree’s predictions:
the ensemble has a comparable bias but a smaller variance.
It makes roughly the same number of errors on the training set, but the decision boundary is less irregular.
Bagging introduces a higher diversity in the subsets that each predictor is trained on, so bagging ends up with a slightly higher bias than pasting; but the extra diversity also means that the predictors end up being less correlated, so the ensemble’s variance is reduced.
Overall, bagging often results in better models, which explains why it’s generally preferred. However if we have spare time and CPU power, we can also use cross-validation to evaluate both bagging and pasting and select the one that works best.
3.2.2 Out-of-Bag Evaluation
With bagging, some training instances may be sampled several times for any given
predictor, while others may not be sampled at all. By default a BaggingClassifier
samples
training instances with replacement (bootstrap=True
), where
is the size of the training set. With this process, it can be shown mathematically that only about 63% of the training instances
are sampled on average for each predictor.
Information : Limits of Bagging
If we were to randomly draw one instance from a dataset of size , each instance in the dataset obviously has probability of getting picked, and therefore it has a probability of NOT getting picked.
If you draw instances with replacement , all draws are independent and therefore each instance has a probability
of NOT getting picked. Now let’s use the fact that is equal to the limit of:
So if we to assume to be sufficiently large, the ratio of Out-of Bag (OOB) instances will be about . Therefore roughly 63% will be sampled
The remaining 37% of the training instances that are NOT sampled are called OOB instances.
Note that they are NOT the same 37% for all predictors.
A bagging ensemble can be evaluated using OOB instances, without the need for a separate validation set as, if there are enough estimators, then each instance in the training set will likely be an OOB instance of several estimators, so these estimators can be used to make a fair ensemble prediction for that instance.
Once
we
have
a
prediction
for
each
instance,
we
can
determine
the
ensemble’s
prediction
accuracy
sklearn
,
we
can
set oob_score=True
when
creating
a BaggingClassifier
to
request
an
automatic
OOB
evaluation
after
training.
The following code demonstrates this effect:
The
resulting
evaluation
score
is
available
in
the oob_score_
attribute:
According
to
this
OOB
evaluation,
this BaggingClassifier
is
likely
to
achieve
about
89.6%
accuracy
on
the
test
set.
Let’s
verify
this:
We
get
92%
accuracy
on
the
test.
The
OOB
evaluation
was
a
bit
too
pessimistic,
just
over
2%
too
low.
The
OOB
decision
function
for
each
training
instance
is
also
available
as
the oob_decision_function_
attribute.
As
the
base
estimator
has
a predict_proba()
method,
the
decision
function
returns
the
class
probabilities
for
each
training
instance.
For example, the OOB evaluation estimates that the first training instance has a 67.6% probability of belonging to the positive class and a 32.4% probability of belonging to the negative class:
3.2.3 Random Patches and Random Subspaces
The BaggingClassifier
class supports sampling the features as well. Sampling is controlled by two
-
max_features
, -
bootstrap_features
.
They work the same way as max_samples
and bootstrap, but for feature sampling instead of instance sampling. Therefore, each
predictor will be trained on a
random subset of the input features
.
This
technique
is
particularly
useful
when
you
are
dealing
with
high-dimensional
inputs,
Sampling both training instances and features is called the random patches method.
Keeping all training instances (by setting bootstrap=False
and max_samples=1.0
) but sampling features (by setting bootstrap_features
to True
and/or max_features
to a value smaller than 1.0) is called the random subspaces method [13].
Sampling features results in even more predictor diversity, trading a bit more bias for a lower variance.
3.3 Random Forests
As
we
have
discussed,
a
RF
is
an
ensemble
of
decision
trees
,
generally
trained
via
the
bagging
method,
max_samples
set
to
the
size
of
the
training
set.
Instead
of
building
a BaggingClassifier
class
and
passing
it
a DecisionTreeClassifier
,
we
can
just
use
the RandomForestClassifier
class,
which
is
more
convenient
and
optimised
for
decision
trees.
With
a
few
slight
exceptions,
a RandomForestClassifier
has
all
the
hyperparameters
of
the DecisionTreeClassifier
class,
which
allows
to
control
how
trees
are
grown,
in
addition
to
all
the
hyperparameters
of
a BaggingClassifier
to
control
the
ensemble
itself.
The RF algorithm introduces extra randomness when growing trees. Instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features.
By default, it samples features, where is the total number of features. The algorithm results in greater tree diversity, which trades a higher bias for a lower variance, generally giving an overall better model.
So,
the
following BaggingClassifier
is
equivalent
to
the
previous RandomForestClassifier
:
3.3.1 Extra-Trees
When
we
are
growing
a
tree
in
a
RF
,
at
each
node,
only
a
random
subset
of
the
features
is
considered
for
splitting.
splitter="random"
when
creating
a DecisionTreeClassifier
.
A
forest
of
such
extremely
random
trees
is
called
an
extremely
randomised
trees
,
or
ExtraTrees
.
Here,
values
are
chosen
from
a
uniform
distribution
within
the
feature’s
empirical
range.
As with previous methods, this technique trades more bias for a lower variance.
It also makes extra-trees classifiers much faster to train than regular RF s, as finding the best possible threshold for each feature at every node is one of the most time-consuming tasks of growing a tree [15].
We can create an extra-trees classifier using sklearn
’s ExtraTreesClassifier
class. Its
Application-Programming
Interface (API)
is identical to the RandomForestClassifier
class, except bootstrap defaults to False
. Similarly, the ExtraTreesRegressor
class has the same
API
as the RandomForestRegressor
class, except bootstrap defaults to False
.
It is hard to tell in advance whether a RandomForestClassifier
will perform better or worse than an ExtraTreesClassifier
.
Generally, the only way to know is to try both and compare them using cross-validation.
3.3.2 Feature Importance
Another great quality of
RF
s is that they make it easy to measure the relative importance of each feature. sklearn
measures a
feature’s importance by looking at how much the tree nodes that use that feature reduce impurity on average, across all trees in
the forest. More precisely, it is a weighted average, where each node’s weight is equal to the number of training samples that are
associated with it.
sklearn
computes this score automatically for each feature after training, then it scales the results so that the sum of all
importances is equal to 1. You can access the result using the feature_importances_
variable. The following code trains a RandomForestClassifier
on the
Based on the results, it seems the most important features are the petal length (44%) and width (42%), while sepal length and width are rather unimportant in comparison which are 11% and 2%, respectively.
Similarly, if we were to train a RF classifier on the Modified National Institute of Standards and Technology (MNIST) dataset and plot each pixel’s importance, we get the image represented in Fig. 3.3 .
RF s are very handy to get a quick understanding of what features actually matter, in particular if you need to perform feature selection.
3.4 Boosting
Boosting refers to any ensemble method that can combine
-
Initially, a model is built using the training data.
-
Subsequent models are then trained to address the mistakes of their predecessors.
-
Boosting assigns weights to the data points in the original dataset.
-
Training on weighted data: The subsequent model learns from the weighted dataset, focusing its attention on harder-to-learn examples (those with higher weights).
-
This iterative process continues until the entire training dataset is accurately predicted, or a predefined maximum number of models is reached.
There
are
many
boosting
methods
available,
but
by
far
the
most
popular
are
AdaBoost
and
gradient
boosting
.
Let’s
start
with
AdaBoost.
3.4.1 AdaBoost
One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances which the predecessor underfit . This results in new predictors focusing more and more on the hard cases. This is the technique used by AdaBoost.
For
example,
when
training
an
AdaBoost
classifier,
the
algorithm
first
trains
a
base
classifier
Fig.
3.4
shows
the
decision
boundaries
of
five
The
second
classifier
therefore
does
a
better
job
on
these
instances,
and
so
on.
The
plot
on
the
right
in
Fig.
3.4
represents
the
same
sequence
of
predictors,
except
that
the
learning
rate
is
halved.
Once all predictors are trained, the ensemble makes predictions very much like bagging or pasting, except that predictors have different weights depending on their overall accuracy on the weighted training set.
There is one important drawback to this sequential learning technique. Training cannot be parallelized as each predictor can only be trained after the previous predictor has been trained and evaluated. Therefore, it does NOT scale as well as bagging or pasting.
Let’s take a closer look at the AdaBoost algorithm.
Each instance weight is initially set to . A first predictor is trained, and its weighted error rate is computed on the training set.
(3.1) |
where
is the
predictor’s prediction for the
instance. The predictor’s
weight (
) is then determined
using Eq. (
3.2
), where
is the learning rate hyperparameter.
(3.2) |
The
more
accurate
the
predictor
is,
the
higher
its
weight
will
be.
If
it
is
just
guessing
randomly,
then
its
weight
will
be
close
to
zero.
However,
if
it
is
most
often
wrong,
Next, the AdaBoost algorithm updates the instance weights, using Eq. ( 3.3 ) , which boosts the weights of the misclassified instances.
(3.3) |
Then all the instance weights are
Finally, a new predictor is trained using the updated weights, and the whole process is repeated:
the new predictor’s weight is computed, the instance weights are updated, then another predictor is trained, and so on.
The algorithm stops when the desired number of predictors is reached, or when a perfect predictor is found.
To make predictions, AdaBoost simply computes the predictions of all the predictors and weighs them using the predictor weights . The predicted class is the one that receives the majority of weighted votes.
sklearn
uses
a
multi-class
version
of
AdaBoost
called
Stagewise
Additive
Modeling
using
a
Multiclass
Exponential
loss
function
(SAMME)
.
When
there
are
just
two
predict_proba()
method.
sklearn
can
use
a
variant
of
SAMME
called
SAMME.R
(the
R
stands
for
“Real”),
which
relies
on
class
probabilities
rather
than
predictions
and
generally
performs
better.
The
following
code
trains
an
AdaBoost
classifier
based
on
30
decision
stumps
using sklearn
’s
AdaBoostClassifier
class.
max_depth=1
,
which
in
other
words,
a
tree
composed
of
a
single
decision
node
plus
two
leaf
nodes.
This
is
the
default
base
estimator
for
the AdaBoostClassifier
class:
Advantages | Disadvantages |
---|---|
Can effectively combine multiple weak classifiers to create a strong classifier with high accuracy | Can be sensitive to outliers and noisy data |
Can handle complex datasets and capture intricate patters by iteratively adapting to difficult examples | Training process can be computationally expensive, especially dealing with large datasets. |
By focusing on misclassified examples and adjusting sample weights, AdaBoost mitigates the risk of overfitting | Appropriate selection of weak classifiers and the number of hyperparameters are crucial for performance. |
A versatile algorithm which can work with different types of base classifiers | Can struggle with imbalanced datasets. |
If AdaBoost ensemble is overfitting the training set, we can try reducing the number of estimators or more strongly regularizing the base estimator.
3.4.2 Gradient Boosting
Another very popular boosting algorithm is gradient boosting. Just like AdaBoost, gradient boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor. However, instead of tweaking the instance weights at every iteration like AdaBoost does, this method tries to fit the new predictor to the residual errors made by the previous predictor.
Let’s go through a simple regression example, using decision trees as the base predictors. This is called gradient tree boosting, or
gradient boosted regression trees (GBRT). First, let’s generate a noisy quadratic dataset and fit a DecisionTreeRegressor
to
it:
Next, we’ll train a second DecisionTreeRegressor
on the residual errors made by the first predictor:
And then we’ll train a third regressor on the residual errors made by the second predictor:
Now we have an ensemble containing three
Fig. 3.5 represents the predictions of these three trees in the left column, and the ensemble’s predictions in the right column. In the first row, the ensemble has just one tree, so its predictions are exactly the same as the first tree’s predictions. In the second row, a new tree is trained on the residual errors of the first tree. On the right we can see that the ensemble’s predictions are equal to the sum of the predictions of the first two trees. Similarly, in the third row another tree is trained on the residual errors of the second tree. You can see that the ensemble’s predictions gradually get better as trees are added to the ensemble.
We
can
use sklearn
’s GradientBoostingRegressor
class
to
train
GBRT
ensembles
more
easily.
As
a
reminder,
there’s
also
a GradientBoostingClassifier
class
for
classification.
Much
like
the RandomForestRegressor
class,
it
has
hyperparameters
to
control
the
growth
of
decision
trees,
such
as max_depth
, min_samples_leaf
,
as
well
as
hyperparameters
to
control
the
ensemble
training,
such
as
the
number
of
trees.
n_estimators
.
The following code creates the same ensemble as the previous one:
The learning_rate
hyperparameter
scales
the
contribution
of
each
tree.
If
you
set
it
to
a
low
value,
such
as
0.05,
you
will
need
more
trees
in
the
ensemble
to
fit
the
training
set,
but
the
predictions
will
usually
generalize
better.
This
is
a
regularization
technique
called
shrinkage
where
Fig.
3.6
shows
two
GBRT
ensembles
trained
with
different
hyperparameters:
the
one
on
the
left
does
not
have
enough
trees
to
fit
the
training
set,
while
the
one
on
the
right
has
about
the
right
amount.
If
we
added
more
trees,
the
GBRT
would
start
to
overfit
the
training
set.
To find the optimal number of trees, we could perform cross-validation using GridSearchCV
or RandomizedSearchCV
, as usual, but
there’s a simpler way:
if
we
set
the n_iter_no_change
hyperparameter
to
an
integer
value,
say
10,
then
the GradientBoostingRegressor
will
automatically
stop
adding
more
trees
during
training
if
it
sees
that
the
last
10
trees
didn’t
help.
This is simply early stopping, but with a little bit of patience. It tolerates having no progress for a few iterations before it stops. Let’s train the ensemble using early stopping:
If you set n_iter_no_change
too low, training may stop too early and the model will underfit. But if you set it too high, it will
overfit instead. We also set a fairly small learning rate and a high number of estimators, but the actual number of estimators in
the trained ensemble is much lower, thanks to early stopping:
When n_iter_no_change
is set, the fit()
method automatically splits the training set into a smaller training set and a validation
set: this allows it to evaluate the model’s performance each time it adds a new tree. The size of the validation set is controlled by
the validation_fraction
hyperparameter, which is 10% by default. The tol
hyperparameter determines the maximum
performance improvement that still counts as negligible. It defaults to 0.0001. The GradientBoostingRegressor
class also
supports a subsample hyperparameter, which specifies the fraction of training instances to be used for training each
tree.
For example, if subsample=0.25
, then each tree is trained on 25% of the training instances, selected randomly. As you can
probably guess by now, this technique trades a higher bias for a lower variance. It also speeds up training considerably. This is
called stochastic gradient boosting.
3.4.3 Histogram-Based Gradient Boosting
sklearn
provides another GBRT implementation, optimised for large datasets: histogram based gradient boosting (HGB). It
works by binning the input features, replacing them with integers. The number of bins is controlled by the max_bins
hyperparameter, which defaults to 255 and cannot be set any higher than this. Binning can greatly reduce the number of possible
thresholds that the training algorithm needs to evaluate. Moreover, working with integers makes it possible to use faster and
more memory efficient data structures. And the way the bins are built removes the need for sorting the features when training
each tree.
As a result, this implementation has a computational complexity of instead of , where b is the number of bins, m is the number of training instances, and n is the number of features. In practice, this means that HGB can train hundreds of times faster than regular GBRT on large datasets. However, binning causes a precision loss, which acts as a regularizer: depending on the dataset, this may help reduce overfitting, or it may cause underfitting.
sklearn
provides two
- 1.
-
HistGradientBoostingRegressor
- 2.
-
HistGradientBoostingClassifier
They’re similar to GradientBoostingRegressor
and GradientBoostingClassifier
, with a few notable differences:
-
Early stopping is automatically activated if the number of instances is greater than 10,000. You can turn early stopping always on or always off by setting the
early_stopping
hyperparameter to True or False. -
Subsampling is not supported.
-
n_estimators
is renamed tomax_iter
. -
The only decision tree hyperparameters that can be tweaked are:
-
max_leaf_nodes
, -
min_samples_leaf
, -
and
max_depth
.
-
The HGB classes also have two
They support both categorical features and missing values. This simplifies pre-processing quite a bit.
However, the categorical features must be represented as integers ranging from 0 to a number lower than max_bins
. You can use
an OrdinalEncoder
for this.
For example, here’s how to build and train a complete pipeline for the California housing dataset:
from sklearn.pipeline import make_pipeline from sklearn.compose import make_column_transformer from sklearn.ensemble import HistGradientBoostingRegressor from sklearn.preprocessing import OrdinalEncoder hgb_reg = make_pipeline( make_column_transformer((OrdinalEncoder(), ["ocean_proximity"]), remainder="passthrough"), HistGradientBoostingRegressor(categorical_features=[0], random_state=42) ) hgb_reg.fit(housing, housing_labels)
The whole pipeline is just as short as the imports! No need for an imputer, scaler, or a one-hot encoder, so it’s really convenient.
Note that categorical_features
must be set to the categorical column indices (or a Boolean array). Without any hyperparameter
tuning, this model yields an RMSE of about 47,600, which is not too bad.
Information : Implementations
Several other optimised implementations of gradient boosting are available in the Python ML ecosystem: in
particular, XGBoost, CatBoost, and LightGBM. These libraries have been around for several years. They are
all specialized for gradient boosting, their APIs are very similar to sklearn
’s, and they provide many additional
features, including GPU acceleration; you should definitely check them out! Moreover, the TensorFlow
RF
s library
provides optimised implementations of a variety of
RF
algorithms, including plain
RF
s, extra-trees, GBRT, and
several more.
3.5 Bagging v. Boosting
Similarities
Bagging and Boosting, both being the commonly used methods, have a universal similarity of being classified as ensemble methods. To summarise, let’s look at the similarities between them.
-
Both are ensemble methods to get N learners from 1 learner.
-
Both generate several training data sets by random sampling.
-
Both make the final decision by averaging the N learners (or taking the majority of them i.e Majority Voting).
-
Both are good at reducing variance and provide higher stability.
Differences
Bagging | Boosting |
---|---|
The simplest way of combining predictions that belong to the same type. classifiers to create a strong classifier with high accuracy | A way of combining predictions that belong to the different types. |
Aim to decrease variance, not bias | Aim to decrease bias, not variance. |
Each model receives equal weight. | Models are weighted according to their performance. |
Each model is built independently. | New models are influenced by the performance of previously built models. |
Different training data subsets are selected using row sampling with replacement and random sampling methods from the entire training dataset. | Iteratively train models, with each new model focusing on correcting the errors (misclassifications or high residuals) of the previous models |
Bagging tries to solve the over-fitting problem. | Boosting tries to reduce bias. |
In this base classifiers are trained in parallel. | In this base classifiers are trained sequentially. |
3.6 Stacking
The
last
ensemble
method
we
will
discuss,
is
called
stacking.
To
train
the
blender,
you
first
need
to
build
the
blending
training
set.
You
can
use cross_val_predict()
on
every
predictor
in
the
ensemble
to
get
out-of-sample
predictions
for
each
instance
in
the
original
training
set,
and
use
these
can
be
used
as
the
input
features
to
train
the
blender;
and
the
targets
can
simply
be
copied
from
the
original
training
set.
Note
that
regardless
of
the
number
of
features
in
the
original
training
set
(just
one
in
this
example),
the
blending
training
set
will
contain
one
input
feature
per
predictor
(three
in
this
example).
Once
the
blender
is
trained,
the
base
predictors
are
retrained
one
last
time
on
the
full
original
training
set.
It is actually possible to train several different blenders this way (e.g., one using linear regression, another using RF regression) to get a whole layer of blenders, and then add another blender on top of that to produce the final prediction. You may be able to squeeze out a few more drops of performance by doing this, but it will cost you in both training time and system complexity.
sklearn
provides
two
classes
for
stacking
ensembles:
StackingClassifier
and
StackingRegressor.
For
example,
we
can
replace
the
VotingClassifier
we
used
at
the
beginning
of
this
chapter
on
the
moons
dataset
with
a StackingClassifier
:
from sklearn.ensemble import StackingClassifier stacking_clf = StackingClassifier( estimators=[ ('lr', LogisticRegression(random_state=42)), ('rf', RandomForestClassifier(random_state=42)), ('svc', SVC(probability=True, random_state=42)) ], final_estimator=RandomForestClassifier(random_state=43), cv=5# number of cross-validation folds ) stacking_clf.fit(X_train, y_train)
For
each
predictor,
the
stacking
classifier
will
call predict_proba()
if
available;
if
not
it
will
fall
back
to decision_function()
or,
as
a
last
resort,
call predict()
.
If
you
don’t
provide
a
final
estimator,
StackingClassifier
will
use LogisticRegression
and
StackingRegressor
will
use
RidgeCV.
If
you
evaluate
this
stacking
model
on
the
test
set,
you
will
find
92.8%
accuracy,
which
is
a
bit
better
than
the
voting
classifier
using
soft
voting,
which
got
92%.
In conclusion, ensemble methods are versatile, powerful, and fairly simple to use. RF s, AdaBoost, and GBRT are among the first models you should test for most ML tasks, and they particularly shine with heterogeneous tabular data. Moreover, as they require very little preprocessing, they’re great for getting a prototype up and running quickly. Lastly, ensemble methods like voting classifiers and stacking classifiers can help push your system’s performance to its limits.