4
Dimensionality Reduction
4.1 Introduction
Many
ML
problems
involve
thousands
or
even
millions
of
features
for
each
training
instance.
Not
only
do
all
these
features
make
training
extremely
slow,
but
they
can
also
make
it
much
harder
to
find
a
good
solution,
which
we
will
see
in
the
continuing
parts
of
this
chapter.
This
problem
is
often
referred
to
as
the
curse
of
dimensionality.
Fortunately, in real-world problems, it is often possible to reduce the number of features considerably, turning an intractable problem into a tractable one.
For example, consider the MNIST images we worked previously. The pixels on the image borders are almost always white, so we could completely drop these pixels from the training set without losing any information.
As
we
saw
previously
in
Ensemble
Learning,
we
have
confirmed
these
pixels
are
unimportant
for
the
classification
task,
shown
in
Fig.
3.3
.
Additionally,
two
if
we
merge
them
into
a
single
pixel
(e.g.,
by
taking
the
mean
of
the
two
pixel
intensities),
we
will
Information : Limits of Reduction
Reducing dimensionality does cause some information loss , similar to compressing an image to JPEG can degrade its quality, so even though it will speed up training, it may make your system perform slightly worse. It also makes your pipelines a bit more complex and thus harder to maintain.
Therefore, it is in our best interest to first try to train your system with the original data before considering using dimensionality reduction.
In some cases, reducing the dimensionality of the training data may filter out some noise and unnecessary details and thus result in higher performance, but in general it won’t. It will just speed up training.
Apart
from
speeding
up
training,
dimensionality
reduction
is
also
highly
useful
for
In this chapter, we will first discuss the curse of dimensionality and get a sense of what goes on in high-dimensional space.
Following this, we will consider the two
-
projection, and
-
manifold learning,
and we will go through three
- 1.
- PCA ,
- 2.
- Random projection,
- 3.
- Locally Linear Embedding (LLE) .
4.1.1 The Problems of Dimensions
We are used to living in three dimensions that our intuition fails us when we try to imagine a high-dimensional space. Even a basic 4D hypercube is incredibly hard to picture in our minds, let alone a 200-dimensional ellipsoid bent in a 1,000-dimensional space.
The more dimensions the data has, the less geometrically explainable it becomes.
Turns out many things behave very differently in high-dimensional space. As an example, picking a random point in a unit square (a 1-by-1 square), will have only about a 0.4% chance of being located less than 0.001 from a border (in other words, it is very unlikely that a random point will be “extreme” along any dimension). However, if we do the same thin in a 10,000-dimensional unit hypercube, this probability is greater than 99.999999%. Most points in a high-dimensional hypercube are very close to the border which as can be seen is very counter-intuitive.
Let be the dimension of a cube in -dimensions. A point within the cube will have 2 boundaries. To be less than 0.001 from a boundary in a -dimensional unit cube, means it is not inside the cube of side length sharing the same origin point as the original cube.
Information : n-dimensional Geometry
When dimensions become detached from real-life, the mathematics and our intuition becomes more divergent. There are many problems in mathematics where as the geometry becomes n-dimensional the results get even more complicated.
Here is another example.
If we were to pick two
But what about two points picked randomly in a 1,000,000-dimensional unit hypercube?
The average distance, believe it or not, will be about 408.25. This is a problem known as Hypercube Line Picking
Let two points and be picked randomly from a unit n-dimensional hypercube. The expected distance between the points , i.e., the mean line segment length, is then:
The first few values for are given in the following table.
n | OEIS | n | OEIS | ||
1 | – | 0.3333333333… | 5 | A103984 | 0.8785309152… |
2 | A091505 | 0.5214054331… | 6 | A103985 | 0.9689420830… |
3 | A073012 | 0.6617071822… | 7 | A103986 | 1.0515838734… |
4 | A103983 | 0.7776656535… | 8 | A103987 | 1.1281653402… |
This is counterintuitive: how can two points be so far apart when they both lie within the same unit hypercube? Well, there’s just plenty of space in high dimensions . As a result, high-dimensional datasets are at risk of being very sparse:
most training instances are likely to be far away from each other.
This also means a new instance will likely be far away from any training instance, making predictions much less reliable than in lower dimensions, since they will be based on much larger extrapolations.
The more dimensions the training set has, the greater the risk of overfitting it.
In theory, one solution to the curse of dimensionality could be to increase the size of the training set to reach a sufficient density of training instances. Unfortunately, in practice, the number of training instances required to reach a given density grows exponentially with the number of dimensions.
With just 100 features, which is significantly fewer than in the MNIST problem, all ranging from 0 to 1, you would need more training instances than atoms in the observable universe in order for training instances to be within 0.1 of each other on average, assuming they were spread out uniformly across all dimensions.
4.2 Main Approaches to Dimensionality Reduction
Before we dive into different dimensionality reduction algorithms, let’s take a look at the two
- 1.
- projection, and
- 2.
- manifold learning.
4.2.1 Projection
In most real-world problems, training instances are NOT spread out uniformly across all dimensions. Many features are almost constant, while others are highly correlated. As a result, all training instances lie within a much lower-dimensional subspace of the high-dimensional space. This sounds very abstract, so let’s look at an example.
In Fig. 4.0a we can see a 3D dataset represented by small spheres .
As you can see, almost all training instances lie close to a plane:
this is a lower-dimensional (2D) subspace of the higher-dimensional (3D) space.
If
we
were
to
project
every
training
instance
perpendicularly
onto
this
subspace,
We have just reduced the dataset’s dimensionality from 3D to 2D. Please observe the axes corresponding to new features and which are the coordinates of the projections on the plane.
4.2.2 Manifold Learning
It
is
worth
mentioning
that,
projection
is
sklearn
that
is
commonly
used
for
testing
and
demonstrating
nonlinear
dimensionality
reduction
algorithms.
It
consists
of
a
set
of
points
in
three
dimensions,
arranged
in
a
Simply
projecting
onto
a
plane
The
Swiss
roll
is
an
example
of
a
2D
manifold.
To
put
simply,
a
2D
manifold
is
a
2D
shape
that
can
be
bent
and
twisted
in
a
higher-dimensional
space.
In the case of the Swiss roll, = 2 and = 3.
It locally resembles a 2D plane, but it is rolled in the dimension.
Many dimensionality reduction algorithms work by modelling the manifold on which the training instances lie. This is called
Most real-world high-dimensional datasets lie close to a much lower dimensional manifold.
This
assumption
is
very
often
empirically
Once again, let us look back at the MNIST dataset: all handwritten digit images have some similarities. They are made of connected lines, the borders are white, and they are more or less centered. If you randomly generated images, only a ridiculously tiny fraction of them would look like handwritten digits.
In other words, the degrees of freedom available to you if you try to create a digit image are dramatically lower than the degrees of freedom you have if you are allowed to generate any image you want.
These constraints tend to squeeze the dataset into a lower-dimensional manifold.
The manifold assumption is often accompanied by another implicit assumption:
The task at hand (e.g., classification or regression) will be simpler if expressed in the lower-dimensional space of the manifold.
For example, in
Fig.
4.3c
, the Swiss roll is split into two
However, this
Reducing the dimensionality of your training set before training a model will usually speed up training, but it may not always lead to a better or simpler solution; it all depends on the dataset.
We now have a good sense of what the curse of dimensionality is and how dimensionality reduction algorithms can fight it, especially when the manifold assumption holds. The rest of this chapter will go through some of the most popular algorithms for dimensionality reduction.
4.3 Principal Component Analysis (PCA)
PCA
is
by
far
the
most
popular
dimensionality
reduction
algorithm,
invented
in
1901
by
Karl
Pearson.
4.3.1 Preserving the Variance
Before we can project the training set onto a lower-dimensional hyperplane, we first need to choose the
On the right is the result of the projection of the dataset onto each of these axes. As we can see:
- Top
-
The projection onto the solid line preserves the maximum variance,
- Bottom
-
The projection onto the dotted line preserves very little variance
- Middle
-
The projection onto the dashed line preserves an intermediate amount of variance
It seems reasonable to select the axis preserving the maximum amount of variance , as it will most likely lose less information than the other projections.
Another way to justify this choice is that it is the axis that minimizes the mean squared distance between the original dataset and its projection onto that axis.
This is core idea behind PCA .
4.3.2 Principal Components
PCA identifies the axis accounting for the largest amount of variance in the training set. In Fig. 4.5 , it is the solid line.
It also finds a second axis, orthogonal to the first one, which accounts for the largest amount of the remaining variance. In this 2D example there is no choice as it is the dotted line. If it were a higher-dimensional dataset, PCA would also find a axis, orthogonal to both previous axes, and a fourth, a fifth, and so on, as many axes as the number of dimensions in the dataset.
The axis is called the Principal Component (PC) of the data. In Fig. 4.5 , the first PC is the axis on which vector lies, and the second PC is the axis on which vector lies.
In Fig. 4.0a the first two PC s are on the projection plane, and the third PC is the axis orthogonal to that plane. After the projection, in Fig. 4.0b , the first PC corresponds to the axis, and the second PC corresponds to the axis.
Information : The Idea Behind PCA
For
each
PC
,
PCA
finds
a
zero-centered
unit
vector
pointing
in
the
direction
of
the
PC
.
As
two
opposing
unit
vectors
lie
on
the
same
axis,
the
direction
of
the
unit
vectors
returned
by
PCA
is
not
stable:
if
you
perturb
the
training
set
slightly
and
run
PCA
again,
the
unit
vectors
may
point
in
the
opposite
direction
as
the
original
vectors.
However,
they
will
generally
still
lie
on
the
same
axes.
In
some
cases,
a
pair
of
unit
vectors
may
even
rotate
or
swap,
So how can we find the principal components of a training set?
Luckily, there is a standard matrix factorisation technique called
Singular Value Decomposition (SVD)
which decomposes the training set matrix
into the matrix multiplication
of three
The following Python code uses NumPy’s svd()
function to obtain all the principal components of the 3D training set represented
in
Fig.
4.0a
, then it extracts the two unit vectors that define the first two
PC
s:
PCA
assumes that the dataset is centered around the origin. As we will see, sklearn
’s
PCA
classes take care of centering the data
for us. If we were to implement
PCA
ourselves, or if we used other libraries, we shouldn’t forget to center the data first.
4.3.3 Downgrading Dimensions
Once we identified all the PC s, we can reduce the dimensionality of the dataset down to dimensions by projecting it onto the hyperplane defined by the first principal components. Selecting this hyperplane ensures that the projection will preserve as much variance as possible.
For example, in
Fig.
4.0a
the 3D dataset is projected down to the 2D plane defined by the first two
To project the training set onto the hyperplane and obtain a reduced dataset of dimensionality , compute the matrix multiplication of the training set matrix by the matrix , defined as the matrix containing the first columns of :
The following Python code projects the training set onto the plane defined by the first two principal components:
We now know how to reduce the dimensionality of any dataset by projecting it down to any number of dimensions, while preserving as much variance as possible.
Using sklearn
The PCA
class uses
SVD
to implement
PCA
, just like we did earlier. The following code applies
PCA
to reduce the dimensionality
of the dataset down to two
sklearn
’s PCA
also automatically takes care of centering the data.
After fitting the
PCA
transformer to the dataset, its components_
attribute holds the transpose of
: it
contains one row for each of the first d principal components.
Explained Variance Ratio
Another useful piece of information is the explained variance ratio of each principal component, available via the explained_variance_ratio_
variable. The ratio indicates the proportion of the dataset’s variance that lies along each principal
component.
For example, let’s look at the explained variance ratios of the first two components of the 3D dataset represented in Fig. 4.0a :
This output tells us that about 76% of the dataset’s variance lies along the first PC, and about 15% lies along the second PC. This leaves about 9% for the third PC, so it is reasonable to assume that the third PC probably carries little information.
4.3.4 The Right Number of Dimensions
Instead of randomly choosing the number of dimensions to reduce down to, it is simpler to choose the number of dimensions which add up to the required summed variance, for example, 95%.
An exception to this rule, of course, is if you are reducing dimensionality for data visualization, in which case you will want to reduce the dimensionality down to 2 or 3.
Let’s load and splits the MNIST dataset and performs PCA without reducing dimensionality, Then compute the minimum number of dimensions required to preserve 95% of the training set’s variance:
from sklearn.datasets import fetch_openml mnist = fetch_openml('mnist_784', as_frame=False, parser="auto") X_train, y_train = mnist.data[:60_000], mnist.target[:60_000] X_test, y_test = mnist.data[60_000:], mnist.target[60_000:] pca = PCA() pca.fit(X_train) cumsum = np.cumsum(pca.explained_variance_ratio_) d = np.argmax(cumsum >= 0.95) + 1# d equals 154
We could then set n_components=d
and run
PCA
again, but there’s a better option. Instead of specifying the number of principal
components you want to preserve, you can set n_components
to be a float between 0.0 and 1.0, indicating the ratio of variance you
wish to preserve:
The actual number of components is determined during training, and it is stored in the n_components_
attribute:
A different option is to plot the explained variance as a function of the number of dimensions which you can see in Fig. 4.6 . There will usually be an elbow in the curve, where the explained variance stops growing fast. In this case, we can see that reducing the dimensionality down to about 100 dimensions wouldn’t lose too much explained variance.
Finally, if we are using dimensionality reduction as a
For example, the following code example creates a two-step pipeline, first reducing dimensionality using
PCA
, then classifying
using a random forest. Next, it uses RandomizedSearchCV
to find a good combination of hyperparameters for both
PCA
and the
random forest classifier.
This example does a quick search, tuning only two
from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import RandomizedSearchCV from sklearn.pipeline import make_pipeline clf = make_pipeline(PCA(random_state=42), RandomForestClassifier(random_state=42)) param_distrib = { "pca__n_components": np.arange(10, 80), "randomforestclassifier__n_estimators": np.arange(50, 500) } rnd_search = RandomizedSearchCV(clf, param_distrib, n_iter=10, cv=3, random_state=42) rnd_search.fit(X_train[:1000], y_train[:1000])
Let’s look at the best hyperparameters found:
It’s interesting to see how low the optimal number of components is:
we reduced a 784-dimensional dataset to just 57 dimensions.
This is tied to the fact that we used a random forest, which is a pretty powerful model. If we used a linear model instead, such as
an SGDClassifier
, the search would find that we need to preserve more dimensions.
4.3.5 PCA for Compression
After dimensionality reduction, the training set takes up much less space. For example, after applying PCA to the MNIST dataset while preserving 95% of its variance, we are left with 154 features, instead of the original 784 features. So the dataset is now less than 20% of its original size, and we only lost 5% of its variance. This is a reasonable compression ratio, and it’s easy to see how such a size reduction would speed up a classification algorithm tremendously.
It is also possible to decompress the reduced dataset back to 784 dimensions by applying the inverse transformation of the PCA projection. This won’t give you back the original data, since the projection lost a bit of information (within the 5% variance that was dropped), but it will likely be close to the original data. The mean squared distance between the original data and the reconstructed data (compressed and then decompressed) is called the reconstruction error [18].
The inverse_transform()
method lets us decompress the reduced MNIST dataset back to 784 dimensions:
Fig. 4.7 shows a few digits from the original training set, seen on the left, and the corresponding digits after compression and decompression. You can see that there is a slight image quality loss, but the digits are still mostly intact. The equation for the inverse transformation is shown below.
4.3.6 Randomized PCA
If you set the svd_solver
hyperparameter to randomized
, sklearn
uses a stochastic algorithm called randomised
PCA
that quickly finds an
approximation of the first
principal components. Its computational complexity is:
(4.1) |
for full SVD approach, therefore it is faster than full SVD when is much smaller than :
By default, svd_solver
is set to "auto"
, where sklearn
automatically uses the randomised
PCA
algorithm if
and n_components
is an integer
smaller than 80% of
, or else
it uses the full
SVD
approach. So the preceding code would use the randomized
PCA
algorithm even if you removed the svd_solver="randomized"
argument, as
.
If we want to force sklearn
to use full
SVD
for a slightly more precise result, you can set the svd_solver
hyperparameter to "full"
.
4.3.7 Incremental PCA
A
problem
with
the
preceding
implementations
of
PCA
is
that
they
require
the
whole
training
set
to
fit
in
memory
in
order
for
the
algorithm
to
run.
Fortunately,
incremental
PCA
algorithms
have
been
developed
that
allow
you
to
split
the
training
set
into
mini-batches
and
feed
these
in
one
mini-batch
at
a
time.
This
is
useful
for
large
training
sets
and
for
applying
PCA
online.
The
following
splits
the
MNIST
training
set
into
100
mini-batches
array_split()
function
and
feeds
them
to sklearn
’s
Incremental
PCA
class
to
reduce
the
dimensionality
of
the
MNIST
dataset
down
to
154
dimensions,
just
like
before.
We must call the partial_fit()
method with each mini-batch, rather than the fit()
method with the whole training set.
Alternatively,
we
can
use
NumPy’s memmap
class
To
demonstrate
this,
let’s
first
create
a
memory-mapped
(
memmap
)
file
and
copy
the
MNIST
training
set
to
it,
then
call flush()
to
ensure
that
any
data
still
in
the
cache
gets
saved
to
disk.
In
real
life, X_train
would
typically
not
fit
in
memory,
so
you
would
load
it
chunk
by
chunk
and
save
each
chunk
to
the
right
part
of
the
memmap
array:
Next,
we
can
load
the
memmap
file
and
use
it
like
a
regular
NumPy
array.
Let’s
use
the IncrementalPCA
class
to
reduce
its
dimensionality.
Since
this
algorithm
uses
only
a
small
part
of
the
array
at
any
given
time,
memory
usage
remains
under
control.
This
makes
it
possible
to
call
the
usual fit()
method
instead
of partial_fit()
,
which
is
quite
convenient:
Only the raw binary data is saved to disk, so specify the data type and shape of the array when you load it.
If you omit the shape, np.memmap()
returns a 1D array.
For very high-dimensional datasets, PCA can be too slow. As we saw previously, even if you use randomized PCA its computational complexity is still
so
the
target
number
of
dimensions
cannot
be
too
large.
If
we
are
dealing
with
a
dataset
with
tens
of
thousands
of
features
or
more,
4.4 Random Projection
As its name suggests, the random projection algorithm projects the data to a lower-dimensional space using a random linear projection. This may sound counter intuitive, but turns out such a random projection is actually very likely to preserve distances fairly well, as was demonstrated mathematically by William B. Johnson and Joram Lindenstrauss in a famous lemma shown in an abridged form below [19].
For , let be a set of points in where is the number of dimensions of the original dataset. If we define a lower dimension such as:
Then there exist a linear mapping of such for all :
The Theorem states that after fixing an error level, one can map a collection of points from one Euclidean space, no matter how high it’s dimension is to a smaller Euclidean space while only changing the distance between any two points by a factor of . The dimension of the image space is only dependent on the error and the number of points. Given that the dimension is very large, one can achieve significant dimension reduction
So, two similar instances will remain similar after the projection, and two very different instances will remain very different.
As intuition goes, the more dimensions we drop, the more information is lost, and the more distances get distorted.
So how can we choose the optimal number of dimensions?
Well, the progenitors of the aforementioned lemma defined a method which determines the minimum number of dimensions to preserve in order to ensure-with high probability-that distances won’t change by more than a given tolerance.
For example, if we have a dataset containing instances with features each, and don’t want the squared distance between any two instances to change by more than = 10%, then we should project the data down to dimensions, with:
which is 7,300 dimensions, which is a significant dimensionality reduction. Please observe that the equation does not use
, it only
relies on
and
. This
equation is implemented by the johnson_lindenstrauss_min_dim()
function:
Now we can just generate a random matrix of shape , where each item is sampled randomly from a Gaussian distribution with , and use it to project a dataset from n dimensions down to :
Simple and efficient, and no training is required. The only thing the algorithm needs to create the random matrix is the dataset’s shape. The data itself is not used at all.
For
implementation sklearn
offers
a GaussianRandomProjection
class.
When
we
call
its fit()
method,
it
uses johnson_lindenstrauss_min_dim()
to
determine
the
output
dimensionality,
then
it
generates
a
random
matrix,
which
it
stores
in
the components_
attribute.
Then
when
we
call transform()
,
it
uses
this
matrix
to
perform
the
projection.
When
creating
the
transformer,
we
can
set eps
if
we
want
to
tweak
,
n_components
if
we
want
to
force
a
specific
target
dimensionality
.
The
following
code
example
gives
the
same
result
as
the
preceding
code
which
we
can
also
use
to
verify
that gaussian_rnd_proj.components_
is
equal
to
P→
:
sklearn
also
provides
a
second
random
projection
transformer,
known
as SparseRandomProjection
.
It
determines
the
target
dimensionality
in
the
same
way,
generates
a
random
matrix
of
the
same
shape,
and
performs
the
projection
identically.
The
main
difference
is
that
the
random
matrix
is
sparse.
This
means
it
uses
much
less
memory:
about
25
MB
instead
of
almost
1.2
GB
in
the
preceding
example!
And
it’s
also
much
faster,
both
to
generate
the
random
matrix
and
to
reduce
dimensionality:
about
50%
faster
in
this
case.
Moreover,
if
the
input
is
sparse,
the
transformation
keeps
it
sparse
(unless
you
set dense_output=True
).
Finally, it enjoys the same distance-preserving property as the previous approach, and the quality of the dimensionality reduction is comparable. In short, it’s usually preferable to use this transformer instead of the first one, especially for large or sparse datasets.
The
ratio
(
)
of
nonzero
items
in
the
sparse
random
matrix
is
called
its
density
.
By
default,
it
is
equal
to
.
With
20,000
features,
this
means
that
roughly
only
cells
in
the
random
matrix
is
non-zero.
Each
cell
in
the
sparse
random
matrix
has
a
probability
of
being
non-zero,
and
each
non-zero
value
is
either
or
,
If
we
want
to
perform
the
inverse
transform,
you
first
need
to
compute
the
pseudo-inverse
of
the
components
matrix
using
SciPy’s pinv()
function,
then
multiply
the
reduced
data
by
the
transpose
of
the
pseudo-inverse:
Random projection is a simple, fast, memory-efficient, and powerful dimensionality reduction algorithm that we should keep in mind, especially when dealing with high-dimensional datasets.
4.5 Locally Linear Embedding
Locally linear embedding(LLE) is a non-linear dimensionality reduction (NLDR) technique. It is a
This approach makes it good at unrolling twisted manifolds, especially when there is not too much noise. The following code
makes a Swiss roll, then uses sklearn
’s LocallyLinearEmbedding
class to unroll it:
The variable t
is a 1D NumPy array containing the position of each instance along the rolled axis of the Swiss roll. We don’t use
it in this example, but it can be used as a target for a nonlinear regression task. The resulting 2D dataset is shown
in Figure 8-10. As you can see, the Swiss roll is completely unrolled, and the distances between instances are
locally well preserved. However, distances are not preserved on a larger scale: the unrolled Swiss roll should be a
rectangle, not this kind of stretched and twisted band. Nevertheless, LLE did a pretty good job of modeling the
manifold.
Operation Principle
For each training instance , the algorithm identifies its k-nearest neighbours, then tries to reconstruct as a linear function of these neighbors. More specifically, it tries to find the weights such that the squared distance between and:
is as small as possible , assuming if is not one of the k-nearest neighbors of . Thus the first step of LLE is the constrained optimization problem:
, where W is the weight matrix containing all the weights wi,j. The second constraint simply normalizes the weights for each training instance x(i).
After this step, the weight matrix W(containing the weights w i,j) encodes the local linear relationships between the training instances. The second step is to map the training instances into a d-dimensional space (where d < n) while preserving these local relationships as much as possible. If z(i) is the image of x(i) in this d-dimensional space, then we want the squared distance between z(i) and j=1 m w i,j z (j) to be as small as possible. This idea leads to the unconstrained optimization problem described in Equation 8-5. It looks very similar to the first step, but instead of keeping the instances fixed and finding the optimal weights, we are doing the reverse: keeping the weights fixed and finding the optimal position of the instances’ images in the low-dimensional space. Note that Z is the matrix containing all z(i).
sklearn
’s LLE implementation has the following computational complexity: O(m log(m)n log(k)) for finding the k-nearest
neighbors, O(mnk3) for optimizing the weights, and O(dm2) for constructing the low-dimensional representations. Unfortunately,
the m2 in the last term makes this algorithm scale poorly to very large datasets. As you can see, LLE is quite different from the
projection techniques, and it’s significantly more complex, but it can also construct much better low- dimensional representations,
especially if the data is nonlinear.