4
Dimensionality Reduction

4.1 Introduction

Many ML problems involve thousands or even millions of features for each training instance. Not only do all these features make training extremely slow, but they can also make it much harder to find a good solution, which we will see in the continuing parts of this chapter. This problem is often referred to as the curse of dimensionality. ¹ ¹ Refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience [16].

Fortunately, in real-world problems, it is often possible to reduce the number of features considerably, turning an intractable problem into a tractable one.

For example, consider the MNIST images we worked previously. The pixels on the image borders are almost always white, so we could completely drop these pixels from the training set without losing any information.

As we saw previously in Ensemble Learning, we have confirmed these pixels are unimportant for the classification task, shown in Fig. 3.3 . Additionally, two (2) neighboring pixels are often highly correlated :

if we merge them into a single pixel (e.g., by taking the mean of the two pixel intensities), we will not lose much information .

Information : Limits of Reduction

Reducing dimensionality does cause some information loss , similar to compressing an image to JPEG can degrade its quality, so even though it will speed up training, it may make your system perform slightly worse. It also makes your pipelines a bit more complex and thus harder to maintain.

Therefore, it is in our best interest to first try to train your system with the original data before considering using dimensionality reduction.

In some cases, reducing the dimensionality of the training data may filter out some noise and unnecessary details and thus result in higher performance, but in general it won’t. It will just speed up training.

Apart from speeding up training, dimensionality reduction is also highly useful for data visualisation . Reducing the number of dimensions down to two (2), or three (3), makes it possible to plot a condensed view of a high-dimensional training set on a graph and often gain some important insights by visually detecting patterns, such as clusters. ² ² A type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. Moreover, data visualisation is essential to communicate our conclusions to people who are not data scientists-in particular, decision makers who will use your results.

In this chapter, we will first discuss the curse of dimensionality and get a sense of what goes on in high-dimensional space. Following this, we will consider the two (2) main approaches to dimensionality reduction:

projection, and
manifold learning,

and we will go through three (3) of the most popular dimensionality reduction techniques:

1.: PCA ,
2.: Random projection,
3.: Locally Linear Embedding (LLE) .

4.1.1 The Problems of Dimensions

We are used to living in three dimensions that our intuition fails us when we try to imagine a high-dimensional space. Even a basic 4D hypercube is incredibly hard to picture in our minds, let alone a 200-dimensional ellipsoid bent in a 1,000-dimensional space.

The more dimensions the data has, the less geometrically explainable it becomes.

Turns out many things behave very differently in high-dimensional space. As an example, picking a random point in a unit square (a 1-by-1 square), will have only about a 0.4% chance of being located less than 0.001 from a border (in other words, it is very unlikely that a random point will be “extreme” along any dimension). However, if we do the same thin in a 10,000-dimensional unit hypercube, this probability is greater than 99.999999%. Most points in a high-dimensional hypercube are very close to the border which as can be seen is very counter-intuitive.

Theory 4.4: Distance Between Points in n-Dimensions

Let $d$ be the dimension of a cube in $d$ -dimensions. A point within the cube will have 2 boundaries. To be less than 0.001 from a boundary in a $d$ -dimensional unit cube, means it is not inside the cube of side length $1 - 2 \times 0.001$ sharing the same origin point as the original cube.

\begin{array}{l} 2D & {(1 - {(1 - 2 \times 0.001)}^{2})}^{} \times 100 % \sim 0.3996 % \\ 10,000 D & {(1 - {(1 - 2 \times 0.001)}^{10, 000})}^{} \times 100 % \sim 99.999999798 % ■ \end{array}

Information : n-dimensional Geometry

When dimensions become detached from real-life, the mathematics and our intuition becomes more divergent. There are many problems in mathematics where as the geometry becomes n-dimensional the results get even more complicated.

Here is another example.

If we were to pick two (2) points randomly in a unit square, the distance between these two points will be, on average, roughly 0.52. If we were to pick two random points in a 3D unit cube instead, the average distance will be roughly 0.66.

But what about two points picked randomly in a 1,000,000-dimensional unit hypercube?

The average distance, believe it or not, will be about 408.25. This is a problem known as Hypercube Line Picking

Theory 4.6: Hypercube Line Picking

Let two points $x$ and $y$ be picked randomly from a unit n-dimensional hypercube. The expected distance between the points $Δ {(n)}^{}$ , i.e., the mean line segment length, is then:

Δ {(n)}^{} = \int_{0}^{1} \dots \int_{0}^{1} \sqrt{{(x_{1}^{} - y_{1}^{})}^{2} + {(x_{2}^{} - y_{2}^{})}^{2} \dots {(x_{n}^{} - y_{n}^{})}^{2}} d x_{1}^{} \dots d x_{n}^{} d y_{1}^{} \dots d y_{n}^{}

The first few values for $Δ {(n)}^{}$ are given in the following table.

n	OEIS	$Δ {(n)}^{}$	n	OEIS	$Δ {(n)}^{}$
1	–	0.3333333333…	5	A103984	0.8785309152…
2	A091505	0.5214054331…	6	A103985	0.9689420830…
3	A073012	0.6617071822…	7	A103986	1.0515838734…
4	A103983	0.7776656535…	8	A103987	1.1281653402…

This is counterintuitive: how can two points be so far apart when they both lie within the same unit hypercube? Well, there’s just plenty of space in high dimensions . As a result, high-dimensional datasets are at risk of being very sparse:

most training instances are likely to be far away from each other.

This also means a new instance will likely be far away from any training instance, making predictions much less reliable than in lower dimensions, since they will be based on much larger extrapolations.

The more dimensions the training set has, the greater the risk of overfitting it.

In theory, one solution to the curse of dimensionality could be to increase the size of the training set to reach a sufficient density of training instances. Unfortunately, in practice, the number of training instances required to reach a given density grows exponentially with the number of dimensions.

With just 100 features, which is significantly fewer than in the MNIST problem, all ranging from 0 to 1, you would need more training instances than atoms in the observable universe in order for training instances to be within 0.1 of each other on average, assuming they were spread out uniformly across all dimensions.

4.2 Main Approaches to Dimensionality Reduction

Before we dive into different dimensionality reduction algorithms, let’s take a look at the two (2) main approaches to reducing dimensionality:

1.: projection, and
2.: manifold learning.

4.2.1 Projection

In most real-world problems, training instances are NOT spread out uniformly across all dimensions. Many features are almost constant, while others are highly correlated. As a result, all training instances lie within a much lower-dimensional subspace of the high-dimensional space. This sounds very abstract, so let’s look at an example.

In Fig. 4.0a we can see a 3D dataset represented by small spheres .

(a)

(b)

Figure 4.1: (a) A 3D dataset lying close to a 2D subspace (b) The new 2D dataset after projection

As you can see, almost all training instances lie close to a plane:

this is a lower-dimensional (2D) subspace of the higher-dimensional (3D) space.

If we were to project every training instance perpendicularly onto this subspace, ³ ³ as represented by the short dashed lines connecting the instances to the plane we get the new 2D dataset shown in Fig. 4.0b .

We have just reduced the dataset’s dimensionality from 3D to 2D. Please observe the axes corresponding to new features $z_{1}^{}$ and $z_{2}^{}$ which are the coordinates of the projections on the plane.

4.2.2 Manifold Learning

It is worth mentioning that, projection is not always the best approach to dimensionality reduction . In many cases the subspace may twist and turn , such as in the famous Swiss roll toy dataset ⁴ ⁴ The Swiss roll is a toy dataset in sklearn that is commonly used for testing and demonstrating nonlinear dimensionality reduction algorithms. It consists of a set of points in three dimensions, arranged in a roll shape, such that the points on the roll are mapped to a two-dimensional plane in a nonlinear fashion represented in Fig. 4.2 .

Figure 4.2: The Swiss roll dataset

Simply projecting onto a plane ⁵ ⁵ e.g., by dropping the dimension $x_{3}^{}$ would squash different layers of the Swiss roll together, as shown on the left side of Fig. 4.3 . What we probably want instead is to unroll the Swiss roll to obtain the 2D dataset on the right side of Fig. 4.3 .

The Swiss roll is an example of a 2D manifold. To put simply, a 2D manifold is a 2D shape that can be bent and twisted in a higher-dimensional space. ⁶ ⁶ A torus can be taught of as a 2D manifold as the entire surface is defined in a 2D space. More generally, a $d$ -dimensional manifold is a part of an $n$ -dimensional space (where $d < n$ ) that locally resembles a $d$ -dimensional hyperplane.

In the case of the Swiss roll, $d$ = 2 and $n$ = 3.

It locally resembles a 2D plane, but it is rolled in the $3^{rd}$ dimension.

Figure 4.3: Squashing by projecting onto a plane (left) versus unrolling the Swiss roll (right)

Many dimensionality reduction algorithms work by modelling the manifold on which the training instances lie. This is called manifold learning [17]. It relies on the manifold assumption, also called the manifold hypothesis :

Most real-world high-dimensional datasets lie close to a much lower dimensional manifold.

This assumption is very often empirically ⁷ ⁷ This means it is based on, concerned with, or verifiable by observation or experience rather than theory or pure logic observed.

Once again, let us look back at the MNIST dataset: all handwritten digit images have some similarities. They are made of connected lines, the borders are white, and they are more or less centered. If you randomly generated images, only a ridiculously tiny fraction of them would look like handwritten digits.

In other words, the degrees of freedom available to you if you try to create a digit image are dramatically lower than the degrees of freedom you have if you are allowed to generate any image you want.

These constraints tend to squeeze the dataset into a lower-dimensional manifold.

(a)

(b)

(c)

(d)

Figure 4.4: The decision boundary may not always be simpler with lower dimensions.

The manifold assumption is often accompanied by another implicit assumption:

The task at hand (e.g., classification or regression) will be simpler if expressed in the lower-dimensional space of the manifold.

For example, in Fig. 4.3c , the Swiss roll is split into two (2) classes. In the 3D space Fig. 4.3a the decision boundary would be fairly complex, but in the 2D unrolled manifold space (on the right) the decision boundary is a straight line.

However, this implicit assumption does not always hold . For example, in the bottom row of Figure 8-6, the decision boundary is located at $x_{1}^{}$ = 5. This decision boundary looks very simple in the original 3D space (a vertical plane), but it looks more complex in the unrolled manifold (a collection of four independent line segments).

Reducing the dimensionality of your training set before training a model will usually speed up training, but it may not always lead to a better or simpler solution; it all depends on the dataset.

We now have a good sense of what the curse of dimensionality is and how dimensionality reduction algorithms can fight it, especially when the manifold assumption holds. The rest of this chapter will go through some of the most popular algorithms for dimensionality reduction.

4.3 Principal Component Analysis (PCA)

PCA is by far the most popular dimensionality reduction algorithm, invented in 1901 by Karl Pearson. ⁸ ⁸ Karl Pearson (1857 - 1936) An English biostatistician and mathematician. He has been credited with establishing the discipline of mathematical statistics. He founded the world’s first university statistics department at University College London in 1911, and contributed significantly to the field of biometrics and meteorology. First it identifies the hyperplane that lies closest to the data, and then it projects the data onto it, just like in Fig. 4.0a .

4.3.1 Preserving the Variance

Before we can project the training set onto a lower-dimensional hyperplane, we first need to choose the correct hyperplane . As an example, a simple 2D dataset is represented on the left in Fig. 4.5 along with three different axes, such as 1D hyperplanes.

On the right is the result of the projection of the dataset onto each of these axes. As we can see:

Top: The projection onto the solid line preserves the maximum variance,
Bottom: The projection onto the dotted line preserves very little variance
Middle: The projection onto the dashed line preserves an intermediate amount of variance

Figure 4.5: Selecting the subspace on which to project.

It seems reasonable to select the axis preserving the maximum amount of variance , as it will most likely lose less information than the other projections.

Another way to justify this choice is that it is the axis that minimizes the mean squared distance between the original dataset and its projection onto that axis.

This is core idea behind PCA .

4.3.2 Principal Components

PCA identifies the axis accounting for the largest amount of variance in the training set. In Fig. 4.5 , it is the solid line.

It also finds a second axis, orthogonal to the first one, which accounts for the largest amount of the remaining variance. In this 2D example there is no choice as it is the dotted line. If it were a higher-dimensional dataset, PCA would also find a $3^{rd}$ axis, orthogonal to both previous axes, and a fourth, a fifth, and so on, as many axes as the number of dimensions in the dataset.

The $i_{}^{t h}$ axis is called the $i^{th}$ Principal Component (PC) of the data. In Fig. 4.5 , the first PC is the axis on which vector $c_{1}^{}$ lies, and the second PC is the axis on which vector $c_{2}^{}$ lies.

In Fig. 4.0a the first two PC s are on the projection plane, and the third PC is the axis orthogonal to that plane. After the projection, in Fig. 4.0b , the first PC corresponds to the $z_{1}^{}$ axis, and the second PC corresponds to the $z_{2}^{}$ axis.

Information : The Idea Behind PCA

For each PC , PCA finds a zero-centered unit vector pointing in the direction of the PC . As two opposing unit vectors lie on the same axis, the direction of the unit vectors returned by PCA is not stable: if you perturb the training set slightly and run PCA again, the unit vectors may point in the opposite direction as the original vectors. However, they will generally still lie on the same axes. In some cases, a pair of unit vectors may even rotate or swap, ⁹ ⁹ if the variances along these two axes are very close but the plane they define will generally remain the same.

So how can we find the principal components of a training set?

Luckily, there is a standard matrix factorisation technique called Singular Value Decomposition (SVD) which decomposes the training set matrix $\vec{X}$ into the matrix multiplication of three (3) matrices $\vec{U} Σ {\vec{V}}_{}^{T}$ , where $\vec{V}$ contains the unit vectors that define all the principal components that you are looking for, as shown below:

\vec{V} = [\begin{matrix} {\vec{c}}_{1}^{} & {\vec{c}}_{2}^{} & \dots & {\vec{c}}_{n}^{} \end{matrix}]

The following Python code uses NumPy’s svd() function to obtain all the principal components of the 3D training set represented in Fig. 4.0a , then it extracts the two unit vectors that define the first two PC s:

import numpy as np 
 
X_centered = X - X.mean(axis=0) 
U, s, Vt = np.linalg.svd(X_centered) 
c1 = Vt[0] 
c2 = Vt[1]

PCA assumes that the dataset is centered around the origin. As we will see, sklearn ’s PCA classes take care of centering the data for us. If we were to implement PCA ourselves, or if we used other libraries, we shouldn’t forget to center the data first.

4.3.3 Downgrading Dimensions

Once we identified all the PC s, we can reduce the dimensionality of the dataset down to $d$ dimensions by projecting it onto the hyperplane defined by the first $d$ principal components. Selecting this hyperplane ensures that the projection will preserve as much variance as possible.

For example, in Fig. 4.0a the 3D dataset is projected down to the 2D plane defined by the first two (2) principal components, preserving a large part of the dataset’s variance. As a result, the 2D projection looks very much like the original 3D dataset.

To project the training set onto the hyperplane and obtain a reduced dataset ${\vec{X}}_{d - p r o j}^{}$ of dimensionality $d$ , compute the matrix multiplication of the training set matrix $\vec{X}$ by the matrix $W_{d}^{}$ , defined as the matrix containing the first $d$ columns of $\vec{V}$ :

{\vec{X}}_{d - p r o j}^{} = \vec{X} {\vec{W}}_{d}^{}

The following Python code projects the training set onto the plane defined by the first two principal components:

W2 = Vt[:2].T 
X2D = X_centered @ W2

We now know how to reduce the dimensionality of any dataset by projecting it down to any number of dimensions, while preserving as much variance as possible.

Using sklearn The PCA class uses SVD to implement PCA , just like we did earlier. The following code applies PCA to reduce the dimensionality of the dataset down to two (2) dimensions:

sklearn ’s PCA also automatically takes care of centering the data.

from sklearn.decomposition import PCA 
 
pca = PCA(n_components=2) 
X2D = pca.fit_transform(X)

After fitting the PCA transformer to the dataset, its components_ attribute holds the transpose of ${\vec{W}}_{d}^{}$ : it contains one row for each of the first d principal components.

Explained Variance Ratio

Another useful piece of information is the explained variance ratio of each principal component, available via the explained_variance_ratio_ variable. The ratio indicates the proportion of the dataset’s variance that lies along each principal component.

For example, let’s look at the explained variance ratios of the first two components of the 3D dataset represented in Fig. 4.0a :

print(pca.explained_variance_ratio_)

[0.17974135 0.1177597 ]

This output tells us that about 76% of the dataset’s variance lies along the first PC, and about 15% lies along the second PC. This leaves about 9% for the third PC, so it is reasonable to assume that the third PC probably carries little information.

4.3.4 The Right Number of Dimensions

Instead of randomly choosing the number of dimensions to reduce down to, it is simpler to choose the number of dimensions which add up to the required summed variance, for example, 95%.

An exception to this rule, of course, is if you are reducing dimensionality for data visualization, in which case you will want to reduce the dimensionality down to 2 or 3.

Let’s load and splits the MNIST dataset and performs PCA without reducing dimensionality, Then compute the minimum number of dimensions required to preserve 95% of the training set’s variance:

from sklearn.datasets import fetch_openml 
 
mnist = fetch_openml('mnist_784', as_frame=False, parser="auto") 
X_train, y_train = mnist.data[:60_000], mnist.target[:60_000] 
X_test, y_test = mnist.data[60_000:], mnist.target[60_000:] 
 
pca = PCA() 
pca.fit(X_train) 
cumsum = np.cumsum(pca.explained_variance_ratio_) 
d = np.argmax(cumsum >= 0.95) + 1  # d equals 154

We could then set n_components=d and run PCA again, but there’s a better option. Instead of specifying the number of principal components you want to preserve, you can set n_components to be a float between 0.0 and 1.0, indicating the ratio of variance you wish to preserve:

pca = PCA(n_components=0.95) 
X_reduced = pca.fit_transform(X_train)

The actual number of components is determined during training, and it is stored in the n_components_ attribute:

print(pca.n_components_)

A different option is to plot the explained variance as a function of the number of dimensions which you can see in Fig. 4.6 . There will usually be an elbow in the curve, where the explained variance stops growing fast. In this case, we can see that reducing the dimensionality down to about 100 dimensions wouldn’t lose too much explained variance.

Figure 4.6: Explained variance as a function of the number of dimensions.

Finally, if we are using dimensionality reduction as a pre-processing step for a supervised learning task, then we can tune the number of dimensions as you would any other hyperparameter.

For example, the following code example creates a two-step pipeline, first reducing dimensionality using PCA , then classifying using a random forest. Next, it uses RandomizedSearchCV to find a good combination of hyperparameters for both PCA and the random forest classifier.

This example does a quick search, tuning only two (2) hyperparameters, training on just 1,000 instances, and running for just 10 iterations:

from sklearn.ensemble import RandomForestClassifier 
from sklearn.model_selection import RandomizedSearchCV 
from sklearn.pipeline import make_pipeline 
 
clf = make_pipeline(PCA(random_state=42), 
              RandomForestClassifier(random_state=42)) 
param_distrib = { 
   "pca__n_components": np.arange(10, 80), 
   "randomforestclassifier__n_estimators": np.arange(50, 500) 
} 
rnd_search = RandomizedSearchCV(clf, param_distrib, n_iter=10, cv=3, 
                       random_state=42) 
rnd_search.fit(X_train[:1000], y_train[:1000])

Let’s look at the best hyperparameters found:

print(rnd_search.best_params_)

{'randomforestclassifier__n_estimators': 475, 'pca__n_components': 57}

It’s interesting to see how low the optimal number of components is:

we reduced a 784-dimensional dataset to just 57 dimensions.

This is tied to the fact that we used a random forest, which is a pretty powerful model. If we used a linear model instead, such as an SGDClassifier, the search would find that we need to preserve more dimensions.

4.3.5 PCA for Compression

After dimensionality reduction, the training set takes up much less space. For example, after applying PCA to the MNIST dataset while preserving 95% of its variance, we are left with 154 features, instead of the original 784 features. So the dataset is now less than 20% of its original size, and we only lost 5% of its variance. This is a reasonable compression ratio, and it’s easy to see how such a size reduction would speed up a classification algorithm tremendously.

It is also possible to decompress the reduced dataset back to 784 dimensions by applying the inverse transformation of the PCA projection. This won’t give you back the original data, since the projection lost a bit of information (within the 5% variance that was dropped), but it will likely be close to the original data. The mean squared distance between the original data and the reconstructed data (compressed and then decompressed) is called the reconstruction error [18].

Figure 4.7: MNIST compression that preserves 95% of the variance

The inverse_transform() method lets us decompress the reduced MNIST dataset back to 784 dimensions:

X_recovered = pca.inverse_transform(X_reduced)

Fig. 4.7 shows a few digits from the original training set, seen on the left, and the corresponding digits after compression and decompression. You can see that there is a slight image quality loss, but the digits are still mostly intact. The equation for the inverse transformation is shown below.

{\vec{X}}_{r e c o v e r e d}^{} = {\vec{X}}_{d - p r o j}^{} {\vec{W}}_{d}^{T}

4.3.6 Randomized PCA

If you set the svd_solver hyperparameter to randomized, sklearn uses a stochastic algorithm called randomised PCA that quickly finds an approximation of the first $d$ principal components. Its computational complexity is:

O {(m \times d_{}^{2})}^{} + O {(d_{}^{3})}^{} Instead of O {(m \times n_{}^{2})}^{} + O {(n_{}^{3})}^{}

(4.1)

for full SVD approach, therefore it is faster than full SVD when $d$ is much smaller than $n$ :

rnd_pca = PCA(n_components=154, svd_solver="randomized", random_state=42) 
X_reduced = rnd_pca.fit_transform(X_train)

By default, svd_solver is set to "auto", where sklearn automatically uses the randomised PCA algorithm if $m a x (m, n) > 500$ and n_components is an integer smaller than 80% of $m i n (m, n)$ , or else it uses the full SVD approach. So the preceding code would use the randomized PCA algorithm even if you removed the svd_solver="randomized" argument, as $154 < 0.8 \times 784$ . If we want to force sklearn to use full SVD for a slightly more precise result, you can set the svd_solver hyperparameter to "full".

4.3.7 Incremental PCA

A problem with the preceding implementations of PCA is that they require the whole training set to fit in memory in order for the algorithm to run. Fortunately, incremental PCA algorithms have been developed that allow you to split the training set into mini-batches and feed these in one mini-batch at a time. This is useful for large training sets and for applying PCA online. ¹⁰ ¹⁰ i.e., on the fly, as new instances arrive.

The following splits the MNIST training set into 100 mini-batches ¹¹ ¹¹ using NumPy’s array_split() function and feeds them to sklearn ’s Incremental PCA class to reduce the dimensionality of the MNIST dataset down to 154 dimensions, just like before.

We must call the partial_fit() method with each mini-batch, rather than the fit() method with the whole training set.

from sklearn.decomposition import IncrementalPCA 
 
n_batches = 100 
inc_pca = IncrementalPCA(n_components=154) 
for X_batch in np.array_split(X_train, n_batches): 
   inc_pca.partial_fit(X_batch) 
 
X_reduced = inc_pca.transform(X_train)

Alternatively, we can use NumPy’s memmap class ¹² ¹² Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory , which allows you to manipulate a large array stored in a binary file on disk as if it were entirely in memory.

To demonstrate this, let’s first create a memory-mapped ( memmap ) file and copy the MNIST training set to it, then call flush() to ensure that any data still in the cache gets saved to disk. In real life, X_train would typically not fit in memory, so you would load it chunk by chunk and save each chunk to the right part of the memmap array:

filename = "my_mnist.mmap" 
X_mmap = np.memmap(filename, dtype='float32', mode='write', shape=X_train.shape) 
X_mmap[:] = X_train  # could be a loop instead, saving the data chunk by chunk 
X_mmap.flush()

Next, we can load the memmap file and use it like a regular NumPy array. Let’s use the IncrementalPCA class to reduce its dimensionality. Since this algorithm uses only a small part of the array at any given time, memory usage remains under control. This makes it possible to call the usual fit() method instead of partial_fit(), which is quite convenient:

X_mmap = np.memmap(filename, dtype="float32", mode="readonly").reshape(-1, 784) 
batch_size = X_mmap.shape[0] // n_batches 
inc_pca = IncrementalPCA(n_components=154, batch_size=batch_size) 
inc_pca.fit(X_mmap)

Only the raw binary data is saved to disk, so specify the data type and shape of the array when you load it. If you omit the shape, np.memmap() returns a 1D array.

For very high-dimensional datasets, PCA can be too slow. As we saw previously, even if you use randomized PCA its computational complexity is still

O {(m \times d_{}^{2})}^{} + O {(d_{}^{3})}^{},

so the target number of dimensions $d$ cannot be too large. If we are dealing with a dataset with tens of thousands of features or more, ¹³ ¹³ For example images. then training may become much too slow: in this case, you should consider using random projection instead.

4.4 Random Projection

As its name suggests, the random projection algorithm projects the data to a lower-dimensional space using a random linear projection. This may sound counter intuitive, but turns out such a random projection is actually very likely to preserve distances fairly well, as was demonstrated mathematically by William B. Johnson and Joram Lindenstrauss in a famous lemma shown in an abridged form below [19].

Theory 4.8: Johnson - Lindenstrauss Lemma

For $0 < 𝜖 < 1$ , let $V = {x_{i}^{} : i = 1, \dots M}^{} \in ℝ$ be a set of points in $ℝ^{m}$ where $m$ is the number of dimensions of the original dataset. If we define a lower dimension such as:

n \geq \frac{c}{𝜖^{2}} \log M

Then there exist a linear mapping of $ℝ^{m} \to ℝ^{n}$ such for all $i \neq j$ :

1 - 𝜖 \leq \frac{∥ A {(x_{i}^{})}^{} - A {(x_{j}^{})}^{} ∥}{∥ x_{i}^{} - x_{j}^{} ∥} \leq 1 + 𝜖

The Theorem states that after fixing an error level, one can map a collection of points from one Euclidean space, no matter how high it’s dimension $m$ is to a smaller Euclidean space while only changing the distance between any two points by a factor of $1 \pm 𝜖$ . The dimension of the image space is only dependent on the error and the number of points. Given that the dimension is very large, one can achieve significant dimension reduction $■$

So, two similar instances will remain similar after the projection, and two very different instances will remain very different.

As intuition goes, the more dimensions we drop, the more information is lost, and the more distances get distorted.

So how can we choose the optimal number of dimensions?

Well, the progenitors of the aforementioned lemma defined a method which determines the minimum number of dimensions to preserve in order to ensure-with high probability-that distances won’t change by more than a given tolerance.

For example, if we have a dataset containing $m = 5, 000$ instances with $n = 20, 000$ features each, and don’t want the squared distance between any two instances to change by more than $𝜖$ = 10%, then we should project the data down to $d$ dimensions, with:

d \geq 4 \frac{\log m}{1 ∕ 2 𝜖^{2} - 1 ∕ 3 𝜖^{3}}

which is 7,300 dimensions, which is a significant dimensionality reduction. Please observe that the equation does not use $n$ , it only relies on $m$ and $𝜖$ . This equation is implemented by the johnson_lindenstrauss_min_dim() function:

Now we can just generate a random matrix $\vec{P}$ of shape $(d, n)$ , where each item is sampled randomly from a Gaussian distribution with $(μ = 0, σ_{}^{2} = 1 ∕ d)$ , and use it to project a dataset from n dimensions down to $d$ :

Simple and efficient, and no training is required. The only thing the algorithm needs to create the random matrix is the dataset’s shape. The data itself is not used at all.

For implementation sklearn offers a GaussianRandomProjection class. When we call its fit() method, it uses johnson_lindenstrauss_min_dim() to determine the output dimensionality, then it generates a random matrix, which it stores in the components_ attribute. Then when we call transform(), it uses this matrix to perform the projection. When creating the transformer, we can set eps if we want to tweak $𝜖$ , ¹⁴ ¹⁴ The default value is 0.1 and n_components if we want to force a specific target dimensionality $d$ .

The following code example gives the same result as the preceding code which we can also use to verify that gaussian_rnd_proj.components_ is equal to P→ :

sklearn also provides a second random projection transformer, known as SparseRandomProjection. It determines the target dimensionality in the same way, generates a random matrix of the same shape, and performs the projection identically. The main difference is that the random matrix is sparse. This means it uses much less memory: about 25 MB instead of almost 1.2 GB in the preceding example! And it’s also much faster, both to generate the random matrix and to reduce dimensionality: about 50% faster in this case. Moreover, if the input is sparse, the transformation keeps it sparse (unless you set dense_output=True ).

Finally, it enjoys the same distance-preserving property as the previous approach, and the quality of the dimensionality reduction is comparable. In short, it’s usually preferable to use this transformer instead of the first one, especially for large or sparse datasets.

The ratio ( $r$ ) of nonzero items in the sparse random matrix is called its density . By default, it is equal to $1 ∕ n$ . With 20,000 features, this means that roughly only $1 ∕ 141$ cells in the random matrix is non-zero. ¹⁵ ¹⁵ There is a reason why it is called a sparse matrix. We can set this density hyperparameter to another value if we prefer.

Each cell in the sparse random matrix has a probability $r$ of being non-zero, and each non-zero value is either $- v$ or $+ v$ , ¹⁶ ¹⁶ Which are both equally likely. where $v = 1 ∕ d r$ .

If we want to perform the inverse transform, you first need to compute the pseudo-inverse of the components matrix using SciPy’s pinv() function, then multiply the reduced data by the transpose of the pseudo-inverse:

Random projection is a simple, fast, memory-efficient, and powerful dimensionality reduction algorithm that we should keep in mind, especially when dealing with high-dimensional datasets.

4.5 Locally Linear Embedding

Locally linear embedding(LLE) is a non-linear dimensionality reduction (NLDR) technique. It is a manifold learning technique which does NOT rely on projections, unlike PCA and random projection. In a nutshell, LLE works by first measuring how each training instance linearly relates to its nearest neighbors, and then looking for a low-dimensional representation of the training set where these local relationships are best preserved.

This approach makes it good at unrolling twisted manifolds, especially when there is not too much noise. The following code makes a Swiss roll, then uses sklearn ’s LocallyLinearEmbedding class to unroll it:

The variable t is a 1D NumPy array containing the position of each instance along the rolled axis of the Swiss roll. We don’t use it in this example, but it can be used as a target for a nonlinear regression task. The resulting 2D dataset is shown in Figure 8-10. As you can see, the Swiss roll is completely unrolled, and the distances between instances are locally well preserved. However, distances are not preserved on a larger scale: the unrolled Swiss roll should be a rectangle, not this kind of stretched and twisted band. Nevertheless, LLE did a pretty good job of modeling the manifold.

Operation Principle

For each training instance ${\vec{x}}_{(i)}^{}$ , the algorithm identifies its k-nearest neighbours, then tries to reconstruct ${\vec{x}}_{(i)}^{}$ as a linear function of these neighbors. More specifically, it tries to find the weights $w_{i, j}^{}$ such that the squared distance between ${\vec{x}}_{(i)}^{}$ and:

\sum_{j = 1}^{m} w_{i, j}^{} {\vec{x}}_{}^{(1)}

is as small as possible , assuming $w_{i, j}^{} = 0$ if ${\vec{x}}_{}^{(i)}$ is not one of the k-nearest neighbors of ${\vec{x}}_{(i)}^{}$ . Thus the first step of LLE is the constrained optimization problem:

{\hat{\vec{W}}}_{}^{} =

, where W is the weight matrix containing all the weights wi,j. The second constraint simply normalizes the weights for each training instance x(i).

After this step, the weight matrix W(containing the weights w i,j) encodes the local linear relationships between the training instances. The second step is to map the training instances into a d-dimensional space (where d < n) while preserving these local relationships as much as possible. If z(i) is the image of x(i) in this d-dimensional space, then we want the squared distance between z(i) and j=1 m w i,j z (j) to be as small as possible. This idea leads to the unconstrained optimization problem described in Equation 8-5. It looks very similar to the first step, but instead of keeping the instances fixed and finding the optimal weights, we are doing the reverse: keeping the weights fixed and finding the optimal position of the instances’ images in the low-dimensional space. Note that Z is the matrix containing all z(i).

sklearn ’s LLE implementation has the following computational complexity: O(m log(m)n log(k)) for finding the k-nearest neighbors, O(mnk3) for optimizing the weights, and O(dm2) for constructing the low-dimensional representations. Unfortunately, the m2 in the last term makes this algorithm scale poorly to very large datasets. As you can see, LLE is quite different from the projection techniques, and it’s significantly more complex, but it can also construct much better low- dimensional representations, especially if the data is nonlinear.

4 Dimensionality Reduction