1
Support Vector Machines

1.1 Introduction

One of the important task a Machine Learning (ML) has to accomplish is to classify data . An a useful method for this task is called Support Vector Machines (SVM) .

A powerful ML model, capable of performing linear or non-linear classification , regression. They are classified as supervised learning which utilises statistical learning ¹ ¹ A framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on data. Statistical learning theory has led to successful applications in fields such as computer vision, speech recognition, and bio-informatics. which can be used for pattern recognition and regression. It can also identify precisely the factors which need to be taken into account to learn successfully certain simple types of algorithms, however, real-world applications usually need more complex models and algorithms (such as neural networks), that makes them much harder to analyse theoretically.

SVMs can be seen as lying at the intersection of learning theory and practice. They construct models that are complex enough (containing a large class of neural networks for instance) and yet that are simple enough to be analysed mathematically.

It is even possible to implement novelty detection using SVM

SVM s most suitable for small to medium sized non-linear datasets, ² ² i.e., hundreds to thousands of instances. especially for classification tasks.

1.2 Linear Support Vector Machine Classification

As with most engineering and abstract concepts, the idea behind SVM is best explained using programming and some visuals. Fig. 1.1 shows part of the iris dataset ³ ³ The Iris flower data set or Fisher’s Iris data set is a multivariate data set used and made famous by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. It is sometimes called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. which was briefly discussed in Data Science I .

Figure 1.1: Large margin classification of the iris dataset.

The two (2) classes can easily and clearly separated with a straight line .

This means the data is linearly separable .

The left plot shows the decision boundaries of three (3) possible linear classifiers. The model whose decision boundary is represented by the dashed line ( - - ) is so bad it does not even separate the classes properly.

The other two (2) models work perfectly on this training set, but their decision boundaries come so close to the instances that these models will probably NOT perform as well on new instances.

In contrast, the black solid line in the plot on the right represents the decision boundary of an SVM classifier . This line not only separates the two classes but also stays as far away from the closest training instances as possible. Think of an SVM classifier as fitting the widest possible street, which in the plot is represented by the parallel dashed lines, between the classes.

This is called large margin classification .

Information : Margin Classifier

A classifier which is able to give an associated distance from the decision boundary for each example. For instance, if a linear classifier is used, the distance of an example from the separating hyperplane is the margin of that example.

Please observe that adding more training instances will NOT affect the decision boundary at all as the boundaries are fully determined ⁴ ⁴ or supported. by the instances located on the edge of the boundaries. These instances are called the support vectors .

SVM s are sensitive to the feature scales , as you can see in Fig. 1.2 . In the left plot, the vertical scale is much larger than the horizontal scale, so the widest possible street is close to horizontal. After feature scaling (e.g., using scikit-learn ’s StandardScaler ), the decision boundary in the right plot looks much better.

Figure 1.2: Sensitivity to feature scales.

1.2.1 Soft Margin Classification

If we impose all instances must be off the street and on the correct side, this is called hard margin classification . There are two (2) major issues with hard margin classification:

Sensitivity to Outliers: Hard margin SVM is highly sensitive to outliers or noisy data points. Even a single mislabeled point can significantly affect the position of the decision boundary and lead to poor generalization on unseen data.
Not Suitable for Non-linear Data: When the data is not linearly separable, hard margin SVM fails to find a valid solution, rendering it impractical for many real-world datasets.

Information : Outlier

A data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error.

Fig. 1.3 shows the iris dataset with just one (1) additional outlier. On the left, it is impossible to find a hard margin whereas on the right, the decision boundary ends up very different from the one we saw in Fig. 1.1 without the outlier, and the model will probably not generalize as well.

Figure 1.3: Hard margin sensitivity to outliers.

To avoid these aforementioned issues, we need to use a more flexible model. The idea is to find a good balance between keeping the street as large as possible and limiting the margin violations. ⁵ ⁵ i.e., instances that end up in the middle of the street or even on the wrong side.

This is called soft margin classification .

As mentioned, soft margin SVM introduces flexibility by allowing some margin violations, misclassifications, to handle cases where the data is NOT perfectly separable.

This is suitable for scenarios where the data may contain noise or outliers.

It introduces a penalty term for misclassifications, allowing for a trade-off between a wider margin and a few misclassifications.

Soft margin SVM allows for some margin violations, meaning that it permits certain data points to fall within the margin or even on the wrong side of the decision boundary. This adaptability is managed by a factor called C, also called the “regularisation parameter”, which helps find a balance between making the gap as big as possible and reducing mistakes in grouping things.

When creating an SVM model using scikit-learn, you can specify several hyperparameters, including the regularisation hyperparameter C.

Information : Hyperparameter C

Regularisation parameter. The strength of the regularisation is inversely proportional to C and must be strictly positive . The penalty is a squared $ℓ_{2}$ penalty.

Setting the C to a low value, we end up with the model on the left of Fig. 1.4 . However, setting a high value, we get the model on the right. As can be seen, reducing C makes the street larger, but it also leads to more margin violations.

In other words, reducing C results in more instances supporting the street, so there’s less risk of over-fitting. But if you reduce it too much, then the model ends up underfitting, as seems to be the case here:

The model with C=100 looks like it will generalise better than the one with C=1.

Figure 1.4: Large margin (left) v. fewer margin violations (right).

If your SVM model is overfitting, we can try regularizing it by reducing C.

The following scikit-learn code loads the iris dataset and trains a linear SVM classifier to detect Iris virginica flowers. The pipeline first scales the features, then uses a LinearSVC with C=1 :

from sklearn.datasets import load_iris 
from sklearn.pipeline import make_pipeline 
from sklearn.preprocessing import StandardScaler 
from sklearn.svm import LinearSVC 
iris = load_iris(as_frame=True) 
X = iris.data[["petal length (cm)", "petal width (cm)"]].values 
y = (iris.target == 2) # Iris virginica 
svm_clf = make_pipeline(StandardScaler(), 
LinearSVC(C=1, random_state=42)) 
svm_clf.fit(X, y)

The resulting model is given on the left in Fig. 1.4 . Then, we can use the model to make predictions:

X_new = [[5.5, 1.7], [5.0, 1.5]] 
print(svm_clf.predict(X_new))

[ True False]

The first plant is classified as an iris virginica , while the second is NOT . Let us look at the scores that the SVM used to make these predictions. These measure the signed distance between each instance and the decision boundary:

print(svm_clf.decision_function(X_new))

[ 0.66163816 -0.22035761]

Unlike the LogisticRegression class, LinearSVC doesn’t have a predict_proba() method to estimate the class probabilities. That said, if you use the SVC class instead of LinearSVC, and if you set its probability hyperparameter to True, then the model will fit an extra model at the end of training to map the SVM decision function scores to estimated probabilities.

Under the hood, this requires using 5-fold cross-validation to create out-of-sample predictions for every instance in the training set, then training a LogisticRegression model, which will slow down training. After that, the predict_proba() and predict_log_proba() methods will be available.

1.3 Nonlinear Support Vector Machine Classification

Although linear SVM classifiers are efficient and often work surprisingly well, many datasets are not even close to being linearly separable. One approach to handling nonlinear datasets is to add more features, such as polynomial features.

In some cases this can result in a linearly separable dataset. Consider the LHS plot in Fig. 1.5 where it represents a simple dataset with just one feature; $x_{1}^{}$ .

As we can see, this dataset is NOT linearly separable. But adding a second feature $x_{2}^{} = x_{1}^{2}$ , the resulting 2D dataset is perfectly linearly separable.

Figure 1.5: Adding features to make a dataset linearly separable.

To implement this idea using scikit-learn, we can create a pipeline containing a PolynomialFeatures transformer, followed by a StandardScaler and a LinearSVC classifier.

Information : Pipeline

A series of interconnected data processing and modeling steps for streamlining the process of working with ML models.

Let’s test this on the moons dataset , a toy dataset for binary classification in which the data points are shaped as two (2) interleaving crescent moons.

We can generate this dataset using the make_moons() function, shown in Fig. 1.6 .

from sklearn.datasets import make_moons 
from sklearn.preprocessing import PolynomialFeatures 
 
X, y = make_moons(n_samples=100, noise=0.15, random_state=42) 
polynomial_svm_clf = make_pipeline( 
   PolynomialFeatures(degree=3), 
   StandardScaler(), 
   LinearSVC(C=10, max_iter=10_000, random_state=42)) 
polynomial_svm_clf.fit(X, y)

Figure 1.6: Linear support vector machine classifier using polynomial features.

1.3.1 Polynomial Kernel

Adding polynomial features is simple to implement and can work great with all sorts of ML algorithms, and not just SVM s. That said, at a low polynomial degree this method cannot deal with very complex datasets, and with a high polynomial degree it creates a huge number of features, making the model too slow.

Fortunately, when using SVM s you can apply a mathematical technique called the kernel trick. The kernel trick makes it possible to get the same result as if you had added many polynomial features, even with a very high degree, without actually having to add them.

This means there’s no combinatorial explosion of the number of features.

This trick is implemented by the SVC class. Let’s test it on the moons dataset:

from sklearn.svm import SVC 
 
poly_kernel_svm_clf = make_pipeline( 
   StandardScaler(), 
   SVC(kernel="poly", degree=3, coef0=1, C=5) 
) 
poly_kernel_svm_clf.fit(X, y)

This code trains an SVM classifier using a $3^{rd}$ degree polynomial kernel, represented on the left in Fig. 1.7 . On the right is another SVM classifier using a $10^{th}$ degree polynomial kernel. Obviously, if your model is overfitting, you might want to reduce the polynomial degree. Conversely, if it is underfitting, you can try increasing it. The hyperparameter coef0 controls how much the model is influenced by high-degree terms versus low-degree terms.

Figure 1.7: SVM classifiers with a polynomial kernel.

Although hyperparameters will generally be tuned automatically (e.g., using randomised search), it’s good to have a sense of what each hyperparameter actually does and how it may interact with other hyperparameters: this way, you can narrow the search to a much smaller space.

1.3.2 The Kernel Trick

We have seen how higher dimensional transformations ⁶ ⁶ i.e., our previous example. can allow us to separate data in order to make classification predictions. It seems that in order to train a support vector classifier and optimize our objective function, we would have to perform operations with the higher dimensional vectors in the transformed feature space.

In real applications, there might be many features in the data and applying transformations that involve many polynomial combinations of these features will lead to extremely high and impractical computational costs.

The kernel trick provides a solution to this problem. The trick is that kernel methods represent the data only through a set of pairwise similarity comparisons between the original data observations $x$ , ⁷ ⁷ with the original coordinates in the lower dimensional space. instead of explicitly applying the transformations $ϕ {(x)}^{}$ and representing the data by these transformed coordinates in the higher dimensional feature space.

In kernel methods, the data set $\vec{X}$ is represented by an $n \times n$ kernel matrix of pairwise similarity comparisons where the entries ${(i, j)}^{}$ are defined by the kernel function: $k {(x_{i}^{}, x_{j}^{})}^{}$ .

This kernel function has a special mathematical property. The kernel function acts as a modified dot product.

Theory 1.2: The Kernel Function

Kernel is defined as the function which takes as its inputs vectors in the original space and returns the dot product of the vectors in the feature space called the kernel function.

Formally, if we have data $\vec{x}, v z \in \vec{X}$ and a mapping of $ϕ : \vec{X} \to ℝ^{N}$ , then:

k {(\vec{x}, \vec{z})}^{} = ⟨ ϕ {(x)}^{}, ϕ {(z)}^{} ⟩

is a kernel function $■$

Our kernel function accepts inputs in the original lower dimensional space and returns the dot product of the transformed vectors in the higher dimensional space. There are also theorems which guarantee the existence of such kernel functions under certain conditions.

The ultimate benefit of the kernel trick is that the objective function we are optimizing to fit the higher dimensional decision boundary only includes the dot product of the transformed feature vectors. Therefore, we can just substitute these dot product terms with the kernel function, and we don’t even use $ϕ {(x)}^{}$ .

1.3.3 Similarity Features

Another technique to tackle non-linear problems is to add features computed using a similarity function, which measures how much each instance resembles a particular landmark.

Figure 1.8: Similarity features using the Gaussian RBF.

For example, let’s take the 1D dataset from earlier and add two landmarks to it at $x_{1}^{} = - 2$ and $x_{1}^{} = 1$ (see the left plot in Fig. 1.8 ). Next, we’ll define the similarity function to be the Gaussian Radial Basis Function (RBF) with $γ = 0.3$ . As it is a Gaussian function, it is a bell-shaped function varying from 0 (very far away from the landmark) to 1 (at the landmark).

ϕ_{γ} {(\vec{x}, ℓ)}^{} = \exp {(- γ ∣ ∣ \vec{x} - ℓ ∣ ∣^{2})}^{}

Now we are ready to compute the new features. For example, let’s look at the instance $x_{1}^{} = - 1$ . It is located at a distance of 1 from the first landmark and 2 from the second landmark. Therefore, its new features are:

x_{2}^{} = e_{}^{- 0.3 \times 1} = 0.75 x_{3}^{} = e_{}^{- 0.3 \times 4} = 0.3

The plot on the right in Fig. 1.8 shows the transformed dataset (dropping the original features). As you can see, it is now linearly separable.

You may wonder how to select the landmarks. The simplest approach is to create a landmark at the location of each and every instance in the dataset. Doing that creates many dimensions and which increases the chances that the transformed training set will be linearly separable.

The downside is that a training set with $m$ instances and $n$ features gets transformed into a training set with $m$ instances and $m$ features. ⁸ ⁸ assuming you drop the original features.

A very large training set, ends up with an equally large number of features.

1.3.4 Gaussian RBF Kernel

Just like the polynomial features method, the similarity features method can be useful with any ML algorithm, but it may be computationally expensive to compute all the additional features. ⁹ ⁹ especially on large training sets Once again the kernel trick can be used here, making it possible to obtain a similar result as if you had added many similarity features, but without actually doing so. Let’s try the SVC class with the Gaussian RBF kernel:

rbf_kernel_svm_clf = make_pipeline( 
   StandardScaler(), 
   SVC(kernel="rbf", gamma=5, C=0.001) 
) 
rbf_kernel_svm_clf.fit(X, y)

This model is represented at the bottom left in Fig. 1.9 . The other plots show models trained with different values of hyperparameters gamma ( $γ$ ) and C.

Increasing gamma makes the bell-shaped curve narrower (see the LHS plots in Figure 1.9 ). As a result, each instance’s range of influence is smaller . The decision boundary ends up being more irregular, wiggling around individual instances. Conversely, a small gamma value makes the bell-shaped curve wider: instances have a larger range of influence, and the decision boundary ends up smoother.

Therefore $γ$ acts like a regularisation hyperparameter : if your model is overfitting, you should reduce $γ$ ; if it is underfitting, you should increase $γ$ (similar to the C hyperparameter).

Figure 1.9: SVM classifiers using an RBF kernel.

Other kernels exist but are used much more rarely. Some kernels are specialized for specific data structures. String kernels are sometimes used when classifying text documents or DNA sequences. ¹⁰ ¹⁰ e.g., using the string subsequence kernel or kernels based on the Levenshtein distance.

You need to choose the right kernel for the job. As a rule of thumb, you should always try the linear kernel first. The LinearSVC class is much faster than SVC(kernel="linear"), especially if the training set is very large. If it is not too large, you should also try kernelised SVM s, starting with the Gaussian RBF kernel; it often works really well. Then, if you have spare time and computing power, you can experiment with a few other kernels using hyperparameter search. If there are kernels specialized for your training set’s data structure, make sure to give them a try too

1.4 Regression

To use SVM s for regression instead of classification, the main idea is to tweak the objective whereby instead of trying to fit the largest possible street between two (2) classes while limiting margin violations, SVM regression tries to fit as many instances as possible on the street while limiting margin violations. ¹¹ ¹¹ i.e., instances off the street.

The width of the street is controlled by a hyperparameter, $𝜖$ . Fig. 1.10 shows two (2) linear SVM regression models trained on some linear data, one with a small margin ( $𝜖$ = 0.5) and the other with a larger margin ( $𝜖$ = 1.2).

Figure 1.10: SVM regression.

Reducing $𝜖$ increases the number of support vectors, which regularises the model. Moreover, if we were to add more training instances within the margin, it will not affect the model’s predictions.

Therefore, the model is said to be $𝜖$ -insensitive.

You can use scikit-learn ’s LinearSVR class to perform linear SVM regression. The following code produces the model represented on the left in Fig. 1.10 .

from sklearn.svm import LinearSVR 
 
np.random.seed(42) 
X = 2 * np.random.rand(50, 1) 
y = 4 + 3 * X[:, 0] + np.random.randn(50) 
 
svm_reg = make_pipeline( 
   StandardScaler(), 
   LinearSVR(epsilon=0.5, dual=True, random_state=42) 
) 
svm_reg.fit(X, y)

To tackle non-linear regression tasks, you can use a kernelized SVM model. Fig. 1.11 shows SVM regression on a random quadratic training set, using a second-degree polynomial kernel. There is some regularisation in the left plot (i.e., a small C value), and much less in the right plot (i.e., a large C value).

Figure 1.11: SVM regression using a second-degree polynomial kernel.

The following code uses scikit-learn ’s SVR class ¹² ¹² which supports the kernel trick to produce the model represented on the left in Fig. 1.11 :

from sklearn.svm import SVR 
 
# extra code  these 3 lines generate a simple quadratic dataset 
np.random.seed(42) 
X = 2 * np.random.rand(50, 1) - 1 
y = 0.2 + 0.1 * X[:, 0] + 0.5 * X[:, 0] ** 2 + np.random.randn(50) / 10 
 
svm_poly_reg = make_pipeline( 
   StandardScaler(), 
   SVR(kernel="poly", degree=2, C=0.01, epsilon=0.1) 
) 
 
svm_poly_reg.fit(X, y)

The SVR class is the regression equivalent of the SVC class, and the LinearSVR class is the regression equivalent of the LinearSVC class. The LinearSVR class scales linearly with the size of the training set (just like the LinearSVC class), while the SVR class gets much too slow when the training set grows very large (just like the SVC class).

1 Support Vector Machines