1
Support Vector Machines
1.1 Introduction
One of the important task a
Machine Learning (ML)
has to accomplish is to
A
powerful
ML
model,
capable
of
performing
SVMs can be seen as lying at the intersection of learning theory and practice. They construct models that are complex enough (containing a large class of neural networks for instance) and yet that are simple enough to be analysed mathematically.
It is even possible to implement novelty detection using SVM
1.2 Linear Support Vector Machine Classification
As
with
most
engineering
and
abstract
concepts,
the
idea
behind
SVM
is
best
explained
using
programming
and
some
visuals.
Fig.
1.1
shows
part
of
the
iris
dataset
The two
This means the data is linearly separable .
The left plot shows the decision boundaries of three
The other two
In contrast, the black solid line in the plot on the right represents the decision boundary of an SVM classifier . This line not only separates the two classes but also stays as far away from the closest training instances as possible. Think of an SVM classifier as fitting the widest possible street, which in the plot is represented by the parallel dashed lines, between the classes.
This is called large margin classification .
Information : Margin Classifier
A classifier which is able to give an associated distance from the decision boundary for each example. For instance, if a linear classifier is used, the distance of an example from the separating hyperplane is the margin of that example.
Please
observe
that
adding
more
training
instances
will
NOT
affect
the
decision
boundary
at
all
as
the
boundaries
are
fully
determined
SVM
s are
sensitive to the feature scales
, as you can see in
Fig.
1.2
. In the left plot, the vertical scale is much larger than the
horizontal scale, so the widest possible street is close to horizontal. After feature scaling (e.g., using scikit-learn
’s StandardScaler
), the decision boundary in the right plot looks much better.
1.2.1 Soft Margin Classification
If we impose all instances must be off the street and on the correct side, this is called
hard margin classification
. There are
two
- Sensitivity to Outliers
-
Hard margin SVM is highly sensitive to outliers or noisy data points. Even a single mislabeled point can significantly affect the position of the decision boundary and lead to poor generalization on unseen data.
- Not Suitable for Non-linear Data
-
When the data is not linearly separable, hard margin SVM fails to find a valid solution, rendering it impractical for many real-world datasets.
Information : Outlier
A data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error.
Fig.
1.3
shows the iris dataset with just one
To
avoid
these
aforementioned
issues,
we
need
to
use
a
more
flexible
model.
The
idea
is
to
find
a
good
balance
between
keeping
the
street
as
large
as
possible
and
limiting
the
margin
violations.
This
is
called
As mentioned, soft margin SVM introduces flexibility by allowing some margin violations, misclassifications, to handle cases where the data is NOT perfectly separable.
This is suitable for scenarios where the data may contain noise or outliers.
It introduces a
Soft margin
SVM
allows for some margin violations, meaning that it permits certain data points to fall within the margin or even
on the wrong side of the decision boundary. This adaptability is managed by a factor called C
, also called the “regularisation
parameter”, which helps find a balance between making the gap as big as possible and reducing mistakes in grouping
things.
When creating an
SVM
model using scikit-learn
, you can specify several hyperparameters, including the regularisation
hyperparameter C
.
Information
: Hyperparameter C
Regularisation parameter. The strength of the regularisation is inversely proportional to C
and must be
strictly
positive
. The penalty is a squared
penalty.
Setting the C
to a low value, we end up with the model on the left of
Fig.
1.4
. However, setting a high value, we get
the model on the right. As can be seen, reducing C
makes the street larger, but it also leads to more margin
violations.
In other words, reducing C
results in more instances supporting the street, so there’s less risk of over-fitting. But if you reduce it
too much, then the model ends up underfitting, as seems to be the case here:
The
model
with C=100
looks
like
it
will
generalise
better
than
the
one
with C=1
.
If your
SVM
model is overfitting, we can try regularizing it by reducing C
.
The following scikit-learn
code loads the iris dataset and trains a linear
SVM
classifier to detect
LinearSVC
with C=1
:
from sklearn.datasets import load_iris from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.svm import LinearSVC iris = load_iris(as_frame=True) X = iris.data[["petal length (cm)", "petal width (cm)"]].values y = (iris.target == 2)# Iris virginica svm_clf = make_pipeline(StandardScaler(), LinearSVC(C=1, random_state=42)) svm_clf.fit(X, y)
The resulting model is given on the left in Fig. 1.4 . Then, we can use the model to make predictions:
The first plant is classified as an
Unlike the LogisticRegression
class, LinearSVC
doesn’t have a predict_proba()
method to estimate the class probabilities. That
said, if you use the SVC
class instead of LinearSVC
, and if you set its probability hyperparameter to True
, then
the model will fit an extra model at the end of training to map the
SVM
decision function scores to estimated
probabilities.
Under the hood, this requires using 5-fold cross-validation to create out-of-sample predictions for every instance in the training
set, then training a LogisticRegression
model, which will slow down training. After that, the predict_proba()
and predict_log_proba()
methods will be available.
1.3 Nonlinear Support Vector Machine Classification
Although linear SVM classifiers are efficient and often work surprisingly well, many datasets are not even close to being linearly separable. One approach to handling nonlinear datasets is to add more features, such as polynomial features.
In some cases this can result in a linearly separable dataset. Consider the LHS plot in Fig. 1.5 where it represents a simple dataset with just one feature; .
As we can see, this dataset is NOT linearly separable. But adding a second feature , the resulting 2D dataset is perfectly linearly separable.
To implement this idea using scikit-learn
, we can create a pipeline containing a PolynomialFeatures
transformer, followed by a StandardScaler
and a LinearSVC
classifier.
Information : Pipeline
A series of interconnected data processing and modeling steps for streamlining the process of working with ML models.
Let’s test this on the
We can generate this dataset using the make_moons()
function, shown in
Fig.
1.6
.
from sklearn.datasets import make_moons from sklearn.preprocessing import PolynomialFeatures X, y = make_moons(n_samples=100, noise=0.15, random_state=42) polynomial_svm_clf = make_pipeline( PolynomialFeatures(degree=3), StandardScaler(), LinearSVC(C=10, max_iter=10_000, random_state=42)) polynomial_svm_clf.fit(X, y)
1.3.1 Polynomial Kernel
Adding polynomial features is simple to implement and can work great with all sorts of ML algorithms, and not just SVM s. That said, at a low polynomial degree this method cannot deal with very complex datasets, and with a high polynomial degree it creates a huge number of features, making the model too slow.
Fortunately, when using SVM s you can apply a mathematical technique called the kernel trick. The kernel trick makes it possible to get the same result as if you had added many polynomial features, even with a very high degree, without actually having to add them.
This means there’s no combinatorial explosion of the number of features.
This trick is implemented by the SVC
class. Let’s test it on the moons dataset:
This code trains an
SVM
classifier using a
degree polynomial kernel, represented on the left in
Fig.
1.7
. On the right is another
SVM
classifier using a
degree
polynomial kernel. Obviously, if your model is overfitting, you might want to reduce the polynomial degree. Conversely, if it is
underfitting, you can try increasing it. The hyperparameter coef0
controls how much the model is influenced by high-degree
terms versus low-degree terms.
Although hyperparameters will generally be tuned automatically (e.g., using randomised search), it’s good to have a sense of what each hyperparameter actually does and how it may interact with other hyperparameters: this way, you can narrow the search to a much smaller space.
1.3.2 The Kernel Trick
We
have
seen
how
higher
dimensional
transformations
In real applications, there might be many features in the data and applying transformations that involve many polynomial combinations of these features will lead to extremely high and impractical computational costs.
The
kernel
trick
provides
a
solution
to
this
problem.
The
trick
is
that
kernel
methods
represent
the
data
only
through
a
set
of
pairwise
similarity
comparisons
between
the
original
data
observations
,
In kernel methods, the data set is represented by an kernel matrix of pairwise similarity comparisons where the entries are defined by the kernel function: .
This kernel function has a special mathematical property. The kernel function acts as a modified dot product.
Kernel is defined as the function which takes as its inputs vectors in the original space and returns the dot product of the vectors in the feature space called the kernel function.
Formally, if we have data and a mapping of , then:
is a kernel function
Our kernel function accepts inputs in the original lower dimensional space and returns the dot product of the transformed vectors in the higher dimensional space. There are also theorems which guarantee the existence of such kernel functions under certain conditions.
The ultimate benefit of the kernel trick is that the objective function we are optimizing to fit the higher dimensional decision boundary only includes the dot product of the transformed feature vectors. Therefore, we can just substitute these dot product terms with the kernel function, and we don’t even use .
1.3.3 Similarity Features
Another technique to tackle non-linear problems is to add features computed using a similarity function, which measures how much each instance resembles a particular landmark.
For example, let’s take the 1D dataset from earlier and add two landmarks to it at and (see the left plot in Fig. 1.8 ). Next, we’ll define the similarity function to be the Gaussian Radial Basis Function (RBF) with . As it is a Gaussian function, it is a bell-shaped function varying from 0 (very far away from the landmark) to 1 (at the landmark).
Now we are ready to compute the new features. For example, let’s look at the instance . It is located at a distance of 1 from the first landmark and 2 from the second landmark. Therefore, its new features are:
The plot on the right in Fig. 1.8 shows the transformed dataset (dropping the original features). As you can see, it is now linearly separable.
You may wonder how to select the landmarks. The simplest approach is to create a landmark at the location of each and every instance in the dataset. Doing that creates many dimensions and which increases the chances that the transformed training set will be linearly separable.
The
downside
is
that
a
training
set
with
instances
and
features
gets
transformed
into
a
training
set
with
instances
and
features.
A very large training set, ends up with an equally large number of features.
1.3.4 Gaussian RBF Kernel
Just
like
the
polynomial
features
method,
the
similarity
features
method
can
be
useful
with
any
ML
algorithm,
but
it
may
be
computationally
expensive
to
compute
all
the
additional
features.
SVC
class
with
the
Gaussian
RBF
kernel:
This
model
is
represented
at
the
bottom
left
in
Fig.
1.9
.
The
other
plots
show
models
trained
with
different
values
of
hyperparameters
gamma
(
)
and C
.
Increasing gamma makes the bell-shaped curve narrower (see the LHS plots in Figure 1.9 ). As a result, each instance’s range of influence is smaller . The decision boundary ends up being more irregular, wiggling around individual instances. Conversely, a small gamma value makes the bell-shaped curve wider: instances have a larger range of influence, and the decision boundary ends up smoother.
Therefore
acts
like
a
regularisation
hyperparameter
:
if
your
model
is
overfitting,
you
should
reduce
;
if
it
is
underfitting,
you
should
increase
(similar
to
the C
hyperparameter).
Other
kernels
exist
but
are
used
much
more
rarely.
Some
kernels
are
specialized
for
specific
data
structures.
String
kernels
are
sometimes
used
when
classifying
text
documents
or
DNA
sequences.
You need to choose the right kernel for the job. As a rule of thumb, you should always try the linear kernel first. The LinearSVC
class is much faster than SVC(kernel="linear")
, especially if the training set is very large. If it is not too large,
you should also try kernelised
SVM
s, starting with the Gaussian RBF kernel; it often works really well. Then, if
you have spare time and computing power, you can experiment with a few other kernels using hyperparameter
search.
If there are kernels specialized for your training set’s data structure, make sure to give them a try too
1.4 Regression
To
use
SVM
s
for
regression
instead
of
classification,
the
main
idea
is
to
tweak
the
objective
whereby
instead
of
trying
to
fit
the
largest
possible
street
between
two
The
width
of
the
street
is
controlled
by
a
hyperparameter,
.
Fig.
1.10
shows
two
Reducing increases the number of support vectors, which regularises the model. Moreover, if we were to add more training instances within the margin, it will not affect the model’s predictions.
Therefore, the model is said to be -insensitive.
You can use scikit-learn
’s LinearSVR
class to perform linear
SVM
regression. The following code produces the model represented
on the left in
Fig.
1.10
.
To tackle non-linear regression tasks, you can use a kernelized
SVM
model.
Fig.
1.11
shows
SVM
regression on a random
quadratic training set, using a second-degree polynomial kernel. There is some regularisation in the left plot (i.e., a small C
value), and much less in the right plot (i.e., a large C
value).
The
following
code
uses scikit-learn
’s SVR
class
from sklearn.svm import SVR# extra code these 3 lines generate a simple quadratic dataset np.random.seed(42) X = 2 * np.random.rand(50, 1) - 1 y = 0.2 + 0.1 * X[:, 0] + 0.5 * X[:, 0] ** 2 + np.random.randn(50) / 10 svm_poly_reg = make_pipeline( StandardScaler(), SVR(kernel="poly", degree=2, C=0.01, epsilon=0.1) ) svm_poly_reg.fit(X, y)
The
SVR
class
is
the
regression
equivalent
of
the
SVC
class,
and
the LinearSVR
class
is
the
regression
equivalent
of
the LinearSVC
class.
The LinearSVR
class
scales
linearly
with
the
size
of
the
training
set
(just
like
the LinearSVC
class),
while
the
SVR
class
gets
much
too
slow
when
the
training
set
grows
very
large
(just
like
the
SVC
class).