Motivation

Understand the use of kernel in regression problems. For usage in unsupervised learning / dimension reduction, see notes on Kernel PCA.

Kernel in Classification

Kernel is usually introduced in SVM classification problems. The rationale is that a linearly non-separable dataset could be separable in a high-dimensional feature space using the mapping ϕ:XF .

But people found that the for certain kinds of feature mapping like polynomial or radial basis function, it’s more efficient to compute ϕ(xi)Tϕ(xj) instead of explicitly calculate the mapping ϕ(xi). If the feature space far exceeds the dimensionality of data / data number. storing the inner products is more efficient than storing the features.

Because of this we relied on the kernel K(xi,xj)=ϕ(xi)Tϕ(xj) and the mapping ϕ are left abstract.

Kernel in Regression

What does this have to do with regression?

For regression, we can imagine our target could be modelled by a linear function in the high dim feature space F, but not in the data space X .

ˆy=wTϕ(x)

Thus given normal data set X (shape N,p ) and target y (shape N,1), we can compute the features ϕ(X)=[ϕ(x1)T;ϕ(x2)T;ϕ(x3)T], which is a (N,Nf) matrix. since usually this Nf>N (more high-dim features than data point), the linear regression problem is ill posed, which needs regularization.

One common regularization is ridge regression

w=[ϕ(X)Tϕ(X)+λIf]1ϕ(X)Ty

The predicted will be

ˆy=ϕ(X)[ϕ(X)Tϕ(X)+λIf]1ϕ(X)Ty

Note that in ridge regression, [ϕ(X)Tϕ(X)+λIf] is full rank and invertible. So we could find this identity

ϕ(X)[ϕ(X)Tϕ(X)+λIf]=[ϕ(X)ϕ(X)T+λIN]ϕ(X)

Note the different dimensionality of the two identity matrices. With some inversion, we got this.

[ϕ(X)ϕ(X)T+λIN]1ϕ(X)=ϕ(X)[ϕ(X)Tϕ(X)+λIf]1

Thus we could rewrite our original prediction values

ˆy=[ϕ(X)ϕ(X)T+λIN]1ϕ(X)ϕ(X)Ty

Now we can hide the feature mappings and retain the kernel matrix K, of shape [N,N]

ˆy=[K+λIN]1KyK=ϕ(X)ϕ(X)T

It’s easy to see, without regularization term λ the training error will be 0, and prediction for training point will be exactly same as training points.

When we obtain new samples X

ˆy=ϕ(X)wˆy=ϕ(X)[ϕ(X)Tϕ(X)+λIf]1ϕ(X)Tyˆy=ϕ(X)ϕ(X)T[ϕ(X)ϕ(X)T+λIN]1yˆy=K(X,X)[K(X,X)+λIN]1y

Through this we can see, for prediction, the system stores weights on the samples α (N,1 shape), and then predict new targets based on their similarity K(X,X) with previous samples.

ˆy=K(X,X)αα=[K(X,X)+λIN]1y

In this sense, the intuition of kernel regression is just to look for similar cases in training data, and draw similar results by averaging. The similarity is assessed in the kernel space.

Interpretation

Kernel matrix K could be understood as a way of measuring similarity.

  • Representational similarity matrix is exactly doing this.
  • It shall assume larger value for more similar points and lower for less similar points.
  • In some sense we can picture it as the exp(d) monotonic decrease function of distance. which is exactly the case for radial basis function.