Motivation
Understand the use of kernel in regression problems. For usage in unsupervised learning / dimension reduction, see notes on Kernel PCA.
Kernel in Classification
Kernel is usually introduced in SVM classification problems. The rationale is that a linearly non-separable dataset could be separable in a high-dimensional feature space using the mapping ϕ:X→F .
But people found that the for certain kinds of feature mapping like polynomial or radial basis function, it’s more efficient to compute ϕ(xi)Tϕ(xj) instead of explicitly calculate the mapping ϕ(xi). If the feature space far exceeds the dimensionality of data / data number. storing the inner products is more efficient than storing the features.
Because of this we relied on the kernel K(xi,xj)=ϕ(xi)Tϕ(xj) and the mapping ϕ are left abstract.
Kernel in Regression
What does this have to do with regression?
For regression, we can imagine our target could be modelled by a linear function in the high dim feature space F, but not in the data space X .
ˆy=wTϕ(x)Thus given normal data set X (shape N,p ) and target y (shape N,1), we can compute the features ϕ(X)=[ϕ(x1)T;ϕ(x2)T;ϕ(x3)T…], which is a (N,Nf) matrix. since usually this Nf>N (more high-dim features than data point), the linear regression problem is ill posed, which needs regularization.
One common regularization is ridge regression
w=[ϕ(X)Tϕ(X)+λIf]−1ϕ(X)TyThe predicted will be
ˆy=ϕ(X)[ϕ(X)Tϕ(X)+λIf]−1ϕ(X)TyNote that in ridge regression, [ϕ(X)Tϕ(X)+λIf] is full rank and invertible. So we could find this identity
ϕ(X)[ϕ(X)Tϕ(X)+λIf]=[ϕ(X)ϕ(X)T+λIN]ϕ(X)Note the different dimensionality of the two identity matrices. With some inversion, we got this.
[ϕ(X)ϕ(X)T+λIN]−1ϕ(X)=ϕ(X)[ϕ(X)Tϕ(X)+λIf]−1Thus we could rewrite our original prediction values
ˆy=[ϕ(X)ϕ(X)T+λIN]−1ϕ(X)ϕ(X)TyNow we can hide the feature mappings and retain the kernel matrix K, of shape [N,N]
ˆy=[K+λIN]−1KyK=ϕ(X)ϕ(X)TIt’s easy to see, without regularization term λ the training error will be 0, and prediction for training point will be exactly same as training points.
When we obtain new samples X′
ˆy′=ϕ(X′)wˆy′=ϕ(X′)[ϕ(X)Tϕ(X)+λIf]−1ϕ(X)Tyˆy′=ϕ(X′)ϕ(X)T[ϕ(X)ϕ(X)T+λIN]−1yˆy′=K(X′,X)[K(X,X)+λIN]−1yThrough this we can see, for prediction, the system stores weights on the samples α (N,1 shape), and then predict new targets based on their similarity K(X′,X) with previous samples.
ˆy′=K(X′,X)αα=[K(X,X)+λIN]−1yIn this sense, the intuition of kernel regression is just to look for similar cases in training data, and draw similar results by averaging. The similarity is assessed in the kernel space.
Interpretation
Kernel matrix K could be understood as a way of measuring similarity.
- Representational similarity matrix is exactly doing this.
- It shall assume larger value for more similar points and lower for less similar points.
- In some sense we can picture it as the exp(−d) monotonic decrease function of distance. which is exactly the case for radial basis function.