\[ \def\F{\mathcal{F}} \def\G{\mathcal{G}} \]
A reproducing kernel Hilbert space (RKHS) is a pretty powerful abstract concept that you’ll eventually come across in statistics or machine learning. It shows up lots of places, like Gaussian processes, support vector machines, and smoothing splines, where its role is to provide flexible and tractable infinite–dimenisional function classes with a quantifiable notion of “smoothness.” Here are some references that I like (in incresing order of mathiness): Scholkopf and Smola (2018), Wainwright (2019), Paulsen and Raghupathi (2016).
In this post, I’ll state clearly an aspect of RKHS’s that I wish I had understood earlier: there are at least three distinct ways to construct an RKHS, each with different strenghts and weaknesses:
- Via a given kernel
- Via feature vectors
- Via the evaluation operator
Mathematically these perspectives are of course equivalent when it is possible for them to be. But each approach makes explicit different aspects of the problem, namely:
- The kernel
- The inner product
- The smooth subspace
Correspondingly, for a given approach, the other aspects may be left implicit or are defined non–constructively.
This post is intended for someone who already knows what an RKHS is and why you’d use it, and might benefit from this particular high–level perspective.
In each example, I’ll consider functions \(f: \mathbb{R} \mapsto \mathbb{R}\) in L2 on \([0,1]\) for simplicity — this is just an expository blog post, after all. Nothing I’m saying is very tied to that particular choice, and hopefully it will be clear enough how to generalize. I’ll denote the usual L2 inner product as \(<f, g> = \int f(x) g(x) dx\). L2 is not an RKHS (why will be clear below if this is not already familiar). So let’s talk about how to construct a subspace of L2 that is an RKHS, and with it a corresponding representing kernel, inner product, and feature vector.
Method 1: via a kernel
Suppose we start with a symmetric positive definite kernel \(k(x, y)\), for which \(k(\cdot, y) \in L2\) for any \(y\). We can define an RKHS \(\F\) as the closure of linear combinations of the kernel. That is, let
\[ \F' := \{ f: f(\cdot) = \sum_{n=1}^N a_n k(\cdot, x_n) \quad\textrm{ for some }N,\{a_n\}_{n \in [N]}, \{x_n\}_{n\in[N]}\}. \]
Then we take the RKHS to be the closure of \(\F'\) in L2. This closure allows \(f\) which are infinite sums as long as \(\sum_{n=1}^\infty a_n^2 < \infty\).
Note that \(k(\cdot, x)\) does not have the reproducing property with respect to the L2 inner product — the RKHS inner product \(<\cdot, \cdot>_\F\) must be different than the L2 inner product. What should it be? Well, if we want the kernel to be reproducing, we require
\[ <k(\cdot, x), k(\cdot, y)>_\F = k(x, y) \]
Suppose we have two functions in \(\F'\), represented by
\[ f = \sum_{n=1}^{N_f} a^f_n k(\cdot, x^f_n) \quad\textrm{and}\quad g = \sum_{n=1}^{N_g} a^g_n k(\cdot, x^g_n). \]
Using the linearity of an inner product, together with the reproducing property of the kernel applied to itself, gives \[ <f, g>_\F = \sum_{n=1}^{N_f} \sum_{n=1}^{N_g} a^f_n a^g_m k(x_n^f, x_m^g). \]
The continuous extension to \(\F\) gives the inner product on the whole space.
The advantage of this construction is that you can start with any kernel, and you know exactly what it is. But it may hard to see intuitively what subspace is being carved out, what the inner product actually means, or what sorts of feature vectors \(k(\cdot, \cdot)\) is implicitly computing.
Method 2: feature vectors
Suppose that, instead of a kernel, we start with a (possibly infinite–dimensional) set orthonormal features in L2, \(\Phi_n(x)\) for \(n=1,\ldots,\infty\). These features span a subspace of L2, and we can define functions as living in the span of that subspace:
\[ \G := \{ f: f = \sum_{n=1}^\infty a_n \Phi_n(x) \quad\textrm{having}\quad \sum_{n=1}^\infty a_n^2 < \infty \}. \]
I’ll now do something illegal but suggestive to motivate forming an RKHS using the feature vectors. The function \(\sum_{n=1}^\infty \Phi_n(x) \Phi_n(\cdot)\) is typically not a member of L2, since it would have infinite norm (each \(||\Phi_n(\cdot)||_2^2 = 1\), and they are orthogonal). But if it were, it would have the reproducing property in L2, since we’d have
\[ \begin{aligned} <\sum_{n=1}^\infty \Phi_n(x) \Phi_n(\cdot), f(\cdot)> =& <\sum_{n=1}^\infty \Phi_n(x) \Phi_n(\cdot), \sum_{n=1}^\infty a_n \Phi_n(x)> \\={}& \sum_{n=1}^\infty a_n \Phi_n(x) <\Phi_n(\cdot), \Phi_n(x)> \\={}& \sum_{n=1}^\infty a_n \Phi_n(x) \\=& f(x). &\textrm{Actually illegal! But suggestive.} \end{aligned} \]
However, we can both modify the inner product (giving an RKHS inner product) and the representer (giving an RKHS kernel) so that the representer is actually a member of L2. Specifically, we can define
\[ \begin{aligned} k(x, \cdot) :={}& \sum_{n=1}^\infty \lambda_n \Phi_n(x) \Phi_n(\cdot) \quad\textrm{for}\quad \sum_{n=1}^\infty \lambda_n^2 < \infty \\ <\Phi_n(\cdot), \Phi_m(\cdot)>_\F =& 0 \quad\textrm{for}\quad n \ne m \\ <\Phi_n(\cdot), \Phi_n(\cdot)>_\F =& \lambda_n^{-1}. \end{aligned} \]
Since \(\lambda_n\) must go to zero as \(n\rightarrow \infty\), this means taking the higher–indexed features to have a larger norm in the RKHS than in L2. Since the \(\lambda_n\) in the inner product cancels with the \(\lambda_n\) in the kernel, we have the reproducing property:
\[ \begin{aligned} <k(x, \cdot), f(\cdot)>_\F ={}& <\sum_{n=1}^\infty \lambda_n \Phi_n(x) \Phi_n(\cdot), \sum_{n=1}^\infty a_n \Phi_n(\cdot)>_\F\ \\={}& \sum_{n=1}^\infty \lambda_n <\Phi_n(\cdot), \Phi_n(x)>_\F \Phi_n(x) a_n \\={}& f(x). \end{aligned} \]
Now, a function has finite norm with respect to \(<\cdot,\cdot>_\F\) exactly when \[ <f,f>_\F = \sum_{n=1}^\infty \frac{a_n^2}{\lambda_n} < \infty. \] Since \(\lambda_n \rightarrow 0\), there are some sequences \(a_n\) that are square summable, but not square summable when divided by \(\lambda_n\), so the RKHS is a proper subspace of \(\G\).
This construction is nice in the sense that it gives us explicit feature vectors, and an explicit notion of how they should “decay” in order to enforce the smoothness of the RKHS. However, computing the kernel requires an infinite amount of computation, and the precise subspace, which is given only in terms of convergence of an infinite sum, may not be clear.
Method 3: Bounded evaluation operators
Finally, we have an RKSH if we simply have an Hilbert space in which the evaluation operator is bounded. I’ll say below what this means. A relatively simple motivating example is the first–order Sobolev space (see Paulsen and Raghupathi (2016) chapter 1.3.1).
suppose we begin with some subspace and inner product \(<\cdot,\cdot>_\F\). In any space, the “evaluation functional” is always linear. Specifically, let \(E_x(f(\cdot)) := f(x)\) for any \(x\). Then
\[ E_x(a f(\cdot) + g(\cdot)) = a f(x) + g(x) = a E_x(f(\cdot)) + E_x(g(\cdot)). \]
Suppose we can additionally show that the evaluation functional is bounded, meaning that, for any \(x\) there exists a constant \(C_x\) such that
\[ E_x(f(\cdot)) \le C_x ||f(\cdot)||_\F. \]
When this is the case, then by the Riesz representation theorem, there exists a representer of evaluation in the space. We can suggestively call the representer for \(E_x\) “\(k(x, \cdot)\)”, giving
\[ E_x(f(\cdot)) = <k(x, \cdot), f>_\F. \]
We thus have shown that our space is an RKHS, with kernel given by the representation theorem.
The nice thing about this construction is that you began with an explicit subspace and inner product, and so know exactly what they are. By simply showing boundedness of the evaluation operator, you are guaranteed that a kernel exists and you are in an RKHS. However, you may need to do extra work to actually find out what the kernel is, and what the feature vectors are.
You always have it all, but you might not be able to write it down
- Method 1:
- Took in a kernel,
- Constructed a subspace and inner product, and
- Has feature vectors via Mercer’s theorem.
- Method 2:
- Took in feature vectors, and
- Constructed a kernel, subspace, and inner product.
- Method 3:
- Took in a subspace and inner product with bounded evaluation operator,
- Has a kernel via the Riesz representation theorem, and
- Has feature vectors via Mercer’s theorem.
So each of the key aspects of an RKHS — the kernel, subspace, inner product, and feature vectors — all exist, no matter how you construct it. But in each method, you really only have an explicit handle by default on the thing that you start with: namely, the kernel, the features, or the space, respectively. Other aspects of the RKHS are then either constructed via infinite sums, or are given existence (but not explicit forms) via Riesz or Mercer’s theorems.