Linear approximation when a norm is the quantity of interest.

differentiability
Published

July 20, 2021

In the setting of some our linear approximation work, I was recently asked the excellent question of what to do when the quantity of interest is a norm. The short answer is that there doesn’t seem likely to be an easy fix except when the dimension of the parameter is very small. But I can suggest a direction to look for a workable solution in practice.

For concreteness, suppose we have data weights \(w\), an estimator \(\hat{\theta}(w)\) with \(\hat\theta = \hat\theta(1)\), and we wish to remove a small proportion of the data in order to produce a large change in \(\phi(\theta) = || \theta - \hat\theta||_2^2\). By the chain rule, \(d \phi(\hat\theta(w)) / dw = 2 || \hat\theta(w) - \hat\theta||_2 d \hat\theta(w) / dw\), which is zero at \(w = 1\). What to do?

To my mind, the first thing to realize is that the zero derivative is a bit of red herring. We might as well have analyzed \(\phi(\theta) = || \theta - \hat\theta||_2\), which is approximately \(\phi(\theta) = || \theta - \hat\theta||_1 = \sum_{d} |\theta_d - \hat\theta_d|\) for \(\theta\) close to \(\hat\theta\), where \(\theta = (\theta_1, \ldots, \theta_D)^T\). The absolute value is not differentiable at \(0\), so you can see that there will be no single linear approximation that will provide a good approximation to the norm. But now the task is a bit clearer — we must somehow form a linear approximation to each “facet” of \(w \mapsto \sum_{d} |\theta_d(w) - \hat\theta_d|\).

To this end, let \(s\) denote a length-\(D\) vector consisting of \(-1\) or \(1\), i.e., \(s \in \{ -1, 1\}^D\). With this notation, \(\sum_{d} |\theta_d(w) - \hat\theta_d| = \max_{s \in \{ -1, 1\}^D} \sum_{d} s_d (\theta_d(w) - \hat\theta_d)\), since the maximum over \(s_d\) will simply choose the sign of \(\theta_d(w) - \hat\theta_d\). Suppose that we restict \(w \in \Omega_w\) (e.g., \(\Omega_w\) could be the set of vectors with ones in all entries except for \(M\) zeros, corresponding to leaving out \(M\) datapoints). Then

\[ \begin{align} \max_{w \in \Omega_w} || \hat\theta(w) - \hat\theta||_1 ={}& \max_{w \in \Omega_w} \max_{s \in \{ -1, 1\}^D} \sum_{d} s_d (\hat\theta_d(w) - \hat\theta_d) \\={}& \max_{s \in \{ -1, 1\}^D} \max_{w \in \Omega_w} \sum_{d} s_d (\hat\theta_d(w) - \hat\theta_d). \end{align} \]

By exchanging the order of the max, we have made the problem more tractable since, for a fixed \(s\), the quantity \(w \mapsto s_d (\hat\theta_d(w) - \hat\theta_d)\) is differentiable and can be treated with the linearization techniques from our papers. If \(D\) is very small, then one can find the adversarial choice for all \(2^D\) possible \(s\) vectors and take the max over all of them.

But in the more common case that \(D\) is large, some way of searching the space of \(s\) vectors is necessary. Although it seems likely to me that finding the exact optimal \(s\) may be a hard combinatorial problem, it may not be hard to approximately find a sign vector that produces a large change in cases when a large change can be produced. Since linearization produces at worst a sub-optimal set of points to remove anyway, further approximation still produces a lower bound on sensitivity as long as \(\hat\theta(w)\) can be re-computed. In any case, I like this formulation because it seems to make it clear what is hard about forming linear approximations to norms, and provides some directions to think about solutions.