In the setting of some our linear approximation work, I was recently asked the excellent question of what to do when the quantity of interest is a norm. The short answer is that there doesn’t seem likely to be an easy fix except when the dimension of the parameter is very small. But I can suggest a direction to look for a workable solution in practice.

For concreteness, suppose we have data weights $$w$$, an estimator $$\hat{\theta}(w)$$ with $$\hat\theta = \hat\theta(1)$$, and we wish to remove a small proportion of the data in order to produce a large change in $$\phi(\theta) = || \theta - \hat\theta||_2^2$$. By the chain rule, $$d \phi(\hat\theta(w)) / dw = 2 || \hat\theta(w) - \hat\theta||_2 d \hat\theta(w) / dw$$, which is zero at $$w = 1$$. What to do?

To my mind, the first thing to realize is that the zero derivative is a bit of red herring. We might as well have analyzed $$\phi(\theta) = || \theta - \hat\theta||_2$$, which is approximately $$\phi(\theta) = || \theta - \hat\theta||_1 = \sum_{d} |\theta_d - \hat\theta_d|$$ for $$\theta$$ close to $$\hat\theta$$, where $$\theta = (\theta_1, \ldots, \theta_D)^T$$. The absolute value is not differentiable at $$0$$, so you can see that there will be no single linear approximation that will provide a good approximation to the norm. But now the task is a bit clearer — we must somehow form a linear approximation to each “facet” of $$w \mapsto \sum_{d} |\theta_d(w) - \hat\theta_d|$$.

To this end, let $$s$$ denote a length-$$D$$ vector consisting of $$-1$$ or $$1$$, i.e., $$s \in \{ -1, 1\}^D$$. With this notation, $$\sum_{d} |\theta_d(w) - \hat\theta_d| = \max_{s \in \{ -1, 1\}^D} \sum_{d} s_d (\theta_d(w) - \hat\theta_d)$$, since the maximum over $$s_d$$ will simply choose the sign of $$\theta_d(w) - \hat\theta_d$$. Suppose that we restict $$w \in \Omega_w$$ (e.g., $$\Omega_w$$ could be the set of vectors with ones in all entries except for $$M$$ zeros, corresponding to leaving out $$M$$ datapoints). Then

\begin{align} \max_{w \in \Omega_w} || \hat\theta(w) - \hat\theta||_1 ={}& \max_{w \in \Omega_w} \max_{s \in \{ -1, 1\}^D} \sum_{d} s_d (\hat\theta_d(w) - \hat\theta_d) \\={}& \max_{s \in \{ -1, 1\}^D} \max_{w \in \Omega_w} \sum_{d} s_d (\hat\theta_d(w) - \hat\theta_d). \end{align}

By exchanging the order of the max, we have made the problem more tractable since, for a fixed $$s$$, the quantity $$w \mapsto s_d (\hat\theta_d(w) - \hat\theta_d)$$ is differentiable and can be treated with the linearization techniques from our papers. If $$D$$ is very small, then one can find the adversarial choice for all $$2^D$$ possible $$s$$ vectors and take the max over all of them.

But in the more common case that $$D$$ is large, some way of searching the space of $$s$$ vectors is necessary. Although it seems likely to me that finding the exact optimal $$s$$ may be a hard combinatorial problem, it may not be hard to approximately find a sign vector that produces a large change in cases when a large change can be produced. Since linearization produces at worst a sub-optimal set of points to remove anyway, further approximation still produces a lower bound on sensitivity as long as $$\hat\theta(w)$$ can be re-computed. In any case, I like this formulation because it seems to make it clear what is hard about forming linear approximations to norms, and provides some directions to think about solutions.