When you have local sensitivity analysis as a tool, it’s tempting to use it to relax constraints on a problem. For example, you might solve a regression problem by fixing most of the coefficients to zero, making it computationally tractable, and then use local sensitivity analysis to approximately loosen that constraint to get the full solution. It’s interesting to note that this process is exactly equivalent to taking a Newton step from the original constrained solution.

\[ \def\thetahat{\hat{\theta}} \def\lambdahat{\hat{\lambda}} \def\f{f} \def\g{g} \]

In this post I’ll be a little loose with assumptions in order to just express the idea. This is something I’ve re–derived for myself periodically since my PhD, so in the spirit of this blog consisting of notes to myself, I’ll finally just put it here.

Suppose you solve a constrained optimization problem,

\[ \thetahat_{c} := \underset{\theta}{\mathrm{argmin}}\, \f(\theta) \textrm{ such that }\g(\theta) = \boldsymbol{0}. \]

Here, I’ll use the subscript “\(c\)” for “constrained”.

This can be equivalently cast as a Lagrange multiplier problem:

\[ \thetahat_c = \underset{\theta}{\mathrm{argmin}}\, \left(\f(\theta) + \lambda^\intercal\g(\theta) \right), \]

where \(\lambda\) is the Lagrange multiplier enforcing \(\g(\theta) = \boldsymbol{0}\). Differentiating both sides with respect to \(\theta\) gives you the value of \(\lambda\):

\[ \left. \frac{\partial \f(\theta)}{\partial \theta} \right|_{\thetahat_c} + \left. \frac{\partial \g(\theta)^\intercal}{\partial \theta} \right|_{\thetahat_c} \lambda = \boldsymbol{0}. \]

Assuming that the constraint Jacobian is full row–rank, we can solve to get

\[ \lambdahat = -\left( \left. \frac{\partial \g(\theta)}{\partial \theta^\intercal} \right|_{\thetahat_c} \left. \frac{\partial \g(\theta)^\intercal}{\partial \theta} \right|_{\thetahat_c} \right)^{-1} \left. \frac{\partial \g(\theta)^\intercal}{\partial \theta} \right|_{\thetahat_c} \left. \frac{\partial \f(\theta)}{\partial \theta} \right|_{\thetahat_c} \]

as the Lagrange multplier that enforces \(\g(\theta) = \boldsymbol{0}\).

Now, suppose we just view \(\thetahat(\lambda)\) as the parameterized solution to the same problem:

\[ \thetahat(\lambda) := \underset{\theta}{\mathrm{argmin}}\, \left(\f(\theta) + \lambda^\intercal\g(\theta) \right). \]

Clearly \(\thetahat(\boldsymbol{0}) = \thetahat_u\) is the unconstrained solution, and this can be approximated using the implicit function theorem a first–order series expansion:

\[ \thetahat_u = \thetahat(\boldsymbol{0}) \approx \thetahat(\lambdahat) + \left. \frac{d\thetahat(\lambda)}{d\lambda^\intercal} \right|_{\lambdahat}(\boldsymbol{0}- \lambdahat). \]

The derivative is given by the implicit function theorem:

\[ \begin{aligned} \left. \frac{d\thetahat(\lambda)}{d\lambda^\intercal} \right|_{\lambdahat} ={}& -\left( \left. \frac{\partial^2 \f(\theta)}{\partial \theta \partial \theta^\intercal} \right|_{\thetahat(\lambdahat)} + \left. \frac{\partial^2 \lambdahat^\intercal\g(\theta)}{\partial \theta \partial \theta^\intercal} \right|_{\thetahat(\lambdahat)} \right)^{-1} \left. \frac{\partial^2 \lambda^\intercal\g(\theta)}{\partial \theta \partial \lambda^\intercal} \right|_{\thetahat(\lambdahat)} \\={}& -\left( \left. \frac{\partial^2 \f(\theta)}{\partial \theta \partial \theta^\intercal} \right|_{\thetahat(\lambdahat)} + \left. \frac{\partial^2 \lambdahat^\intercal\g(\theta)}{\partial \theta \partial \theta^\intercal} \right|_{\thetahat(\lambdahat)} \right)^{-1} \left. \frac{\partial \g(\theta)^\intercal}{\partial \theta} \right|_{\thetahat(\lambdahat)}. \end{aligned} \]

Plugging in,

\[ \begin{aligned} \thetahat_u \approx{}& \thetahat_c -\left( \left. \frac{\partial^2 \f(\theta)}{\partial \theta \partial \theta^\intercal} \right|_{\thetahat(\lambdahat)} + \left. \frac{\partial^2 \lambdahat^\intercal\g(\theta)}{\partial \theta \partial \theta^\intercal} \right|_{\thetahat(\lambdahat)} \right)^{-1} \left. \frac{\partial \g(\theta)^\intercal}{\partial \theta} \right|_{\thetahat(\lambdahat)} \left( \left. \frac{\partial \g(\theta)}{\partial \theta^\intercal} \right|_{\thetahat_c} \left. \frac{\partial \g(\theta)^\intercal}{\partial \theta} \right|_{\thetahat_c} \right)^{-1} \left. \frac{\partial \g(\theta)^\intercal}{\partial \theta} \right|_{\thetahat_c} \left. \frac{\partial \f(\theta)}{\partial \theta} \right|_{\thetahat_c} \\={}& -\left( \left. \frac{\partial^2 \f(\theta)}{\partial \theta \partial \theta^\intercal} \right|_{\thetahat(\lambdahat)} + \left. \frac{\partial^2 \lambdahat^\intercal\g(\theta)}{\partial \theta \partial \theta^\intercal} \right|_{\thetahat(\lambdahat)} \right)^{-1} \left. \frac{\partial \f(\theta)}{\partial \theta} \right|_{\thetahat_c}, \end{aligned} \]

where we have used the fact that the first–order condition implies that the projection of the gradient of \(\f\) onto the constraint Jacobian is the gradient of \(\f\) itself.

Suppose \(\g(\theta) = A^\intercal\theta\) for some matrix \(A\), as in the case where we have set some components of \(\theta\) to zero. In that case, we get

\[ \thetahat_u \approx \thetahat_c -\left( \left. \frac{\partial^2 \f(\theta)}{\partial \theta \partial \theta^\intercal} \right|_{\thetahat(\lambdahat)} \right)^{-1} \left. \frac{\partial \f(\theta)}{\partial \theta} \right|_{\thetahat_c}, \]

which is precisely a Newton step in the unconstrainted space, starting at the constrained estimator.

Approximate constraint relaxation is a Newton step

References