How does AMIP work for regression when the weight vector induces colinearity in the regressors? This problem came up in our paper, as well as in a couple users of zaminfluence
. Superficially, the higher-order infinitesimal jackknife has nothing to say about such a point, since a requirement for the accuracy of the approximation is that the Hessian matrix be uniformly non-singular. However, we shall see that, in the special case of regression, we can re-express the problem so that the singularity disappears.
Notation
Suppose we have a weighted regression problem with regressor matrix \(X\) (an \(N \times D\) matrix), response vector \(\vec{y}\) (an \(N\)-vector), and weight vector \(\vec{w}\) (an \(N\)-vector): \[ \begin{aligned} % \hat{\theta}(\vec{w}) :={}& \theta\textrm{ such that }\sum_{n=1}^N w_n (y_n - \theta^T x_n) x_n = 0 \Rightarrow\\ \hat{\theta}(\vec{w}) ={}& \left(\frac{1}{N}\sum_{n=1}^N w_n x_n x_n^T \right)^{-1} \frac{1}{N}\sum_{n=1}^Nw_n y_n x_n \\={}& \left((\vec{w}\odot X)^T X\right)^{-1} (\vec{w}\odot X)^T \vec{y}. % \end{aligned} \]
Here, we have used \(\vec{w}\odot X\) like the Hadamard product with broadcasting. Formally, we really mean \((\vec{w}1_D^T) \odot X\), where \(1_D^T\) is a \(D\)-length row vector containing all ones. Throughout, we will use \(1\) and \(0\) subscripted by a dimension to represent vectors filled with ones and zeros respectivley.
How can weights induce rank deficiency?
We are interested in what happens to the linear approximation at a weight vector \(\vec{w}\) for which the Hessian \(\frac{1}{N}\sum_{n=1}^Nw_n x_n x_n^T\) is singular. Assume that \(X\) has rank \(D\), and that \((\vec{w}\odot X)\) is rank \(D-1\). Specifically, there exists some nonzero vector \(a_1 \in \mathbb{R}^D\). such that \((\vec{w} \odot X) a_1 = 0_N\), where \(0_N\) is the \(N\)-length vector of zeros. For each \(n\), the preceding expression implies that \(w_n x_n^T a_1 = 0\), so either \(w_n = 0\) or \(x_n^T a_1 = 0\). Without loss of generality, we can thus order the observations so that we drop the first \(N_{d}\) rows:
\[ \begin{aligned} % \vec{w}= \left(\begin{array}{c} 0_{N_{d}}\\ 1_{N_{k}} \end{array}\right) \quad\textrm{and}\quad X= \left(\begin{array}{c} X_d\\ X_k \end{array}\right) % \end{aligned} \]
Here, \(X_d\) is an \(N_{d}\times D\) matrix of dropped rows and \(X_k\) is an \(N_{k}\times D\) matrix of kept rows, where \(N_{k}+ N_{d}= N\). We thus have
\[ \begin{aligned} % Xa_1 = X= \left(\begin{array}{c} X_d a_1\\ 0_{N_{k}} \end{array}\right). % \end{aligned} \]
Here, \(X_d a_1 \ne 0\) (for otherwise \(X\) could not be rank \(D\)). In other words, the rows \(X_k\) are rank deficient, the rows \(X_d\) are not, but \(\vec{w}\odot X\) is rank deficient precisely because \(\vec{w}\) drops the full-rank portion \(X_d\).
Reparameterize to isolate the vanishing subspace
To understand how \(\hat{\theta}(\vec{w})\) behaves, let’s isolate the coefficient that corresponds to the subspace that vanishes. To that end, let \(A\) denote an invertible \(D \times D\) matrix whose first column is \(a_1\).
\[ \begin{aligned} % A := \left(\begin{array}{c} a_1 & a_2 & \ldots & a_D \\ \end{array}\right). % \end{aligned} \]
Define \(Z:= XA\) and \(\beta := A^{-1} \theta\) so that \(X\theta = Z\beta\). Then we can equivalently investigate the behavior of
\[ \begin{aligned} % \hat{\beta}(\vec{w}) ={}& \left((\vec{w}\odot Z)^T Z\right)^{-1} (\vec{w}\odot Z)^T \vec{y}. % \end{aligned} \]
If we write \(Z_1\) for the first column of \(Z\) and \(Z_{2:D}\) for the \(N \times (D - 1)\) remaining columns, we have
\[ \begin{aligned} % Z= \left(\begin{array}{c} Z_1 & Z_{2:D} \\ \end{array}\right) = \left(\begin{array}{c} Xa_1 & XA_{2:D} \\ \end{array}\right) = \left(\begin{array}{c} X_d a_1 & X_d A_{2:D} \\ 0_{N_{k}} & X_k A_{2:D} \\ \end{array}\right), % \end{aligned} \]
where we have used the fact definition of \(a_1\) and the partition of \(X\) from above.
Consider a straight path from \(1_N\) to \(\vec{w}\)
Define \(\vec{w}(t) = (\vec{w}- 1_N) t+ 1_N\) for \(t\in [0, 1]\), so that \(\vec{w}(0) = 1_N\) and \(\vec{w}(1) = \vec{w}\). We can now write an explicit formula for \(\hat{\beta}(\vec{w}(t))\) as a function of \(t\) and consider what happens as \(t\rightarrow 1\).
Because \(\vec{w}\) has zeros in its first \(N_{d}\) entries,
\[ \begin{aligned} % \vec{w}(t) \odot Z= \left(\begin{array}{cc} (1-t) X_d a_1 & (1-t) X_d A_{2:D} \\ 0_{N_{k}} & X_k A_{2:D} \\ \end{array} \right) % \end{aligned} \]
and
\[\begin{aligned} % \left((\vec{w}\odot Z)^T Z\right) ={}& \left(\begin{array}{c} (1-t) a_1^T X_d^T X_d a_1 & (1-t) a_1^T X_d^T X_k A_{2:D}\\ (1-t) A_{2:D}^T X_k^T X_d a_1 & A_{2:D}^T ( X_k^T X_k + (1-t) X_d^T X_d )A_{2:D} \\ \end{array}\right). % \end{aligned} \]
Since the upper left hand entry \((1-t) a_1^T X_d^T X_d a_1 \rightarrow 0\) as \(t\rightarrow 1\), we can see again that the regression is singular when evaluated at \(\vec{w}\).
However, by partitioning \(\vec{y}\) into dropped and kept components, \(\vec{y}_d\) and \(\vec{y}_k\) respectively, we also have
\[ \begin{aligned} % (\vec{w}(t) \odot Z)^T \vec{y}={}& \left(\begin{array}{c} (1-t) a_1^T X_d^T \vec{y}_d\\ A_{2:D}^T \left(X_k^T + (1-t)X_d^T \right)\vec{y}_k \end{array}\right). % \end{aligned} \]
One can perhaps see at this point that the \((1-t)\) will cancel in the numerator and denominator of the regression. This can be seen directly by the Schur complement. Letting \(\hat{\beta}_1\) denote the first element of \(\hat{\beta}\), we can use the Schur complement to write
\[ \begin{aligned} % \hat{\beta}_1(\vec{w}(t)) = \frac{(1-t) a_1^T X_d^T \vec{y}_d}{ (1-t) a_1^T X_d^T X_d a_1 - O((1-t)^2) } = \frac{a_1^T X_d^T \vec{y}_d}{ a_1^T X_d^T X_d a_1 - O((1-t)) } % \end{aligned} \]
where \(O(\cdot)\) denotes a term of the specified order as \(t\rightarrow 1\). We can see that, as \(t\rightarrow 1\), \(\hat{\beta}_1(\vec{w}(t))\) in fact varies smoothly, and we can expect linear approximations to work as well as they might even without the singularity. Formally, the singularity is a “removable singularity,” analogous to when a factor cancels in a ratio of polynomials.
An analogous argument holds for the rest of the \(\hat{\beta}\) vector, again using the Schur complement. Since \(\hat{\theta}\) is simply a linear transform of \(\hat{\beta}\), the same reasoning applies to \(\hat{\theta}\) as well.
Conslusions and consequences
Though the regression problem is singular precisely at \(\vec{w}\), it is in fact well-behaved in a neighborhood of \(\vec{w}\). This is because re-weighting downweights both the non-singular rows and their corresponding observations. Singularity occurs only when entries of \(\vec{w}\) are precisely zero. For theoretical and practical purposes, you can completely avoid the problem by simply considering weight vectors that are not precisely zero at the left-out points, taking instead some arbitrarily small values.
The most common way this seems to occur in practice is when a weight vector drops all the levels of some indicator. It is definitely on my TODO list to find some way to allow zaminfluence
to deal with this gracefully.
Note that this analysis leaned heavily on the structure of linear regression. In general, when the Hessian matrix of the objective function is nearly singular, it will be associated with non-linear behavior of \(\hat{\theta}(\vec{w}(t))\) along a path from \(1_N\) to \(w\). Linear regression is rather a special case.