# R torch for statistics (not just machine learning).

The `torch`

package for R (found here) is
CRAN-installable and provides automatic differentiation in R, as long as you’re
willing to rewrite your code using Torch functions.

The current docs for the `torch`

package are great, but assume you’re interested
in machine learning. But gradients are useful for ordinary statistics, too!
In the notebook below I fit a simple Poisson regression model using `optim`

by implementing the log likelihod and derivatives in torch. Though not really
competitive with (the highly optimized) `lme4::glm`

on this toy example, the
my point is more how easily you can roll your own MLE in R using `torch`

.

The notebook itself can be downloaded here, and an markdown version follows.

# Example of `torch`

for classical stats (Poisson regression)

In this notebook, I’ll show how easy it is to use `torch`

for R to optimize loss functions and compute standard error estimates.

The torch for R website is mostly focused on machine learning applications. The purpose of this notebook is just to show how easy it is to use `torch`

to get gradients and Hessians for your own purposes, including vanilla classical statistics.

I’ll use `torch`

to implement and optimize a Poisson regression loss function and compute standard errors using Fisher information. This is just a toy problem, but by simply dropping the loss into an out-of-the-box optimizer, we get essentially the same answer as the (highly optimized) `lme4`

package in a similar amount of time.

# Installation

One of the big benefits of `torch`

is that it can be installed via CRAN, and so can be easily packaged in with your own R packages without the user having to do a bunch of extra Python nonsense. Installation instructions can be found here.

```
library(lme4)
library(tidyverse)
library(torch)
set.seed(44)
```

Let’s generate some data. The model will be a simple Poisson regression:

\[p(y_n | x_n) = \mathrm{Poisson}(\exp(x_n^T \beta))\]The goal will be to estimate \(\beta\), and standard errors, using maximum likelihood and the inverse Fisher information.

```
n_obs <- 1000
x <- runif(n_obs * 2) %>% matrix(n_obs, 2)
beta_true <- matrix(c(1, 1.8), nrow=2, ncol=1)
lambda_true <- exp(x %*% beta_true)
summary(lambda_true)
y <- rpois(n_obs, lambda_true)
summary(y)
```

```
V1
Min. : 1.021
1st Qu.: 2.570
Median : 3.998
Mean : 4.742
3rd Qu.: 6.349
Max. :15.447
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 2.000 4.000 4.664 6.000 25.000
```

Let’s define the log likelihood in `torch`

. Then we can use `torch`

to evaluated gradients of the log likelihood for optimization, and the Hessian for standard errors.

There are two important things to know:

- Torch does not operate on R numeric types. It operates on torch tensors, which can be created with torch_tensor().
- Torch uses only its own functions — not base R! You can typically find the things you need by browsing through the reference material.

I’ll keep torch versions of the data around in a list `tvars`

for easy re-use. And I’ll write a function `EvalLogLik`

, which takes in a torch tensor `beta`

, the data, and returns the log likelihood, again as a torch tensor.

```
tvars <- list()
tvars$x <- torch_tensor(x)
tvars$y <- torch_tensor(y)
EvalLogLikTorch <- function(beta, tvars) {
if (!is(beta, "torch_tensor")) {
stop("beta must be a torch tensor")
}
log_lambda <- torch_matmul(tvars$x, beta)
lp <- torch_sum(tvars$y * log_lambda - torch_exp(log_lambda))
return(lp)
}
# Sanity check that it works
EvalLogLikTorch(torch_tensor(beta_true), tvars)
```

```
torch_tensor
1.72212e+06
[ CPUFloatType{} ]
```

We want to pass the (negative) log likelihood to an R routine as a function to be optimized. So we need to write a wrapper that takes an `R`

numeric type, converts it to a torch tensor, calls `EvalLogProb`

, and converts the result back to an R numeric type.

```
# From now on we'll take `tvars` to be a global variable to save
# writing everything as lambda functions.
EvalLogLik <- function(beta, verbose=FALSE) {
log_lik <- EvalLogLikTorch(torch_tensor(beta), tvars) %>% as.numeric()
if (verbose) {
cat(paste(beta, collapse=", "), ": ", as.character(log_lik, digits=20), "\n")
}
return(log_lik)
}
# Just check that this runs
EvalLogLik(beta_true)
```

```
[1] 1722115.375
```

Now for some magic — a function that returns the gradient of `EvalLogLik`

with respect to beta. This is what we’re using `torch`

for. As before, we want something that we can pass to our optimizer, so that it takes a R numeric value for \(\beta\) as input and returns \(\partial \log p(y \vert x, \beta) / \partial \beta\) as R numeric output.

Unlike before, we call `torch_tensor`

with the extra argument `requires_grad=TRUE`

. That tells `torch`

that we will later want to compute a gradient with respt to this parameter.

We compute the `loss`

(the negative log likelihood) as we would normally.

We then call `autograd_grad`

, which returns the gradient of the first argument’s tensor with respect to the second argument’s tensor using all the computations that have been performed since the tensors were defined. The `autograd_grad`

function returns a list (you can take gradients with respect to multiple inputs), so we just pull out the first element of the list.

```
EvalLogLikGrad <- function(beta) {
beta_ad <- torch_tensor(beta, requires_grad=TRUE)
loss <- EvalLogLikTorch(beta_ad, tvars)
grad <- autograd_grad(loss, beta_ad)[[1]]
return(as.numeric(grad))
}
# Just check that this runs and has the correct dimensions
EvalLogLikGrad(beta_true)
```

```
[1] -440151.25 -691394
```

Now we can use the loss and gradient into a nonlinear optimizer. Sure enough, we get a reasonable estimate, with a somewhat small loss gradient.

(This gradient would ideally be smaller, but the out-of-the-box BFGS isn’t a very good optimization algorithm.)

```
optim_time <- Sys.time()
opt_result <- optim(
fn=EvalLogLik,
gr=EvalLogLikGrad,
method="BFGS",
par=c(0, 0),
control=list(fnscale=-1 / n_obs))
optim_time <- Sys.time() - optim_time
data.frame("Estimate"=opt_result$par, "Truth"=beta_true) %>% print()
cat("\nGradient at BFGS optimum:\n")
print(EvalLogLikGrad(opt_result$par))
beta_hat <- opt_result$par
```

```
Estimate Truth
1 0.9706614 1.0
2 1.7922191 1.8
Gradient at BFGS optimum:
[1] -0.0355956 -0.0324285
```

We can compare the results to what we’d get from the same regression using `lme4::glm`

. Sure enough they match, and run in a comparable amount of time. (I’ve found there’s a fair amount of noise in the timing, and of course this is only reporting a single run, so all that really matters here is that the two are of the same order.)

The glm algorithm (IRLS) tends to do a much better job of optimizing, in the sense that the gradient is smaller at the glm optimum, and the algorithm runs more quickly. Still, BFGS is just a quick-and-dirty choice, and doesn’t require any special structure to the problem.

```
glm_time <- Sys.time()
glm_fit <- glm(
y ~ x1 + x2 - 1,
data=data.frame(y=y, x1=x[, 1], x2=x[, 2]),
start=c(0, 0),
family="poisson")
glm_time <- Sys.time() - glm_time
cat("Difference in coefficients estimated by optim and glm:\t",
max(abs(coefficients(glm_fit) - beta_hat)), "\n")
cat("\nEstimation time (s):\n")
cat("optimization and torch: \t", optim_time, "\n")
cat("glm: \t\t\t\t", glm_time, "\n")
cat("\nGradient at glm optimum:\n")
print(EvalLogLikGrad(coefficients(glm_fit)))
```

```
Difference in coefficients estimated by optim and glm: 1.882297e-05
Estimation time (s):
optimization and torch: 0.1431296
glm: 0.01248574
Gradient at glm optimum:
[1] -3.278255e-05 -7.224083e-05
```

To compute standard errors, we need to compute the negative Hessian matrix of the log likelihood:

\[\hat{\mathcal{I}} := - \left. \frac{\partial \log p(y | x, \beta)} {\partial \beta \partial \beta^T} \right|_{\hat\beta}\]The quantity \(\mathcal{I}\) is the empirical Fisher information, and \(\mathcal{I}^{-1}\) is a standard estimator of the covariance of the MLE \(\hat\beta\) under correct specification.

The Python version of `torch`

has a native function, like `autograd_grad`

, that computes the Hessian directly. Unfortunately, that function has not yet been ported to R. (See this issue on github.) However, we can compute a Hessian by computing the gradients of each row of the gradient, as follows.

```
EvalLogLikHessian <- function(beta) {
beta_ad <- torch_tensor(beta, requires_grad=TRUE)
log_lik <- EvalLogLikTorch(beta_ad, tvars)
# The argument `create_graph` allows `grad` to be itself differentiated, and
# the argument `retain_graph` saves gradient computations to make repeated differentiation
# of the same quantity more efficient.
grad <- autograd_grad(log_lik, beta_ad, retain_graph=TRUE, create_graph=TRUE)[[1]]
# Now we compute the gradient of each element of the gradient, each of which is
# one row of the Hessian matrix.
hess <- matrix(NA, length(beta), length(beta))
for (d in 1:length(grad)) {
hess[d, ] <- autograd_grad(grad[d], beta_ad, retain_graph=TRUE)[[1]] %>% as.numeric()
}
return(hess)
}
# Just check that this runs and has the correct dimensions
EvalLogLikHessian(beta_true)
```

```
[1] -1907663 -1720186
-1720186 -2263947
```

In the code below, `fisher_info`

is precisely \(\hat{\mathcal{I}}\). We can see that the standard errors match one another.

```
fisher_info <- -1 * EvalLogLikHessian(opt_result$par)
torch_se <- solve(fisher_info) %>% diag() %>% sqrt()
glmer_se <- summary(glm_fit)$coefficients[, "Std. Error"]
cat("Difference in estimated standard errors:\t",
max(abs(torch_se - glmer_se)), "\n")
```

```
Difference in estimated standard errors: 3.857842e-07
```