The torch package for R (found here) is CRAN-installable and provides automatic differentiation in R, as long as you’re willing to rewrite your code using Torch functions.

The current docs for the torch package are great, but assume you’re interested in machine learning. But gradients are useful for ordinary statistics, too! In the notebook below I fit a simple Poisson regression model using optim by implementing the log likelihod and derivatives in torch. Though not really competitive with (the highly optimized) lme4::glm on this toy example, the my point is more how easily you can roll your own MLE in R using torch.

The notebook itself can be downloaded here, and an markdown version follows.

Example of torch for classical stats (Poisson regression)

In this notebook, I’ll show how easy it is to use torch for R to optimize loss functions and compute standard error estimates.

The torch for R website is mostly focused on machine learning applications. The purpose of this notebook is just to show how easy it is to use torch to get gradients and Hessians for your own purposes, including vanilla classical statistics.

I’ll use torch to implement and optimize a Poisson regression loss function and compute standard errors using Fisher information. This is just a toy problem, but by simply dropping the loss into an out-of-the-box optimizer, we get essentially the same answer as the (highly optimized) lme4 package in a similar amount of time.


One of the big benefits of torch is that it can be installed via CRAN, and so can be easily packaged in with your own R packages without the user having to do a bunch of extra Python nonsense. Installation instructions can be found here.



Let’s generate some data. The model will be a simple Poisson regression:

\[p(y_n | x_n) = \mathrm{Poisson}(\exp(x_n^T \beta))\]

The goal will be to estimate \(\beta\), and standard errors, using maximum likelihood and the inverse Fisher information.

n_obs <- 1000
x <- runif(n_obs * 2) %>% matrix(n_obs, 2)
beta_true <- matrix(c(1, 1.8), nrow=2, ncol=1)
lambda_true <- exp(x %*% beta_true)
y <- rpois(n_obs, lambda_true)
 Min.   : 1.021
 1st Qu.: 2.570
 Median : 3.998
 Mean   : 4.742
 3rd Qu.: 6.349
 Max.   :15.447

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  0.000   2.000   4.000   4.664   6.000  25.000

Let’s define the log likelihood in torch. Then we can use torch to evaluated gradients of the log likelihood for optimization, and the Hessian for standard errors.

There are two important things to know:

  • Torch does not operate on R numeric types. It operates on torch tensors, which can be created with torch_tensor().
  • Torch uses only its own functions — not base R! You can typically find the things you need by browsing through the reference material.

I’ll keep torch versions of the data around in a list tvars for easy re-use. And I’ll write a function EvalLogLik, which takes in a torch tensor beta, the data, and returns the log likelihood, again as a torch tensor.

tvars <- list()
tvars$x <- torch_tensor(x)
tvars$y <- torch_tensor(y)

EvalLogLikTorch <- function(beta, tvars) {
    if (!is(beta, "torch_tensor")) {
        stop("beta must be a torch tensor")
    log_lambda <- torch_matmul(tvars$x, beta)

    lp <- torch_sum(tvars$y * log_lambda - torch_exp(log_lambda))


# Sanity check that it works
EvalLogLikTorch(torch_tensor(beta_true), tvars)
[ CPUFloatType{} ]

We want to pass the (negative) log likelihood to an R routine as a function to be optimized. So we need to write a wrapper that takes an R numeric type, converts it to a torch tensor, calls EvalLogProb, and converts the result back to an R numeric type.

# From now on we'll take `tvars` to be a global variable to save
# writing everything as lambda functions.
EvalLogLik <- function(beta, verbose=FALSE) {
    log_lik <- EvalLogLikTorch(torch_tensor(beta), tvars) %>% as.numeric()
    if (verbose) {
        cat(paste(beta, collapse=", "), ": ", as.character(log_lik, digits=20), "\n")

# Just check that this runs
[1] 1722115.375

Now for some magic — a function that returns the gradient of EvalLogLik with respect to beta. This is what we’re using torch for. As before, we want something that we can pass to our optimizer, so that it takes a R numeric value for \(\beta\) as input and returns \(\partial \log p(y \vert x, \beta) / \partial \beta\) as R numeric output.

Unlike before, we call torch_tensor with the extra argument requires_grad=TRUE. That tells torch that we will later want to compute a gradient with respt to this parameter.

We compute the loss (the negative log likelihood) as we would normally.

We then call autograd_grad, which returns the gradient of the first argument’s tensor with respect to the second argument’s tensor using all the computations that have been performed since the tensors were defined. The autograd_grad function returns a list (you can take gradients with respect to multiple inputs), so we just pull out the first element of the list.

EvalLogLikGrad <- function(beta) {
    beta_ad <- torch_tensor(beta, requires_grad=TRUE)
    loss <- EvalLogLikTorch(beta_ad, tvars)
    grad <- autograd_grad(loss, beta_ad)[[1]]

# Just check that this runs and has the correct dimensions
[1]  -440151.25   -691394

Now we can use the loss and gradient into a nonlinear optimizer. Sure enough, we get a reasonable estimate, with a somewhat small loss gradient.

(This gradient would ideally be smaller, but the out-of-the-box BFGS isn’t a very good optimization algorithm.)

optim_time <- Sys.time()
opt_result <- optim(
    par=c(0, 0),
    control=list(fnscale=-1 / n_obs))
optim_time <- Sys.time() - optim_time

data.frame("Estimate"=opt_result$par, "Truth"=beta_true) %>% print()

cat("\nGradient at BFGS optimum:\n")

beta_hat <- opt_result$par
   Estimate Truth
1 0.9706614   1.0
2 1.7922191   1.8

Gradient at BFGS optimum:
[1] -0.0355956 -0.0324285

We can compare the results to what we’d get from the same regression using lme4::glm. Sure enough they match, and run in a comparable amount of time. (I’ve found there’s a fair amount of noise in the timing, and of course this is only reporting a single run, so all that really matters here is that the two are of the same order.)

The glm algorithm (IRLS) tends to do a much better job of optimizing, in the sense that the gradient is smaller at the glm optimum, and the algorithm runs more quickly. Still, BFGS is just a quick-and-dirty choice, and doesn’t require any special structure to the problem.

glm_time <- Sys.time()
glm_fit <- glm(
    y ~ x1 + x2 - 1,
    data=data.frame(y=y, x1=x[, 1], x2=x[, 2]),
    start=c(0, 0),
glm_time <- Sys.time() - glm_time

cat("Difference in coefficients estimated by optim and glm:\t",
    max(abs(coefficients(glm_fit) - beta_hat)), "\n")

cat("\nEstimation time (s):\n")
cat("optimization and torch: \t", optim_time, "\n")
cat("glm: \t\t\t\t", glm_time, "\n")

cat("\nGradient at glm optimum:\n")

Difference in coefficients estimated by optim and glm:	 1.882297e-05

Estimation time (s):
optimization and torch: 	 0.1431296
glm: 				 0.01248574

Gradient at glm optimum:
[1] -3.278255e-05 -7.224083e-05

To compute standard errors, we need to compute the negative Hessian matrix of the log likelihood:

\[\hat{\mathcal{I}} := - \left. \frac{\partial \log p(y | x, \beta)} {\partial \beta \partial \beta^T} \right|_{\hat\beta}\]

The quantity \(\mathcal{I}\) is the empirical Fisher information, and \(\mathcal{I}^{-1}\) is a standard estimator of the covariance of the MLE \(\hat\beta\) under correct specification.

The Python version of torch has a native function, like autograd_grad, that computes the Hessian directly. Unfortunately, that function has not yet been ported to R. (See this issue on github.) However, we can compute a Hessian by computing the gradients of each row of the gradient, as follows.

EvalLogLikHessian <- function(beta) {
    beta_ad <- torch_tensor(beta, requires_grad=TRUE)
    log_lik <- EvalLogLikTorch(beta_ad, tvars)
    # The argument `create_graph` allows `grad` to be itself differentiated, and
    # the argument `retain_graph` saves gradient computations to make repeated differentiation
    # of the same quantity more efficient.
    grad <- autograd_grad(log_lik, beta_ad, retain_graph=TRUE, create_graph=TRUE)[[1]]

    # Now we compute the gradient of each element of the gradient, each of which is
    # one row of the Hessian matrix.
    hess <- matrix(NA, length(beta), length(beta))
    for (d in 1:length(grad)) {
        hess[d, ] <- autograd_grad(grad[d], beta_ad, retain_graph=TRUE)[[1]] %>% as.numeric()

# Just check that this runs and has the correct dimensions
[1] -1907663  -1720186
    -1720186  -2263947

In the code below, fisher_info is precisely \(\hat{\mathcal{I}}\). We can see that the standard errors match one another.

fisher_info <- -1 * EvalLogLikHessian(opt_result$par)

torch_se <- solve(fisher_info) %>% diag() %>% sqrt()
glmer_se <- summary(glm_fit)$coefficients[, "Std. Error"]

cat("Difference in estimated standard errors:\t",
    max(abs(torch_se - glmer_se)), "\n")
Difference in estimated standard errors:	 3.857842e-07