Introduction

Penalized regression models are typically used to learn linear predictors from high-dimensional biomarker panels such as genome-wide genotyping chips and gene expression arrays.

Instead of maximizing the log-likelihood as in classical regression analysis, penalized regression maximizes an objective function defined as the log-likelihood minus a penalty term. Two types of penalty term have been widely used:

an L2 penalty obtained by summing the squared regression coefficients. This is known as ridge regression.
an L1 penalty obtained by summing the absolute values of the regression coefficients. This was named the LASSO (Least Absolute Shrinkage and Selection Operator).

Unless the original scales of the predictor variables convey information about their relevance, it is appropriate to scale all these variables to have the same variance. It is also usual to exclude variables known to be relevant from the penalty. For instance in constructing a model to predict clinical outcome from a biomarker panel, we would not usually penalize variables such as age, sex and duration of disease.

With either type of penalty, the model specifies a parameter $\lambda$ that takes non-negative values. Without Bayesian methods, the only way to learn an L1 penalty is by cross-validation to optimize predictive performance on test data. The elastic net penalty is a weighted average of L2 and L1 penalties.

In Bayesian terms, the L2 penalty is equivalent to maximizing the posterior density for a model that specifies gaussian priors on the regression coefficients $\beta_i$ such that

\[ \beta_i \tau \sim \textrm{N}\left(0, \lambda^2 \right) \] where $\textrm{N} \left(\mu, \sigma^2 \right)$ denotes the gaussian distribution with mean $\mu$ and standard deviation $\sigma$.

and the L1 penalty is equivalent to specifying a Laplace (double exponential) prior on these coefficients such that

\[ \lvert \beta_i \rvert \sim \textrm{Exponential} \left( \lambda \right)\]

However in many real-life problems neither of these priors is heavy-tailed enough to be realistic. For instance, when evaluating the predictive performance of a large panel of biomarkers for some clinical outcome, a realistic prior might be that the regression coefficients are very near zero for most biomarkers but large for a few relevant biomarkers.

A fully Bayesian approach allows us to specify more complex priors on the parameters than simple gaussian (L2 penalty) or Laplace (L1 penalty) distributions. A more general family of priors, that includes heavy-tailed distributions is the family of hierarchical shrinkage priors.

These priors have a global scale parameter $\tau$ but also have one or two levels of local scale parameters that allow some coefficients to “escape” the shrinkage imposed by the global scale parameter. Note for avoidance of confusion: in the conventional notation for models with L1 or L2 penalties, large values of the penalty parameter shrink the effect sizes. In the notation used below, small values of the scale parameters shrink the effect sizes.

With one level of local variables: \[ \beta_i \sim \textrm{N} \left(0, \lambda_i^2 \tau^2 \right) \] \[ \lambda_i \sim t_{\nu}^+ \left(0, 1 \right) \]

where, and $t_{\nu}^+ \left(0, 1 \right)$ denotes the half-Student-$t$ distribution with $\nu$ degrees of freedom. For $\nu=1$ this is the half-Cauchy prior.

The local shrinkage factor $\kappa_i = \left( 1 + \lambda_i^2 \right)$ describes the relative shrinkage of the regression coefficient $\beta_i$ on a scale from 0 (no shrinkage) to 1 (maximal shrinkage). The special case when $\nu = 1$ is known as the horseshoe prior, as the half-Cauchy prior on $\lambda_i$ is equivalent to a $\textrm{Beta} \left( \frac{1}{2}, \frac{1}{2} \right)$ prior (which has a horseshoe-like shape) on $\kappa_i$. A half-Cauchy prior may also be specified for the global scale parameter $\tau$.

To examine how the prior specification has imposed a sparse solution, we can examine the posterior means of the shrinkage coefficients $\kappa$. If the likelihood supports a distribution of effect sizes that approximates a spike-and-slab, most of the shrinkage coefficients should be closer to 1 than 0 (maximal shrinkage) and a few should be close to 0 (minimal shrinkage).

The simple horseshoe prior has limitations: it does not encode prior belief about the plausible size of the largest effects, and the NUTS sampling algorithms often fail to sample the posterior. Three approaches have been used to overcome this problem:

increasing the number of degrees of freedom $\nu$ – but this makes the prior on the regression coefficients less sparsifying
introducing a second layer of local variables $\eta_i$. \[ \beta_i \sim \textrm{N} \left(0, \lambda_i^2 \eta_i^2 \tau^2 \right) \] \[ \lambda_i \sim t_{\nu}^+ \left(0, 1 \right) \] \[ \eta_i \sim t_{\nu}^+ \left(0, 1 \right) \]
regularizing the local variables $\lambda_i$ by introducing a positive parameter $c$. \[ \beta_i \sim \textrm{N} \left(0, \tilde{\lambda}_i^2 \tau^2 \right) \] \[ \tilde{\lambda}_i^2 = \frac{c^2 \lambda_i^2}{c^2 + \tau^2 \lambda_i^2} \]. When the regularizing parameter $c$ is large compared with $\lambda_i^2 \tau^2$, this is close to the original horseshoe. When $c^2$ is small compared with $\lambda_i^2 \tau^2$, the prior on $\beta_j$ approaches a gaussian prior with mean zero and variance $c^2$. Piironen and Vehtari (2017) show that for a typical high-dimensional problem (using microarray data for classification), that regularization of the horseshoe prior markedly reduces computation time and fraction of divergent transitions (failures of the sampling algorithm) without loss of predictive performance.

Setting priors on the hyperparameters

The parameter scale_icept encodes the scale of the prior on unpenalized variables, including the intercept. For this it is usually appropriate to specify a value that reflects your prior belief about the size of effects associated with these unpenalized variables.
The parameters scale_global and nu_global encode the scale and degrees of freedom of the prior on the global scale parameter $. Piironen and Vehtari recommend a half-Cauchy prior (nu_global = 1), but setting the scale_global parameter so as to specify the size of effects.
The parameter nu_local encodes the number of degrees of freedom of the local scale parameters $\lambda_j$. A value of 1 specifies a half-Cauchy distribution
The parameters slab_scale and slab_df control the prior on $c$, and thus encode how much the largest coefficients are regularized. With a small value for slab scale and a large value for slab_df, these coefficients are regularized as gaussian.

These models can be learned from data using the programs Stan or PyMC3 to generate the posterior distribution of model parameters including the penalty parameters. With this fully Bayesian approach, there is no need to use cross-validation to learn parameters from the data, though we may still want to use cross-validation to evaluate predictive performance.

A toy example: the Cleveland Heart Study

The Cleveland Heart Study dates is described elsewhere. For this example, the outcome variable y has been recoded as binary, and observations with missing values have been dropped. Predictor variables with more than 10 distinct values have been gaussianized, giving a data.frame xdata with 297 observations and 13 predictor variables.

library(data.table)
library(LambertW)
library(pander)

## outcome variable is num vessels narrowed, coded 0 to 4
cleveland <- fread("./datasets/processed.cleveland.data", na.strings="?")
colnames(cleveland) <- c("age", "sex", "chest_pain", "resting_BP",
                         "cholesterol", "fasting_blood_sugar", "ECG", "max_heart_rate",
                         "angina", "ST_depression", "ST_slope", "coronary_fluoroscopy",
                         "thallium_test", "num")
cleveland <- cleveland[!is.na(cleveland$num), ]
keep.rows <- which(!is.na(rowSums(cleveland)))
cleveland <- cleveland[keep.rows, ]
xdata <- as.data.frame(cleveland)[, -14]
y <- as.integer(cleveland$num > 0)

varnames <- colnames(xdata)
for(j in 1:ncol(xdata)) {
    if(length(table(xdata[, j])) > 10) {#gaussianize predictors with >10 distinct values
        z <- as.numeric(Gaussianize(xdata[, j]))
        xdata[, j] <- z
    }
}
u.varnames <- NULL # names of variables for which coefficients with not be penalized
p.varnames <- colnames(xdata) # names of variables to be penalized

The columns of the matrix of predictors should be ordered so that unpenalized variables (for instance routinely available clinical variables) are before the variables (such as biomarkers) that are to be penalized.

## objects should be in memory
##  xdata data frame of predictors, unpenalized variables first
##  y vector of responses
##  u.varnames vector of names of unpenalized variables
##  p.varnames vector of names of penalized variables

beta.names <- c(u.varnames, p.varnames)
formulastring <- paste(beta.names, collapse=" + ")
formulastring <- paste("~", formulastring)
model.matrix.formula <- as.formula(formulastring) 

beta.names <- c("intercept", beta.names)
N <- nrow(xdata)
X <- model.matrix(model.matrix.formula, data=xdata) # this puts interaction terms at the right
P <- ncol(X) # number of columns of design matrix including intercept
U <- 1 + length(u.varnames) 

## this fix will work as long as the interaction terms are only between unpenalized variables
which.interaction <- grep(":", colnames(X))
P.i <-  length(which.interaction)
which.unpenalized.main <- 1:(U - P.i)
which.penalized <- (U - P.i + 1):(P - P.i)
X <- X[, c(which.unpenalized.main, which.interaction, which.penalized)]

X.means <- colMeans(X, na.rm=TRUE)
X.sd <- apply(X, 2, sd, na.rm=TRUE)
for(j in 2:ncol(X)) { # standardize predictors
    X[, j] <- X[, j] - X.means[j]
    X[, j] <- X[, j] / X.sd[j]
}

## values smaller than 2 for the three df params cause the PyMC3 sampler to fail to initialize

scale_icept <- 5  # weak prior on unpenalized variables including intercept
scale_global <- 2 # large effects unlikely  
nu_global <- 2  # 1 for half-Cauchy prior on global scale param
nu_local <- 2    # 1 for half-Cauchy, horseshoe+ or horseshoe if c is large
slab_scale <- 2 # small value regularizes largest coeffs 
slab_df <- 2    # 2 for half-Cauchy large value specifies a gaussian prior on slab component
with_penalized <- 1  

save(N=N, y=y, X=X, U=U, P=P,
     scale_icept=scale_icept,
     scale_global=scale_global,
     nu_global=nu_global,
     nu_local=nu_local,
     slab_scale=slab_scale,
     slab_df=slab_df,
     with_penalized=with_penalized,
     file="cleveland.Rdata")

Bayesian inference with Stan and PyMC3

Like BUGS and its later clone JAGS, Stan and PyMC3 are programs for Bayesian inference using Markov chain Monte Carlo (MCMC) to sample the posterior distribution given the data and the model. Key differences from the earlier programs that use Gibbs sampling to update one parameter at a time are:-

Stan and PyMC3 use a more efficient sampling algorithm - Hamiltonian MCMC - to sample the posterior distribution of all parameters jointly. This can be used for models where the mixing of the sampler would be hopelessly slow in JAGS.
As an alternative to MCMC sampling, these programs can use a faster learning algorithm that generates a variational Bayes approximation to the posterior and also generates a lower bound approximation to the marginal likelihood of the model. In principle, this “evidence lower bound” (ELBO) can be used for model comparison and hypothesis testing.

To sample a distribution that can be specified as a scale mixture, Peltola (2014)) recommends specifying the distribution as a scale mixture rather sampling the distribution directly. For instance, a Student-$t$ distribution with $\nu$ degrees of freedom can be specified as the marginal distribution of a gaussian with variance drawn from an inverse-gamma$(\nu/2, \nu/2)$ distribution. This is used below to specify a logistic regression model.

Tutorial on fitting regression models with hierarchical shrinkage priors

Paul McKeigue, Marco Colombo

27 February 2021

Introduction

Setting priors on the hyperparameters

A toy example: the Cleveland Heart Study

Bayesian inference with Stan and PyMC3

Using Stan

Using PyMC3