Calculating sample size required for given power and threshold p-value

Even if you plan to use other approaches to hypothesis testing, or if your objective is to construct a predictor rather than to test a null hypotheses about effects of single variables, reviewers of your grants will probably expect to see a power calculation somewhere.

This note describes a simple method for a statistical power calculation that can be applied to almost any test of a null hypothesis against an alternative.

We want a power calculation for the null hypothesis $\mathcal{H}_0$ that a parameter $\theta = 0$ against the alternative $\mathcal{H}_1$ that $\theta = \theta_1$.

It is usually best to specify the parameter $\theta$ in a basis that allows it to take values on the real line (from minus infinity to plus infinity) so that a gaussian approximation to the likelihood (equivalent to a quadratic approximation to the log-likelihood) will be accurate. So if the parameter of interest is a rate ratio, specify the logarithm of the rate ratio. If you want to detect a difference in proportions, specify the parameter as a log odds ratio rather than a difference in proportions. A gaussian approximation to the likelihood of the difference in proportions (which can take values from 0 to 1) will be less accurate than a gaussian approximation to the likelihood of the log odds ratio.

Write $I$ for the Fisher information from a single observation about $\theta$, $N$ for the number of observations. The Fisher information is minus the expectation of the second derivative of the log-likelihood with respect to $\theta$

Under the null hypothesis, the sampling distribution of the maximum likelihood estimate $\hat{\theta}$ is asymptotically gaussian with mean 0 and variance $\frac{1}{N I}$

Under the alternative, $\hat{\theta}$ is distributed with mean $\theta_1$ and variance $\frac{1}{N I}$.

For Type 1 error probability $\alpha$ (threshold p-value $\alpha / 2$) and Type 2 error probability $\beta$, we can write down an expression for the absolute value of $\theta$ at which there is probability $1 - \beta$ given $\mathrm{H}_1$ that the absolute value of the maximum likelihood value $\hat{\theta}$ will be greater than quantile $(1 - \alpha / 2)$ for $\hat{\theta}$ under $\mathrm{H}_0$.

\[ Z_{1 - \alpha/2} + Z_{1 - \beta} = \theta_1 \sqrt{N I} \]

where $Z_q$ is quantile $q$ of the standard normal distribution. This can be rearranged to give

\[ N = \frac{\left( Z_{1 - \alpha/2} + Z_{1 - \beta} \right)^2 }{I \theta^2} \]

For a typical criterion of 90% power to detect at a threshold p-value of 0.01, $Z_{1 - \alpha/2} = 2.57$ and $Z_{1 - \beta} = 1.28$, this expression becomes

\[ N = \frac{14.9}{I \theta^2} \]

We just need to plug in the appropriate expression for the Fisher information $I$. These are given below for common statistical models, with snippets of R code:-

For a linear regression of $Y$ on $X$ where the parameter to be tested is the regression slope, $I = \textrm{Var} \left( X \right) / \textrm{Var} \left( Y \right)$. If the exposure $X$ is a continuous variable, it is often simplest to assume that it has been scaled to a standard deviation of 1, and express the effect size as the effect on $Y$ of a difference of one standard deviation in $X$. If $X$ is a binary variable which takes value 1 in proportion $p$ of people and value 0 in proportion $( 1 - p ), the variance of $X$ is the Bernoulli variance $p ( 1 - p )

alpha <- 0.01 # threshold p-value = Type 1 error probability
beta <- 0.1   # Type 2 error probability = 1 - power

# theta, linear regression coefficient
# var.y, residual variance of Y
# var.x, variance of predictor X
N <- (qnorm(1 - alpha/2) + qnorm(1 - beta))^2 * var.Y / (theta^2 * var.X)

Where the parameter to be tested is the correlation coefficient (assumed to be small) between $Y$ and $X$, $I = 1$. So for 90% power to detect association at $p < 0.01$, the required sample size is roughly fifteen times the inverse of the squared correlation coefficient.

# rho, correlation coefficient
N <- (qnorm(1 - alpha/2) + qnorm(1 - beta))^2 / (rho^2)

For a logistic regression of a binary outcome (such as case/control status) on $X$, where the proportion of cases is $p$ and the parameter $\theta$ under test is the log odds ratio, $I = p \left( 1 - p \right) \textrm{Var} \left( X \right)$

# p, proportion of cases
# theta, logistic regression coefficient
N <- (qnorm(1 - alpha/2) + qnorm(1 - beta))^2 / (p * (1 - p) * var.X * theta^2)

For a survival (time to failure) analysis modelled with Poisson regression or Cox regression on $X$, the parameter $\theta$ of interest is the log rate ratio. The sample size $N$ is the number of person-years of follow up and the information $I = \lambda \textrm{Var} \left( X \right)$, where $\lambda$ is the hazard rate per person-year.

# lambda, hazard rate per person-year
# var.x, variance of predictor X
# theta, Cox (or Poisson) regression coefficient (log rate ratio)
N <- (qnorm(1 - alpha/2) + qnorm(1 - beta))^2 / (lambda * var.X * theta^2)

Example: in a population-based cohort of people with diabetes, new drugs were introduced five years ago. The total number $N$ of person-years at risk for each of the eight classes of drug are respectively 1.65, 1.2, 1.05, 3.9, 8.25, 6.75, 6.75, 21.75 millions. The numbers of exposed person years are respectively 40, 11, 20, 71, 189, 50, 90, 329 thousands. The rates (per 1000 person-years at risk) of the two adverse outcomes under study are: cardiovascular disease 9, hip fracture 1. So for 90% power to detect at a threshold $p$-value of 0.01, we rearrange the equation above so that the log rate ratio $\theta$ is on the left side. You would usually present the effect sizes as rate ratios

alpha <- 0.01
beta <- 0.1
N <- 1E6 * c(1.65, 1.2, 1.05, 3.9, 8.25, 6.75, 6.75, 21.75) # total person-years at risk
lambda <- 9 / 1000 # event rates 
N.x <- 1E3 * c(40, 11, 20, 71, 189, 50, 90, 329)
p <- N.x / N
var.X <- p * (1 - p)  
theta <- -(qnorm(1 - alpha/2) + qnorm(1 - beta)) / sqrt(lambda * var.X * N)
cvd <- round(exp(theta), 2)
lambda <- 1 / 1000 # event rates 
theta <- (qnorm(1 - alpha/2) + qnorm(1 - beta)) / sqrt(lambda * var.X * N)
hip.fracture <- round(exp(theta), 2) 
require(pander)
pander(data.frame(cvd, hip.fracture))

For a difference $\theta$ between two groups of size $N$, $r N$ where the within-group standard deviation is $\sigma$, the total information is

\[ \frac{N}{\sigma^2}\frac{r}{1+r} \]

# sigma, within-group standard deviation
# theta, difference between groups                                        
N <- (qnorm(1 - alpha/2) + qnorm(1 - beta))^2 * sigma^2 * (1 + r) / (r * theta^2)

For two groups of size $N$, this expression becomes

N <- (qnorm(1 - alpha/2) + qnorm(1 - beta))^2 * sigma^2 * 2 / theta^2

Thus for 90% power to detect at $p < 0.01$ a difference between two groups of equal size, the required number in each group is 30 for a difference of one standard deviation, and 119 for a difference of half a standard deviation.

Extension to two-step Mendelian randomization tests

In a two-step Mendelian randomization study, the first step is to construct a genetic predictor of an intermediate variable $X$ from a genome-wide association study of $X$. Summary statistics can be used if individual-level data are not available. In the second step, the expected value of $X$ given genotype is calculated in a genome-wide association study of an outcome variable $Y$, and the outcome $Y$ is regressed on the expected value of $X$ given genotype. If the assumptions underlying the Mendelian randomization argument hold, the slope of this regression of $Y$ measures the causal effect of $X$ on $Y$.

If the association between $Y$ and $X$ is causal, the sample size required to detect the effect of the expected value of $X$ on $Y$ given genotype will scale inversely with the proportion of information about $X$ that is contained in the genotypic predictor. For a linear regression of $Y$ on $X$ where the squared correlation is $R_{YX}^2$ and the proportion of variance in $X$ explained by the genotypic predictor is $R_{XG}^2$, the required sample size is

\[ N = \frac{\left( Z_{1 - \alpha/2} + Z_{1 - \beta} \right)^2 }{R_{YX}^2 R_{XG}^2} \]

Sample size calculation: a unified approach based on Fisher information

Paul McKeigue

27 April 2018

Calculating sample size required for given power and threshold p-value

Extension to two-step Mendelian randomization tests