Combining score tests and Wald tests in a meta-analysis

The test usually provided in the summary of a regression analysis is the Wald test:

test statistic $Z$ is calculated as $\hat{\beta} / s$ where $ is the maximum likelihood estimate of the parameter and $s$ is the standard error of this estimate

Most GWAS programs compute a score test: the null model includeing covariates (like age, sex, principal component scores) because only has to be fitted once, and the program can then loop over SNPs to calculate a score test for each one. This is much faster than fitting a regression model for each SNP tested, especially when using a linear mixed model to allow for relatedness. When the genotypes are not scored directly but imputed as a probabilities, the score test can be extended to allow for this uncertainty by averaging over the posterior distribution of the genotypes. This “missing-data likelihood” score test is implemented in SNPTest.

The score test is based on calculating at the null value ($\beta = 0$):

the score $U$ (gradient of the log-likelihood with respect to $\beta$)
the information $V$ (minus the expectation of the second derivative of the log-likelihood
Under the null the score is distributed with mean zero and variance $V$, so the test statistic $Z$ is calculated as $U / \sqrt{V}$.

In a large sample, the log-likelihood is asymptotically quadratic (this holds exactly if the likelihood is gaussian). The Wald test and the score test are then algebraically equivalent, and either one can be calculated from the other:-

$\hat{\beta} = U/V$ and $s = 1/\sqrt{V}$
$V = 1/s^2$ and $U = \hat{\beta} V$

Meta-analysis

If you have a mixture of studies reported as score tests and studies reported as maximum likelihood estimate $\hat{\beta}$ and standard error $s$, there are two alternative ways to combine them in a meta-analysis.
algebraically equivalent.

Calculate the meta-analysis using score and information

Convert any studies reported as maximum likelihood estimates and and standard errors to corresponding values of score $U$ and information $V$.
Then sum the score and information over the different studies, and calculate the test statistic as the score divided by the square root of the information.

Calculate the meta-analysis using maximum likelihood estimates and standard errors

Convert any studies reported as score and information to corresponding values of maximum likelihood estimate $\hat{\beta}$ and standard error $s$

Then calculate

weighted average of the maximum likelihood estimates as $\frac{\sum{\hat{\beta}_i / s_i^2}}{\sum{1 / s_i^2}}$, where the weights are the inverse variances
standard error of this weighted average as $1 / \sqrt{ \sum{1/s_i^2} }$
Calculate the test statistic $Z$ as the weighted average estimate of $\beta$ divided by its standard error.

Linear regression tests as an approximation to logistic regression for mixed models with related individuals

When using mixed models to allow for relatedness with a binary outcome, there is no computationally efficient implementation of logistic regression. The usual practice is to approximate the logistic regression by a linear mixed model.

When the effect size is small compared with the variance of the genotypes, the linear regression coefficients will approximate the logistic regression coefficients.

Write

$y$ for the binary outcome,
$x$ for the genotype, coded 0, 1, 2 and standardized to zero mean
$P$ for the predicted value of $y$ (probability that $y=1$) given the covariates included in the null model

Score and information for logistic regression

The contribution of a single observation to the score $U$ and information $V$ at log odds ratio $\beta=0$ is

\[ U = {\left( y - P \right) x} \]

\[ V = \textrm{Var} \left( x \right) P \left( 1 - P \right) \]

This logistic regression model can be approximated by a linear regression, in which $y$ given the covariates has a gaussian distribution with mean $P$ and residual variance $\sigma^2$

For a linear regression, the contribution of a single observation to the score $U$ and information $V$ at $\beta=0$ is

\[ U = \frac{\left( y - P \right) x}{ \sigma^2 } \]

\[ V = \frac{ \textrm{Var} \left( x \right) }{ \sigma^2 } \]

If the log-likelihood is approximately quadratic, the score test and maximum likelihood estimate for the logistic regression can be approximated by the corresponding results of a linear regression. To rescale the score (gradient of the log-likelihood) we have to multiply by ($P \left( 1 - P \right)$) and to rescale the information we have to multiply by ($P^2 \left( 1 - P \right)^2$. Equivalently, to rescale the linear regression coefficient to approximate the logistic regression coefficient, we have to divide by the binomial variance of $y$ which is ($P \left( 1 - P \right)$).

Pirinen, Donnelly and Spencer call this the first-order approximation and derive a more accurate second-order approximation.

For the likelihood of the logistic regression model to be approximated accurately by the linear regression model:

the sample size should be large
the proportions $P$, $1 - P$ of cases and controls should not be close to 0 or 1
the effect size $\beta$ of the genotype should be small compared with the standard deviation of the genotype $\sqrt{2 \phi \left( 1 - \phi \right)}$.

When combining studies analysed with different programs, you may need to check that the scaling of the genotypes is consistent: some programs estimate the effect size after standardizing the genotypes at each SNP to have unit variance.

Score tests based on the missing-data likelihood to allow for uncertain imputation of genotypes

To test for association with imputed genotypes, we should allow for genotype uncertainty. A useful algorithm for constructing tests in this situation is a score test based on the missing-data likelihood. The first use of this algorithm in genetics was by David Clayton to allow for uncertainty in imputing segregation indicators in family-based association studies.

For any realization of the posterior distribution of the missing data, we can calculate the complete data score $U$ and the information $V$ by summing over all observations. Standard results [Dempster et al. 1977] yield the observed score as the posterior expectation of $U$, the missing information as the posterior variance of $U$, and the complete information as the posterior expectation of $V$. The observed information is calculated by subtracting the missing information from the complete information. A useful by-product of this algorithm is that the ratio of observed to complete information (proportion of information extracted) can be used to assess the efficiency of the study in relation to an ideal design in which no data are missing (in this context, where all genotypes are typed or imputed with certainty).

This algorithm is implemented in the program SNPTEST, for linear or logistic regression with unrelated individuals and genotypes represented as probabilities.

Extending the score tests based on the missing-data likelihood to mixed models for related individuals

Existing programs for GWAS with related individuals do not allow for genotype uncertainty. Programs such as ProbABEL and GMMAT fit a mixed model under the null hypothesis of no effect of genotype, and compute score tests at each loci using only the genotype dosage (posterior mean) to construct a score test, rather than the missing-data likelihood.

In principle it is straightforward to extend the missing-data likelihood score test to mixed models. From the model fitted at the null, we obtain the residual deviations of the response variable from its predicted value, and the covariance matrix of the residual polygenic effects. We can sample these residual polygenic effects from a multivariate gaussian, and we have also the posterior distribution of the genotypes at each locus in each individual. At each locus, a score test can be obtained by averaging over the posterior distribution as above.

For computationally efficiency, the expectations of $U$, $U^2$, and $V$ should be evaluated in two steps.

Draw $S$ realizations of the residual polygenic effects and store them in a matrix. For each realization of the residual polygenic effects, accumulate the conditional expectations of $U$, $U^2$, and $V$ over the probability distribution of genotypes.
Calculate the observed score as the average (over all realizations) conditional expectation of $U$, and the complete information as the average (over all realizations) of the conditional expectation of V. From the law of total variance, we can calculate the missing information (posterior variance of the score) as the sum of the variance of the conditional expectation of U and the expectation of the conditional variance of U.

Using angle brackets to denote expectation:-

$\textrm{Var}_\theta \left( U \right) = \textrm{Var}_\theta \langle U \rangle + \langle \textrm{Var} \left( U \mid \theta \right) \rangle_\theta$

The conditional expectations at each realization of the polygenic effects can be calculated exactly from the genotype probabilities. Sampling is required only to average over the posterior distribution of residual polygenic effects, and only one set of realizations need be drawn to test all loci.

The missing information can be partitioned into two components:

information missing because of uncertainty about residual polygenic effects: $\textrm{Var}_\theta\langle U \rangle$
information missing because of uncertainty about genotypes: $\langle \textrm{Var} \left( U \mid \theta \right) \rangle_\theta$.

Effect estimates and standard errors can be calculated from the score and information using a quadratic approximation to the log-likelihood

Sampling from the posterior distribution of residual genotypic effects $G$

With standard software, we can fit a linear mixed model of the form

$\boldsymbol{Y} = \boldsymbol{\mu} + \boldsymbol{G} + \epsilon$

where $\boldsymbol{G} \sim \mathcal{N} \left( \boldsymbol{O}, \boldsymbol{\Sigma} \right)$

and $\epsilon \sim \mathcal{N} \left( 0, \sigma^2 \right)$

For a linear mixed model, the conditional distribution of the residual polygenic effects $\boldsymbol{G}$ given $\boldsymbol{y}$ is

\[ \mathcal{N} \left( \boldsymbol{\Sigma} \left(\boldsymbol{\Sigma} + \sigma^2 \right)^{-1} \left (\boldsymbol{y} - \boldsymbol{\mu} \right), \boldsymbol{\Sigma} - \boldsymbol{\Sigma} \left( \boldsymbol{\Sigma} + \sigma^2 \right)^{-1} \boldsymbol{\Sigma} \right) \]

Notes on statistical analysis of genome-wide association studies: meta-analysis, mixed models to allow for relatedness

Paul McKeigue

8 January 2018

Combining score tests and Wald tests in a meta-analysis

Meta-analysis

Score tests based on the missing-data likelihood to allow for uncertain imputation of genotypes

Notes on statistical analysis of genome-wide association studies: meta-analysis, mixed models to allow for relatedness

Paul McKeigue

8 January 2018

Combining score tests and Wald tests in a meta-analysis

Meta-analysis

Linear regression tests as an approximation to logistic regression for mixed models with related individuals

Score and information for logistic regression

Score tests based on the missing-data likelihood to allow for uncertain imputation of genotypes

Extending the score tests based on the missing-data likelihood to mixed models for related individuals

Sampling from the posterior distribution of residual genotypic effects \(G\)