Concordance rate and recurrence risk ratio

For a binary trait, the resemblance between relatives can be quantified in terms of the concordance rate.

The probandwise concordance rate is defined as the proportion \(K_{R}\) of affected individuals among the relatives of previously defined cases. This rate is the conditional probability that a person is affected, given that his/her relative is affected. The probandwise concordance rate is equivalent to the ``recurrence risk’’ that clinical geneticists calculate to advise parents of a child affected by a genetic disease of the risk that the next child also will be affected. Thus a rare Mendelian recessive disorder will have a probandwise concordance rate in siblings of 0.25. Dividing the concordance rate by the risk in the population at large yields the recurrence risk ratio \(\lambda_{R}\).

Recurrence risks for a single-locus two-allele model

For a single-locus two-allele model, we can calculate the recurrence risks from the penetrances and allele frequencies, given the population risk \(K\).

Defining the additive variance \(V_A\) and dominance variance \(V_D\) on an arithmetic scale of risk, the genetic covariance between two relatives with coefficient of relationship \(R\) is

\(\textrm{Cov}(G_{i}, G_{j}) = RV_{A}+ k_{2} V_{D}\) where \(k_2\) is the probability of sharing 2 copies identical by descent

and the recurrence risk \(K_R\) is \(K_R = K + \frac{R V_A + k_2 V_D}{K}\)

For full sibs the sibling recurrence risk ratio \(\lambda_S =1+\frac{\textstyle{1 \over 2}V_A +\textstyle{1 \over 4}V_D }{K^2}\)

and the parent-offspring recurrence risk ratio \(\lambda_1 =1+\frac{\textstyle{1 \over 2}V_A }{K^2}\)

Write \(p\) and \(q\) for the frequencies of the two alleles A\(_{1}\) and A\(_{2}\), and \(f_{i}\) for the penetrance of the genotype with \(i\) copies of the A\(_{1}\) allele.

\(V_{A} = 2pq [p(f_{2 }- f_{1}) + q(f_{1 } - f_{0})]^{2}\)

and \(V_{D} = p^{2}q^{2}(f_{2} - 2f_{1} + f_{0})^{2}\)

Assuming Hardy-Weinberg equilibrium, the population risk \(K\) is given by

\(K = p^{2}f_{2} + 2pqf_{1} + q^{2}f_{0 }\)

We can simplify the expression for the recurrence risk ratio by assuming a multiplicative model for the joint effects of the two alleles in which \(f_{2}/f_{1} = f_{1}/f_{0}=\gamma\), the genotypic risk ratio.

From the formulae given above for the additive and dominance components of variance, we can derive expressions for the sibling recurrence risk ratio \(\lambda_{S}\) and the parent-offspring recurrence risk ratio \(\lambda_1\) as

\(\lambda_S = \left( 1 + \frac{1}{2} w \right)^2\) and \(\lambda_1 = 1 + w\) where \(w=\frac{pq\left( {\gamma -1} \right)^2}{\left( {p\gamma + q} \right)^2}\)

For values of \(\gamma\) close to 1, \(\gamma - 1 = \log{\gamma} = \beta\), the log of the genotypic risk ratio and \(\log{\lambda_S} = \log{\lambda_1} = w = p q \beta^2\),

For a rare disease or for incidence density sampling, the risk ratio is equal to the rate rato and \(\beta\) is the log odds ratio for the effect of genotype on the outcome. The expression \(p q \beta^2\) is then half the square of the standardized log odds ratio \(\sqrt{2 p q} \beta\).

Relation of expected information for discrimination to a logistic polygenic model

As argued here, a natural way to quantify the relationship of predictors to a binary outcome is the expected information for discrimination (expected weight of evidence) \(\Lambda\).

For a polygenic model (total genetic effect is generated by many loci of small effect), the weight of evidence in favour of true status (case or control) will have its asymptotic gaussian distribution with variance twice its expectation. The \(C\)-statistic can be calculated from the expected weight of evidence, but there is no obvious advantage in doing this as $has a more intuitive interpretation.

An equivalent result has been noted previously by Pharaoh et al (2002) and Clayton (2009), without the interpretation in terms of weight of evidence. They showed that a logistic regression of the outcome on the polygenic score (scaled to have a regression coefficient of 1) implies that the class-conditional variance of the polygenic score (assumed to be equal in cases and controls) is equal to the difference between the means of its distributions in cases and controls.

Thus the expected information for discrimination \(\Lambda\) (in natural logarithms) is half the variance of the polygenic score in a logistic model (under the true model). For a model learned from a training dataset, \(\Lambda\) should be estimated not from the variance of the polygenic score (which might overfit the training dataset) but from test data by by subtracting the prior log odds from the posterior log odds, and taking the average over all observations in the test dataset.

Clayton noted a result first shown by Good (1963): additivity of effects on a logistic scale maximizes the entropy of the outcome \(Y\) given the predictors \(X\) subject to the constraint that the fitted averages of the conditional distributions of \(Y \mid X\) equate to the observed averages of these distributions on the training data. This supports using the logistic regression model unless other information about the dependence of \(Y\) on \(X\) is available: to use any other model would be to use information we do not have. I’ll use the term logistic polygenic model to denote a model in which polygenic effects on the outcome are additive on the logistic scale.

Relation of the recurrence risk ratio to the expected information for discrimination

Clayton showed also that under a polygenic additive model, \(\Lambda = \log{\lambda_S}\), where \(\lambda_S\) is the sibling recurrence risk ratio. So for instance where the sibling recurrence risk ratio is 4, the expected genetic information for discrimination is 2 bits and the optimal $C-statistic of a genetic predictor (if complete genotypes and a training sample of infinite size were available) is 0.88.

Under the logistic polygenic model, the log recurrence risk ratio scales inversely with the degree of relationship. Collins et al used this to examine how well the logistic polygenic model fits the observed recurrence risks. They used the term “beta model” for the logistic polygenic model; their beta parameter corresponds to \(\Lambda\) expressed in natural log units. They examined the fit of this model to the observed recurrence risks in relatives for twelve complex diseases: only for Alzheimer disease and for schizophrenia was there strong evidence of departure from a model in which effects are additive on a logistic scale.

The probit model and “heritability” of a binary trait

The approach most widely used in quantitative genetics is to quantify the genetic effects as “heritability” defined in a probit regression model as the ratio of the variance of the polygenic effect to the total variance.

Both the probit model and the logistic regression model can be written in the form of latent variable models. In this liability-threshold formulation of the model, the liability is a latent variable, equal to the linear predictor plus a random variable representing noise or error. Affected individuals are those in whom the liability is greater than zero. In the probit model, the error has a gaussian distribution with zero mean and variance fixed (conventionally) at 1. In the logistic regression model, the error has a logistic distribution, which is more heavy-tailed than the gaussian distribution.

The logistic distribution has a variance of \(\pi^2 / 3\), so an approximation to the genetic variance in a logistic model can be calculated from the genetic variance in a probit model by multiplying by \(\pi^2 / 3\).

Transforming variance from arithmetic scale to probit scale

To estimate genetic variance on a logistic scale from genetic relationship matrices, we have to fit a logistic mixed model with a large covariance matrix. Methods for this have recently become available: they include the programs GMMAT and SAIGE program

Most studies quantifying genetic effects on binary traits use the GCTA program to estimate heritability. GCTA does not fit a probit regression, but simply estimates the genetic variance of the outcome on an arithmetic scale. This is transformed to estimate heritability on a (probit) liability scale using a relationship first derived by Robertson (1949, see appendix)

Rearranged this gives

\(h^2_{\mathrm{liability}} = h^2_{\mathrm{binary}} \frac{P \left( 1 - P \right)}{f \left( P \right)^2}\)

where \(P\) is the prevalence of the trait and \(f \left( \cdot \right)\) is the standard gaussian density function

A more complicated formula has to be used where cases have been oversampled, as in a case-control study

Practical implications

For most complex diseases, genetic effects are not large enough for genotypic prediction on its own to be a good predictor, though genotypic prediction may be useful for identifying rare subtypes and as an adjunct to clinical predictors. For instance estimates of the sibling recurrence risk ratio for coronary heart disease are in the range 2 to 3, equivalent to expected information for discrimination of about 1.5 bits. This is the predictive performance of an optimal model: a realistic model would not come near this. For comparison, a good predictor, based on clinical data and biomarkers would be considered to be one with at least 3 bits of information for discrimination.