Introduction

ADMIXMAP is a general-purpose program for modelling admixture, using marker genotypes and trait data on a sample of individuals from an admixed population (such as African-Americans), where the markers have been chosen to have extreme differentials in allele frequencies between two or more of the ancestral populations between which admixture has occurred. The main difference between ADMIXMAP and classical programs for estimation of admixture such as ADMIX is that ADMIXMAP is based on a multilevel model for the distribution of individual admixture in the population and the stochastic variation of ancestry on hybrid chromosomes. This makes it possible to model the associations of ancestry between linked marker loci, and the association of a trait with individual admixture or with ancestry at a linked marker locus.

Possible uses of the ADMIXMAP program:

Modelling the distribution of individual admixture values and the history of admixture (inferred by modelling the stochastic variation of ancestry along chromosomes)
Case-control, cross-sectional or cohort studies that test for a relationship between disease risk and individual admixture
Localizing genes underlying ethnic differences in disease risk by admixture mapping
Controlling for population structure (variation in individual admixture) in genetic association studies so as to eliminate associations with unlinked genes
Reconstructing the genetic structure of an ancestral population where unadmixed modern descendants are not available for study

ADMIXMAP can model admixture between more than two populations, and can use data from multi-allelic or biallelic marker polymorphisms. The program has been developed for application to admixed human populations, but can also be used to model admixture in livestock or for fine mapping of quantitative trait loci in outbred stocks of mice.

ADMIXMAP is designed to analyse datasets that consist of trait measurements and genotype data on a sample of individuals from an admixed or stratified population. Although the name of the program reflects its origins as a program designed for admixture mapping, it has wider uses, especially in genetic association studies. The study design can be a cross-sectional survey of a quantitative trait or binary outcome, a case-control study or a cohort study. For admixture to be modelled efficiently, at least some of the loci typed should be "ancestry-informative markers": markers chosen to have large allele frequency differentials between the ancestral subpopulations that underwent admixture. The program can deal with any number of ancestral subpopulations and any number of linked marker loci. In its present version, the program handles only data from samples of unrelated individuals.

The program is written in C++, and is freely available with source code under a GPL license. Offers to help with development of the program are welcome. The current version runs only on a single processor, and computation time can be a serious limitation on large datasets.

The program is based on a hybrid of Bayesian and classical approaches. A Bayesian full probability model is specified, assigning vague prior distributions to parameters for the distribution of admixture in the population and the stochastic variation of ancestry along hybrid chromosomes. The posterior distribution of all unobserved variables given the observed genotype and trait data, is generated by Markov chain Monte Carlo simulation. These unobserved variables include the ancestry at each locus and the ancestry-specific allele frequencies at each locus. For a description of the theory underlying this approach, see the following papers:

McKeigue, P.M., Carpenter, J., Parra, E.J., Shriver, M.D. Estimation of admixture and detection of linkage in admixed populations by a Bayesian approach: application to African-American populations. Annals of Human Genetics 2000; 64:171-86.

Hoggart, C.J., Parra, E.J., Shriver, M.D., Bonilla, C., Kittles, R.A., Clayton, D.G. and McKeigue, P.M. Control of confounding of genetic associations in stratified populations. Am J Hum Genet. 2003; 72:1492-1504.

Hoggart, C.J., Shriver, M.D., Kittles, R.A., Clayton, D.G. and McKeigue, P.M. Design and analysis of admixture mapping studies. Am J Hum Genet. 2004; 74:965-978.

McKeigue, P.M. Prospects for admixture mapping of complex traits. Am J Hum Genet 2005; 76 (1):1-7.

Applications of the program

1. Modelling the dependence of a disease or quantitative trait upon individual admixture

For a binary trait, such as presence of disease, the program fits a logistic regression model of the trait upon individual admixture, mean of admixture proportions of both parents. For a continuous trait, such as skin pigmentation, the program fits a linear regression model of the trait value on individual admixture. Covariates such as age, sex and socioeconomic status can be included in the regression model. The program output includes posterior means and 95% credible intervals for the regression coefficients. Alternatively, the program can be used to test a null hypothesis of no association of disease risk or trait level effect with individual admixture as described below.

2. Controlling for confounding of genetic associations in stratified populations

For more details of this application, see Hoggart et al (2003). The program calculates a score test for association of the disease or trait with alleles or haplotypes at each locus, adjusting for individual admixture and other covariates in a regression model. Where there is evidence for association of a trait with individual admixture, the posterior distribution of the regression coefficient can be estimated in a further analysis. For this application, the dataset should include at least 30 markers informative for ancestry.

3. Admixture mapping: localizing genes that underlie ethnic differences in disease risk

For more details of this application, see Hoggart et al (2004). Where differences in disease risk have a genetic basis, testing for association of the disease with locus ancestry by conditioning on parental admixture can localize genes underlying these differences. This approach is an extension of the principles underlying linkage analysis of an experimental cross. To exploit the full power of admixture mapping, 1000 or more markers informative for ancestry across the genome are required.

4. Detecting population stratification and identifying admixed individuals

Where no information about the demographic background of the population under study is available, ADMIXMAP can be used to test for population stratification, to determine how many subpopulations are required to model this stratification, and to identify admixed individuals. This is useful when assembling panels of unadmixed individuals to be used for estimating allele frequencies. We emphasize that when the program is run without supplying prior information about allele frequencies in each subpopulation, the subpopulations are not identifiable in the model.Thus inference should be based only on the posterior distribution of variables that are unaffected by permuting the labels of the subpopulations.

5. Testing for associations of a trait with haplotypes and estimating haplotype frequencies from a sample of unrelated individuals

Where two or more loci in the same gene have been typed, ADMIXMAP will model the unobserved haplotypes, conditional on the observed unordered genotypes. Score tests for association of haplotypes with the trait can be obtained, and samples from the posterior distribution of haplotype frequencies can be obtained.This application of the program is not limited to admixed or stratified populations: for a population that is not stratified, the user can simply specify the option populations=1.

For each of these applications, score tests of the appropriate null hypotheses are built into the program.

Modelling admixture and trait values

Any run of two or more loci for which the distance between loci is specified as zero is modelled as a single "compound locus". Thus if L "simple loci" (SNPs, insertion/deletion polymorphisms or microsatellites) have been typed, and three of these simple loci are in the same gene, the model will have L - 2 compound loci. The program assumes that on any gamete, the ancestry state is the same at all loci within a compound locus. The program allows for allelic association within any compound locus that contains two or more simple loci, and models the unobserved haplotypes at this compound locus.

For each parent of each individual, admixture proportions are defined by a vector with K co-ordinates, where K is the number of ancestral subpopulations that contributed to the admixed population under study. For instance, in a Caribbean population it may be possible to model the gene pool of the admixed population as a mixture of three subpopulations: African, European and Native American. The model of admixture is described by the following hierarchy:

The population distribution from which the parental admixture proportions are drawn is modelled by a Dirichlet distribution with parameter vector of length K.
The allele or haplotype frequencies at each compound locus are modelled by a Dirichlet distribution, with prior parameters specified by the user.
Locus ancestry is modelled by a multinomial distribution, with cell probabilities specified by the admixture of both parents.
The probabilities of observing each allele or haplotype at each locus on each gamete, given the ancestry of the gamete at that locus, are modelled by a multinomial distribution, with parameters given by the ancestry-specific allele (or haplotype) frequencies.
The stochastic variation of ancestral states along the chromosomes transmitted from each parent is modelled as a mixture of independent Poisson arrival processes with intensities a, b, g per Morgan (for three-way admixture). For given values of parental admixture, it is only necessary to specify a single parameter r for the sum of intensities: r = a + b + g.

If an outcome variable is supplied, ADMIXMAP fits a regression model (logistic regression for a binary trait, linear regression for a quantitative trait) with individual admixture proportions and any covariates supplied by the user as explanatory variables.

Modelling allele/haplotype frequencies

The program can be run with ancestry-specific allele / haplotype frequencies specified either as fixed or as random variables. If one of the two options allelefreqfile or fixedallelefreqs is specified, the allele frequencies are specified as fixed at the values supplied. If option populations is specified, the allele frequencies are specified as random variables with reference (uninformative) prior distributions. If option priorallelefreqsfile is specified, the allele frequencies are specified as random variables with a prior distribution given by the values in this file. This option is used where allele frequencies have been estimated from samples of unadmixed modern descendants of the ancestral subpopulations that contributed to the admixed population under study. For instance, in a study of a population of mixed European and west African ancestry, allele frequencies at some or all of the loci typed may have been estimated in samples from modern unadmixed west African and European populations. The program will use this information to estimate the ancestry-specific allele frequencies from the unadmixed and admixed population samples simultaneously, allowing for sampling error.

If no information about allele frequencies in the ancestral subpopulations is provided, the ancestry-specific allele frequencies are estimated only from the admixed population under study. If no information about allele frequencies is provided at any locus, the subpopulations are not identifiable in the model. This does not matter when the program is used only to control for confounding by hidden population stratification, as described in Hoggart et al. (2003).

The file priorallelefreqfile specifies the parameters of a Dirichlet prior distribution for the allele frequencies at each locus in each subpopulation. Where the alleles or haplotypes have been counted directly in samples from unadmixed modern descendants, these parameter values should be specified by adding 0.5 to the observed counts of each allele or haplotype in each subpopulation. These parameter values specify the Dirichlet posterior distribution that we would obtain by combining a reference prior with the observed counts. Using this as a prior distribution when analysing data from the admixed population is equivalent to estimating the allele frequencies simultaneously from the admixed and unadmixed population samples, with a reference prior.

For compound loci, where haplotype frequencies have been estimated from unordered genotypes rather than by counting phase-known gametes, the user cannot specify the prior distribution simply by adding 0.5 to the observed counts of each haplotype in a sample of unadmixed modern descendants, because the counts have not been observed. Instead, we can compute the posterior distribution of haplotype frequencies in the unadmixed population, given a reference prior and the observed unordered genotype data. In accordance with the principles of Bayesian inference, we can then use this posterior distribution to specify the prior distribution when modelling data from the admixed population. To simplify the computation, we generate a large sample from the posterior distribution of haplotype frequencies in the unadmixed population and calculate the parameters of the Dirichlet distribution that most closely approximates this posterior distribution. The parameters of this Dirichlet distribution are then entered in file priorallelefreqfile.

This can be implemented by running ADMIXMAP with genotype data from each unadmixed population sample, specifying options populations=1 and allelefreqoutputfile to sample the posterior distribution of the haplotype frequencies. The parameters of the Dirichlet distribution that most closely approximates this posterior distribution can then be computed, and substituted into the file priorallelefreq as input to the program when modelling data from the admixed population. This is straightforward: for each locus and each subpopulation, the posterior expectations and the posterior covariance matrix of the allele frequencies are evaluated using the samples from the posterior distribution in allelefreqoutputfile. The Dirichlet parameters a_i that approximate the posterior distribution are then computed by equating the posterior expectations of the allele frequencies to the ratios a_i / aS_i, and equating the determinant of the posterior covariance matrix to the determinant of the covariance matrix of the Dirichlet distribution. If ADMIXMAP is run with the option allelefreqoutputfile and the R script AdmixmapOutput.R is run to process the output files, the Dirichlet parameters will be computed and written to a file in the correct format for use in subsequent analyses as priorallelefreqfile.

With options populations, allelefreqfile or priorallelefreqfile, the program fits a model in which the allele frequencies in modern unadmixed descendants of the ancestral subpopulations are identical to the corresponding ancestry-specific allele frequencies in the admixed population under study.The option dispersiontestfile will generate a diagnostic test of this assumption.

With option historicallelefreqfile, the program fits a more general model in which there is dispersion of allele frequencies between the unadmixed and admixed populations.

With option correlatedallelefreqs a correlated allele frequency model is fitted, in which the allele frequency prior parameters are the same across subpopulations and specified as vectors of proportions and a sum, common to all loci.

Inference

There are various approaches to statistical inference and hypothesis testing using the ADMIXMAP program:

(1) A model based on the null hypothesis can be fitted, and this null hypothesis can be tested against alternatives with a score test computed by averaging over the posterior distribution of the missing data. For a description of the theory underlying this approach, see Hoggart et al. (2003). Several score tests are built into the program, and are described below. Additional score tests can be constructed by the user.

(2) The effect under study can be included in the regression model, so that the posterior distribution of the effect parameter is estimated. In large samples, the posterior mean and 95% credible interval for the effect parameter are asymptotically equivalent to the maximum likelihood estimate and 95% confidence interval that would be obtained by classical methods.

(3) The log-likelihood function for the effect of a parameter can be computed: not yet implemented

(4) The marginal likelihood of the model can be evaluated using Chib's algorithm or thermodynamic integration. Chib's algorithm is implemented only for a single individual (the first listed in the genotypes file) and for a model with no outcome variable but thermodynamic integration is implemented for all models.

In addition to these methods for formal inference, model diagnostics, based on the posterior predictive check probability, are provided to detect population stratification not accounted for in the model, lack of fit of the allele frequencies to the specified model and for departure from Hardy-Weinberg equilibrium.

Comparison with other programs for modelling admixture

The program STRUCTURE (available from http://pritch.bsd.uchicago.edu/) fits a similar hierarchical model for population admixture, given genotype data on admixed and unadmixed individuals, if you specify the "popalphas" option (see documentation for this program at http://pritch.bsd.uchicago.edu/software/readme_structure2.pdf).

The main differences between ADMIXMAP and STRUCTURE are:

STRUCTURE does not model the dependence of the outcome variable on individual admixture and thus cannot adjust for the effect of individual admixture on the outcome variable.
STRUCTURE does not allow the user to supply prior distributions for the allele frequencies. To use allele frequency data from unadmixed individuals in STRUCTURE, individuals sampled from unadmixed and admixed populations have to be included in the same model. This is not recommended, either with STRUCTURE or ADMIXMAP, because the model assumes a unimodal distribution of individual admixture values in the population. If samples from both unadmixed and admixed populations are included in the same model, the distribution of individual admixture values will generally not be unimodal, and the fit of the model will be poor.
STRUCTURE does not allow for allelic association (other than that generated by admixture), and is therefore unsuitable for analysis of datasets in which two or more tightly-linked loci (for instance SNPs in the same gene) have been typed. ADMIXMAP allows for allelic association (if the distance between loci is coded as zero) and models the unobserved haplotypes.
STRUCTURE assumes that the markers have not been selected to be highly informative for ancestry, and the authors recommend that a model which allows for correlation between the allele frequencies in each subpopulation is specified. ADMIXMAP does not in general assume any correlation between the allele frequencies in each subpopulation.

The program ANCESTRYMAP (available from http://genepath.med.harvard.edu/~reich/ ) is similar to ADMIXMAP but is restricted to diallelic simple loci and two populations.

Tools for converting data from ANCESTRYMAP format to ADMIXMAP format and vice-versa as well as from STRUCTURE format to ADMIXMAP format are available here.