Front. Gene.Frontiers in GeneticsFront. Gene.1664-8021Frontiers Research Foundation10.3389/fgene.2012.00078GeneticsMethods ArticleModel and Algorithm for Linkage Disequilibrium Analysis in a Non-Equilibrium PopulationLiuJingyuan^{1}WangZhong^{2}WangYaqun^{1}LiRunze^{1}^{2}WuRongling^{1}^{2}*^{1}Department of Statistics, The Pennsylvania State UniversityState College, PA, USA^{2}Division of Biostatistics and Bioinformatics, Pennsylvania State UniversityHershey, PA, USA
Edited by: Christina Kitchen, University of California Los Angeles, USA
Reviewed by: Dalin Li, Cedars-Sinai Medical Center, USA; Christina Kitchen, University of California Los Angeles, USA
*Correspondence: Rongling Wu, Division of Biostatistics and Bioinformatics, Pennsylvania State University, Hershey, 17033 PA, USA. e-mail: rwu@hes.hmc.psu.edu
This article was submitted to Frontiers in Statistical Genetics and Methodology, a specialty of Frontiers in Genetics.
This is an open-access article distributed under the terms of the Creative Commons Attribution Non Commercial License, which permits non-commercial use, distribution, and reproduction in other forums, provided the original authors and source are credited.
The multilocus analysis of polymorphisms has emerged as a vital ingredient of population genetics and evolutionary biology. A fundamental assumption used for existing multilocus analysis approaches is Hardy–Weinberg equilibrium at which maternally- and paternally-derived gametes unite randomly during fertilization. Given the fact that natural populations are rarely panmictic, these approaches will have a significant limitation for practical use. We present a robust model for multilocus linkage disequilibrium analysis which does not rely on the assumption of random mating. This new disequilibrium model capitalizes on Weir’s definition of zygotic disequilibria and is based on an open-pollinated design in which multiple maternal individuals and their half-sib families are sampled from a natural population. This design captures two levels of associations: one is at the upper level that describes the pattern of cosegregation between different loci in the parental population and the other is at the lower level that specifies the extent of co-transmission of non-alleles at different loci from parents to their offspring. An MCMC method was implemented to estimate genetic parameters that define these associations. Simulation studies were used to validate the statistical behavior of the new model.
Linkage disequilibria have been used as a fundamental concept for studying the pattern of genetic diversity in a natural population (Lewontin, 1964, 1988; Hedrick, 1987; Weir, 1996; Lou et al., 2003; Li et al., 2007; Slatkin, 2008) as well as for fine-mapping the genetic architecture of complex traits (Kruglyak, 1999). The traditional definition of linkage disequilibrium is the non-random association of alleles at different loci within the same gametes. The theoretical basis of estimating gametic linkage disequilibria is founded on the assumption that the population under study is at Hardy–Weinberg equilibrium (HWE), in which individuals are randomly mating to produce next generations. With the HWE assumption, genotype frequencies in the population are expressed as the product of gamete frequencies, a key expression to estimate and test the gametic linkage disequilibrium. Recently, Wu et al. (2010) have showed that, when multiple loci are considered simultaneously, this expression may not be necessarily true even if these loci are individually at HWE.
More importantly, for a given population, the assumption of random mating may be violated by many evolutionary forces such as selection, mutation, genetic drift, and population structure. For a non-equilibrium population at Hardy–Weinberg disequilibrium (HWD), zygotic disequilibria that have power to characterize non-random associations at both gametic and zygotic levels (Weir, 1996; Yang, 2000, 2002) may be more relevant. Earlier studies have documented possible genetic and evolutionary causes for zygotic associations in a non-equilibrium population (Haldane, 1949; Bennett and Binet, 1956; Charlesworth, 1991; Barton and Gale, 1993). Weir (1996) documented five different types of disequilibria simultaneously which are (1) Hardy–Weinberg disequilibria at each locus, (2) gametic disequilibrium (including two alleles in the same gamete, each from a different locus), (3) non-gametic disequilibrium (including two alleles in different gametes, each from a different locus), (4) trigenic disequilibrium (including a zygote at one locus and an allele at the other), and (5) quadrigenic disequilibrium (including two zygotes each from a different locus). Because it is impossible to estimate all the five disequilibrium parameters due to inadequate degrees of freedom, Weir (1996) collapsed gametic and non-gametic disequilibria to estimate a so-called composite gametic disequilibrium. More recently, Liu et al. (2006) used Weir’s approach to estimate zygotic disequilibria in a canine population and gain a better insight into the structure and organization of the canine genome.
For a traditional population genetic strategy that samples unrelated individuals at random from a natural population, Weir’s approach cannot separate the gametic and non-gametic disequilibria. This is because it provides insufficient information to distinguish two diplotypes of a double heterozygote which have the same genotype. In this article, we present a disequilibrium model for estimating the relative proportions of these two diplotypes for the double heterozygote and, therefore, making a distinction between the disequilibria occurring within and between gametes. The new model is based on an open-pollinated (OP) plant design that contains a set of randomly selected maternal plants from a natural population and multiple progeny from each maternal plant. The OP design was first proposed by Wu and Zeng (2001) to simultaneously estimate the linkage and (gametic) linkage disequilibrium between different markers. Li et al. (2009) further used this design to estimate the mating behavior of an outcrossing plant species, thus providing a comprehensive means for studying the genetic structure and diversity of natural populations. Hou et al. (2009) extended Li et al.’s (2009) work to jointly model the linkage, linkage disequilibrium, and genetic interference of genes using multilocus data. Here, we use the OP design to relax the HWE assumption which is needed for traditional multilocus analysis and modeling. While most of the previous work was implemented with the EM algorithm, we develop a Markov chain Monte Carlo (MCMC) procedure to estimate the parameters that define zygotic disequilibria. We conduct computer simulation to test the statistical properties of the model and validate its use in practice.
Zygotic DisequilibriumGamete and non-gamete frequencies
Different types of zygotic disequilibria were defined in Weir (1996). As a follow-up of Liu et al.’s (2006) work, we will use the same notation and procedure for dissecting genotypic frequencies into disequilibrium components. Suppose we have two SNP markers A (with two alleles A and a) and B (with two alleles B and b). The frequencies of the corresponding alleles are denoted as p_{A} and p_{a} (p_{A}+ p_{a}= 1) as well as p_{B} and p_{b} (p_{B}+ p_{b}= 1), respectively. There are three distinguishable genotypes at each marker, i.e., AA, Aa, and aa for A and BB, Bb, and bb for B, whose genotype frequencies are denoted as P subscribed by the genotype notation. The two markers form 9 distinguishable genotypes, although there are 10 genotypic configurations or diplotypes. One genotype, AaBb, has two possible genotypic configurations BA|ba| and bA|Ba|, which are genotypically seen as the same.
Table 1 tabulates 9 genotype frequencies and 10 diplotype frequencies denoted by P, subscripted and superscripted by the genotype or diplotype notation. Using two-marker diplotype frequencies, we can estimate one-marker genotype frequencies by
for marker B, and further estimate the allele frequencies by
pA=PAA+12PAapa=Paa+12PAapB=PBB+12PBbpb=Pbb+12PBb.
Frequencies and numbers of observations of marker genotypes.
Marker
Marker B
A
BB
Bb
bb
Total
AA
PAABB(N1)
PAABb(N2)
PAAbb(N3)
p_{AA} (N_{1} + N_{2} + N_{3})
Aa
PAaBB(N4)
PAaBb+PAabB=PAaBb(N5)
PAabb(N6)
P_{Aa} (N_{4} + N_{5} + N_{6})
aa
PaaBB(N7)
PaaBb(N8)
Paabb(N9)
P_{aa} (N_{7} + N_{8} + N_{9})
P_{BB}
P_{Bb}
P_{bb}
1
Total
n_{1}+ N_{4} + N_{7}
N_{2} + N_{5} + N_{8}
N_{3} + N_{6} + N_{9}
N.
(1) Genotype AaBb contains two different configurations or diplotypes AB|ab and Ab|aB; (2) N_{i}’s in the parentheses are the numbers of corresponding observed genotypes.
The two markers form four gametes, AB, Ab, aB, and ab, whose frequencies are denoted as p subscribed by the gamete notation. They are derived from diplotype frequencies by
As shown above, all the genotype, allele, gamete, and non-gamete frequencies are uniquely determined by the diplotype frequencies. According to Table 1, the frequencies of all the diplotypes, except for AB|ab and Ab|aB, can be estimated from the observed data, since they are all one to one corresponding to genotypes which can be directly estimated from data.
A complete set of disequilibria
Zygotic disequilibrium is defined as the deviation of two-locus genotype frequencies from products of single-locus genotype frequencies and, thus, is composed of all non-allelic genic disequilibria at the two loci (Weir, 1996). For a population at HWD, the desirable property of an equilibrium population will not occur, such as independence of different alleles at the same locus (Lynch and Walsh, 1998). The HWD attempts to test for two alleles at the same locus, but on different gametes, whereas (gametic) linkage disequilibrium describes two alleles on the same gametes, but at different loci. For a zygotic disequilibrium, however, there is a third test, i.e., two alleles on different gametes and at different loci.
Since the population is not in HWE, two alleles at each marker are not independent, with the coefficients of HWD defined as
DA=PAA−pA2=−12PAa+pApa=Paa−pa2
for marker A and
DB=PBB−pB2=−12PBb+pBpb=Pbb−pb2
for marker B, respectively. The coefficient of digenic gametic linkage disequilibrium between the two markers is defined as
Dab=pAB−pApB=−pAb+pApb=−paB+papB=pab+papb.
For the non-equilibrium population, digenic linkage disequilibrium that occurs between non-alleles at different gametes is defined as
Da/b=pA/B−pApB=−pA/b−pApb=−pa/B−papB=pa/b−papb.
The trigenic disequilibria between two alleles from marker A and one allele from marker B is defined as
With diplotype frequencies, allele frequencies, HWD, gametic and non-gamete disequilibria, and trigenic disequilibria, we can derived the quadrigenic disequilibrium () between two alleles from marker A and two alleles from marker B (Weir, 1996). Analogous to Liu et al. (2006), we use a table (Table 2) to express the formulas for D_{AB}, from which it is clear that a full set of disequilibria can only be estimated from diplotype frequencies. In this article, D_{AB} will be estimated by using information from offspring genotypes. For clarity, we use small and capital letters to denote gamete and zygotic disequilibria, respectively. Conversely from Table 2, we can see that each of the diplotype frequencies can be expressed in terms of the allele frequencies (p_{A}, p_{a} and p_{B}, p_{b}), HWD coefficients (D_{A} and D_{B}) and gametic (D_{ab}) and non-gametic disequilibria of different orders (D_{a/b}, D_{Ab}, D_{aB}, and D_{AB}).
Expressions of quadrigenic disequilibrium D_{AB} in terms of genotypic configuration frequencies, allele frequencies and lower-order disequilibrium coefficients.
Frequency
1
DADB+Dab2+Da∕b2
D_{A}
D_{B}
D_{ab}
D_{a/b}
D_{Ab}
D_{aB}
PAABB
-pA2pB2
–1
-pB2
-pA2
–2p_{A}p_{B}
–2p_{A}p_{B}
–2p_{B}
–2p_{A}
-12PAABb
pA2pBpb
–1
p_{B}p_{b}
-pA2
–p_{A}p_{B} + p_{A}p_{b}
–p_{A}p_{B} + p_{A}p_{b}
–p_{B} + p_{b}
–2p_{A}
PAAbb
-pA2pb2
–1
-pb2
-pA2
2p_{A}p_{b}
–2p_{A}p_{b}
–2p_{b}
–2p_{A}
-12PAaBB
pApapB2
–1
-pB2
p_{A}p_{a}
–p_{A}p_{B} + p_{a}p_{B}
–p_{a}p_{b} + p_{a}p_{B}
–2p_{B}
–p_{A} + p_{a}
12PAaBb
–p_{A}p_{a}p_{B}p_{b}
–1
p_{B}p_{b}
p_{A}p_{a}
–p_{A}p_{B} – p_{a}p_{b}
p_{A}p_{b} + p_{a}p_{B}
–p_{B} + p_{b}
–p_{A} + p_{a}
12PAabB
–p_{A}p_{a}p_{B}p_{b}
–1
p_{B}p_{b}
p_{A}p_{a}
p_{A}p_{b} + p_{a}p_{B}
–p_{a}p_{b} – p_{a}p_{b}
–p_{B} + p_{b}
–p_{A} + p_{a}
-12PAabb
pApBpb2
–1
-pb2
p_{A}p_{a}
p_{A}p_{b} – p_{a}p_{b}
p_{A}p_{b} – p_{a}p_{b}
2p_{b}
–p_{A} + p_{a}
PaaBB
−pa2pB2
–1
−pB2
−pa2
2p_{a}p_{B}
2p_{a}p_{B}
–2p_{B}
2p_{a}
−12PaaBb
pa2pBpb
–1
p_{B}p_{b}
−pa2
p_{a}p_{B} – p_{a}p_{b}
p_{a}p_{B} – p_{a}p_{b}
–p_{B} + p_{b}
2p_{a}
paabb
−pa2pb2
–1
pb2
−pa2
–2p_{a}p_{b}
–2p_{a}p_{b}
2p_{b}
2p_{a}
Model for estimation
Based on the above description, it can be seen that the estimation of all disequilibrium parameters purely relies on the separation of two diplotypes underlying the double heterozygote, since all the other diplotypes are one to one corresponding to their genotypes and can be directly estimated from the data. In this section, we show that these two diplotypes can be separated using an open-pollinated (OP) design. For a monoecious plant, each plant may be OP randomly by both its own pollen and that from the natural pool. In a dioecious plant, a female plant is only pollinated by the pollen pool. To simplify our explanation, we focus on dioecious plants. A model for monoecious plants will be described elsewhere.
Let N_{i} denote the number of maternal plants which bear on genotype i (i = 1,…,9) for the two SNPs considered and Ni(j) denote the number of offspring which have genotype j (j = 1,…,9) derived from maternal genotype i. Table 3 gives the structure of genotypic data collected from N.=∑i=19Ni random maternal plants and N.(.)=∑i=19∑j=19Ni(j) random offspring plants, respectively. For a given maternal genotype, a certain group of offspring genotypes is produced. Table 4 gives genotype frequencies of offspring produced by any maternal genotype. For the OP design of dioecious plants, offspring genotypes are determined jointly by maternal gametes of the corresponding maternal plant and paternal gametes from the pollen pool. In the natural population, the frequencies of paternal gametes are expressed as p_{AB}, p_{Ab}, p_{aB}, and p_{ab}, respectively, which can be estimated from genotypic configuration frequencies by Eq. 4.
Data structure of two markers in the OP design.
Maternal family
Offspring genotype
No.
Genotype
Size
1
2
3
4
5
6
7
8
9
AABB
AABb
AAbb
AaBB
AaBb
Aabb
aaBB
aaBb
aabb
AB|AB
AB|Ab
Ab|Ab
AB|aB
AB|ab or Ab|aB
Ab|ab
aB|aB
aB|ab
ab|ab
1
AABB
N_{1}
N1(1)
N1(2)
N1(4)
N1(5)
2
AABb
N_{2}
N2(1)
N2(2)
N2(3)
N2(4)
N2(5)
N2(6)
3
AAbb
N_{3}
N3(2)
N3(3)
N3(5)
N3(6)
4
AaBB
N_{4}
N4(1)
N4(2)
N4(4)
N4(5)
N4(7)
N4(8)
5
AaBb
N_{5}
N5(1)
N5(2)
N5(3)
N5(4)
N5(5)
N5(6)
N5(7)
N5(8)
N5(9)
6
Aabb
N_{6}
N6(2)
N6(3)
N6(5)
N6(6)
N6(8)
N6(9)
7
aaBB
N_{7}
N7(4)
N7(5)
N7(7)
N7(8)
8
aaBb
N_{8}
N8(4)
N8(5)
N8(6)
N8(7)
N8(8)
N8(9)
9
aabb
N_{9}
N9(5)
N9(6)
N9(8)
N9(9)
Offspring genotype frequencies given each maternal genotype in the OP design.
The frequencies of maternal gametes are dependent on the genotype of a maternal plant. If the maternal plant is a double heterozygote AbBb, then any genotype generated in the offspring will include a mixture of genotypes derived from gametes of the two underlying diplotypes of the double heterozygote. The proportions of mixture components are determined by two parameters, the recombination fraction (r) between the two markers, and the relative proportions (ϕ) of the two underlying diplotypes to the double heterozygote. For a maternal plant with genotype AbBb, there are two possible diplotypes, AB|ab or Ab|aB, with relative frequencies
ϕ=PAaBbPAaBb,1−ϕ=PAabBPAaBb
where P_{AaBb} is the frequency of the double heterozygote. Based on the definition of ϕ, we rewrite Eq. 4 to calculate the paternal gamete frequencies by
Maternal diplotypes, AB|ab or Ab|aB, will produce four different maternal gametes, with relative frequencies depending on the recombination fraction (r) between the two markers:
where we define θ = ϕ(1 − r) + r(1 − ϕ) and 1 − θ = ϕr + (1 − r)(1 − ϕ). Thus, overall haplotype frequencies produced by the double heterozygote maternal are calculated as 12θ for AB or ab and 12θ for Ab and aB.
As seen from Table 4, the conditional genotype frequencies of offspring given maternal genotypes depend on parameter θ and paternal gamete frequencies. Since paternal gamete frequencies (14) are determined by unknown parameter ϕ and genotype/diplotype frequencies estimated from maternal genotypes (which can be thought of constants), these conditional probabilities are actually dependent on only θ and ϕ.
Parameter estimation with the MCMC algorithm
As have been clear above, only two parameters ϕ and θ determine the genotype frequencies of offspring (Table 4). The log-likelihood of a complete set of genotype data includes the two parts based on the maternal and offspring genotypes, respectively, expressed as
logL(ϕ,θ)=logLM(ϕ)+logL0(ϕ,θ).
We have
logLM(ϕ)∝N1logPAABB+…+N5logPAaBb+…+N9logPaabb
for the upper level of the log-likelihood that specifies the genotype distribution of maternal plants in the natural population, and
for the lower level of the log-likelihood that specifies the transmission of alleles from parents to offspring.
Since all the genotype frequencies of maternal plants can be directly estimated by the observed data, we can view the upper level of the log-likelihood as constant, thus only focus on the lower level logL_{0}(ϕ,θ), i.e., logL(ϕ,θ)∝logL_{0}(ϕθ). Furthermore, logL_{0}(ϕ,θ) can be simplified as
Notice that p_{AB}, p_{Ab}, p_{aB}, and p_{ab} are a function of ϕ according to equation (14). Then, within the Bayesian framework, the MCMC technique can be used to draw samples from the posterior distributions of ϕ and θ. We will use the Variable-at-a-Time Metropolis-Hastings algorithm described as follows:
Start with initial value (ϕ^{(0)},θ^{(0)});
After getting (ϕ^{(n)},θ^{(n)}), update ϕ from ϕ^{(n)} to ϕ^{(n+1)} according to logπ(ϕ|θ^{(n)},N) using proposal Unif(0,1): Generate ϕ* from Unif(0,1) and accept it, i.e., set ϕ^{(n+1)} = ϕ* with probability α, where logα = min{0,logπ(ϕ*|θ^{(n)},N) − logπ(ϕ^{(n)}|θ^{(n)},N)}; otherwise, set ϕ^{(n+1)} = ϕ^{(n)};
After getting (ϕ^{(n+1)} = θ^{(n)}), update θ from θ^{(n)} to θ^{(n+1)} according to logπ(θ|ϕ^{(n+1)},N) using proposal Unif(0,1): Generate θ* from Unif(0,1) and accept it, i.e., set θ^{(n+1)} = θ* with probability β, where logβ = min{0,logπ(θ_{*}|ϕ^{(n+1)},N)− logπ(θ^{(n)}|ϕ^{(n+1)},N)}; otherwise, set θ^{(n+1)} = θ^{(n)};
Repeat (2) and (3) n times, where n is the number of iterations.
After burning in the first few sampled ϕ and θ, get the mean values of the remaining samples as the respective estimates. The estimate of r can be obtained by
r=θ-ϕ1-2ϕfromθ=ϕ1-r+r1-ϕ.
Since the values of r, ϕ and, hence, θ have restricted domains as follows:
0≤r≤0.5;0≤ϕ≤1;0≤θ≤1,
we choose Uniform (0,1) as a proposal distribution when updating both ϕ and θ.
With the estimate of ϕ, we can separate two diplotypes of the double heterozygote. Therefore, all the frequencies and disequilibria are obtained from Eqs 1 to 12 and Table 2. Also, we provide an estimate of the recombination fraction.
Computer Simulation
The statistical properties of the new model were investigated through simulation studies. The values of zygotic disequilibria used to simulate marker data for the OP design were chosen from their spaces which were shown in Liu et al. (2006). Because of the limitation of a population-based sampling design, Liu et al. (2006) was not able to estimate all types of zygotic disequilibria. By collapsing gametic and non-gametic linkage disequilibria, leading to a composite quadrigenic disequilibrium rather than the quadrigenic disequilibrium as defined in Table 2, they provided a reduced model for disequilibrium analyses. In a comparison with the traditional gametic linkage disequilibrium model, Liu et al.’s reduced model was found to be more general and cover the results from the former. It is expected that our model covers Liu et al.’s model because we do not need to combine gametic and non-gametic linkage disequilibria.
The simulation uses three different scenarios. The first assumes different relative proportions of two diplotypes for the double heterozygote (ϕ) by fixing the recombination fraction. In order to obtain different ϕ values, we need to adjust the zygotic disequilibria. Specifically, we have three combinations:
in which r = 0.2, p_{A}= 0.5, p_{B}= 0.6, D_{A}= D_{B}= 0.05, and the true value of θ can be calculated given ϕ and r by equation (15). The second is to change the recombination fraction, i.e., (1) r = 0.3, (2) r = 0.05, and (3) r = 0.5, with the other parameters fixed, i.e., ϕ = 0.87, p_{A}= 0.5, p_{A}= 0.6, D_{A}= D_{B}= 0.05, D_{ab}= 0.01, D_{a/b}= D_{Ab}= D_{aB}= D_{AB}= 0.01. The third is about sampling strategies, (1) 1000 maternal plants × 9 seeds per family and (2) 200 maternal plants × 49 seeds per family, producing the same total size of samples, in which ϕ = 0.87, r = 0.2, p_{A}= 0.5, p_{B}= 0.6, D_{A}= D_{B}= 0.05, D_{ab}= 0.1, D_{a/b}= D_{Ab}= D_{aB}= D_{AB}= 0.01. For the first two scenarios, we use 1000 × 9 strategy.
For the simulated data, we estimated all genotype frequencies from maternal genotype data, which were used to estimate p_{A}, p_{B}, D_{A}, D_{B}, D_{Ab}, and D_{aB} with equations (1), (2), (3), (6), (7), (8), (11), and (12), respectively. However, D_{ab}, D_{a/b}, and D_{AB} can be estimated only after ϕ is estimated. Tables 5–7 give the results from simulated data under three different scenarios. In each case, 1000 simulation replicates were performed to get the means and standard deviations of the estimates. Each estimate was based on 1000 iterations after burns-in during the MCMC procedure. In general, all parameters can be estimated reasonably from our model.
Estimates of parameters and their standard deviations (in parentheses) based on the MCMC algorithm for the data simulated under scenario 1.
ϕ
θ
r
p_{A}
p_{B}
D_{A}
D_{B}
Dab
D_{a/b}
D_{Ab}
D_{aB}
D_{AB}
DESIGN 1
True
0.87
0.72
0.2
0.5
0.6
0.05
0.05
0.1
0.01
0.01
0.01
0.01
Estimate
0.871
0.724
0.192
0.500
0.599
0.050
0.099
0.011
0.010
0.010
0.010
0.010
SD
(0.036)
(0.021)
(0.049)
(0.009)
(0.008)
(0.010)
(0.007)
(0.003)
(0.009)
(0.003)
(0.003)
(0.002)
DESIGN 2
True
0.12
0.28
0.2
0.5
0.6
0.05
0.05
0.01
0.1
0.01
0.01
0.01
Estimate
0.122
0.276
0.196
0.500
0.600
0.050
0.050
0.010
0.100
0.010
0.010
0.010
SD
(0.048)
(0.021)
(0.059)
(0.009)
(0.011)
(0.007)
(0.008)
(0.003)
(0.007)
(0.003)
(0.003)
(0.003)
DESIGN 3
True
0.5
0.5
0.2
0.5
0.6
0.05
0.05
0.02
0.02
0.03
0.03
0.01
Estimate
0.496
0.501
0.374
0.500
0.600
0.050
0.050
0.020
0.021
0.030
0.030
0.010
SD
(0.053)
(0.022)
(7.368)
(0.012)
(0.012)
(0.008)
(0.008)
(0.003)
(0.010)
(0.003)
(0.003)
(0.002)
In scenario 1, uneven allocations of two diplotypes for the double heterozygote help the estimates of the recombination fraction (Table 5). If these two diplotypes are equally allocated, i.e., ϕ = 0.5, we could not give a good estimate of the recombination fraction because in this case θ is not dependent on r, making r unidentifiable. As shown in Li and Wu (2009), a three-locus analysis can overcome this problem. In scenario 2, we found that ϕ and all set of disequilibria can be precisely estimated, no matter whether r = 0.5 or not (Table 6). This is because the MCMC procedure treats ϕ and θ as two separate parameters. The results of scenario 3 help to determine an optimal sampling strategy. As expected, sampling more maternal plants increases the precision of parameter estimation (Table 7).
Estimates of parameters and their standard deviations (in parentheses) based on the MCMC algorithm for the data simulated under scenario 2.
ϕ
θ
r
p_{A}
p_{B}
D_{A}
D_{B}
D_{ab}
D_{a/b}
D_{Ab}
D_{aB}
d_{ab}
DESIGN 1
True
0.87
0.65
0.3
0.5
0.6
0.05
0.05
0.1
0.01
0.01
0.01
0.01
Estimate
0.871
0.640
0.295
0.500
0.599
0.050
0.050
0.100
0.011
0.010
0.010
0.010
SD
(0.045)
(0.018)
(0.040)
(0.012)
(0.012)
(0.008)
(0.008)
(0.003)
(0.010)
(0.003)
(0.003)
(0.002)
DESIGN 2
True
0.87
0.84
0.05
0.5
0.6
0.05
0.05
0.1
0.01
0.01
0.01
0.01
Estimate
0.875
0.837
0.044
0.500
0.600
0.050
0.050
0.100
0.010
0.010
0.010
0.010
SD
(0.045)
(0.016)
(0.067)
(0.012)
(0.012)
(0.008)
(0.007)
(0.003)
(0.010)
(0.003)
(0.003)
(0.002)
DESIGN 3
True
0.87
0.5
0.5
0.5
0.6
0.05
0.05
0.1
0.01
0.01
0.01
0.01
Estimate
0.874
0.499
0.501
0.500
0.600
0.049
0.050
0.100
0.010
0.010
0.010
0.010
SD
(0.046)
(0.018)
(0.025)
(0.012)
(0.012)
(0.008)
(0.007)
(0.003)
(0.010)
(0.003)
(0.003)
(0.002)
Estimates of parameters and their standard deviations (in parentheses) based on the MCMC algorithm for the data simulated under scenario 3.
ϕ
θ
r
p_{a}
p_{b}
d_{a}
d_{b}
D_{ab}
D_{a/b}
D_{ab}
D_{ab}
d_{ab}
SAMPLING STRATEGY 1: 1000 × 9
True
0.87
0.72
0.2
0.5
0.6
0.05
0.05
0.1
0.01
0.01
0.01
0.01
Estimate
0.871
0.724
0.192
0.500
0.599
0.050
0.099
0.011
0.010
0.010
0.010
0.010
SD
(0.036)
(0.011)
(0.049)
(0.009)
(0.008)
(0.010)
(0.007)
(0.003)
(0.009)
(0.003)
(0.003)
(0.002)
SAMPLING STRATEGY 2: 200 × 49
True
0.87
0.72
0.2
0.5
0.6
0.05
0.05
0.1
0.01
0.01
0.01
0.01
Estimate
0.866
0.726
0.154
0.500
0.599
0.049
0.050
0.098
0.011
0.010
0.010
0.010
SD
(0.085)
(0.018)
(0.500)
(0.028)
(0.027)
(0.027)
(0.017)
(0.004)
(0.021)
(0.006)
(0.006)
(0.004)
Discussion
The structural study of linkage disequilibrium helps to understand the genetic variation and evolution of populations and facilitates the positional cloning of genes underlying common complex diseases (Lewontin, 1964, 1988; Hedrick, 1987; Weir, 1996; Kruglyak, 1999). The current approaches for estimating linkage disequilibria rely on the assumption that the population under consideration is randomly mating, following Hardy–Weinberg equilibrium (HWE). However, many populations may be founded by a small number of ancestors and/or are frequently under evolutionary pressure, such as mutation, genetic drift, population admixture, and structure (Lynch and Walsh, 1998). For those populations, HWE may be violated. We need a new analysis that relaxes the random mating assumption. Weir (1996) introduced the concept of zygotic association or zygotic disequilibrium that specify the disequilibria between different loci in a non-equilibrium population. Part of these disequilibria was used by Liu et al. (2006) to examine the extent and distribution of zygotic disequilibria across the canine genome.
In this article, we have for the first time developed a new statistical model for estimating a complete set of zygotic disequilibria, including (1) Hardy–Weinberg disequilibria at each locus, (2) gametic disequilibrium (including two alleles in the same gamete, each from a different locus), (3) non-gametic disequilibrium (including two alleles in different gametes, each from a different locus), (4) trigenic disequilibrium (including a zygote at one locus and an allele at the other), and (5) quadrigenic disequilibrium (including two zygotes each from a different locus). This model is based on an open-pollinated (OP) design by sampling multiple maternal plants and their offspring. The major advantage of this design lies in its power to separate two diplotypes of a double heterozygote (specified by a proportion ϕ), thus retrieving the lost information of a traditional population-based sampling strategy.
Under the OP design setting, a full log-likelihood of diplotype frequencies from the maternal and offspring generations was formulated in terms of two unique parameters, ϕ and recombination fraction r. An MCMC procedure was then implemented to estimate these two parameters from which a full set of zygotic disequilibria and the recombination fraction are estimated. Extensive simulation studies have been performed to test the statistical behavior of the new model by considering a range of disequilibrium and recombination fraction as well as different sampling strategies. In general, all the parameters can be estimated with reasonable accuracy and precision.
As a first attempt to relax the Hardy–Weinberg equilibrium assumption that has been widely accepted for population genetic studies of many decades, our model will reshape the fundamental theory of this field. With an increasing availability of genetic data due to the rapid development of genotyping technologies, this model will show its increasing implications and likelihood for uncovering new discoveries related to population genetics. The model can be extended to accommodate various situations in the following aspects. First, by integrating selfing rates into the model, we can develop a similar design for monoecious plants in which the simultaneous occurrence of female and male flowers allows selfing. Second, a multilocus model including more than two markers should be developed. This will not only increase the power of the model by estimating genetic interference, but also overcome the problem of estimating the recombination fraction when ϕ is equal to 0.5. Third, the model can be integrated with quantitative traits to map their underlying QTLs and genetic interactions (see Wu et al., 2002). All these extensions will make our model more useful in practice, ultimately resolving difficult challenges in population and quantitative genetic studies.
Conflict of Interest Statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
This work is partially supported by NSF/IOS-0923975 to Rongling Wu and NIDA, NIH grants R21 DA024260 and R21 DA024266 to Runze Li. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIDA or the NIH.
ReferencesBartonN. H.GaleK. S. (1993). “Genetic analysis of hybrid zones,” in BennettJ. H.BinetF. E. (1956). Association between Mendelian factors with mixed selfing and random mating. CharlesworthB. (1991). The evolution of sex chromosomes. HaldaneJ. B. S. (1949). The association of characters as a result of inbreeding and linkage. HedrickP. W. (1987). Gametic disequilibrium measures: proceed with caution. HouW.LiuT.LiY.LiQ.LiJ. H.DasK.WuR. L. (2009). Multilocus genomics of outcrossing populations. KruglyakL. (1999). Prospects for whole-genome linkage disequilibrium mapping of common disease genes. LewontinR. C. (1964). The interaction of selection and linkage. General considerations, heterotic models. LewontinR. C. (1988). On measures of gametic disequilibrium. LiJ. H.LiQ.HouW.HanK.LiY.WuS.LiY. C.WuR. L. (2009). An algorithmic model for constructing a linkage and linkage disequilibrium map in open-pollinated progeny populations. LiQ.WuR. L. (2009). A multilocus model for constructing a linkage disequilibrium map in human populations. LiY.LiY.WuS.HanK.WangZ.HouW.ZengY.WuR. (2007). Estimation of multilocus linkage disequilibria in diploid populations with dominant markers. LiuT.TodhunterR. J.LuQ.SchoettingerL.LiH. Y.LittellR. C.BlissS.AclandG.LustG.WuR. L. (2006). Extent and distribution of zygotic linkage disequilibrium in canine. LouX. Y.CasellaG.LittellR. C.YangM. C. K.JohnsonJ. A.WuR. L. (2003). A haplotype-based algorithm for multilocus linkage disequilibrium mapping of quantitative trait loci with epistasis. LynchM.WalshB. (1998). SlatkinM. (2008). Linkage disequilibrium – understanding the evolutionary past and mapping the medical future. WeirB. (1996). WuR. L.MaC. X.CasellaG. (2002). Joint linkage and linkage disequilibrium mapping of quantitative trait loci in natural populations. WuR. L.ZengZ. B. (2001). Joint linkage and linkage disequilibrium mapping in natural populations. WuS.YangJ.WuR. L. (2010). Mapping quantitative trait loci in a non-equilibrium population. YangR. C. (2000). Zygotic associations and multilocus statistics in a nonequilibrium diploid population. YangR. C. (2002). Analysis of multilocus zygotic associations.