^{*}

Edited by: Qizhai Li, Chinese Academy of Sciences, China

Reviewed by: Guohua Zou, Chinese Academy of Sciences, China; Tian-Qing Zheng, Chinese Academy of Agricultural Sciences, China

*Correspondence: Tao Wang, Division of Biostatistics, Institute for Health and Society, Medical College of Wisconsin, 8701 Watertown Plank Road, PO Box 26509, Milwaukee, WI 53226, USA e-mail:

This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Zeng et al. (

Currently there are two types of statistical genetic models that are commonly used in genetic analysis of quantitative traits. One is the F_{∞} type models that concentrate on direct modeling of the expected genotypic values at targeted quantitative trait loci (QTL) or genetic markers and association testing for various allelic effects and interactions (Fisher,

In genetic association studies, we are often interested in direct comparison of the expected genotypic values at certain QTL or marker loci. The _{∞} models are appealing in this setting due to their simplicity in interpretation of their model parameters, which are often referred as the fixed genetic effects such as the additive and dominance effects or the allelic effects and allelic interactions in terms of the expected genotypic values. By applying the _{∞} models, we can compare the expected genotypic values via hypothesis tests on various fixed genetic effects. However, as pointed out in Wang and Zeng (_{∞} and the Fisher type models form basis in the analysis of quantitative traits. They provide different perspectives in assessing the genetic effects of QTL or genetic markers.

The basic genetic model on assessing the genetic variance components was first proposed by Fisher (

In this study, we further extend the G2A model to QTL with multiple alleles and multiple loci. In bi-allelic case, only one additive effect and one dominance effect are needed at each locus, and the locus-by-locus interactions can be easily included for constructing a full re-parameterization of the genotypic values. For one QTL with multiple alleles, how to define the dominance effects for various allelic interactions is not straightforward especially when phases of its genotypes are unknown. The extension to multiple loci is also cumbersome by the much more complex structure of locus-by-locus interactions. How to present the model and define various genetic variance components are not trivial tasks. To construct one-locus general multi-allele (GMA) model, we overcome the phase problem by appropriately merging the paternal and maternal allelic effects and allelic interactions in the phase-known situation. Typically, with phase unknown genotypes at a locus, we may have to assume that the paternal and maternal alleles have the same frequencies and contribute the same genetic effects so that we could merge them without distinguishing their parental origins. With phase-known genotypes, we can further break down the additive variance component into paternal and maternal variance components. For multiple QTL with multiple alleles, we develop concise expressions for constructing multi-locus GMA models and defining various genetic components. Explicit formulas for calculating various genetic variance components in equilibrium population are also derived.

The structure of this manuscript is organized as the following. First, we consider one multi-allele QTL with phase known genotypes. Following the same strategy as adopted in Wang and Zeng (_{∞} models and GMA models to build reduced models for the expected genotypic values is explored. Finally, we apply the GMA model to a published experimental data set.

The variation of a quantitative trait _{G}

First, let us consider a single QTL with multiple alleles _{1}, …, _{m}^{2} possible phased genotypes: (_{i}, A_{j}_{i}_{j}^{i}_{j}_{i}, A_{j}_{i}, A_{j}^{2} possible expected genotypic values ^{i}_{j}, i, j

where α^{i} (or α_{j}) is the so-called _{i}_{j}^{i}_{j} is the _{i}_{j}_{1}, …, _{m}

Let ^{i} be the frequency of allele _{i}_{j}_{j}_{A}_{D}_{G}_{i,j}p^{i}p_{j}^{i}_{j}^{2} has an orthogonal partition _{G}_{A}_{D}

Note that there are in total ^{2} + 2^{2} parameters being involved in Fisher model (1) including the intercept μ, which is more than the total number ^{2} of the expected genotypic values ^{i}_{j}, i, j

With these constraints, Fisher (

where ^{·}_{.} = ^{i}_{.} = _{i}^{·}_{j} = _{j}

Here we propose a way to get rid of the redundant parameters. Let us first introduce the following indicator variables that describe the transmission inheritance of the paternal and maternal alleles.

for _{Pi}, z_{Mj}

Then we can re-write the Fisher model (1) as

where _{Pi}_{Mj}

Now, we further exclude the redundant parameters in model (3). For a diploid subject such as human being, his or her genotype at a locus on a pair of homologous chromosomes consists of two alleles with one from the father and the other one from the mother. Therefore, we always have _{Pm}_{Mm}

Model (4) provides a full re-parameterization of the ^{2} expected genotypic values ^{i}_{j}, ^{*i} (or α_{*j}) as the average (additive) allelic effect of the paternal (or maternal) allele _{i}_{j}^{*i}_{*j} the average allelic interaction between the paternal allele _{i}_{j}_{m}_{m}^{*i} = α^{i} − α^{m}, α_{*j} = α_{j} − α_{m} and δ^{*i}_{*j} = δ^{i}_{j} − δ^{m}_{j} − δ^{i}_{m} + δ^{m}_{m}, for

Model (4) retains most of the nice features of the original Fisher model (1) on partition of the genotypic variance. Based on this model, we have the genetic additive variance components _{Pi}, i_{Mj}, j_{AP}_{AM}_{D}

In HWD, the disequilibrium measurements can be captured by the covariances between the index variables _{Pi}_{Mj}_{m}_{AP}_{AM}_{D}

Note that the partition of the total genotypic variance _{G}_{G}_{G}

where, for _{k}_{k}_{k} is a model residual contributed by other environmental and genetic factors that cannot be captured by _{k}_{k}_{k}, ^{2}_{ϵ}, we have _{y}^{2}_{ϵ}. To further partition _{Pi}_{k}_{Mj}_{k}_{AP}_{AM}_{D}_{P}_{M}_{P}, A_{M}_{P}, D_{M}, D

respectively, where

In this subsection, we consider modeling a multi-allele QTL with phase unknown genotypes—a more common situation in practice. As we cannot distinguish the parental origins of alleles in QTL genotypes, as usual, we assume that the paternal and maternal gametes share the same set of alleles with the same allele frequencies. Let _{1}, …, _{m}_{i}, i_{i}A_{i}, i_{i}A_{j}_{ij} = _{i}A_{j}_{ij}_{ji}_{ij}_{ij}

where α_{i} is the average (additive) allelic effect of the paternal or maternal allele _{i}_{ij} is the average allelic interaction between two alleles _{i}_{j}

From the symmetric property of _{ij}_{jk}'s are symmetric. Similarly, the above constraints together with the symmetry property of δ_{jk} make it difficult to fit model (5) using the standard LS approach.

Note that model (5) can be treated as a special case of model (1) or (3) with α^{i} = α_{i} and δ^{i}_{j} = δ^{j}_{i} for ^{*i}_{Pi}_{*i}_{Mi}^{*i}_{*j}_{Pi}x_{Mj}^{*j}_{*i}_{Pj}x_{Mi}^{*i} = α_{*i} as α^{*}_{i} for ^{*j}_{*i} = δ^{*i}_{*j} as δ^{*}_{ij} for

where, for

and for

Here ^{c}_{ij} denotes an allele which is different from _{i}_{j}_{i}_{ii}_{ij}_{Pi}, x_{Mj}^{*}_{i} as the average allelic effect of allele _{i}^{*}_{ij} as the average allelic interaction between two alleles _{i}_{j}_{m}^{*}_{i} = α_{i} − α_{m}, for ^{*}_{ij} = δ_{ij} − δ_{im} − δ_{jm} + δ_{mm}, for

Model (6) is an extension of the one-locus G2A model proposed in Zeng et al. (_{ij}_{ii}^{*}_{ii} can keep the same interpretation as δ^{*i}_{*i} in model (4). In addition, the combined index variables _{ij}^{*}_{i} = _{Pi}_{Mi}^{*}_{ij}(_{Pi}z_{Mj}_{Pj}z_{Mi}_{ij}^{*}_{ij} − (_{i}w^{*}_{j} + _{j}w^{*}_{i}) + 2_{i}p_{j}

Still, model (6) retains the nice feature of the classical Fisher's model on partition of the genotypic variance. The additive variance component _{A}_{D}

Here we define δ^{*}_{ji} = δ^{*}_{ij}, for _{Pi}_{Mj}

As an example, let us consider a QTL with 3 alleles _{1}, _{2}, and _{3}. By taking _{3} as the baseline allele, model (6) leads to

Or, in a matrix form, we have

If we choose _{1} (or _{2}) instead of _{3} as the baseline allele, we can obtain different re-parameterizations of the six expected genotypic values. But they all give the same partition on the variance of the expected genotypic values. Álvarez-Castro and Yang (

Similar to the phase-known case, we can estimate the additive, dominance variance components and the covariance Cov(_{i}(g_{k})_{ij}(g_{k})_{A}_{D}

The one-locus GMA models can be extended to multiple loci. Typically, for each locus ^{(k)}_{Pi}^{(k)}_{Mj}_{k,i}, v_{k,ij}

Consider _{k1}, …, _{kmk}^{2}_{1}··· ^{2}_{L} possible expected genotypic values: ^{s1···sL}_{t1···tL} = _{1s1}···_{LsL}_{1t1}···_{LtL}_{1s1}···_{LsL}_{1t1}···_{LtL}_{1s1}···_{LsL}_{1t1}···_{LtL}_{k}, t_{k}_{k}^{P}_{ksk}^{M}_{ktk}_{ksk}_{ktk}^{(k)}_{Psk}_{ksk} (_{k}_{k}^{(k)}_{Mtk}_{ktk}_{k}_{k}_{1m1}, …, _{LmL}

where the summation of _{k}_{k}_{k}_{k}_{k}_{ksk}_{ktk}_{k}_{k}^{s*1···s*L}_{t*1···t*L} in each term represents an average allelic effect of a single paternal or maternal allele, or an allelic interaction from a set of paternal and maternal alleles that are involved in this term with respect to the baseline alleles _{1m1}, …, _{LmL}^{s*1···s*L}_{t*1···t*L} are defined as ^{*}_{k} = _{k}_{k}^{*}_{k} = _{k}_{k}^{*}_{k} = 0 (or ^{*}_{k} = 0). Note that _{P}_{M}

The multi-locus GMA model (8) provides a full re-parameterization of the ^{2}_{1}···^{2}_{L} expected genotypic values ^{s1···sL}_{t1···tL} with phase-known genotypes without using redundant parameters. Note that

where _{1}, _{1}, …, _{L}, j_{L}^{0···0}_{0···0} is a constant, which corresponds to an intercept without any alleles being involved. Therefore, for a ^{2L}−1) genetic variance components _{1}, _{1}, …, _{L}, j_{L}

respectively (

and the two-locus paternal by maternal variance component is

The variance component of epistases with the highest order is

which has 2

Based on these genetic components, we can partition the variance of the expected genotypic values into the variances and covariances of the genetic components. Still, the coefficients α^{s*1···s*L}_{t*1···t*L} are defined based on the baseline alleles _{1m1}, …, _{LmL}^{(k)}_{Psk}, s_{k}_{k}^{(k)}_{Mtk}, t_{k}_{k}^{(j)}_{Psj}, x^{(j)}_{Mtj}, s_{j}, t_{j}_{j}^{(k)}_{Psk}, x^{(k)}_{Mtk}, s_{k}, t_{k}_{k}^{(k)}_{Psk}, s_{k}_{k}, k^{(k)}_{Mtk}, t_{k}_{k}, k^{0···0}_{0···0} and an orthogonal partition on the variance of the expected genotypic values is given by

where

Similarly, we can construct multi-locus GMA models for QTL with phase unknown genotypes. Without distinguishing the parental origin of the alleles, there are totally ∏^{L}_{k = 1 }_{k}_{k}^{L} possible expected genotypic values: _{s1t1···sLtL} = _{1s1}_{1t1}, …, _{LsL}A_{LtL}_{k}, t_{k}_{k}_{ksk}_{k, i}_{k,ij}_{1m1}, …, _{LmL}

where 1_{{ik=j}} is the Kronecker function which equals 1 when _{k}_{k}_{k}_{k}_{k}_{k}_{ksk}_{ktk}_{k}_{k}_{ksk}_{k,sk}_{k}^{*} = _{k}_{k}_{ksk}_{ktk}_{k,sktk}^{*}_{k}_{k}t_{k}_{k}_{k}_{k}_{s*1···s*L} represents the average allelic effect of a single allele, or an allelic interaction from all the alleles that are involved in this term, with respect to the baseline alleles _{1m1}, …, _{LmL}

Based on the above model, we can define the genetic components as

for _{1}, …, _{L}_{1} = ··· = _{L}^{L} − 1 genetic variance components _{i1···iL}) for _{1}, …, _{L}

respectively, for

The variance component of epistases with the highest order of 2

Under both the gametic and linkage equilibria, we have _{0···0} and an orthogonal partition on the variance of the expected genotypic values

where

Here, when _{k}^{*}_{k} = _{k}t_{k}^{*}′_{k} = _{k}t′_{k}_{s*1···s*L} (or α_{s*′1···s*′L}) to be the same if we switch the order of _{k}_{k}^{*}_{k} (or _{k}_{k}^{*}′_{k}_{s*1···s*L} in model (10) are defined based on the baseline alleles _{1m1}, …, _{LmL}

In practice, we do not have to rely on the derived formula to estimate the genetic variance or covariance components. Similar to the one-locus case, given the observed QTL genotypes for a random sample from a study population, we can always incorporate model (8) or (10) into a regression model with other possible adjusted covariates and fit the model using standard LS approach. Then we can estimate various genetic variance components as well as the covariances among different genetic components based on the fitted model. A good fit of a fully parameterized GMA model often requires that the expected genotypic values for all possible joint genotypes of the QTL are estimable from the study sample. If certain genotypes are not observable or rarely present in subjects from the study sample, a situation which likely happens when the number of alleles or the number of QTL is large with moderate or small sample size, the design matrix for the genetic effects could become singular which implies that some genetic variance components cannot be estimated reliably. But we do not have to use fully parameterized GMA models to model the expected genotypic values. In this case, we may want to build a reduced GMA model that can provide a good approximation to the expected genotypic values overall and meanwhile has a less complicated model structure. The fact that two terms within the same genetic component are unavoidably correlated suggests that we should perhaps treat each genetic component as a whole and keep or drop its terms all at once in building a GMA model. As genetic components of lower orders tend to have bigger impact on the expected genotypic values than the higher order ones, one way to construct a reduced GMA model is perhaps to go through a stepwise forward selection procedure by hierarchically adding the lowest order genetic component that can achieve a nominal significance level (e.g., 5%) but has not yet been selected in the model into the model one at a time. Here, the classical likelihood ratio statistic can be used to assess each genetic component for entering into or dropping from the model.

It has been known that the model building procedure is often sensitive to potential confounding among the selected variables. A GMA model uses the mean-corrected index variables ^{(k)}_{Psk}, x^{(k)}_{Mtk}_{k,sk}, v_{k,sktk}_{∞} model can be thought of directly using the inheritance indicator variables ^{(k)}_{Psk}^{(k)}_{Mtk}^{*}_{k,sk}^{(k)}_{Psk}^{(k)}_{Msk}^{*}_{k,sktk}^{(k)}_{Psk}z^{(k)}_{Mtk}^{(k)}_{Ptk}z^{(k)}_{Msk}_{∞} model could have its low-order terms being highly confounded with other high-order terms when they contain shared alleles (see Zeng et al., _{∞} models, the stepwise forward selection procedure could be problematic because failing to include a significant higher order term (or component) in a reduced _{∞} model could make the assessment of some low-order terms (or components) unreliable. On selecting significant QTL from a set of loci without having the locus-by-locus interactions being involved, the choice of using GMA or _{∞} model in building a reduced model for the expected genotypic values should not matter much because using mean-corrected index variables mainly affects the intercept in this case. However, when we consider including epistases for a given set of QTL, the GMA model can appropriately use the orthogonal property among different genetic components to dissect the confounding at least in equilibrium populations, while it appears that the _{∞} models cannot make full use of the equilibrium information. When disequilibria are present, as Hardy-Weinberg equilibria are expected to be held in most of the human genomic regions and LD mainly present for closely linked loci, we would expect that most of the genetic components in a GMA model are likely uncorrelated. Therefore, in most cases, using the GMA model could still be preferable to using _{∞} model in building reduced models for expected genotypic values especially when epistases are involved.

As an example, we apply the GMA model to a published experimental data set on the polymorphism at the human acid phosphatase locus (ACP1). The analysis of this data set was first conducted by Greene et al. (^{ac}) and inhibition (^{in}). The estimates of the expected genotypic values and the genotype frequencies are summarized in Table

ACP1 enzyme activity (^{ac} |
122.4 | 153.9 | 188.3 | 183.6 | 212.3 | 240.0 |

ACP1 enzyme inhibition (^{in} |
41.2 | 37.9 | 34.4 | 58.7 | 53.1 | 76.0 |

Genotype frequency | 0.1242 | 0.4139 | 0.3349 | 0.0445 | 0.0799 | 0.0025 |

From the genotype frequencies, we first estimate the allele frequencies as _{A}_{B}_{C}_{1}(_{2}(_{11}(_{22}(_{12}(^{*}_{1} = −59.260, α^{*}_{2} = −26.254, δ^{*}_{11} = −4.800, δ^{*}_{22} = 3.700 and δ^{*}_{12} = −2.000. For ACP1 enzyme inhibition, we have μ = 39.386, α^{*}_{1} = −16.149, α^{*}_{2} = −19.714, δ^{*}_{11} = −0.200, δ^{*}_{22} = 4.200, and δ^{*}_{12} = 2.100.

Next, for each trait separately, we calculate _{k})^{*}_{1}_{1}(_{k}^{*}_{2}_{2}(_{k}_{k}^{*}_{11}_{11}(_{k}^{*}_{22}_{22}(_{k}^{*}_{12}_{12}(_{k}_{A}_{D}^{ac}_{A}^{ac}_{A}_{D}^{in}_{A}^{in}_{A}_{D}_{A}^{in}_{D}

Using the same allele frequencies but assuming HWE, we would have slightly different genotype frequencies. Note that the LSE of the model parameters keep the same and do not depend on the genotype frequencies because they are completely determined by the allele frequencies and the six expected genotypic values when a fully-parameterized GMA is used. But the total variance of the expected genotypic values and its variance components will be different. For ACP1 enzyme activity, we obtain _{A}_{D}^{ac}_{A}^{ac}_{A}_{D}^{in}_{A}^{in}|

For this three-allele example, we also applied the NOIA model using the formulas (10) and (11) provided in Álvarez-Castro and Yang (_{D}_{D}

In the analysis of genetic variance components, a separation of the variations contributed by the additive allelic effects and allelic interactions is complicated by the fact that the observed genotypes are often phase-unknown. In this study, by appropriately merging the paternal and maternal allelic effects and allelic interactions in the phase-known situation, we propose a way to construct one-locus and multi-locus GMA models on analysis of genetic variance components for QTL with multiple alleles. In the same way as building a G2A model, we construct a GMA model by first specifying its design matrix for the genetic effects via some mean-corrected index variables. As these mean-corrected index variables are well defined based on the observed genotypes and allele frequencies, they can be treated as regular covariates for coding QTL genotypes. These one-locus or multi-locus GMA models can then be incorporated into standard regression models with other possible adjusted covariates and fitted using standard LS approach. Based on the fitted models, we can further estimate the genetic variance and covariance components through the sample variances and covariances of various genetic components. As we have pointed out, these GMA models can be applied to equilibrium populations as well as populations in Hardy-Weinberg and/or linkage disequilibria. By using the full set or a low-order subset of the index variables (or genetic components), the GMA model allows us to make either full or reduced re-parameterization of the genotypic values. When some loci have phase known genotypes while other loci have phase unknown genotypes (a possible hypothetical situation), a mixed GMA model could also be constructed by adopting the same modeling strategy.

Sometimes we may want to perform hypothesis tests on the existence of certain genetic variance and covariance components. Note that the GMA models have allele frequencies being involved in their design matrices. As allele frequencies often need to be estimated from the genotype data, they could contribute another source of variation in the LSE (^{2}E[(^{−1}], where

In genetic studies, QTL with missing genotypes is a common phenomenon. GMA model can be used to fit QTL with missing genotypes. Rather than excluding patients with missing QTL genotypes, we could treat “missing” as an allele although this strategy may induce potential bias as we assume that all the missing alleles have the same genetic effect. GMA models could also be applied incorporation with various imputation methods. In recent years, there has been a great deal of interest in developing methodologies for QTL mapping using recombinant intercrosses from multiple inbred lines. In this case, the putative QTL often have their locations and genotypes unknown. But the allele frequencies of QTL could probably be inferred from the study design and the QTL genotypes might be imputable from their neighboring genetic markers. How to apply GMA models to this type of experimental crosses for QTL mapping could be a research topic for further exploration.

In summary, the analysis of genetic variance components for multi-allele QTL has been challenging due to complex allelic interactions and locus-by-locus interactions. In this study, we thoroughly explored the architecture of one-locus and multi-locus GMA models with either phase known or unknown genotypes. Particularly, we described in detail the architecture of the multi-locus GMA model, and how the model terms can be grouped into various genetic components. Under equilibria populations, we also derived formulas for orthogonal partition of the genetic variance components, which could be useful for analytical assessment of the variance components. Comparing to the classical Fisher model, the GMA models can estimate the genetic variance and covariance components more conveniently via standard LS approach for either one or multiple QTL with multiple alleles, in equilibrium as well as disequilibrium populations.

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The author would like to acknowledge Dr. Zhao-Bang Zeng at Bioinformatics Research Center, North Carolina State University, for his thoughtful comments and suggestions on an earlier version of the manuscript.

In HWD, we can represent the genotype frequencies as _{(Ai,Aj)}^{i}p_{j}^{i}_{j}^{i}_{j}^{i}_{j} = _{(Ai,Aj)}^{i}_{j} = Cov(_{Pi}, x_{Mj}

where _{AP}_{AM}_{D}

For the covariances, we have

As the paternal and maternal alleles are correlated in HWD, we will likely have non-zero covariances among _{P}_{M}

In HWD, we can represent the genotype frequencies as _{AiAi}^{2}_{i}_{ii}_{AiAj}_{i}p_{j}_{ij}_{i}_{AiAi}_{j≠i}P_{AiAj}_{ij}_{ji}_{ij}_{Pi}, x_{Mj}_{A}_{D}

In HWD, the genotype frequencies can also be parameterized as _{AiAi}^{2}_{i} + _{i}_{i}_{AiAj}_{i}p_{j}_{ii}_{i}_{i}_{ij}_{i}p_{j}f