^{1}

^{*}

^{2}

^{3}

^{1}

^{2}

^{3}

Edited by: Mariza De Andrade, Mayo Clinic, USA

Reviewed by: Bjarni V. Halldorsson, Reykjavik University, Iceland; Tian-Qing Zheng, Chinese Academy of Agricultural Sciences, China

*Correspondence: Fabrice Larribe, Département de Mathématiques, Université du Québec à Montréal, Case Postale 8888, Succursale Centre-Ville, Montréal, QC H3C 3P8, Canada

This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

We present a methodology which jointly infers haplotypes and the causal alleles at a gene influencing a given trait. Often in human genetic studies, the available data consists of genotypes (series of genetic markers along the chromosomes) and a phenotype. However, for many genetic analyses, one needs haplotypes instead of genotypes. Our methodology is not only able to estimate haplotypes conditionally on the disease status, but is also able to infer the alleles at the unknown disease locus. Some applications of our methodology are in genetic mapping and in genetic counseling.

In human genetic studies, the unobserved raw data consists of two DNA sequences for each individual (see Figure

For many genetic analyses however, the haplotype data is required, and in some cases even this information may not suffice. The unknown allelic state at the TIM is also required and this is an issue addressed by our work. Indeed, many methodologies use genealogies to infer parameters of populations (such as recombination rate and mutation rate) and these genealogies must be built using haplotypes, not genotypes, since the haplotypes contain the additional information about which genetic material was transmitted from one ancestor to a child. All such methodologies have a way to deal with this problem: some include an estimate of the haplotypes, like the Margarita program (Minichiello and Durbin,

To our knowledge, none of the current methodologies is able to jointly estimate haplotypes and the alleles at a causal gene. To infer haplotypes from genotypes, laboratory or computational methods can be used (Browning and Browning,

Most of the aforementioned methodologies do not take into account the phenotype, nor the genetic model. In our context of a case/control study, one would not want to ignore the phenotype information. Unlike our method, none of the existing methodologies proposes to estimate the alleles at the TIM.

Finally, note that recent methodologies for estimating haplotypes use human sequence data, but some parts of the human genome are still difficult to sequence, which can limit the use of these methods; it is important to note that our method can be used on human sequences as well as animal or plant sequences, could also be extended to non diploid organisms.

Note that the EM algorithm presented here, by taking into account the phenotype, is in a way an extension of the work of Excoffier and Slatkin (

Assume a large population of diploid individuals in Hardy-Weinberg equilibrium, where we can observe a dichotomous trait that depends on at least one Trait Influencing Mutation (TIM). As the phenotype ϕ depends on the TIM through a genetic model, it is actually possible for the trait to be dependent on several TIMs, but we consider only one TIM at a time. Let _{0} denote the distribution of haplotypes among non carriers of the TIM, and _{1} denote the distribution of haplotypes among carriers. Alternatively, carrier haplotypes will be called mutant haplotypes, and non carrier haplotypes, primitive haplotypes. For a given type of haplotype _{0}(_{1}(_{0}, _{1}, _{2}) associated with the TIM we are studying is such that _{i} is the probability for an individual to express the trait given that it bears

Since the population is in Hardy-Weinberg equilibrium, we can easily calculate specific probabilities of the form _{1}, δ_{2})]; the sample spaces for ^{2} is the probability for an individual to be a double carrier, and _{2} the probability for a double carrier to express the trait, the probability for any individual

The other seven cases are treated similarly (see Table

(1 − ^{2} |
|||

_{1} |
^{2} |
||

_{1} |
^{2} |
||

^{2} |
|||

Total | 1 − |
1 |

Let ^{0} be a non carrier haplotype of type ^{1} a carrier haplotype of type _{1}, δ_{2} ∈ {0, 1}. Let _{0} and _{1}, we are in an incomplete data problem, where the complete data is the set of phenotypes Φ and the set of diplotypes _{1}, δ_{2}) at the causal gene were known, then the probability of the diplotype _{0} and _{1}, and we would then have:

Since the phenotype depends on the diplotype only through the causal gene, the joint probability of the diplotype

We have previously seen how to calculate probabilities of the form _{1}, δ_{2})] (see for example Equation 1); hence it becomes easy to calculate the above probability for each of the eight combinations of δ_{1}, δ_{2} and ϕ; for instance:

Because individuals are assumed independent, the likelihood of (_{0}, _{1}) on the complete data is:

Since the probabilities _{i}, _{i1}, δ_{i2})] do not depend on the distributions _{0} and _{1} but only on the penetrance model

where the last expression is obtained by taking the product over the types of haplotypes instead of individuals, and _{0} and _{1} are the frequencies _{0} and _{1} were known.

Denote the expectation of the sufficient statistics by:

We then have to maximize the function:

with the constraints

i.e., we obtain maximum likelihood estimates from the complete data. It can be shown that _{L} is a maximum if

Applying the constraints, the

where

For now we have seen how to evaluate

which gives, by conditioning on the phenotype:

Using _{i} = 1] = _{i} = 0] = 1 −

The joint probability of the genotype _{i} and the phenotype ϕ_{i} is obtained from Equation (3) by summing over all the possible diplotypes:

Then, the probability of a diplotype _{i} and the genotype _{i}, is, using Equations (3) and (5):

We can see that the conditional probability depends only on the distributions _{0} and _{1}, and the probabilities _{i}].

Let's now evaluate the conditional expectation _{g, ϕ} be the number of individuals with genotype

will receive the sequence ^{δ} from their mother, and

from their father. As usual, the conditional probability of having a given sequence as the maternal haplotype is obtained by summing on all the compatible paternal haplotypes (and vice versa). Recall that if there is no missing information on the genotypes, there is a unique sequence _{g} compatible with _{g}) ∈

These two probabilities being equal by symmetry, the mean number of copies of ^{δ} carried by the _{g, ϕ} individuals presenting this profile is

We then obtain

for each ^{δ}. Note that the method can be generalized to missing data, by considering every combination of haplotypes compatible with the observed genotypes.

We have assumed until now that the sample was obtained by simple random sampling from the population, but usually this is not the case in genetics, since most samples are obtained using a case/control design. Let _{1} be the number of cases in the sample of size _{1} ∕

In this section we show that the algorithm described in Section 2.2 is robust to this case/control sampling. The proportions given in Table _{1}] =

_{00} |
|||

_{01} |
|||

_{10} |
|||

_{11} |
|||

Total | ω | 1 − ω | 1 |

Let's now review the steps of the algorithm. The likelihood on the complete data for such a stratified sample, conditional on the number of cases, is:

Since knowledge of the phenotypes Φ carries knowledge of _{1}, _{1} can be removed from the numerator. Moreover, the probability of obtaining _{1} cases from a simple random sample does not depend on the distributions _{0} and _{1}. After removing terms which do not depend on _{0} and _{1}, the likelihood for this data is the same as before:

which shows the likelihood remains the same for case/control sampling, and hence the M step of the algorithm remains unchanged.

Recall that the E step depended on diplotypes' probabilities, conditional on the genotype and the phenotype. We prove that these probabilities are not modified by the type of sampling. Let's begin by calculating the joint probability of a diplotype and a phenotype, conditional on ω, the proportion of cases; this probability is obtained by adding a condition on the proportion of cases in Equation (3):

Once the status at the causal gene is determined, the diplotype probability depends only on the distributions _{0} and _{1}. Moreover, if the phenotype is known,

Following the derivation for a simple random sample, the term _{i} ∣ ω] cancels out in the conditional probability formula, and we get the same result as before. Because these probabilities are not affected by the sampling design, the E step of the algorithm, described in Equation (8) remains the same. We have shown that this EM algorithm can be applied to case/control samples.

Assume the penetrance model _{0}, _{1}, _{2}) and the frequency of the causal mutation are known. The steps of our algorithm are:

Compute the probabilities

Consider an initial

E Step:

For each genotype

(1) Evaluate, for all

(2) Sum these probabilities to obtain:

(3) The conditional probability can then be computed as:

Compute, for each sequence ^{δ} (see Equation 8):

M Step: update the _{·} distributions, by evaluating, for all

Convergence test. Convergence is reached when

One convergence is reached, let

We have assumed that the proportion of carrier haplotypes is known, which is of course not realistic in practice. Note however, by assuming that the penetrance model

if _{0} + _{2} − 2_{1} = 0, then _{0}) ∕ 2(_{1} − _{0}). In general, however, we have:

and there exists a solution in [0, 1] which satisfies the penetrance model. If 0 ≤ _{0} ≤ _{1} ≤ _{2}, then the solution is unique. The methodology has been implemented in C++, and is available from the corresponding author.

For the proposed illustration, we have used ^{8} haplotypes are possible in theory, but only 24 of them are compatible with the observed genotypes. Figure _{0} and _{1} for each of the 24 possible types of haplotypes in the sample. By comparing individual values, _{1}(1) and _{0}(·) seem to be slightly better than those of _{1}(·); this is due to the fact that we have more information on control haplotypes than on case haplotypes, because phenocopy causes many case haplotypes to be non carriers.

_{1} (left) and _{0} (right)^{8} possible haplotypes of length 8 markers, only 24 were compatible with the observed genotypes.

The methodology presented in this paper is the first to permit one to jointly estimate the haplotypes at genotyped markers and the (non observed) alleles at the TIM. Estimating the causal alleles could be very useful in genetic counseling for example, where the patient's risk and treatment could be adjusted if the alleles at the disease genes were known. In the sequel, we assume that haplotypes are known and we assess the capacity of our method to infer the causal alleles. As explained in Dupont (

In the EM algorithm, the number of parameters increases as the number of genetic markers increases: since the markers are binary, if sequences of length ^{d} possible haplotypes, leading to a maximum of 2^{d} − 1 parameters to estimate. For this reason, and because huge numbers of genetic markers are available today, the method is illustrated here using a moving windows strategy, i.e., we use windows made of

Let _{cas} and _{con} be the numbers of case and control haplotypes, and ^{0} and ^{1} the numbers of non carrier and carrier haplotypes, respectively. Let

and semi-partial success rates:

finally, we have the global success rate:

All these rates have different meanings, and are useful depending on the question of interest. In particular, π^{0} is the probability to estimate a non carrier if the individual is a non carrier, in other word the specificity, while π^{1} is the probability to estimate a carrier if the individual is a carrier, in other word the sensitivity. The probability π is known as the accuracy. As with all classification rules, it is not informative to achieve high sensitivity without specificity and vice versa. Accuracy alone is not an ideal measure of success for low frequency TIM, as high accuracy could be achieved by simply setting _{1} = 0. Figure

The accuracy of our method depends on several factors. We have identified three of them: the genetic model, the sample size, and the windows width _{1} = _{1} ∕ _{0} ∈ {1.01, 1.1, 2, 10}, RR_{2} = _{2} ∕ _{0} ∈ {1.01, 1.1, 2, 10}, for different sample sizes (_{con} ∕ _{cas} ∈ {100∕100, 200∕200, 400∕400, 800∕800}) and for various window of widths _{1} and RR_{2} implies there is less information in the data to infer mutant haplotypes. To obtain an informative range of values for RR_{1} and RR_{2}, we fixed _{0} at 0.01 and allowed _{1} and _{2} to take every value in the set {0.0101, 0.011, 0.02, 0.1} such that _{0} ≤ _{1} ≤ _{2}. These combinations lead to various genetic models, including recessive and dominant ones. Regarding the windows width, we expect that short windows contain less information about the data, however very large windows can cause many single haplotypes, making ^{0} and ^{1} difficult to estimate. Each sample originates from the same population, with a TIM frequency of

Figure _{1} and RR_{2}. When relative risks are low, we observe that the rates are constant along the sequences, with

_{1} = RR_{2} = 10, for a sample size of 400/400 cases/controls, and windows of 16 SNPs

The effect of the windows width (

_{1} = RR_{2} = 10

An overview of the effect of the different factors on the global rate π (the accuracy) is shown in Figure

_{1} = RR_{2} = 10, windows of 16 SNPs. _{1} = RR_{2} = 10, 400/400 cases/controls.

_{1} = RR_{2} = 10, windows of 16 SNPs. _{1} = RR_{2} = 10, 400/400 cases/controls.

_{1} = RR_{2} = 10, windows of 16 SNPs. _{1} = RR_{2} = 10, 400/400 cases/controls.

We have compared this EM methodology to a simpler naive method, which consists of testing the association of each marker in the region with the phenotype, and to infer the alleles at the TIM to be the alleles at the marker having the strongest association with the phenotype. To illustrate this procedure, we have used 7503 heterozygote markers to test the association on the data in the case RR_{1} = RR_{2} = 2. As shown in Figure ^{1} and the specificity π^{0} for the naive method, and to compare them with our previous estimates obtained using our method. This comparison is illustrated in Figure ^{1} and π^{0} (from Figure _{10} _{1} = RR_{2} = 10, then it is likely that one of the marker could perform as good or better than the EM method, because the association between the marker and the phenotype in this case would be more direct. This illustrates that our EM method surpasses the naive method in the most interesting cases.

_{1} = RR_{2} = 2^{0} (specificity) and the the red line indicates the success rate π^{1} (sensitivity) both obtained with TIM estimated by our proposed methodology; the four green dots are the specificity of the four most associated markers (naive method), and the four red dots are the sensitivity of the same four markers (naive method). _{10} of the

We have shown how to build an EM algorithm to jointly estimate haplotypes and unknown alleles at the TIM conditionally on the phenotype. In contrast to other methodologies, we use the phenotypic information available, and estimate the frequencies of haplotypes for non carriers, _{0}, and for carriers, _{1}; the method also estimates the alleles _{1}, δ_{2}), opening new avenues. This method to estimate the alleles at the TIM can also be used with resolved haplotypes, and with missing values. We have shown that the methodology is robust to the sampling from case/control design, which is commonly used in genetic studies. We benchmarked the method on data simulated under the coalescent. The efficiency of the method to infer the alleles at the TIM depends mostly on the strength of the genetic model: when the relative risks are high, the success rates of correctly estimating the alleles are high. This implies that it would be relatively easy to infer TIM alleles for mendelian traits, however we probably need more data when relative risks are low. We observed that neither the frequency of the disease nor of or the causal alleles in the population had any impact on the efficiency of the method. This was to be expected given the case/control design, which implies an enrichment in cases, and thus in causal alleles, in the samples. We also compared our methodology to a naive method, which consists of estimating the alleles at the TIM by the alleles of the marker the most significantly associated with the phenotype. By studying specificity and sensitivity, we have shown that the proposed method provides both higher specificity and higher sensitivity, especially around the true position of the TIM.

This work was supported by the Natural Sciences and Engineering Research Council of Canada by a grant to the first FL. GB has received scholarships from FQRNT and NSERC.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

We thank our colleague S. Froda for helpful comments on earlier versions of the manuscript.