^{1}

^{†}

^{2}

^{†}

^{3}

^{1}

^{1}

^{4}

^{2}

^{1}

^{2}

^{3}

^{4}

Edited by: Raina Robeva, Sweet Briar College, USA

Reviewed by: Tom M. W. Nye, Newcastle University, UK; Liang Liu, Harvard University, USA

*Correspondence: Ruriko Yoshida, Department of Statistics, University of Kentucky, 817 Patterson Office Tower, Lexington, KY 40506-0027, USA. e-mail:

^{†}Elissaveta Arnaoudova and David C.Haws contributed equally to this work.

This article was submitted to Frontiers in Systems Biology, a specialty of Frontiers in Neuroscience.

This is an open-access article subject to an exclusive license agreement between the authors and the Frontiers Research Foundation, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are credited.

We propose a statistical method to test whether two phylogenetic trees with given alignments are significantly incongruent. Our method compares the two distributions of phylogenetic trees given by two input alignments, instead of comparing point estimations of trees. This statistical approach can be applied to gene tree analysis for example, detecting unusual events in genome evolution such as horizontal gene transfer and reshuffling. Our method uses difference of means to compare two distributions of trees, after mapping trees into a vector space. Bootstrapping alignment columns can then be applied to obtain

Estimating differences between phylogenetic trees is one of the fundamental questions in computational biology. Conflicting phylogenies arise when, for example, different phylogenetic reconstruction methods are applied to the same data set, or even with one reconstruction method applied to multiple different genes. Gene phylogenies may be codivergent by virtue of congruence (identical trees) or insignificant incongruence. Otherwise, they may be significantly incongruent Maddison (

_{0}: Phylogenetic trees _{1} and _{2} are congruent.

_{1}: Phylogenetic trees _{1} and _{2} are incongruent.

Usually a statistical test on the above hypotheses considers point estimates of the trees obtained by a tree reconstruction method, such as maximum likelihood (ML) estimates (Felsenstein,

There are several techniques to test if gene trees are codiverged. For example, the Bayesian estimation methods (e.g., Ane et al.,

This paper is organized as follows: In Section

Let 𝒯_{n}_{n}

_{n}^{m} for some m, the vector v(T) is the_{n}

The difference between trees _{1},_{2} ∈ 𝒯_{n}_{2} norms.

A notable example of our framework is the

_{n}, let

_{2} norm (Euclidean length)

In our computational experiments, we will use the dissimilarity map distance. Dissimilarity map distance was studied in Buneman (

In our framework, given are _{1},_{2}, each a collection of _{1},_{2} were generated by models of sequence evolution on unknown trees _{1},_{2} ∈ 𝒯_{n}

For convenience, we describe our approach as comparing two gene trees _{1},_{2} ∈ 𝒯_{n}

Random fluctuations in sequence evolution can cause reconstructed gene trees for _{1} and _{2} to look at least slightly different, even if the true underlying trees are equal. Thus we need a way to tell if the difference between two estimated trees is “significant.”

One classical approach to assess variability in reconstructed trees is the bootstrap (Felsenstein,

Here we propose a bootstrap procedure to assess significance of the distance between two trees. Our method is based on the triangle inequality. Namely, if _{1}),_{2}), then the triangle inequality says

which gives a lower bound on the distance between the true trees _{1},_{2} ∈ 𝒯_{n}_{1}) − _{2})|| = 0. So the inequality in Eq. _{1},_{2} ∈ 𝒯_{n}

The bootstrap procedure we have proposed can be applied with any tree estimator, such as neighbor-joining or ML. Since we are presuming tree uncertainty is high, and Bayes estimator trees are more accurate than neighbor-joining or ML (Huggins et al.,

Given an alignment _{n}_{1}|_{1}) and _{2}|_{2}) of trees _{1},_{2} ∈ 𝒯_{n}_{1},_{2}, respectively, let _{1} drawn from _{1}|_{1}), and similarly let _{2} drawn from _{2}|_{2}). Then we can use _{1}), and similarly _{2}). The

and _{1}) − _{2})||.

Some feature space maps produce very high-dimensional feature vectors _{1}),_{2}) for trees _{1},_{2} ∈ 𝒯_{n}_{1}) − _{2})|| can be computed quickly without explicitly writing down the feature vectors for _{1} and _{2}. Notable examples include Robinson–Foulds distance and quartet distance. In such cases, it would be desirable if the difference of means

_{1},x_{2},y_{1},y_{2}^{m} be four pairwise independent random variables, where x_{1} _{2} _{1}, _{2} _{1}) = 𝔼(_{2}) = μ_{x} and_{1}) = 𝔼(_{2}) = μ_{y}

A proof of Proposition 1 is provided in Supplementary Material. Using the proposition and a subroutine which computes the norm in Definition 2, the length

In this section we estimate posterior distributions of phylogenetic trees via MCMC-based software ^{−8} was used to yield sequences with sequence divergence similar to real data. Table

Species depth | Min | Q1 | Median | Q3 | Max |
---|---|---|---|---|---|

1000K | 0.000 | 0.002 | 0.005 | 0.008 | 0.017 |

600K | 0.000 | 0.003 | 0.006 | 0.01 | 0.022 |

100K | 0.000 | 0.001 | 0.001 | 0.002 | 0.006 |

1000K | 0.032 | 0.04 | 0.043 | 0.045 | 0.054 |

600K | 0.025 | 0.03 | 0.032 | 0.035 | 0.046 |

100K | 0.004 | 0.007 | 0.008 | 0.012 | 0.016 |

In order to estimate posterior distributions we used the MCMC-based software

We generated simulated data sets in three different ways; (i) two separate sequence data sets generated from the same gene tree, (ii) sequence data sets generated from two different gene trees under the same species tree, (iii) sequence data sets generated by two sequence data sets generated from two different gene trees whose species trees are also different. We tested 10 gene trees for each species depth (i.e., 30 different gene trees in total) generated under the same species tree. One can find the species trees we used in Figure

_{1} and _{2}. _{1} and _{2}. In

We also compared our method with two others: the statistical hypothesis testing described in Example 3 of Section 4.4.1 in Holmes (

For SH test we used _{1} is contained in the confidence region for an unknown tree _{2}. In our framework, both _{1} and _{2} are unknown. Thus we applied the SH procedure twice: once to test whether the ML estimate _{2}, and once to test whether _{1}. If both tests reject, then we declare that the overall procedure rejects _{1} = _{2}. We call this the “paired SH test.” To run the paired SH test at level α, each of the two individual SH tests is run at level α.

With these parameters, neither SH nor our method exhibited any false positives when the nominal Type I error rate was set to α ≤ 0.1. For α ≥ 0.05, SH had slightly more power, but our method was much more powerful than SH for small α. See Figure

We tested our method with a well-known gopher-louse data set (Hafner and Nadler,

Data set | |
---|---|

Gopher-louse (dataset 1) | 0.64 |

Gopher-louse (dataset 2) | 0.40 |

Gopher-louse (dataset 3) | 0.84 |

Gopher-louse (dataset 4) | 0.59 |

Grass-endophyte |
0.04 |

Grass-endophyte |
0.08 |

Grass-endophyte |
0.00 |

The posterior distributions were estimated using MrBayes with the following parameters: (i) for the model: GTR + Gamma + Invariant sites; (ii) for MCMC: number of runs: 1, number of chains: 2, chain length: 100,000, sample frequency: 1,000, burn-in: 25%; and (iii) for bootstrap sampling: 100 bootstrap samples with sample size of 379 columns which is the length of sequence alignments in the data sets.

We also tested our Method with the data sets from Schardl et al. (

The posterior distributions were estimated using MrBayes with the following parameters: (i) for the model: GTR + Gamma + Invariant sites; (ii) for MCMC: number of runs: 1, number of chains: 2, chain length: 100,000, sample frequency: 1,000, burn-in: 25%; and (iii) for bootstrap sampling: 100 bootstrap samples, number of bootstrap columns equals length of original alignment.

These results are interesting in comparison with the prior finding of significant relationship between the phylogenies of the grasses and their endophytes (Schardl et al.,

We chose an additional biological data set to compare phylogenies of genes that occur together in endophyte genomes. Whereas

Data set | |
---|---|

0.39 | |

0.56 | |

0.94 | |

0.23 | |

0.34 | |

0.87 |

To facilitate computations for our experiments, we developed a set of programs, collectively called

In this paper we presented a method to determine if two phylogenetic trees with given alignments are significantly incongruent. Our method computes the difference of means of posterior distributions of trees, which has the advantage of using entire tree distributions, as opposed to single tree estimators.

In this paper we used the triangle inequality (_{1} ≤ _{2} + _{3} in Figure _{1} ≤ max(_{2},_{3})]. We explored this in the Supplementary Material, and it seems that the max condition provides much more power, but is somewhat anti-conservative.

In this paper we used the dissimilarity map as a feature space. However, there are other common tree features which can be used to define different feature spaces. Examples of distances derived from tree features include (normalized) Robinson–Foulds distance (Robinson and Foulds, _{p}

The Supplementary Material for this article can be found online at

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

David Haws, Elissaveta Arnaoudova, Jerzy W. Jaromczyk, Christopher L. Schardl, Ruriko Yoshida are supported by NIH R01 grant 5R01GM086888. Elissaveta Arnaoudova, Jerzy W. Jaromczyk, and Neil Moore developed the software