^{1}

^{2}

^{3}

^{3}

^{4}

^{5}

^{1}

^{1}

^{*}

^{1}

^{2}

^{3}

^{4}

^{5}

Edited by: Lingling An, University of Arizona, United States

Reviewed by: Michael B. Sohn, University of Rochester, United States; Hongmei Jiang, Northwestern University, United States

This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Differential abundance analysis is a crucial task in many microbiome studies, where the central goal is to identify microbiome taxa associated with certain biological or clinical conditions. There are two different modes of microbiome differential abundance analysis: the individual-based univariate differential abundance analysis and the group-based multivariate differential abundance analysis. The univariate analysis identifies differentially abundant microbiome taxa subject to multiple correction under certain statistical error measurements such as false discovery rate, which is typically complicated by the high-dimensionality of taxa and complex correlation structure among taxa. The multivariate analysis evaluates the overall shift in the abundance of microbiome composition between two conditions, which provides useful preliminary differential information for the necessity of follow-up validation studies. In this paper, we present a novel

The human microbiome, referred as the aggregate of microorganisms that resides on or within any human tissues and biofluids, has recently gained substantial scientific interest due to its vital role in many human health and disease conditions, including but are not limited to obesity (Turnbaugh et al.,

In many microbiome studies, the investigators are often interested in studying how the abundance of microbiome is related with clinical characteristics of the samples, such as health/disease status, smoking status, or dietary habit (high-calorie or low-calorie). That is, many studies attempt to detect differentially abundant microbiome features (species/OTUs) between two predefined classes of samples, where a microbiome feature is considered differentially abundant, if its mean proportion is significantly different between two conditions. This type of analysis can improve understanding the pathology of the disease from a microbiome perspective and potentially lead to preventive or therapeutic strategies (Virgin and Todd,

Similar to individual gene-based and pathway-based differential expression analysis, there are two types of microbiome differential analyses: individual taxon-based univariate analysis and taxa set-based multivariate analysis. Along with the recent huge scientific interest in microbiome studies, many statistical methods for microbiome differential analysis have also been proposed (Sohn et al.,

An alternative approach to taxon-level microbiome differential analysis is to compare the microbiome composition at the level of taxa-set. Examples of such a taxa set can be either a group of OTUs belonging to the same upper-level taxonomic rank (e.g., phylum, class, order, family, or genus) or even all OTUs in the microbiome community. The multivariate-type microbiome differential analysis usually gains power by reducing the multiple testing correction burden and aggregating modest effects across multiple taxa. Moreover, the multivariate analysis is typically less sensitive to normalization/transformation compared to individual analysis as it has a much larger analysis unit. Motivated by this, many statistical methods for microbiome community-level analysis have been recently proposed (McArdle and Anderson,

Despite of the potential power gain, a major critique of these existing multivariate microbiome analyses (e.g., differential analysis) is that the result of the test is global and is unable to identify specific taxon in the taxa-set that are differentially abundant. Besides the limitation in results' interpretation, it may also jeopardize the power of the test when the taxa-set contains many taxa that are not differentially abundant (Cao et al.,

Assume that we have measured the microbiome abundances of a community of _{1}+_{2}) samples collected from two groups with sizes of _{1} and _{2}, respectively. Here, the term community refers as a taxa-set, which typically consists of taxa from the same taxonomic rank such as genus, family, phylum, or bacteria kingdom. Let _{k} × ^{(1)} and ^{(2)}, respectively. In many practical problems, the hypothesis of interest is to examine whether microbiome abundances are different under two different conditions, that is,

For microbiome data, due to the varying amount of DNA yielding materials across different samples, the count of microbiome sequencing reads can vary greatly from sample to sample. The normalization of the raw sequencing read counts to relative abundances makes the microbial abundances comparable across samples. Therefore, it is a common practice to analyze high-dimensional microbiome compositional data with a unit sum (Li,

A popular approach to relax the compositional constraint of microbiome data is to perform the statistical analysis through log-ratio transformations (Aitchison,

To avoid a zero relative abundance in Equation (2), as a common practice, a zero count is usually replaced by a pseudo count of 0.5 before the relative abundance normalization and centered log-ratio transformation (Li, _{k} and

Two-sample testing on the equality of two high-dimensional means has been well studied in the statistical literature (Bai and Saranadasa,

An alternative approach to test hypothesis (Equation 1)is to use a non-parametric test that does not need to estimate the covariance matrix. One such test is the kernel-based maximum mean discrepancy (MMD) test (Gretton et al.,

In particular, the MMD statistic between two independent samples

where ^{2} statistic is zero, and thus, a larger MMD^{2} statistic indicates a larger discrepancy between the two distributions. Asymptotically, MMD^{2} follows a mixture of

A limitation of the aforementioned MMD test is that it equally utilizes information in all dimensions. When the signal is sparse, the MMD test typically has a low power due to the high degrees of freedom paid for many noise variables. The same phenomenon has been widely observed in the field of set-based genetic association studies (Cai et al.,

A common adaptive approach in a multivariate association test or two-sample test is to assign different weights to variables so that important variables are up-weighted and non-informative variables are down-weighted (Cai et al., _{1}, …, _{p}), we apply the test on a putative testing subset _{S}, where

An adaptive two-sample test for microbiome differential abundance analysis

Apply the centered log-ratio transformation Equation (2) to the microbiome composition matrix. Without loss of generality, we still use Use the testing subset selection procedure described in section 2.4 to select a testing subset For Calculate the final |

There is a vast statistical literature on high-dimensional variable selection. Some famous examples include the lasso (Tibshirani,

We first randomly permute the row indices of matrix _{1}, …, _{p} and _{j}, as

It should be noted that the aforementioned permutation-based procedure is one way to achieve testing subset selection but not the only way, and it is possible to select testing subset

A comprehensive simulation study has been conducted to compare the performance of AMDA to a wide range of existing microbiome association tests in the framework of microbiome differential abundance analysis. The five other tests evaluated in this simulation include the MiRKAT (Zhao et al.,

We closely followed the simulation design of the MAX test (Cao et al., ^{(1)} were drawn from a uniform distribution Unif(0,10) and we considered the banded covariance structure ^{1/2}^{1/2}, where _{jj} = 1, _{j,j−1} = _{j−1,j} = −0.5. Under the null model, we set ^{(2)} = ^{(1)}. Under the alternative model, we randomly picked a subset _{j} ~ _{1} = _{2} =

After the data were simulated, we applied AMDA, MAX, OMiAT, MMD, MiRKAT, and QCAT to examine the two-sample differences. The first three tests AMDA, MAX, OMiAT are adaptive in the sense that they either use a testing subset of the taxa (AMDA and MAX) or assign a different weight for each taxon in the set (OMiAT) to conduct the multivariate two-sample test. The Gaussian kernel (^{2}/ρ}, where ^{2}. The type I error was evaluated using 5,000 replicates generated under the null model and the power of test was assessed with 1,000 replicates under the alternative model. Without loss of generality, we set the nominal significance level α = 0.05 throughout this simulation.

The type I error of different tests are reported in ^{*}/

Empirical type I errors of different tests for microbiome differential abundance analysis under nominal significance level α = 0.05.

50 | 0.0478 | 0.0478 | 0.0506 | 0.0516 | 0.0508 | 0.0436 | |

50 | 100 | 0.0464 | 0.0458 | 0.0492 | 0.0536 | 0.0540 | 0.0488 |

200 | 0.0504 | 0.0542 | 0.0530 | 0.0534 | 0.0548 | 0.0480 | |

50 | 0.0486 | 0.0478 | 0.0490 | 0.0434 | 0.0424 | 0.0532 | |

100 | 100 | 0.0464 | 0.0494 | 0.0492 | 0.0544 | 0.0542 | 0.0478 |

200 | 0.0524 | 0.0558 | 0.0514 | 0.0440 | 0.0424 | 0.0470 | |

50 | 0.0454 | 0.0498 | 0.0492 | 0.0438 | 0.0400 | 0.0490 | |

200 | 100 | 0.0514 | 0.0476 | 0.0464 | 0.0530 | 0.0516 | 0.0538 |

200 | 0.0464 | 0.0510 | 0.0506 | 0.0542 | 0.0530 | 0.0476 | |

50 | 0.0480 | 0.0464 | 0.0504 | 0.0556 | 0.0442 | 0.0474 | |

500 | 100 | 0.0540 | 0.0544 | 0.0566 | 0.0570 | 0.0498 | 0.0468 |

200 | 0.0556 | 0.0576 | 0.0456 | 0.0490 | 0.0442 | 0.0336 |

Empirical power of different tests under

Empirical power of different tests under

Among three non-adaptive tests, MMD and MiRKAT have similar power under each scenario. On the other hand, QCAT has the highest power when the dimension of taxa-set is relatively low (

Among the three more powerful adaptive tests, MAX seems to be slightly more powerful than AMDA and OMiAT when the density of signal is sparse (^{*}/^{*} = 50 even under the sparse scenario and AMDA can be more powerful than MAX by including more signals in the testing subset (bottom row of ^{*}/^{*}/^{*}/

To conclude, like five other methods, the proposed AMDA method is able to preserve the nominal type I error in microbiome differential abundance analysis. Power-wise speaking, there is no uniformly most powerful test in our simulations. However, the proposed AMDA method is always the most powerful one among all six tests being evaluated in this simulation under most scenarios, and the power advantage of AMDA over the other five methods can be huge (^{*}/

We applied the proposed AMDA method to a study investigating how the oral microbiome differs across children with autistic behaviors (Hicks et al.,

Taxonomic reads were further filtered to include only the taxa with counts of more than 10, in more than 20% samples, which ended up with a oral microbiome community of 753 taxa. Sequence alignment with the k-SLAM (Ainsworth et al.,

We first applied these tests to examine whether there is an overall shift in oral microbiome composition between different developmental groups by testing the differential abundances of all 753 taxa as a whole community. For the comparison of ASD vs. DD, the test ^{*} = 3 or 6 as suggested in the original analysis) compared to the number of variables (

Number of significant differential abundant taxa-set at each taxonomic rank detected by different methods under family-wise error rate of 0.05.

Phylum (10) | 3 | 1 | 0 | 0 | 0 | 1 | |

ASD vs. DD | Class (18) | 3 | 1 | 1 | 2 | 2 | 1 |

Order (34) | 2 | 0 | 0 | 1 | 1 | 2 | |

Family (52) | 1 | 0 | 0 | 0 | 0 | 1 | |

Phylum (10) | 2 | 2 | 2 | 0 | 0 | 1 | |

ASD vs. TD | Class (18) | 4 | 3 | 3 | 2 | 2 | 5 |

Order (34) | 3 | 2 | 1 | 1 | 1 | 2 | |

Family (52) | 2 | 2 | 1 | 2 | 2 | 2 |

Next, we shift our analysis unit to lower ranks than the community-level to comprehensively assess taxa-set (with multiple taxa) at each taxonomic rank that are differentially abundant among different developmental status groups. The testing results are summarized here in

With the ever-increasing availability of microbiome and metagenomics data generated by next generation sequencing technology, the need to develop and implement efficient statistical analysis for the data is important to ensure both statistical rigor and biological relevance. In this paper, we consider the problem of differential abundance analysis for microbiome data, which leads to a better understanding of the behavior of microbiome communities. Most existing methods tackle this problem using individual taxon-based approach followed by multiple testing adjustment. However, as taxa living in the same community do not grow independently, the complicated interactions among taxa result in complicated correlation structures among taxa relative abundances, which may violate the correlation assumptions (among individual tests) of existing multiple correction methods (Hawinkel et al.,

The AMDA method has two main advantages compared to a traditional individual taxon-based approach. First, it can provide new biological and biomedical insights. The joint modeling of all taxa in the set is able to capture conditional effects of taxa that are missed in the traditional individual taxon-based approach, and thus new insights can be gained by shifting the analysis unit to a higher taxonomic rank. Second, it is statistically powerful by aggregating marginal signals of individual taxon and reducing the multiple testing burden. By adaptively choosing the subset being tested, our AMDA further boosts the statistical testing power compared to existing taxa set-based differential abundance analyses (e.g., MiRKAT). Moreover, the adaptive strategy used in AMDA could be easily extended to other hypothesis testing framework (e.g., association testing) beyond the two-sample problem considered in this paper. We conducted comprehensive numerical simulation studies to show the superior performance of AMDA over existing approaches in terms of maintaining the correct type I error while having a higher power to detect a true difference. The potential usefulness of AMDA was further demonstrated via its application to an oral microbiome data, where AMDA tends to detect more significant differences than its competitors.

For illustration of our method, we applied the Gaussian kernel-based MMD test, which has been shown to be a consistent two-sample test (Gretton et al.,

This study involves only secondary analyses, where all the utilized data sets are published in a previous study.

KB and NZ analyzed the data, drafted the paper, prepared figures and tables, AS and LX conducted the testing subset simulations, SH and FM provided and helped analyze the oral microbiome data. RW contributed substantial expertise to improve the paper and revised the paper. XZ conceived and designed the experiments, analyzed the data, wrote the paper, and software. All authors read and approved the final manuscript.

The authors declare that this study received funding from a National Institutes of Mental Health STTR award (R41 MH111347) to Quadrant Biosciences, Inc. Quadrant Biosciences was involved with study design, and data collection for the RNA sequencing results employed in this study's secondary data analysis (autism microbiome data). SH and FM serve on the scientific and medical advisory boards of Quadrant Biosciences Inc., and SH is a paid consultant for Quadrant Biosciences Inc. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The authors would like to thank the Associate Editor and two reviewers for their insightful comments that improved the paper. Funding was provided by Quadrant Biosciences Inc. (Research agreement with SH) and NIH STAR (R41 MH111347).

The Supplementary Material for this article can be found online at: