^{1}

^{*}

^{2}

^{1}

^{2}

Edited by: Michele Guindani, University of California, Irvine, United States

Reviewed by: Yanxun Xu, Johns Hopkins University, United States; Michael Pester, Deutsche Sammlung von Mikroorganismen und Zellkulturen (DSMZ), Germany

*Correspondence: Juhee Lee

This article was submitted to Systems Microbiology, a section of the journal Frontiers in Microbiology

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

The successional dynamics of microbial communities are influenced by the synergistic interactions of physical and biological factors. In our motivating data, ocean microbiome samples were collected from the Santa Cruz Municipal Wharf, Monterey Bay at multiple time points and then 16S ribosomal RNA (rRNA) sequenced. We develop a Bayesian semiparametric regression model to investigate how microbial abundance and succession change with covarying physical and biological factors including algal bloom and domoic acid concentration level using 16S rRNA sequencing data. A generalized linear regression model is built using the Laplace prior, a sparse inducing prior, to improve estimation of covariate effects on mean abundances of microbial species represented by operational taxonomic units (OTUs). A nonparametric prior model is used to facilitate borrowing strength across OTUs, across samples and across time points. It flexibly estimates baseline mean abundances of OTUs and provides the basis for improved quantification of covariate effects. The proposed method does not require prior normalization of OTU counts to adjust differences in sample total counts. Instead, the normalization and estimation of covariate effects on OTU abundance are simultaneously carried out for joint analysis of all OTUs. Using simulation studies and a real data analysis, we demonstrate improved inference compared to an existing method.

Microbial communities are influenced by several factors whether they live in the host's guts or other occupied niches. Their successional dynamics could further change in response to perturbations of the host or of the surrounding environments (Turnbaugh et al.,

Analysis of huge NGS data is computationally expensive and challenging. One of the key challenges is the normalization of counts across samples. Total counts (often called library size or sequencing depth) may vastly vary across different samples due to technical reasons. Thus, observed counts are not directly comparable across samples and cannot be used as a measure of the abundance of an OTU. Normalized counts through rarefaction or relative frequencies are commonly used for easy comparison of OTU abundance across samples. However, such

Ocean microbiome data. _{ti,k, j}). OTU and samples are in rows and columns, respectively. OTU counts are rescaled within a sample for better illustration. _{i}) at the time point.

We develop a Bayesian semiparametric generalized linear regression model to study the effects of physical and biological factors on abundance of microbes. The proposed method performs mode-based normalization through a hierarchical model, which enables direct modeling of OTU counts. Furthermore, the hierarchical model facilitates borrowing strength between OTUs, between samples, and between time points through joint analysis and improves inference on the effects of covariates _{0}), and OTU and time factor (α_{t}), that is, _{0} × α_{t}. Due to the overparametrization of the baseline mean abundance, individual factors are not identifiable. To avoid identifiability issues, we place the regularizing priors with mean constraints (Li et al., _{0}. In addition, we model a temporal dependence structure between the baseline expected counts for an OTU through a convolutional Gaussian process (Higdon, _{0}, and α_{t} are not fully interpretable under the proposed model, but baseline mean counts

The rest of the paper is organized as follows. In section 2 we describe the proposed model and discuss the prior formulations and the resulting posterior inference. We perform simulation studies to assess the proposed model and perform comparison with an existing method that analyzes one OTU at a time. We then apply the proposed model to an ocean microbiome dataset. Section 3 presents the performance of the proposed model from the simulation experiment and the ocean microbime data. Section 4 concludes the paper with a discussion on limitations and possible extensions.

Suppose that samples are taken at _{i} ≤ _{i} replicates at time point _{i}. We consider count _{ti,k, j} of OTU _{i}, where _{i}, and _{i} and _{ti,k, j}] denote the _{ti,k, j} is integer-valued and nonnegative. Also, suppose that covariates _{i}. For example, covariates are physical and biological factors in our motivating data.

Count data by NGS methods is often modeled through a Poisson distribution. The assumption under the Poisson distribution that the variance is equal to the mean is often too restrictive to accommodate overdispersion that variation in data exceeds the mean. The negative binomial (NB) distribution is a popular and convenient alternative to address the overdispersion problem and is widely recognized as a model that provides improved inference to NGS count data (for example, see Robinson and Smyth, _{t,k,j} of OTU

where mean count μ_{t,k,j} > 0 and overdispersion parameter _{j} > 0. The model in Equation (1) implies that count of OTU _{t,k,j} ∣ μ_{t,k,j}) = μ_{t,k,j} and variance _{j}. In the limit as _{j} → 0, the model in Equation (1) yields the Poisson distribution with mean μ_{t,k,j}. We assume a gamma distribution for a prior distribution of _{j}, _{s} and _{s}.

We next model the mean count μ_{t,k,j} of _{t,k,j}. We decompose the mean count into factors, a baseline mean count and a function of covariates, μ_{t,k,j} = _{t,k,j}η_{j}(_{t}). Here parameter _{t,k,j} denotes the baseline mean abundance of OTU _{j}(_{t}) is a function of covariates _{t} for OTU _{j, p} quantifies the effect of covariate _{p} on the mean abundance of OTU _{j} close to the zero vector produces a value of η_{j}(_{t}) close to 1, and the mean count remains similar to the baseline mean count _{t,k,j}, implying insignificant covariate effects. A negative (positive) of β_{j, p} implies a negative (positive) association between mean counts and the _{j,p} decreases (increases) the mean count, while holding the other covariates constant. We consider a Laplace prior on β_{j}. Specifically, we express the Laplace distribution as a scale mixture of normals and assume for

where _{λ}, _{λ}, _{σ}, and _{σ} are fixed. _{j,p} denote the global and local shrinkage parameters, respectively, for OTU _{j,p} out, the prior distribution of β_{j,p} is the Laplace distribution with location parameter 0 and scale parameter _{j,p}, the Laplace distribution has more concentration around zero but allows heavier tails. The regularized regression through the Laplace prior more shrinks the coefficients of insignificantly related covariates into zero and less pulls the coefficients of important covariates toward zero. Shrinkage of β estimates through the model in Equation (2) mitigates possible issues due to multicollinearity and efficiently improves estimation of β in a high dimensional setting (Polson and Scott,

We next build a prior probability model for the baseline mean count _{t,k,j} of OTU _{t,k,j} = _{t,k}α_{0, j}α_{t, j} to separate sample (_{t,k}), OTU (α_{0, j}), and OTU-time (α_{t, j}) factors. Sample total counts _{t,k} account for different total counts in different samples and expected counts normalized by _{t,k} are comparable across samples. Factor α_{0, j} explains variabilities in baseline mean abundances of OTUs and α_{t,j} models temporal dependence of the mean counts for an OTU, respectively. Factors α_{0, j} and α_{t,j} are not indexed by replicate

The model for _{t,k,j} in Equation (3) is overparameterized and the individual parameters are not identifiable. To avoid potential identifiability issues, many of NB models rely on some form of approximation for the baseline mean counts. For example, one may find the maximum likelihood estimates (MLEs) of baseline mean abundance under some constraints and plug in those estimates to infer the mean abundance levels μ_{ti, j} of OTUs (Witten, _{t,k,j}, we take an alternative in Li et al. (_{t,k} and α_{0, j}. We let the logarithm of the factors

where ϕ(η, ^{2}) is the probability density function of the normal distribution with mean η and variance ^{2}, constraints for the mixture weights _{ℓ} and 1 − _{ℓ}, respectively, and the mean of the component is _{r} and _{α}, respectively. Li et al. (_{t,k,j} are identifiable, while _{t,k}, α_{0, j}, and α_{t,j} are not directly interpretable. More importantly, the parameters of primary interest η_{j}(_{t}) can be uniquely estimated and β_{j,p}'s keep their interpretation as parameters that quantify the effects of covariates on mean counts. We used an empirical approach to fix the mean constraints _{r} and _{α}. Sensitivity analyses were conducted to assess the robustness to the specification of _{r} and _{α} and show that the model provides reasonable estimates of _{t,k,j} and moderate changes in the values of _{r} and _{α} minimally change the estimates. More details of the specification of _{r} and _{α} are discussed in section 3.1. We fix the numbers of mixture components, ^{r} and ^{α} and variances _{r} and _{α}. We let ^{r} and ^{α} with fixed _{r}, _{r}, _{α}, and _{α}.

Recall that samples are collected over time points _{1}, …, _{n} in [0, _{t,j} accounts for temporal dependence in the baseline mean count for an OTU. We let _{t,k,j}. The Gaussian process (GP) is one of the most popular stochastic models for the underlying process in spatial and spatio-temporal data (for example, see Cressie, _{j}(_{1}, …, _{M} in [0,

where {_{1}, …, _{M}} a set of basis points in _{m}) a Gaussian kernel centered at _{m}, _{m} and the range parameter γ can be treated as random variables by placing prior distributions, e.g., consider a gamma prior for γ. For simplicity, we fix them as follows. We first choose a value for _{m} evenly spaced over time _{t,k,j}. A discussion is included in section 3.1. Given the number of basis points

We implement posterior inference on the parameters

We conducted simulation studies to assess the performance of our model. We compared the model to an alternative model, the negative binomial mixed model (NBMM) in Zhang et al. (_{i}) and numbers of replicates (_{i}) of our ocean microbiome data as shown in Figure ^{2}) or N(1.5, 0.05^{2}) with equal probability, where N(^{2}) denotes the normal distribution with mean ^{2}. It implies that a covariate has no effect on OTU abundance with probability 0.85 or may significantly affect mean abundance with the remaining probability 0.15. To specify _{i} in year and _{ti,k, j} from the negative binomial distribution

Ocean microbiome data. Bar plots of discretized covariates, concentration levels of _{1}) and _{2}), Pseudo-nitzchia (Pn, _{3}), domoic acid (DA, _{4}) in _{4}, _{5}), nitrate (N, _{6}), phosphate (P, _{7}), silicate (Si, _{8}), water temperature (T, _{9}), concentration level of chlorophyll (Chl, _{10}) in

For comparison, we used the negative binomial mixed model (NBMM) in Zhang et al. (^{NBMM} and shape parameter θ^{NBMM} to model OTU counts and assumes _{t} and _{t,k} are the covariate matrices for fixed effects and random effects, respectively. It assumes random effects _{t,k, ·} are used as an offset to adjust for the variability in total counts across samples. Similar to other existing methods, the NBMM performs separate analyses of OTUs. An iterative weighted least squares algorithm is developed to produce the MLEs under the NBMM and implemented in a R function

We applied the proposed statistical method to ocean microbiome data. Seawater samples were collected weekly at the end of Santa Cruz Municipal Wharf (SCW), Monterey Bay (36.958 ^{o}N, −122.017 ^{o}W), with an approximate depth of 10 m. SCW is one of the ocean observing sites in Central and Northern California (CenCOOS), where harmful algal bloom species [HAB species: _{4}), silicate (Si), nitrate (N), phosphate (P)], temperature (T), domoic acid (DA) concentration, and chlorophyll (Chl). Details of phytoplankton net tow sampling of measuring phytoplankton abundance, measurement of physical (nutrients and temperature) and biological parameters (chlorophyll α and DA concentration) are described in Sison-Mangus et al. (

For bacterial RNA samples, three depth-integrated (0, 5, and 10 ft) water samples were collected at a total of 55 time points between April 2014 and November 2015. Two or three samples are sequenced at each time point. The numbers of replicates at the time points are illustrated in Figure

To fit the proposed model for the simulated data designed in section 2.2, we specified hyperparameter values of the model as follows; for the Laplace prior of β_{j,p}, we let _{λ} = _{λ} = 0.5 for a gamma prior of _{λ}/_{λ} and variance _{σ} = _{σ} = 0.3 for an inverse gamma prior for common variances _{α} = _{r} = 10, _{r} = _{r} = _{α} = _{α} = 1, _{r} = 30 and _{α} = 50. To specify values of the mean constraints _{r} and _{α}, we took an empirical approach. We used the simulated _{ti,k, j}, computed estimates of _{ti,k, j} and α_{j, 0} as described in section 2.2 and fixed the mean constraints at the means of the logarithm of the estimates, respectively. Note that the specified values of _{r} and _{α} were very different from the means of their true values. For the process convolution prior of OTU-time factor _{m}, _{i} + 10. For overdispersion parameter _{j} we let _{s} = 1 and _{s} = 2. To run MCMC simulation, we initialized the parameters by simulating with their prior distributions. We then implemented posterior inference using MCMC simulation over 25,000 iterations, discarding the first 10,000 iterations as burn-in and choosing every other sample as thinning.

Figure _{j,p} to their true values _{j,p} and their 95% credible intervals, respectively. _{4} in Figures _{j,p} with

Simulation 1—proposed model. Comparison of the true values _{j,p} under the proposed model for some selected covariates. Dots and blue dashed lines represent estimates of posterior means _{j,p} and 95% credible intervals (CIs) of β_{j,p}, respectively. The insert plot in each panel is a scatter plot of

Figures _{t,k,j} with their means (black dots) and 95% credible intervals (blue vertical lines) for some selected OTUs, _{ti,k}, α_{0, j}, and α_{t,j}, but we rather aim to reasonably recover the true baseline mean counts, _{r} used for analysis. Different from the estimates of _{t,k,j} as seen in Figures _{t,k,j}. The posterior predicted values of _{ti,k, j} with their 95% predictive intervals for OTUs

Simulation 1—proposed model. Panels _{t,k,j} for some selected OTUs _{t,k,j} for each OTU. Panels ^{TR}. Dots represent posterior mean estimates and blue vertical dotted lines 95% credible intervals. Red squares represent the true values.

In addition, we conducted a sensitivity analysis to the specification of mean constraints _{r} and _{α} for the priors of _{r} and _{α} and compared the estimates of _{t,k,j} to their truth. Supplementary Figures _{j} between ĝ_{ti,k, j} and _{r} and _{α}. The histograms show minor change in estimates of _{ti,k,j} under different specifications of _{r} and _{α}. An sensitivity analysis to the specification of the number _{t,k,j}. Supplementary Figures _{j} for each of

For comparison, we used the NBMM to the simulated data. Since the NBMM does not accommodate missing covariates, we used ^{TR} to fit the NBMM. Figure _{j,p} to the true values for the same covariates used in Figure _{j,p} for all covariates. Supplementary Figures ^{NBMM} is the inverse of _{j} for many OTUs, and yields poor predicted values, implying the lack of a fit.

Simulation 1—NBMM. Comparison of the true values _{j,p} under the negative binomial mixed model (NBMM) for some selected covariates. Dots and blue dashed lines represent

We further examined the performance of the proposed model through additional simulation studies, Simulations 2 and 3 in Supplementary section _{j,p} from a mixture of normals. The performance of our model is almost the same as in Simulation 1 (see Supplementary Figures _{j,p} (see Supplementary Figures

We specified hyperparameters similar to those in the simulations and analyzed the microbiome data in section 2.3. The MCMC simulation was run over 25,000 iterations. The first 15,000 iterations were discarded as burn-in and every other sample was kept as thinning and used for inference. Figure _{4}) and the concentration level of nitrate (N, _{6}) significantly decrease the mean abundance of OTU 16 by a multiplicative factor of exp(−0.572) = 0.564 and exp(−0.260) = 0.771, respectively. One may infer that the medium concentration level of domoic acid is significantly associated with lower expected counts for the OTU compared to those with category none of the domoic acid concentration level. A similar argument can be applied to the inference on the nitrate concentration level. Interestingly, we observed statistically significant reduction in abundance from many OTUs belonging to Gamma-proteobacteria including those OTUs for increasing domoic acid concentration (not shown). The resulting inference was further validated through a lab experiment. Most notably, four bacterial cultured isolates belonging to Gamma-proteobacteria (three among them are _{j,p}, are illustrated in Supplementary Figure _{ti,k} and OTU specific overdispersion parameters _{j} is illustrated in Supplementary Figures

Ocean microbiome data—proposed model. Posterior Inference on β_{j} for some selected OTUs (_{j,p}. Each vertical line connects the lower bounds and the upper bounds of 95% credible intervals.

For comparison, we fitted the NBMM to the data. Since the NBMM does not account for missing values, we use the maximum a posteriori estimates of the missing values under the proposed for the NBMM. We used the R function _{j,p}. Histograms of the MLEs of β_{j,p},

Ocean microbiome data—NBMM. Inference on β_{j} for some selected OTUs (_{j,p}. Each vertical line connects the lower bounds and the upper bounds of 95% confidence intervals.

In this paper, we developed a Bayesian semiparametric regression model for joint analysis of microbiome data. We formulated the mean counts of OTUs as a product of factors and built models for the factors. We utilized the regularizing priors with mean constraints to avoid possible idenfiability issues, and the process convolution model to capture the temporal dependence structure in the baseline mean abundance of an OTU. The flexible model developed for baseline abundance enables joint analysis of all OTUs in the data and allows borrowing information across OTUs, across samples, and across time points. The model produces accurate estimates of the baseline mean counts and yields improved estimates of the effects of the covariates. We incorporated the Laplace distribution, a sparsity inducing shrinkage prior for the coefficients and the proposed model produces sparse estimates that is more desirable when the problem is high-dimensional and covariates are highly correlated. We compared the proposed model to a comparable frequentist model that does separate analyses for individual OTUs. The comparisons through the simulation study and real data analysis show better performance of the proposed model.

Although we focused on the analysis of NGS count data, the proposed model is quite general and can be applied for analyses of any count data. Future work will explore alternative approaches to model the effects of covariates on the mean counts. For example, one may consider a nonparametric model using linear combinations of basis functions (Kohn et al., _{j,p,t−1} and β_{j,p,t}. Considering the high dimensionality in OTU data, posterior computation may need to be carefully handled. Also, prior information may be needed to produce sensible inference due to sparsity in data.

JL developed the statistical model and conducted simulation studies and data analysis. She also prepared the first draft and led the collaboration with MS-M for statistical analysis. MS-M provided the ocean microbiome data, participated the statistical model development, provided biological interpretation of the resulting inference and edited the manuscript.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

We gratefully acknowledge Raphael Kudela who provided the environmental data in this study, Michael Kempnich and Sanjin Mehic for doing the water sampling and processing the water samples for DNA extraction.

The Supplementary Material for this article can be found online at: