^{1}

^{2}

^{1}

^{1}

^{2}

^{*}

^{1}

^{2}

Edited by: Helder Nakaya, University of São Paulo, Brazil

Reviewed by: Diego Bonatto, Federal University of Rio Grande do Sul, Brazil; Ling-Yun Wu, Academy of Mathematics and Systems Science (CAS), China

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

The study of interactions among biological components can be carried out by using methods grounded on network theory. Most of these methods focus on the comparison of two biological networks (e.g., control vs. disease). However, biological systems often present more than two biological states (e.g., tumor grades). To compare two or more networks simultaneously, we developed

In the last two decades, the high-dimensional data production, such as metabolomics, proteomics, transcriptomics, and genomics, increased considerably (Zhu et al.,

Biological systems can be assessed by correlation networks, in which the nodes represent the elements (variables) and edges represent the statistical relations among its elements. Some approaches have been proposed to qualitatively analyze the correlation networks by performing a visual inspection of their structure (Caldana et al.,

Over the last years, several tools have been developed to statistically test whether correlation networks are different across conditions. Examples include

Although several biological studies compare more than two networks (Caldana et al.,

In the context of functional brain network studies, a generalization of

We propose a method for comparing simultaneously two or more biological correlation networks. In the following subsections, we explain the construction of correlation networks (graphs), the structural graph analysis, and the statistical test performed by

A correlation network is an undirected graph, where each node corresponds to a biological variable, and each edge connects a pair of nodes indicating the association between two variables. In our context, the edge corresponds to the statistical dependence between two variables. To measure and detect monotonic relations,

The proposed method is based on graph topological features. In the following sections, we describe how

A random graph

Consider a set of nodes _{1}, _{2}, …, _{nv}} of the graph, _{1}, _{2}, …, _{r}, and _{i} samples (number of observations) for each state _{i}, for _{1}, _{2}, …, _{r} (each one representing a state) were generated by the same random graph model. In case the PDFG are different, it would be assumed that the graphs were generated by different random graph models. As will be seen next, here we analyzed correlation networks in which the elements correspond to variables such as genes, proteins, metabolites, and phenotypic variables. Examples of states include different treatments or conditions. An alteration in the structure of the network, detected by a change in the PDFG, could mean that a healthy human cell may be turning into a tumor cell or the tumor tissue might be entering in a new degree of aggressiveness.

The _{1}, _{2}, …, _{r}, (ii) computation of the statistic test, denoted by θ, which quantifies the differences among the networks, and (iii) a permutation test.

The PDFG is the probability density function of some topological feature _{v} elements _{1}, _{2}, …, _{nv}. Examples of topological features are the set of eigenvalues of the adjacency matrix of the graph, or graph centrality measures. Let δ be the Dirac's delta and the brackets “〈〉” denote the expectation according to the probability law of a random graph. Formally, the PDFG (

In real systems, the PDFG is unknown. To estimate the PDFG,

The

The θ statistic is calculated as follows:

For each graph _{i} (_{gi}.

Calculate the average PDFG as:

Calculate the Kullback-Leiber (KL) divergence between (_{gi}) and _{gM} :

The statistic θ, which measures the difference among graphs, is the average distance:

The KL divergence measures the discrepancy between two probability distributions. For graphs, we can use the KL divergence to select the graph model that best describes the observed graph or to discriminate PDFGs (Takahashi et al., _{1} and _{2} be two random graphs with densities _{g1} and _{g2}, respectively. If the support of _{g2} contains the support of _{g1}, then the KL divergence between _{g1} and _{g2} is (Takahashi et al.,

where 0 log 0 = 0 and _{g2} is called the reference measure. If the support of _{g2} does not contain the support of _{g1}, then _{g1}|_{g2}) = +∞. The KL divergence is non-negative, and it is zero if and only if _{g1} and _{g2} are equal. For many cases, _{g1}|_{g2}) and _{g2}|_{g1}) are different when _{g1} and _{g2} are not equal, i.e., KL is an asymmetric measure.

As in section 2.2, consider a set of nodes _{1}, _{2}, …, _{nv}} and a set of edges _{1}, _{2}, …, _{ne}} of the graph, _{1}, _{2}, …, _{r}, and _{i} samples (number of observations) of each state _{i}, for _{1}, _{2}, …, _{r}, of each state, are the same among all graphs.

Alterations in the centrality measures among networks means that the importance of the gene/protein/metabolite changed, i.e., its connectivity was altered regarding the main issues associated. Our tool, therefore, affords evaluation of data by assessing: (i) importance of a node in relation to the entire population of nodes in the network; (ii) proximity among nodes; (iii) importance of a node in the communication within the network, and (iv) the connectivity strength of the network as a whole.

The differential analysis consists of the same steps described in section 2.2.1. However, since in this case we are comparing the graphs centralities, the PDFG _{gi} is replaced by the vector of centrality measure and the _{i} by the Euclidean distance between the vector of nodes/edges centralities of graph _{i} and the vector containing the average centralities among the graphs (steps 2 and 3 of section 2.2.1).

Consider a set of nodes _{1}, _{2}, …, _{nv}} and a set of edges _{1}, _{2}, …, _{ne}} of the graph, _{1}, _{2}, …, _{r}, and _{i} samples (number of observations) of each state _{i}, for _{j}, for _{v}, or for the edge _{l}, for _{e}, is the same among _{1}, _{2}, …, _{r}, of each state. In the same way that was done in section 2.3, here we considers the five node centrality measures (degree, eigenvector, closeness, betweenness, and clustering coefficient) and the edge centrality (edge betweenness).

The _{1}, _{2}, …, _{r}, (ii) computation of the statistic test, denoted by θ, which quantifies the differences among the node centralities of each network, and (iii) a permutation test.

The θ statistic is calculated as follows:

For each node _{j} (_{v}) or for each edge _{l} (_{v}) in graph _{i} (

From the

Calculate the distance between the centrality of nodes/edges in each graph _{i} ^{j}):

The statistic θ, which measures the difference among centralities for each node/edge

The hypotheses to be tested are defined as:

_{0} : θ = 0 vs. _{1} :θ > 0.

To construct the null hypothesis we perform a permutation test as follows:

Compute

Construct

Compute

Repeat steps 2 and 3 until obtaining the desired number of permutation replications.

Test if

For

Differential node analysis based on the degree centrality.

MAPK3 | 25.151 | 0.001 | 0.017 | 25 | 28.1 | 18.7 | 9.3 |

MAPK10 | 19.904 | 0.001 | 0.017 | 29 | 30.7 | 22.2 | 17.5 |

MAPK9 | 18.653 | 0.001 | 0.017 | 27.9 | 30.9 | 22.4 | 17.8 |

TOLLIP | 17.877 | 0.002 | 0.026 | 25 | 28.2 | 20 | 15.3 |

TAB1 | 17.393 | 0.001 | 0.017 | 27.2 | 30.8 | 25.2 | 16.1 |

PIK3R1 | 17.098 | 0.001 | 0.017 | 28.9 | 30.7 | 24.5 | 18 |

AKT3 | 17.013 | 0.001 | 0.017 | 31.1 | 31.4 | 24.2 | 21.3 |

PIK3CB | 15.215 | 0.002 | 0.026 | 29.1 | 31.8 | 23.9 | 21.7 |

The

Schematic diagram of BioNetStat. BioNetStat receives an input file containing the values of the variables to be analyzed and _{1}, …, _{r}). This figure illustrates the method performed with PDFG, however it can be replaced by centralities (such as Degree, Betweenness, and Closeness) without loss of generality.

To illustrate the utility of

The glioma dataset was obtained from a public database (TCGA) (Tomczak et al.,

The plant metabolism dataset contains 73 metabolites from whole-plant sorghum development (de Souza et al.,

To evaluate the performance of

We performed Monte Carlo experiments to verify the ability of

To measure the statistical power (the ability to detect differences among two or more networks when indeed they are different) of the methods, we build

To summarize the statistical power of the test, we constructed Receiver Operating Characteristic (ROC) curves. The

In

Comparison of the statistical power of BioNetStat based on PDFG (black circles) and degree centrality (red circles), and GSCA (blue triangles). The values in the

As expected, we observe in

We also observed that for a fixed γ, the empirical power decreases with the increase of the number of networks, as shown in

Besides the statistical power, other criteria are relevant in the choice of the method to be used. In the following steps, we further analyze the glioma dataset.

We applied

We show the results of the tests, each one based on 1,000 permutation tests, for all gene sets in

This complementarity is already expected, because

To verify this hypothesis, we classified the 1,289 gene sets in

To highlight the applicability of the proposed method, we went deeper in the analysis of the 62 gene sets that were detected by

Our analyses suggested that at least one network is different from the others in the TLR gene set. Then, we performed a pairwise comparison of the four cancer types to understand better how they differ from each other.

Dendrogram of the distances among the four glioma subtypes regarding the

In the second data set, we studied how the metabolic networks of five plant organs differ from each other. The 73 metabolites analyzed in sorghum organs (leaf, culm, root, prop root, and grains) were partitioned in five groups according to their biochemical roles: carbohydrates, amino acids, organic acids, nucleotides, and all 73 metabolites. We built one network for each organ and each metabolic group. Then we compared the networks across the organs using the PDFG, the centrality tests of

The grain-filling stage in plants is largely dependent on metabolic status (Schnyder,

Results of the PDFG and degree centrality statistical tests comparing all five organs networks.

All | 73 | 0.017 | 17.167 | 0.329 | ||||||

Carbohydrate | 18 | 0.056 | 3.857 | 0.299 | 0.416 | 0.416 | ||||

Organic acid | 13 | 0.044 | 0.065 | 0.108 | 3.482 | 0.341 | ||||

Amino acid | 24 | 0.018 | 0.292 | 0.312 | 5.152 | 0.314 | 0.179 | 0.313 | ||

Nucleotide | 12 | 0.034 | 0.312 | 0.312 | 3.041 | 0.352 |

We obtained pairwise distances among the organ networks for those metabolic sets with a statistically significant difference.

For the tests performed with the degree centrality, we identified significant differences in all groups. The results suggest that even if the network structure (PDFG) does not change, the role of the metabolites and its mean correlation values in each organ can be different. The

Differential node analysis based on degree centrality.

Piruvate | 4.219 | 0.001 | 0.003 | 10.328 | 1.906 | 0.765 | 0.821 | 9.579 |

Mevalonate | 4.215 | 0.001 | 0.003 | 9.93 | 0 | 0.838 | 0 | 8.191 |

cis-Aconitate | 3.582 | 0.001 | 0.003 | 9.361 | 0 | 0.872 | 0.821 | 6.693 |

AKG | 3.499 | 0.001 | 0.003 | 9.474 | 2.641 | 0.872 | 5.11 | 10.854 |

2/3PGA | 3.862 | 0.001 | 0.003 | 10.412 | 0.805 | 5.216 | 0 | 9.695 |

Chiquimate | 3.523 | 0.002 | 0.004 | 8.97 | 0 | 0.913 | 2.517 | 7.994 |

Malate | 3.029 | 0.003 | 0.005 | 10.374 | 1.917 | 0.765 | 5.206 | 7.376 |

Isocitrate | 2.782 | 0.003 | 0.005 | 9.588 | 1.872 | 4.654 | 5.316 | 9.898 |

Citrate | 2.631 | 0.009 | 0.013 | 10.019 | 1.857 | 2.702 | 6.104 | 7.156 |

PEP | 2.205 | 0.064 | 0.083 | 9.393 | 1.863 | 3.583 | 5.111 | 2.529 |

Fumarate | 2.387 | 0.089 | 0.105 | 4.018 | 1.845 | 3.505 | 0.829 | 10.01 |

trans-Aconitate | 2.712 | 0.097 | 0.105 | 0 | 1.842 | 0.913 | 3.304 | 9.835 |

Succinate | 2.115 | 0.141 | 0.141 | 8.871 | 1.84 | 4.567 | 6.22 | 7.738 |

The majority of the metabolites of the

Citrate cycle (TCA Cycle) metabolic pathway from KEGG database (Kanehisa and Goto,

The analyzed data were collected between 10 a.m. and 12 a.m. when the leaf performs constant photosynthesis and mobilization of carbon. Also, the grain metabolism is geared toward storage of carbohydrate and proteins. Therefore, we have evidence to believe that the average degree centrality of metabolites are higher in the leaf and grain networks because the organic acid metabolism of these organs is more active than the organic acid metabolism of the other organs. Our findings reinforce that network analysis brings a new view to the data, since de Souza et al. (

Publicly available datasets were analyzed in this study. This data can be found here:

VJ, SS, AF, and MB conceived and designed the experiments, analyzed the data, and wrote the paper. VJ performed the experiments.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The handling editor declared a shared affiliation, though no other collaboration, with the authors at the time of the review.

The Supplementary Material for this article can be found online at: