^{1}

^{2}

^{1}

^{1}

^{*}

^{1}

^{2}

Edited by: Alejandro Sanchez-Flores, Universidad Nacional Autonoma de Mexico, Mexico

Reviewed by: Ali Rana Atilgan, Sabanci University, Turkey; Raul Isea, Fundaciòn Instituto de Estudios Avanzados IDEA, Venezuela

*Correspondence: Claudia Steglich, Genetics & Experimental Bioinformatics, Faculty of Biology, University of Freiburg, Schaenzlestr. 1, 79104 Freiburg, Germany e-mail:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

The visualization of massive datasets, such as those resulting from comparative metatranscriptome analyses or the analysis of microbial population structures using ribosomal RNA sequences, is a challenging task. We developed a new method called CoVennTree (Comparative weighted Venn Tree) that simultaneously compares up to three multifarious datasets by aggregating and propagating information from the bottom to the top level and produces a graphical output in Cytoscape. With the introduction of weighted Venn structures, the contents and relationships of various datasets can be correlated and simultaneously aggregated without losing information. We demonstrate the suitability of this approach using a dataset of 16S rDNA sequences obtained from microbial populations at three different depths of the Gulf of Aqaba in the Red Sea. CoVennTree has been integrated into the Galaxy ToolShed and can be directly downloaded and integrated into the user instance.

In recent years, new high-throughput sequencing technologies such as 454, Illumina and SOLiD have become available and have led to an enormous increase in the volume of available sequence data while simultaneously facilitating a dramatic decrease in sequencing costs. The development of these technologies has enabled the large-scale application of metatranscriptomics and metagenomics approaches and has been responsible for substantial advances in a broad variety of research, including the large-scale identification of DNA polymorphisms, investigations of the compositions of microbial communities, and genome- and population-wide gene expression studies at single-nucleotide resolution. For the first time, the comprehensive comparison of sequences obtained in the field with sequences from databases using annotated functions has become possible and has enabled the assessment of environmentally important genes and their linked metabolic pathways. The first step in the analysis of sequencing data is based on either a composition or a comparison approach. The latter consists of the mapping of reads against a database using BLAST (Altschul et al.,

A weighted Venn data structure for three datasets is completely defined by a 6-tuple (_{1},_{2},_{3},_{1,2},_{1,3},_{2,3}), where _{i}_{i,j}_{1} = 1000, _{2} = 3000, and _{3} = 4000, the co-occurrence weights are _{1,2} = 1000, _{1,3} = 1000, and _{2,3} = 3000. The resulting weighted Venn diagram for each leaf contains three interleaving circles, which overlap by 100%.

Prior to the VDS calculation, three sets are defined as follows: “

To compute the VDS value for the given children, five steps are required (Equation 1). The two sums in Equation (1) represent the decomposition of the weighted Venn diagrams: the first sum is related to the total content of every dataset, and the second sum is related to the overlaps between different datasets. The maximum number of datasets or possible overlaps is three; therefore, the sums run from 1 to 3. To normalize the values to an interval of [0, 1], the outcome of each sum is divided by its corresponding set, |

Equations (2) through (5) describe the essential steps that are involved in the decomposition in detail. In this context, decomposition means the splitting of every child node (weighted Venn diagram) into two vectors. One vector contains the number of data points in every dataset (called weights), and the other contains the numbers of data points that are shared between datasets 1 and 2, between datasets 1 and 3, and between datasets 2 and 3 (called co-occurrence weights). All vectors of the children of a parent node are stored in a corresponding matrix. Matrix Θ contains all sets, and matrix Π contains all overlaps. Every column ϑ_{1n}, ϑ_{2n}, and ϑ_{3n} in matrix Θ is related to a corresponding column in matrix Π: π_{1n}, π_{2n}, and π_{3n}, respectively. Every row in matrix Θ corresponds to a condition, and every row in matrix Π corresponds to a co-occurrence (the co-occurrence of conditions 1 and 2, the co-occurrence of conditions 1 and 3 or the co-occurrence of conditions 2 and 3). The information contents of the matrices Θ - Π, Θ′ - Π′, Θ″ - Π‴, and Θ‴ - Π‴ are distinct, but the mathematical operations are the same for each step.

In Equation (2), the variables ϑ_{i.} and π_{i.} for _{i.} and π′_{i.} for _{i} and π‴_{i} for

The following formulas (Equations 6–11) represent the procedure used to compute the frame size (space), which is essential for drawing a weighted Venn diagram. The graphical output, consisting of a weighted Venn diagram, is achieved by applying the Google API, but this tool does not allow for the manual adjustment of the position of a single set. Therefore, a combination of the complete sums [_{sum})] and the overlaps with the largest set [_{sum})] is required to determine the frame size in pixels (Equation 6). The function _{sum}, the available sets for the corresponding weighted Venn diagram are summed (Equation 8).

For instance, if only the first two sets are available, the final set (3 of 3) takes a value of zero and does not contribute to the outcome. The additional value add_{sum} represents the region in which there is no overlap between the largest set and the remaining smaller sets, which is incorporated into the weighted Venn diagram structure. Equation (9) returns the sum of the smaller sets, and Equation 10 returns the overlap between the largest set and the smaller sets. The non-overlapping component is determined by subtracting corr_{ov} from corr_{set}, and this additional value add_{sum} is used to expand the native frame size.

CoVennTree associates rooted tree data structures with weighted Venn diagrams to produce an aggregated and comparative tree visualization for up to three massive datasets (Figure

We developed a new correlation measure named the VDS (

To demonstrate the power of CoVennTree and illustrate its use, a comparative analysis was performed using three 16S rDNA datasets containing more than 150,000 sequences. Sampling for the 16S rDNA analysis was performed at station A in the Red Sea at depths of 60 m, 100 m, and 130 m. The processing of the samples has been described by Steglich et al. (

Producing clear, publication-ready trees for large datasets that can be presented on a single printed page is not a simple task. Most attempts focus on the extensive analysis of single datasets (for example, Krona Ondov et al.,

SCL, BV, and CS conceived the tool. SCL developed the tool. CS and WRH performed the experiments. SCL, BV, WRH, and CS wrote the paper.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

This work was supported by the ASSEMBLE (Association of European Marine Biological Laboratories) Infrastructure Access Call 2 to the Interuniversity Institute for Marine Sciences (IUI), Eilat, Israel (grant agreement no: 227799) through CS and by the German-Israeli Foundation for Scientific Research and Development (GIF) (project number 1133-13.8/2011) through WRH. We thank Franz Baumdicker for critical reading and helpful comments.

The Supplementary Material for this article can be found online at:

^{*}.sif) and the corresponding attribute file (^{*}.venn) into Cytoscape version 2.8.x and provides an example of graph structuring