^{*}

Edited by: Celia M. T. Greenwood, McGill University, Canada

Reviewed by: Claudia L. Kleinman, McGill University, Canada; Pingzhao Hu, University of Manitoba, Canada

*Correspondence: Elie Maza

This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

In the past 5 years, RNA-Seq has become a powerful tool in transcriptome analysis even though computational methods dedicated to the analysis of high-throughput sequencing data are yet to be standardized. It is, however, now commonly accepted that the choice of a normalization procedure is an important step in such a process, for example in differential gene expression analysis. The present article highlights the similarities between three normalization methods: TMM from edgeR R package, RLE from DESeq2 R package, and MRN. Both TMM and DESeq2 are widely used for differential gene expression analysis. This paper introduces properties that show when these three methods will give exactly the same results. These properties are proven mathematically and illustrated by performing

In the past 5 years, RNA-Seq approaches, based on high-throughput sequencing technologies, are becoming an essential tool in transcriptomics studies (cf. Wang et al.,

This paper deals with two widely used and very important normalization methods and a third method related to these. The first method is the “Trimmed Mean of

In this paper, all theoretical results will be illustrated by

TMM | 0.98012 | 0.92236 | 0.71989 | 1.05807 | 0.98130 | 0.88352 | 1.13027 | 1.19388 | 1.24130 |

RLE | 1.01712 | 0.80899 | 0.72660 | 0.86594 | 1.23622 | 0.73647 | 1.28172 | 1.27220 | 1.37315 |

MRN | 0.87105 | 0.75416 | 0.91430 | 0.79324 | 1.20131 | 0.80461 | 1.33984 | 1.25330 | 1.29317 |

The aim of this study is to provide a deeper understanding as to why the three normalization methods quoted above share a similar normalization approach. This paper also demonstrates that, in some cases, some shared parameters (such as relative size of transcriptomes or normalization factors) are strictly equal.

To investigate the tomato transcriptome dynamics of fruit set, RNA were isolated from flower buds (Bud) and flowers at anthesis (Ant) and post-anthesis (Pos) stages. For each stage, cDNA libraries were generated from three biological replicates and subjected to Illumina mRNA-Seq technology sequencing. Then, after mapping reads to the tomato genome sequence, we obtained a table of raw counts with 34675 rows (genes) and 9 columns (3 stages and 3 replicates per stage). These technical procedures are described in Maza et al. (

All computations were done within R environment (cf. R Development Core Team,

As described above, the matrix containing raw counts is denoted by

The TMM normalization method is implemented in the edgeR package by means of the

The RLE normalization method is implemented in the DESeq2 package by means of the function

The MRN normalization method is implemented in a homemade function called

We define hereafter some important terms that are used in the studied normalization methods. More detailed definitions are given in Robinson and Oshlack (

Both _{g1} and _{g2}. They represent respectively the fold change and the absolute expression level of the gene:

A trimmed mean of

In this section, the three studied normalization methods are first described and compared. Then, three propositions are introduced and

We have to underline that we focus, in this paper, on so called “scaling normalization methods” but this is just one approach, which can be limited to some specific experimental situations (cf. Maza et al.,

We notice here that, in order to be consistent, the first paragraph below (named “Notations and Experimental design”) reproduces information that have been already reported in detail in Maza et al. (

Let _{gkr} be the observed number of reads (or count) of gene _{gk} be the expectation of the true and unknown number of transcripts of a given cell for gene _{g} the length of gene _{kr} the total number of reads in condition _{gkr} as

where _{k} is the size of studied transcriptome in condition

Then, for each gene

We can easily see in the equation above that the ratio of interest, in a differential analysis point of view, i.e.,

Table

^{6}.

I | Pre-normalization by library size | |||

II | Reference sample, or |
|||

III | Relative sizes of transcriptomes and reference sample, or |
|||

IV | ||||

V | Taking into account both the relative size and the library size, or |
|||

VI | Normalization factors, or |
|||

VII | Normalization of counts, or |

After the above detailed descriptions of our three methods, we introduce below three properties showing particular cases where all three methods give the same result.

Proposition 1(concerning TMM and MRN).

An example illustrating Proposition 1 is given in Table

TMM | 0.97654 | 0.92966 | 0.72054 | 1.06259 | 0.97360 | 0.87363 | 1.14166 | 1.19541 | 1.23937 |

MRN | 0.97658 | 0.92957 | 0.72079 | 1.06280 | 0.97361 | 0.87361 | 1.14189 | 1.19599 | 1.23792 |

We can clearly see in Table

Proposition 2 (concerning RLE and MRN).

We illustrate Proposition 2 by calculating the size factors for some pairs of samples (see Table

RLE | 1.1015522 | 0.9078099 | 0.7870385 | 1.2705859 | 0.8248517 | 1.2123391 |

MRN | 1.1015522 | 0.9078099 | 0.7870385 | 1.2705859 | 0.8248517 | 1.2123391 |

We can see in Table

We must note here that, with

Proposition 3 (concerning RLE, TMM, and MRN).

We note here that the

Moreover, if we assume that (i) the trimmed mean of the TMM method is done with an unweighted mean as described in the Step III of the edgeR method in Table

Then

□

Then, we directly have that

and

Let's then describe calculations for the MRN method. For

Then, the relative sizes are the following:

That leads to

Finally, the calculation of the geometric mean of these values, i.e.,

implies that

and

It follows that

□

For the TMM method, assuming the assumptions of Proposition 1, the relative scaling factors are the following:

and

Then, with the following geometric mean of these values:

the adjusted relative scaling factors are the following:

and

We can then calculate the effective library sizes:

and

Hence, these effective library sizes are equal (up to a constant) to the normalization factors obtained from RLE (and MRN) methods:

and

And the proposition is proved. □

This paper focus on two widely used normalization methods for RNA-Seq data and a third method related to these, that seem to give similar results and outperform many other classical methods if we consider all references given in the Introduction. Better understanding these methods is then an important issue dealt by this paper.

We highlight in this paper that the three considered normalization methods deal with similar underlying ideas. Moreover, we prove that these methods give exactly the same result in some simple experimental designs. For instance, Proposition 3 shows that for two given samples, normalized counts are (up to a constant) equal.

It has also been shown in this paper that the user should carefully use and not mix these normalization methods and R packages as all concepts are not equal. For instance, the so called “normalization factors” from edgeR and “size factors” from DESeq2 are not the same theoretical parameters.

Nevertheless, it has been shown in Maza et al. (

Finally, we conclude here that for a very simple experimental design, i.e., about two conditions and no replicates, users can use any of the three studied normalization methods with no impact on results. But, for a more complex experimental design, the results described in Maza et al. (

EM has carried out the calculations, performed the analysis, and written the paper.

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The reviewer CK and handling Editor declared their shared affiliation, and the handling Editor states that the process nevertheless met the standards of a fair and objective review.

I thank Sarah Chrisment for the thorough revision of my manuscript which help me to fix the grammatical errors and to improve the overall readability of the text. This work was supported by the Laboratoire d'Excellence (LABEX) TULIP (ANR-10-LABX-41). This work benefited from the networking activities within the European funded COST ACTION FA1106 “Qualityfruit.”

The Supplementary Material for this article can be found online at:

This first additional file contains the R code of all calculations carried out in this paper. This file can be obviously executed on R directly (see Materials and Methods) but it can also be opened with a simple text viewer.

The second additional file contains the R code for the

This file contains the fruit set data used in Additional file