Edited by: Lee Sweetlove, University of Oxford, UK
Reviewed by: Rui Alves, Universitat de Lleida, Spain; Thomas Christopher Rhys Williams, Universidade Federal de Vicosa, Brazil
*Correspondence: Zoran Nikoloski, Systems Biology and Mathematical Modeling Group, MaxPlanck Institute of Molecular Plant Physiology, Am Mühlenberg 1, 14424 Potsdam, Germany email:
This article was submitted to Plant Systems Biology, a section of the journal Frontiers in Plant Science.
This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Genomescale metabolic models (GEMs) are increasingly applied to investigate the physiology not only of simple prokaryotes, but also eukaryotes, such as plants, characterized with compartmentalized cells of multiple types. While genomescale models aim at including the entirety of known metabolic reactions, mounting evidence has indicated that only a subset of these reactions is active in a given context, including: developmental stage, cell type, or environment. As a result, several methods have been proposed to reconstruct contextspecific models from existing genomescale models by integrating various types of highthroughput data. Here we present a mathematical framework that puts all existing methods under one umbrella and provides the means to better understand their functioning, highlight similarities and differences, and to help users in selecting a most suitable method for an application.
Genomescale metabolic models (GEMs) have become a useful tool to investigate metabolism. They present numerous applications, from basic research on metabolic functioning and cell physiology (Bordbar et al.,
The success of GEMs is largely due to their integrative nature, representing the whole known network of biochemical reactions of a given organism, and the possibility to readily use them in a mathematical model. This mathematical model can be further interrogated with powerful methods from constraintbased analysis (Lewis et al.,
The recent advent of highthroughput technologies has propelled the GEM community to develop new methods for integrating highthroughput data into existing metabolic models. In general, these methods employ data to (1) improve flux predictions through further constraining of the solution space (Colijn et al.,
Highthroughput data sets can be divided in hierarchical categories that correspond to different cellular processes. On one hand, transcript profiles capture the instantaneous expression state of a given genome under a particular condition. They have the greatest coverage, since usually all known genes are considered. They are also the most accessible in terms of experimental tractability, due to the availability of classical technologies (e.g., microarray) as well as modern developments (i.e., RNAseq). However, gene expression is also at the top of the hierarchical chain of events that govern metabolic fluxes, which may explain the relatively low correlation values between these two quantities, as reported in previous works (Yang et al.,
Several recent comprehensive reviews provide extensive coverage of computational methods for integrating highthroughput data in GEMs (Joyce and Palsson,
Our framework for classification of the existing methods for extraction of contextspecific models simultaneously offers a generalization of the mathematical and algorithmic formulation. With respect to the employed objective, these methods can be divided into three main families, namely: GIMME, iMAT, and MBAlike families, termed after the first representative method in each class (Figure
The GIMMElike family encompasses the GIMME method (Becker and Palsson,
1. function 

2. max_{v} RMF  


SV = 0  
v_{min} ≤ v ≤ v_{max}  SV = 0  SV = 0 
3. min_{v} IS  


SV = 0  
v_{min} ≤ 

RMF = k^{*}RMF_{opt}  
k ∈ [0, 1]  
4. end function 
Set of reactions of the generic model  
Set of reactions of the (partial) contextspecific model  
Core set of reactions  
Core set of reactions with high likelihood  
Core set of reactions with moderate likelihood  
Noncore set of reactions  
Subset of reactions from N_{C}  
KeyMet  Set of key metabolites (holding positive evidence) 
KeyMetProd  Set of reactions producing a key metabolite (with positive evidence) 
KeyMetSink  Set of sink reactions for a metabolite with positive evidence 
MetTask  Set of reactions participating in a given metabolic task (a linear combination of a subset of the generic model) 
Negative  Set of reactions whose associated transcript/s hold/s negative evidence (nonexpressed in any condition) 
Rev  Set of reversible reactions of the generic model 
Set of reactions with high associated expression value  
Set of reactions with low associated expression value  
Weighting factor (scalar), typically k∈[0, 1]  
Userdefined threshold for expression values  
ϵ, δ  Userdefined small positive value 
Vector of weighting factors (arbitrary function of experimental evidence)  
Stoichiometric matrix  
Vector of flux values  
Boundary conditions for V (physiologically maximal and minimal flux capacity)  
Forward and reverse senses of reversible reactions  
Vector of concentration rates  
Vector of data values  
IS  Inconsistency score 
FVA  Flux Variability Analysis 
RMF, RMF_{opt}  Required Metabolic Functionality, RMF optimum value as calculated by FBA 
Gene expression measured intensities, maximum gene intensity (for a given sample) and intensity value for a particular gene, respectively 
In GIMME, the penalty function is termed
GIM^{3}E introduces several modifications to the original GIMME. First, it allows integration of metabolomics data, imposing a nonzero flux value to reactions involving a metabolite for which there is evidence of being synthesized in an investigated condition. Second, it modifies the definition of the reaction penalty; here, the penalties for all reactionassociated genes are determined separately and are then mapped to the reaction following the GPR rules. Moreover, the penalties are calculated as the distance between each transcript and the maximum expression level of the set. Consequently, after mapping transcript penalties all reactions obtain a penalty value, rather than only the set below the threshold which is the case in GIMME. Finally, GIM^{3}E takes into account directionality of reversible reactions by constraining them to operate in only one direction, which is modeled by introducing a binary variable for the direction of choice. As a result, GIM^{3}E is formulated as a mixed integer linear program (MILP), which is more computationally challenging than the LP formulation of GIMME.
When a given RMF operates in different contexts, the operability constraint may lead to more accurate contextspecific model reconstructions and flux distributions. This issue has been evaluated in a recent review (Machado and Herrgård,
Nevertheless, while the selection of a RMF can be a relatively easy task for prokaryotes, whereby experimental evidence supports the choice of cellular growth or biomass maximization as a plausible RMF, this task is much more challenging for eukaryotic organisms, especially the multicellular. In this case, choosing a RMF for a given tissue or cellular type is a complicated task, as each cell type is specialized in certain biochemical functions, modulated on the level of the entire organism. Therefore, methods that do not require a RMF may be applied easier to models of multicellular organisms.
There are existing implementations for both methods: GIMME can be executed using the c
The iMATlike family comprises three methods, iMAT (Shlomi et al.,
The algorithm first classifies reactions into two groups based on a previously defined threshold for the corresponding expression data; this results in the groups of reactions with a high and low associated expression values. It then maximizes the number of matches between a reaction state, defined through a minimum flux value, and the group to which the reaction belongs. Thus, if a reaction is included in the highly expressed group, the aim is to obtain a flux value over the minimum, which is performed by solving the MILP in Box
Several network states can yield the same overall similarity to expression data, i.e., multiple flux distributions may yield the same objective function value. iMAT tackles this issue through an adapted flux variability analysis (FVA): First, it forces each reaction to be active and evaluates the similarity, and then repeats the process in a similar way by forcing each reaction to be inactive. The final outcome is computed by comparing the two obtained similarities. A reaction is termed active if its inclusion results in higher similarity to data, and it is termed as inactive, if its inclusion decreases this similarity. In the case that both similarities are equal, iMAT categorizes the reaction as undetermined.
INIT was optimized to integrate evidences from the Human Protein Atlas, although expression data are integrated when proteomic evidences are missing. In this case, INIT does not group reactions in categories in contrast to iMAT. Instead, it adopts experimental data to weight the binary variable of the corresponding reaction, whereby the weight is a function of experimental data (e.g., gene expression profiles) or a set of arbitrary numbers that quantify the color code of the entries of the Human Protein Atlas. In addition, INIT imposes a positive net production of metabolites for which there is experimental support for that context or tissue. Hence, when a metabolite is experimentally determined to be present, its net production is forced to comply with a given lower bound. As a result, INIT allows the integration of metabolomics data in a qualitative way. This method has been applied to generate a human metabolic reaction database (“Human Metabolic Atlas
tINIT, an extension of INIT, has been recently proposed (Agren et al.,
The main advantage of this family of methods is the independence of a RMF; therefore, these methods are convenient for extracting contextspecific models when no specific RMF is known to dominate the context, which is often the case for tissuespecific models of multicellular organisms. However, MILP problems are computationally more challenging in comparison to LP problems, and may, in general, require longer computation time. This is particularly the case of iMAT, in which two MILPs have to be solved in the modified FVA per reaction. iMAT can be easily implemented using the c
The MBAlike family is composed of MBA (Jerby et al.,
function R_{P} ← C N_{C} ← R_{G}\C blockedReactions ← if blockedReactions = ∅ return R_{P} end if while blockedReactions ≠ ∅ R_{P} ← R_{P} ∪ R_{Nc} N_{C} ← N_{C}\R_{Nc} blockedReactions← end while return R_{P} end function 
function R_{P} ← R_{G} N_{C} ← R_{G}\(C_{H}∪C_{M}) choose random permutation, P, from N_{C} for each reaction r ∈ P, R_{P} ← R_{P}\r blockedReactions ← e_{H} ← blockedReactions∩C_{H} e_{M} ← blockedReactions∩C_{M} e_{Nc} ← blockedReactions\(C_{H}∪C_{M}) if (e_{H} = 0) AND (e_{M} < k^{*}e_{Nc}), R_{P} ← R_{P}\(e_{M}∪e_{Nc}) end if end for end function 
function R_{P} ← ∅ J ← C P ← R_{G}\C while J ≠ ∅ R_{P} ← R_{P}∪ J ← J\R_{P} P ← P\R_{P} end while end function 
function R_{P} ← R_{G} N_{C} ← R_{G}\C for each reaction r∈N_{C}, R_{P} ← R_{P}\r blockedReactions ← e_{C} ← blockedReactions∩C e_{Met} ← blockedReactions∩KeyMetProd e_{NC} ← blockedReactions∩N_{C} if r∉Negative, if (e_{C} = 0) AND (e_{Met} = 0), R_{P} ← R_{P}\r∪e_{Nc} end if else if r ∈ Negative, if (e_{Met} =0) AND (e_{C} < k*e_{Nc}), R_{P} ← R_{P}\r∪e_{Nc}∪e_{C} end if end if end for end function 
function max_{v} ∑_{i ∈ RP} SV = 0 v_{min} ≤ v ≤ v_{max} R_{P} ← R_{P}\{i ∈ R_{P}: v_{i} ≥ ε} min_{v} ∑_{i ∈ RP ∩ Rev} SV = 0 v_{min} ≤ v ≤ v_{max} R_{P} ← R_{P}\{i ∈ R_{P}: v_{i} ≥ ε if {i∈R_{P}: v_{i} ≥ ε} = ∅, select random reaction, i, and solve FVA R_{P} ← R_{P}\{i : v_{i} ≥ ε} end if end function 
function max_{v,z} ∑_{i ∈ J} z_{i} ∈ [0,ε], ∀i ∈ J, z_{i} ∈ ℝ_{+} v_{i} ≥ z_{i}, ∀i∈J SV = 0 v_{min} ≤ v ≤= v_{max} K ← {i∈J:v_{i} ≥ ε} min_{v,z} ∑_{i ∈ P} v_{i} ∈[−z_{i},z_{i}], ∀i∈P, z_{i} ∈ ℝ_{+} v_{i} ≥ ε, ∀i∈K SV = 0 v_{min} ≤ v ≤ v_{max} end function 
MBA divides the core set in two subcores: a set with high likelihood to be present in the contextspecific model (
A prominent characteristic of mCADRE lies in ranking reactions of the genomescale reconstruction according to three scores: expression, connectivity, and confidencelevelbased. In addition, this ranking determines the core set of reactions as well as the order by which noncore reactions are eliminated. The core is determined by fixing a threshold value to the expressionbased score; therefore, reactions whose values are above the threshold are included in the core, and the rest constitute the noncore reactions. Unlike other methods, the expressionbased score does not directly consider the levels of expression. Instead, it calculates the frequency of expressed states over a battery of transcript profiles in the same context, and, thus, requires a previous binarization of the expression data. Reactions outside the core are then ranked according to the connectivitybased score, which assesses the connectedness of adjacent reactions, and the confidence levelbased score, which accounts for the type of evidences supporting a reaction in the genomescale reconstruction.
Noncore reactions are in turn sequentially removed according to the previous ranking, and consistency is evaluated. Here, mCADRE presents two other innovations: it defines a set of key metabolites, with positive evidences of appearing in the contextspecific model reconstruction, and relaxes the stringent condition of including all core reactions in the final model. More specifically, a reaction can only be eliminated if it does not prevent the production of a key metabolite and if it is unnecessary to ensure core consistency. However, if evidence exists for the respective transcript to be unexpressed in any of the contextspecific samples, mCADRE allows the elimination of the reaction even if it blocks some of the core reactions. To this end, two conditions have to be satisfied: (1) production of key metabolites is not impaired and (2) the relation between the number of blocked core and noncore reactions matches a predefined ratio. To check model consistency, mCADRE maintains the procedure proposed in MBA, although adapted to use FastFVA (Gudmundsson and Thiele,
While FastCORE aims also at obtaining a minimal consistent model containing all core reactions, typical for this family of methods, it differs principally from MBA and mCADRE in the algorithmic strategy. Instead of eliminating one noncore reaction followed by consistency evaluation at each step, FastCORE solves two LPs: The first LP maximizes the cardinality of the core set of reactions, computed as the number of reaction values above a small positive constant. On the other hand, the second LP minimizes the cardinality outside the core set by minimizing the
One of the main advantages of this family over other methods is the possibility to integrate multiple data sets of different nature together with wellcurated biochemical knowledge. Defining a core set of reactions from such a diverse collection of experimental evidence may increase the confidence for a particular set of reactions to appear in a certain context (e.g., tissue), as missing information on one data set can be complemented by another. Moreover, imposing the whole core set inclusion can be highly advantageous, as reactions with overwhelming evidence would always be included in the contextspecific model. Moreover, like the iMATlike family, MBAlike methods are independent of a RMF and, hence, appropriate to be employed if no RMF is known to operate in a given context. Nevertheless, we would like to emphasize that MBAlike methods provide only a contextspecific model reconstruction, in contrast to the iMATlike methods which generate both a contextspecific reconstruction and a flux distribution.
MBAlike methods follow two ways to define the core set of reactions: MBA takes into account wellcurated biochemical knowledge and a variety of experimental data (e.g., transcript, protein, metabolite, and/or metabolic flux profiles). While this approach to define the core set of reactions may be more accurate, it is also timeconsuming due its manual nature. On the other hand, the definition of the core set in mCADRE allows for full automation, since it relies only on determining a threshold to expressionbased evidence.
In terms of computation time, FastCORE outperforms the contending alternatives. Therefore, it has advantages over other methods when computing time is the limiting resource, provided that a properly defined core set is given (note that FastCORE does not provide an operational definition of a core set). The good timerelated performance of FastCORE is due to two main innovations: First, the maximization of the cardinality represents a softer objective than the maximization of the total sum of flux values (used in MBA), since fluxes are only required to be above a small positive value. Consequently, solving this optimization problem usually results in more active reactions per iteration than the MBA counterpart. Second, the computation of the
Here we presented a classification of the existing approaches for extracting contextspecific metabolic models. We classified the methods into three families according to their mathematical formulation. Furthermore, we also proposed a mathematical generalization for each family, which summarizes the fundamental principles shared by its members.
Altogether, the classification and generalization constitutes a mathematical framework that aims to fulfill three main purposes: First, it provides a better understanding of the rationale behind methods, allowing an easy inspection of its main characteristics as well as highlighting the advantages and shortcomings. Second, such structured knowledge may facilitate the envisioning of novel approaches to extract contextspecific models. Third, it may help users in choosing a best suited method for their particular problem, since the classification outlines the differences in the data and knowledge requirements as input to the particular methods.
The flowchart on Figure
GIMME  LP  COBRA (Matlab)  Transcripts  Required  Yes  
GIM^{3}E  MILP  COBRA (Python)  Transcripts, metabolites  Required  Yes  
iMAT  Data discretization 
⟳ MILP  COBRA (Matlab)  Transcripts, proteins  Unrequired  Yes 
INIT/tINIT  Data discretization 
MILP  RAVEN (Matlab)  Transcripts, proteins, metabolites  Optional  Yes 
MBA  Data discretization 
⟳ LP    Curated biochemical knowledge, transcripts, proteins, metabolites, fluxes  Unrequired  No 
mCADRE  Data discretization 
⟳ LP  Matlab  Transcripts, metabolites  Unrequired  No 
FastCORE  ϵ, 
⟳ LP  COBRA(Matlab)    Unrequired  No 
Without the information about the operability of a particular RMF in a given context, the iMATlike family may provide the method of choice. To select between iMAT and INIT one could take into account the flexibility on integrating different types of experimental data, since iMAT was developed to integrate transcript profiles, whereas INIT can integrate semiquantitative proteomic data, transcript profiles and metabolic evidences. In addition, one could consider the possibility of the method to discriminate between multiple optima with same similarity score, together with the computational cost for performing this task.
In contrast, if only a contextspecific model extraction is required, one may opt for any of the presented method. However, the methods in the MBAlike family have some advantageous properties, namely, the integration of a variety of experimental data sources and the inclusion of reactions for which there is strong experimental evidence in the contextspecific reconstruction. One may then choose based on the core set definition of each method as well as on the total computational time required. The MBAlike family proposes two ways to define the core: the MBA semiautomated procedure, whereby reactions are included in the core set if there is sufficient positive evidence across different databases, and the mCADRE automated procedure, whereby reactions are included if the expression value of the respective transcript is larger than a given threshold. Thus, if an appropriate number of databases contain experiments about the context of interest and the computation time is not a primary limitation, the MBA core definition may be a suitable alternative. As previously commented, this procedure can crossvalidate the confidence on a reaction to belong to a certain context, due to the simultaneous usage of several databases. Subsequently, one can readily employ MBA to extract the contextspecific model, or can opt for FastCORE, which can perform the extraction, using the previously defined core, in a more efficient way. On the other hand, mCADRE could be preferentially applied when an automated core definition is preferred. Moreover, the mCADRE relaxation of whole core inclusion can improve accuracy when a core reaction diminishes the overall coherence with respect to the data, through the inclusion of noncore reactions with negative evidences to ensure consistency. Finally, one can also apply FastCORE to a core set defined in an automated way to benefit of its rapid computation. However, neglecting the characteristic core relaxation and ranking of noncore reactions of mCADRE.
Development of new approaches for extraction of contextspecific metabolic models can further expand on the advantages of the existing methods, while facilitating efficient computation accounting for the shortcomings. This will allow rapid devising of contextspecific models and their interconnection in larger multilevel models, typical for complex eukaryotes, to allow for more realistic simulation scenarios.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The authors wish to acknowledge the Max Planck Society and the International Max Planck Research School in Primary Metabolism and Plant Growth (IMPRSPMPG) for funding support.
^{1}Retrieved from
^{2}Retrieved from
^{3}Retrieved from
^{4}Retrieved from
^{5}Retrieved from