^{1}

^{*}

^{†}

^{2}

^{†}

^{2}

^{2}

^{2}

^{3}

^{1}

^{2}

^{3}

Edited by: Mariza De Andrade, Mayo Clinic, USA

Reviewed by: Paola Sebastiani, Boston University, USA; Ricardo De Matos Simoes, Dana Farber Cancer Institute, USA

*Correspondence: Stéphane Guerrier

This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics

†The first two authors are Joint First Authors.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Gene selection has become a common task in most gene expression studies. The objective of such research is often to identify the smallest possible set of genes that can still achieve good predictive performance. To do so, many of the recently proposed classification methods require some form of dimension-reduction of the problem which finally provide a single model as an output and, in most cases, rely on the likelihood function in order to achieve variable selection. We propose a new prediction-based objective function that can be tailored to the requirements of practitioners and can be used to assess and interpret a given problem. Based on cross-validation techniques and the idea of importance sampling, our proposal scans low-dimensional models under the assumption of sparsity and, for each of them, estimates their objective function to assess their predictive power in order to select. Two applications on cancer data sets and a simulation study show that the proposal compares favorably with competing alternatives such as, for example, Elastic Net and Support Vector Machine. Indeed, the proposed method not only selects smaller models for better, or at least comparable, classification errors but also provides a set of selected models instead of a single one, allowing to construct a network of possible models for a target prediction accuracy level.

Gene selection has become a common task in most gene expression studies. The problem of assigning tumors to a known class is an example that is of particular importance and has received considerable attention in the last 10 years. Conventional class prediction methods of leukemia or other cancers are in general based on microscopical examination of stained tissue specimens. However, such methods require highly trained specialists and are subjective (Tibshirani et al.,

To avoid these drawbacks, many automatic selection methods have been proposed recently. The goal of these methods is often to identify the smallest possible set of genes that can still achieve good predictive performance (Díaz-Uriarte and De Andres,

Nonetheless, many of these methods do not necessarily respond to the needs of practitioners and researchers when they approach the gene selection process. First of all, many of them have to rely on some form of size reduction and often require a subjective input to determine the dimension of the problem. Also, many of these methods often provide a single model as an output whereas genes interact inside biological systems and can be interchangeable in explaining a specific response. The idea of interchangeability of genes in explaining responses appears for instance in Kristensen et al. (

Another issue of most existing gene selection methods is their reliance on the likelihood function, or a penalized version of it, as a means to develop a selection criterion. However, the likelihood function may not necessarily be the quantity that users are interested in as they may want to target some other kind of loss function such as, for example, the classification error. Of course, maximizing the likelihood function is not typically the same as minimizing a particular loss function. Moreover, adapting these methods to handle missing or contaminated data is not straightforward. This has limited the applicability and reliability of these methods in many practical cases.

To eliminate the limitations of the gene selection procedures described above, this paper proposes an objective function for out-of-sample predictions that can be tailored to the requirements of practitioners and researchers. This is achieved by enabling them to select a criterion according to which they would like to assess and/or interpret a given problem. However, the optimization of such a criterion function is typically not an easy task since the function can be discontinuous, non-convex and would require computationally intensive techniques. To tackle this issue, we propose a solution using a different approach based on a procedure that resembles

The advantages of this proposal are multiple:

This last aspect is of great interest for gene selection since this list can provide insight into the complex mechanisms behind different biological phenomena. Different cases, some of which can be found in Section 4, indicate that this method appears to outperform other methods in terms of criteria minimization while, at the same time, selects models of considerably smaller dimension which allow improved interpretation of the results. The set of selected models can naturally be viewed as a network of possible structures of genetic information. We call this a paradigmatic network. In Section 4 we give an example of a graphical representation of such networks based on the analysis of one of two cancer data sets which are discussed therein.

In this paper we first describe and formalize the proposed approach within the model selection statistical framework in Section 2. In Section 3 we illustrate the techniques and algorithms used to address the criterion minimization problem highlighted in Section 2. The performance of our approach is then illustrated on two data sets concerning leukemia classification (Golub et al.,

To introduce the proposed method, let us first define some notation which will be used throughout this paper:

Let _{f} = {1, 2, …,

Let _{f})\∅, |^{p} − 1, be the power set including all possible models that can be constructed with the

Let

Let ^{J} ∈ ℝ^{p} be the parameter vector for model

where _{k} are respectively the ^{J} and

Keeping this notation in mind, for a given model

where 𝔼[·] is the expectation operator and ^{J} ∈ ℝ^{p}. Models of the form (1) are very general and include all parametric models and a large class of semiparametric models when

We assume that for a fixed ^{J} and given a new covariate vector _{0}, the user can construct a prediction

With this property being respected, the divergence measure can arbitrarily be specified by the user according to the interest in the problem. Examples of such divergence measures include the _{1} loss function

or an asymmetric classification error

where _{1}, _{2} ≥ 0. The latter is for a Bernoulli response and is typically an interesting divergence measure when asymmetric classification errors have to be considered. Indeed, in most clinical situations, the consequences of classification errors are not equivalent with respect to the direction of the misclassification. For instance, the prognosis and the treatment of Estrogen Receptor (ER) positive Breast Cancers (BC) are quite different from those of ER negative ones. Indeed, if a patient with ER negative is treated with therapies designed for patients with ER positive, the consequence is much more severe than if this were done the other way round because of the excessive toxicities and potentially severe side effects. It therefore makes sense to give different values to _{1} and _{2}. By defining _{1} > _{2} we would take these risks into account, where _{1} would be the weight for a misclassification from ER negative to ER positive BC and _{2} for the opposite direction. Weight values can be modulated according to the current medical knowledge and the clinical intuition of the physicians.

Considering this divergence measure

where 𝔼_{0} denotes the expectation on the new observation (_{0}, _{0}). Let _{0} denote the models with the smallest cardinality among all _{0} could contain more than one model. Let us define the models corresponding to _{0} as the “true” models. Thus, our “true” models are essentially the most parsimonious models that minimize the expected prediction error.

The optimization problem in Equation (2) is typically very difficult to solve. First of all, supposing we do not consider interaction terms, the outer minimization would require to compare a total of 2^{p} − 1 results, each a result of the inner minimization problem. In addition, each of the 2^{p} − 1 inner minimization problems is also very hard to solve, even if the risk ^{J}. Indeed, the inner minimization problem is in general non-convex and could be combinatorial, implying that the minimizer might not be unique. For example, when ^{J} without explicit form and needs to be approximated.

We propose to estimate _{k, l} of size _{l} for

Having approximated the expectation 𝔼_{0}, the minimization problem in Equation (2) becomes

Despite the above approximation, the minimization problem remains complicated for the reasons mentioned earlier. Thus, we further eliminate the inner minimization problem in Equation (4) by inserting an estimator ^{J}, say _{k, l}. This estimator can be any available estimator, for example, the maximum likelihood estimator (MLE), a moment based estimator, or a quantile regression based estimator, etc. (see for example, Azzalini,

The intuition of replacing the inner minimization in Equation (4) with a sample average evaluated at an arbitrary estimator is due to the fact that this estimator, under a fixed “true” model and regardless of whether this estimator is a standard MLE or a minimizer of the divergence measure _{l} → ∞, then we have that

We now have an optimization problem in Equation (5) which requires a comparison of 2^{p} − 1 values and is much easier to solve. To further reduce the number of comparisons, the following section describes some procedures and algorithms allowing to solve this problem in a more efficient manner.

To solve the optimization problem in Equation (5), we propose an approach designed to have the following three features:

Identify a

Find this set of models within a

This set achieves

Note that the last feature above reflects the belief that most of the covariates are irrelevant for the problem under consideration and should be excluded. Indeed, our method is designed to work effectively if such a sparsity assumption holds, putting it on the same level of almost all variable selection procedures in the literature. Moreover, we require the method to have the first feature in order to increase flexibility in terms of interpretation. Indeed, in many domains such as gene selection, for example, the aim may not be to find a single model but a set of variables (genes) that can be inserted in a paradigmatic structure to better understand the contribution of each of them via their interactions.

Given this goal, assume that we have at our disposal an estimate of the measure of interest ^{p} − 1 models. In this case, our interest would be to select a set of “best” models by simply keeping the set of models that have a low discrepancy measure ^{p} − 1 is an extremely large number and the probability of randomly sampling a “good” set of variables from the 2^{p} − 1 variables is very small. Using the sparsity property of the problem, we propose to start with the set of variables _{0} (typically an empty set) and increase the model complexity stepwise. Throughout this procedure, we ensure that at step _{max} ≪

More formally, let us first define the set of all possible models of size

We then define the set of promising models,

where

and

whose complement we define as

With this approach in mind and using the above notations, to start the procedure we assume that we have

_{0} with the goal of finally obtaining the set

Construct the _{0} with each of the

Compute

From Steps A.1 and A.2, construct the set

_{max}.

Augment _{0} with

Randomly select a set, either set

Select one variable uniformly at random and without replacement from the set chosen in Step (i) and add this variable to _{0}.

Repeat Steps (i) and (ii) until _{0}.

Construct a model of dimension

From Steps B.1 and B.2, construct the set _{max}, go to Step B and let

Once the algorithm is implemented, the user obtains an out-of-sample discrepancy measure for all evaluated models. Given that the goal is to obtain a set of models

The algorithm described above lays out the basic procedure to solve the problem in Equation (2). However, as many other heuristic selection procedures, there are a series of “hyper-parameters” to be determined and certain aspects to be considered. In the following paragraphs we will discuss some of these issues arising when implementing our algorithm in practice.

The parameters _{max}, _{max} represents a reasonable upper bound for the model dimension which is constrained to _{max} ≤ _{max} ≪

As a final note, it is also possible for the initial model _{0} to already contain a set of _{0} covariates which the user considers to be essential for the final output. In this case, the procedure described above would remain exactly the same since the procedure would simply select from the _{0} +

The final goal of the algorithm is to find a subset of models of dimension ^{*} that in some way minimize the considered discrepancy. A possible solution would be to select the set of models _{d}(α) is unknown and replaced by its estimator ^{*} taking into account the variability of ^{*} such that we cannot reject the hypothesis that _{max}. As long as the difference is significant we increment ^{*} =

The type of test and its corresponding rejection level are determined by the user based on the nature of the divergence measure. For example, if we take the _{1} loss function as a divergence, one could opt for the Mann-Whitney test or if the loss function is a classification error (as in the applications in Section 4), one could choose the binomial test or other tests for proportions. The rejection level will depend, among others, on the number of tests that need to be run, typically less than _{max} − 1, and need to be adjusted using, for example, the Bonferroni correction. Finally, once the set ^{*}.

Some of the ideas put forth in this work have also been considered in the literature. An extensive survey of the related works goes beyond the scope of this paper. Here we briefly describe some of the connections to three main ideas that have been explored to this point.

The first one is recognizing that practitioners might aim to minimize some criterion that differs from likelihood-type losses. An interesting paper illustrating this point is Juang et al. (

Secondly, there is a large literature that uses stochastic search procedures to explore the space of candidate models. Influential work in this direction includes George and McCulloch (

Finally, other authors have also considered providing a set of interesting models as opposed to a single “best” model. The stochastic search procedures mentioned in the above paragraph can naturally be used to obtain a group of interesting models. For example, Cantoni et al. (

In this section we provide an example of how the methodology proposed in this paper selects and groups genes to explain, describe and predict specific outcomes. We focus on the data-set (hereinafter

The analysis of these data-sets focuses both on the advantages of the proposed methodology and the biological interpretation of the outcomes. One of the goals of our method is to help decipher the complexity of biological systems. We will take on an overly simplified view of the cellular processes in which we will assume that one biomarker maps to only one gene that in turn has only one function. Although this assumption is not realistic, it allows us to give a straightforward interpretation of the selected models or “networks” which can therefore provide an approximate first insight into the relationships between variables and biomarkers (as well as between the biomarkers themselves). We clarify that we do not claim any causal nature in the conclusions we present in these analyses but we believe that the selected covariates can eventually be strongly linked to other covariates that may have a more obvious and direct interpretation for the problem at hand. Finally, the data-set has binary outcomes [as does the data-set in Appendix

Golub et al. (

In order to understand how our proposed methodology performs compared to existing ones, we split the

In Figure ^{1}

Golub | 3/38 | 4/34 | 50 |

Support vector machine | 2/38 | 1/34 | 31 |

(with recursive feature elimination) | |||

Penalized logistic regression | 2/38 | 1/34 | 26 |

(with recursive feature elimination) | |||

Nearest shrunken centroids | 2/38 | 2/34 | 21 |

Elastic net | 3/38 | 0/34 | 45 |

Panning Algorithm |
|||

Model a | 0/38 | 2/34 | 2 |

Model b | 0/38 | 2/34 | 2 |

Model c | 0/38 | 2/34 | 2 |

[…] | |||

Model averaging | 2/34 | 2 |

Once this procedure is completed, we can create a gene network to facilitate interpretation. This is a direct benefit of our method which does not deliver a single model after the selection process but provides a series of models that can be linked to each other and interpreted jointly. Indeed, the existence of a single model that links the covariates to the explained variable is probably not realistic in many settings, especially for gene classification. For this reason, the frequency with which each gene is included within the selected models and with which these genes are coupled with other genes provides the building block to create an easy-to-interpret gene network with powerful explanatory and predictive capacities. A graphical representation of this gene network can be found in Figure

The three hubs that were identified are the following:

Cystatin C: a secreted cysteine protease inhibitor abundantly expressed in body fluids (see Xu et al.,

Zyxin: a zinc-binding phosphoprotein that concentrates at focal adhesions and along the actin cytoskeleton;

Complement factor D: a rate-limiting enzyme in the alternative pathway of complement activation (see White et al.,

In the current state of knowledge about acute leukemia, these three hubs appear to make sense from a biological viewpoint. Cystatin C is directly linked to many pathologic processes through various mechanisms and recent studies indicate that the roles of Cystatin C in neuronal cell apoptosis induction include decreasing B-cell leukemia-2 (BCL-2) whose deregulation is known to be implicated in resistant AML (see Sakamoto et al.,

The interpretation of the network can be carried out through plots or tables such as those presented in Appendix

In this section we present a simulation study whose goal is to highlight the practical benefits of the proposed method over competing methods frequently used in genomics. Considering the complexity of simulating from a gene network, in this setting we limit ourselves to considering the existence of a unique true model which therefore does not allow to assess one of the features of the proposed approach which is its network building capacities. Hence, this section specifically focuses on the prediction power and dimension-reduction ability of the method and, for the comparison with alternative methods to be fair, we only keep one model for each simulation replicate. This means that, once the dimension of the model has been identified, the model with the lowest estimated prediction error is kept (thereby discarding the other potential candidates).

In this optic, for the simulation study we mimicked the acute leukemia dataset seen in Section 4.1 where we set the true model to be generated by a combination of two gene expressions: Cystatin C (_{1}) and Thymine-DNA Glycosylase (_{2}) (see Section 4.1.2). Hence the response ^{⋆} in the simulations is a realization of a Bernoulli random variable with probability parameter γ which is obtained through a logit-link function applied to a linear combination of the two above-mentioned variables plus an intercept (with all β coefficients equaling one) i.e.,:

Once the binary response variable ^{⋆} is generated, this is then separated into a training and a test set of the same size as that in the original data-set (i.e., 38 and 34 respectively).

Using the implementation of the proposed algorithm available at the corresponding GitHub repository^{2}^{*}, we ran the testing procedure described in Section 3.1.2 based on a ^{*} instead of a set of models. This model was chosen such that it had the minimum training error and, if this minimum was not unique, then the model was randomly chosen among those achieving this minimum.

Panning algorithm | |||

Elastic net | |||

Support vector machine | |||

Penalized logistic regression | |||

Logistic regression | |||

Nearest shrunken centroids |

Concerning the competing methods, these were implemented using existing R functions with default values. For the Elastic Net we used the R package “^{3}

Table

This paper has proposed a new model selection method with various advantages compared to existing approaches. Firstly, it allows the user to specify the criterion according to which they would like to assess the predictive quality of a model. In this setting, it gives an estimate of the dimension of the problem, allowing the user to understand how many gene expressions are needed in a model to well describe and predict the response of interest. Building on this, it provides a paradigmatic structure of the selected models where the selected covariates are considered as elements in an interconnected biological network. The approach can handle more variables than observations without going through dimension-reduction techniques such as pre-screening or penalization.

The problem definition of this method and the algorithmic structure used to solve it deliver further advantages such as the ability to cope with noisy inputs, missing data, multicollinearity and the capacity to deal with outliers within the response and the explanatory variables (robustness).

Some issues which must be taken into account concerning the proposed method are (i) its computational demand and (ii) its need for an external validation. As far as the first aspect goes, this can be considered indeed negligible compared to the time often required to collect the data it should analyse and can be greatly reduced according to the needs and requirements of the user. Concerning the second aspect, external validation is a crucial point which is often overlooked and is required for any model selection procedure. In this sense, the proposed method does not differ from any other existing approach in terms of additional requirements.

Having proposed a method with considerable advantages for gene selection using statistical ideas in model selection and machine learning, future research aims at studying the statistical properties of this approach to understand its asymptotic behavior and develop the related inference tools.

SG: Proposed the algorithm for model selection purposes and supervised the writing of the paper. NM: Adapted the algorithm for gene selection and gave biological interpretation of the algorithm outputs. RM: Programmed the basic functions for the algorithm and wrote different sections of the paper. SO: Developed the software for the algorithm to be distributed and ran the case studies with relative outputs (tables and plots). MA: Supervised the mathematical contents of the paper and provided the context for the methodology and its links with current literature. YM: Provided inputs for the formal presentation of the algorithm and guaranteed an overall quality-check.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

We are very thankful to John Ramey (

The Supplementary Material for this article can be found online at:

^{1}The use of the software making available the competing methods is described in Section 5.

^{2}

^{3}Note that the special cases α = 0 and α = 1 correspond respectively to ridge regression and lasso.