^{1}

^{2}

^{2}

^{1}

^{3}

^{1}

^{*}

^{1}

^{2}

^{3}

Edited by: Zhongyu Wei, Fudan University, China

Reviewed by: Zhi-Ping Liu, Shandong University, China; Qin Ma, The Ohio State University, United States

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Bioengineering and Biotechnology

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Detecting gene sets that serve as biomarkers for differentiating patient survival groups may help diagnose diseases robustly and develop multi-gene targeted therapies. However, due to the exponential growth of search space imposed by gene combinations, the performance of existing methods is still far from satisfactory. In this study, we developed a new method called BISG (BIclustering based Survival-related Gene sets detection) based on a rectified factor network (RFN) model, which allows efficiently biclustering gene subsets. By correlating genes in each significant bicluster with patient survival outcomes using a log-rank test and multi-sampling strategy, multiple survival-related gene sets can be detected. We applied BISG on three different cancer types, and the resulting gene sets were tested as biomarkers for survival analyses. Secondly, we systematically analyzed 12 different cancer datasets. Our analysis shows that the genes in all the survival-related gene sets are mainly from five gene families: microRNA protein coding host genes, zinc fingers C2H2-type, solute carriers, CD (cluster of differentiation) molecules, and ankyrin repeat domain containing genes. Moreover, we found that they are mainly enriched in heme metabolism, apoptosis, hypoxia and inflammatory response-related pathways. We compared BISG with two other methods, GSAS and IPSOV. Results show that BISG can better differentiate patient survival groups in different datasets. The identified biomarkers suggested by our study provide useful hypotheses for further investigation. BISG is publicly available with open source at

Identifying biomarker genes for survival risk prediction allows earlier detection of mortality risk and design of individualized therapy (Wang and Liu,

In gene expression experiments, functionally related genes often exhibit a similar pattern in only a subset of samples or under specific experimental conditions (Padilha and Campello,

In this study, we adapted RFN for biclustering analysis of integrated mutation and gene expression datasets from the same sets of samples, and developed a new method called BISG (BIclustering based

The overall design of BISG is shown in

Overview of BISG.

Cancer data used for training and validating biomarkers.

1 | Brain lower grade glioma | 2,511 | 3,141 | 282 |

2 | Colorectal adenocarcinoma | 10,680 | 23,982 | 222 |

3 | Glioblastoma | 4,148 | 5,974 | 130 |

4 | Head and neck squamous cell carcinoma | 11,767 | 27,742 | 500 |

5 | Kidney renal clear cell carcinoma | 6,572 | 9,923 | 435 |

6 | Lung adenocarcinoma | 8,180 | 16,625 | 221 |

7 | Ovarian serous cystadenocarcinoma | 3,641 | 4,573 | 183 |

8 | Pancreatic adenocarcinoma | 6,101 | 9,415 | 150 |

9 | Papillary thyroid carcinoma | 1,320 | 1,437 | 313 |

10 | Prostate adenocarcinoma | 7,673 | 12,658 | 496 |

11 | Thyroid carcinoma | 1,656 | 1,835 | 395 |

12 | Breast Invasive Carcinoma | 7,079 | 11,089 | 448 |

After the biomarkers were predicted, we utilized three microarray datasets GSE16011 (Gravendeel et al.,

Independent test datasets used for confirming predicted biomarkers and for comparison.

GSE3494 | Breast cancer | 4,883 | 236 |

GSE11969 | Lung Adenocarcinoma | 5,273 | 149 |

GSE16011 | Gliomas | 2,061 | 264 |

GSE1456 | Breast cancer | 14,204 | 159 |

GSE32062 | Ovarian cancer | 19,592 | 260 |

Given a normalized gene expression matrix, _{1}, …, _{N}}, a set of columns _{1}, …_{M}}, and the element _{ij} ∈ _{1}, …_{n}) ⊂ _{1}, …_{m})⊂_{1}, …_{s}} such that each bicluster _{k} = (_{k}, _{k}) satisfies specific homogeneity criteria. The RFN model is a single or stacked factor analysis model as in Equation (1), which extracts the covariance structure of the data.

where _{1}, …_{N}) is the input data (visible units),

Let ^{T}) = ^{T} + Υ. The marginal distribution for ^{T} + Υ). The log-likelihood of the input data is given in Equation (2).

For the mean-centered input vector _{i}|_{i}) is Gaussian with the mean vector (_{up)i} and covariance matrix _{pp} as in Equation (3):

To maximize the likelihood, we introduce a variational distribution

where _{i}|_{i}). We constrain _{KL} > 0 is the KL distance. 𝔽 is the objective of the EM algorithm. The E-step maximizes 𝔽 with respect to _{KL}(_{i}|_{i})||_{i}|_{i})). The M-step maximizes 𝔽 respect to the parameters (_{i}|_{i})log_{i}|_{i})_{i}. Considering the quadratic problem of the posterior regularization method, to speed up the computation using fast GPU implementations, we perform a gradient step in both E- and M-steps. In the E-step, we use the projected Newton method as in Equation (5).

In Equation (5), with

In M-step, we decrease the expected reconstruction error, as in Equation (6).

Where ^{new} = ^{−1}, Υ^{new} = ^{T}−^{T}+^{T}.

To get the sparse, non-negative and non-linear of the input representations, and also to model the covariance structure of the input, we choose the maximum likelihood factor analysis as the model and apply the posterior regularization method (Ganchev et al.,

According to Koyuturk et al. (

where δ > 0. Assume that the probability of observing ^{*}, then by Equation (8), the bicluster is significant if

For each bicluster identified, the Bonferroni correction is used to control the overall type I error. The level of significance is set at

Significant bicluster extraction process.

We use Kaplan-Meier plots (Goel et al., _{i}, S(t_{i}) is calculated as below:

where _{i−1}) is the probability of being alive at _{i−1}. _{i} is the number of patients alive just before _{i}. _{i} is the number of events at _{i}. _{0} = 0 and

Considering genes in each significant bicluster, both samples in the training set and validation set can be divided into two groups G1 (with over 80% bicluster genes significantly changed) and G2 (with bicluster genes express normally). To test the survival difference of samples in G1 and G2, a multi-sampling strategy is utilized, each time the same number of samples are selected. The survival curves of the two selected sample groups can be compared statistically by testing the null hypothesis i.e., there is no difference regarding survival among two groups. This null hypothesis is statistically tested by a log-rank test. In the log-rank test, we calculate the expected number of events in each group, i.e., E1 and E2, while O1 and O2 are the total number of observed events in each group, respectively. The test statistic is:

The test statistic and the significance can be drawn by comparing the calculated value with the critical value (using the chi-square table). To guarantee that the bicluster genes are more likely survival-related, for each significant bicluster, considering samples in the training set, we repeat the log-rank test 100 times. If the genes in the bicluster can separate patient groups in more than 80% sampling times, then we use the validation datasets to test whether they can also separate them into two different survival groups. Only bicluster gene sets passing all these significance tests are filtered out as the final biomarkers. We also confirm some biomarkers with independent datasets from the GEO database. In this study, the log-rank test and survival analysis are conducted based on functions in the

We applied BISG on the datasets of brain lower grade glioma, lung adenocarcinoma and breast invasive carcinoma from the cBioPortal database (

Twenty four significant survival-related gene sets detected in brain lower grade glioma with datasets from the cBioPortal database (

We systematically detected significant survival-related biomarker genes sets in 12 different cancer types with datasets in

Enriched GSEA hallmark gene sets of all the biomarker gene sets of all the 12 cancer types. Names on the right Y-axis are the hallmark gene sets. Names on the bottom X-axis are the names of the 12 cancer types. Count means the number of cancers whose significant gene sets enriched in corresponding hallmark gene sets. Values in this figure are 0 or 1. Zero means the biomarker gene sets of the corresponding cancer are not enriched in the hallmark gene sets.

We also analyzed the enriched KEGG pathways of all the bicluster gene sets. As shown in

To test whether biomarker gene sets detected by BISG with datasets from cBioPortal database can differentiate patients into different survival groups with new independent datasets, we collected three microarray datasets GSE16011, GSE3494, and GSE11969, as well as their corresponding sample survival information (

Kaplan-Meier plots of the survival analysis of the samples from brain lower grade glioma (GSE16011), lung adenocarcinoma (GSE11969), and breast invasive carcinoma (GSE3494) patients. 1, 3 means the first and the third top-ranked biomarker gene sets detected by BISG with corresponding cBioPortal datasets. The patients were separated into two groups according to the expression profiles of biomarker genes in the selected biomarker gene set. These genes provided the best split between patients of high and low risk based on their expression levels. In the case of genes in biomarker gene sets (labeled in brown) the over-expression is correlated with poor survival (only up-regulated genes were considered); and in the case of patients without biomarker genes (labeled in blue) the over-expression is correlated with good survival. In all cases the adjusted

To further validate our method, firstly, we compared our methods with GSAS. GSAS quantitatively assesses a gene set's activity score with the BASE algorithm (Cheng et al.,

Comparison of gene set based patient survival group classification. “With gene set” means patients with over 80% expression of genes in the gene set significantly changed. “Without the gene set” means patients with the expression of genes in gene set are normal.

Furthermore, we also compared BISG with IPSOV. We tested whether the ovarian cancer survival-related gene sets detected by IPSOV (with data from GSE32062) and the top-ranked gene set identified by BISG with ovarian cancer datasets from the cBioPortal database can differentiate samples in GSE32062 (used by GSAS but not BISG) into different survival groups. Detailed results are shown in

Based on the fast GPU implementation of the RFN model, BISG can do biclustering analysis of large input datasets in a fast and accurate way, which enables BISG using a multi-sampling strategy to iteratively detect survival-related biomarker gene sets. In contrast to the standard clustering, the samples of a bicluster are only similar to each other on a subset of genes. As a result, genes in each significant bicluster can better differentiate samples into different survival groups. Compared with GSAS and IPSOV, the biomarker gene sets of our method are directly detected from biclustering analysis of the expression datasets, which can well capture the dynamic change of gene sets, and can reflect the real relationships of these genes.

In this paper, we proposed BISG for identifying cancer survival-related biomarker gene sets. BISG can efficiently conduct biclustering for high-dimensional gene expression matrix, and along with patient time-to-event data perform survival analyses. To speed up computation, BISG performs a generalized alternating minimization algorithm with GPU implementations. In this way, BISG can efficiently construct very sparse, non-linear, high-dimensional representations of the input via their posterior means. To identify robust biomarker gene sets, multiple iterations and a random sampling strategy were utilized, and each time only bicluster genes that can significantly differentiate patient survival groups were kept. To detect patterns in survival-related gene sets, we systematically analyzed 12 different cancer types, and identified their enriched pathways and their gene families. The results indicated that the identified gene families and genes are biologically meaningful and consistent with the existing scientific findings. With several independent test datasets, identified biomarkers were confirmed. We also compared BISG with two related methods, and BISG outperformed them. The predicted biomarker gene sets can be further investigated for improving cancer patient survival. BISG is now based on a simple factor analysis model, which can be further extended into multi-layers with a deep learning network structure.

Our method has the potential to be extended for single-cell RNA-seq analysis, which has been widely applied in studying cell heterogeneity such as cells of different cancer types or subtypes. A pertinent question in such analyses is to identify cell subpopulations. Our methods can conduct biclustering effectively and efficiently especially for big expression matrices. Ongoing consortium efforts have generated extensive atlases of single-cell datasets covering diverse biological contexts with thousands of samples (Xie et al.,

Publicly available datasets were analyzed in this study. This data can be found here: GSE3439, GSE11969, GSE16011, GSE1456, and GSE32062,

LS, DX, and GL contributed conception and design of the study. LS, JW, and JG downloaded and organized datasets. LS performed the statistical and result analysis. LS wrote the first draft of the manuscript. All authors contributed to manuscript revision, read, and approved the submitted version.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The Supplementary Material for this article can be found online at: