^{1}

^{*}

^{1}

^{1}

^{1}

^{2}

^{1}

^{1}

^{1}

^{1}

^{1}

^{*}

^{1}

^{2}

Edited by: Jianing Xi, Northwestern Polytechnical University, China

Reviewed by: Hao Wu, Shandong University, China; Feng Li, Qufu Normal University, China; Kai Shi, Guilin University of Technology, China

This article was submitted to Computational Genomics, a section of the journal Frontiers in Genetics

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Multi-omics molecules regulate complex biological processes (CBPs), which reflect the activities of various molecules in living organisms. Meanwhile, the applications to represent disease subtypes and cell types have created an urgent need for sample grouping and associated CBP-inferring tools. In this paper, we present CBP-JMF, a practical tool primarily for discovering CBPs, which underlie sample groups as disease subtypes in applications. Differently from existing methods, CBP-JMF is based on a joint non-negative matrix tri-factorization framework and is implemented in Python. As a pragmatic application, we apply CBP-JMF to identify CBPs for four subtypes of breast cancer. The result shows significant overlapping between genes extracted from CBPs and known subtype pathways. We verify the effectiveness of our tool in detecting CBPs that interpret subtypes of disease.

Complex biological processes (CBPs) are the coordinated effect of multiple molecules, which result in some functional pathways and the vital processes occurring in living organisms. In addition, the vast amounts of multi-omics data, such as genomics, epigenomics, transcriptomics, proteomics, and metabolomics, can be integrated to understand systems biology accurately (Suravajhala et al.,

Non-negative matrix factorization (NMF) (Lee and Seung,

Omics data across the same samples contain signal values from expression counts, methylation levels, and protein concentrations, which control biological systems, resulting in so-called multi-dimensional genomic (MG) data. The natural representation of these diverse MG data is a series of matrices with measured values in rows and individual samples in columns. Recently, there are integrative analysis tools based on NMF technique that reveal low-dimensional structure patterns. The low-dimensional structure patterns reflect CBPs and sample groups while preserving as much information as possible from high-dimensional MG data (Stein-O'Brien et al.,

In general, most particular matrix factorization techniques are being developed to enhance their applicability to specific biological problems. Meanwhile, the applications to represent disease subtypes (Biton et al.,

The rest of this paper is organized as follows. Section “Framework of CBP-JMF” deals with the problem formulation of CBP-JMF and the implementation of it. Then, Section “Results” exemplifies our approach by applying CBP-JMF to identify CBPs for different subtypes of breast cancers and compares the results of classifying unlabeled samples with CBP-JMF and its several variants. Finally, Section “Discussion” discusses our results and lists our expectations of our method and the limitations of it. Section “Conclusions” highlights our method.

Given a non-negative matrix ^{m×n}, it can be factorized into three non-negative matrix factors based on matrix tri-factorization: ^{m×k}, ^{k×k}, and ^{k×n}. Factored matrix

In CBP-JMF, given a MG dataset composed of ^{(1)}, ^{(2)}, ..., ^{(P)}, as illustrated in ^{(p)} (^{(p)} ∈ ^{m×n}, ^{(p)} ≈ ^{(p)}^{(p)}^{(p)} ∈ ^{m×k} and sample basis matrix (SBM) ^{k×n} are the pattern indicator matrices of ^{(p)} ∈ ^{k×k} explores the relationships between them. Furthermore, MCM describes the structure pattern between molecules (

Illustration of the framework and optimization objective function of complex biological processes–joint matrix tri-factorization.

Overall, ^{(1)}, ^{(2)}, ..., ^{(P)} can be jointly factorized into specific ^{(1)}, ^{(2)}, ..., ^{(P)}, ^{(1)}, ^{(2)}, ..., ^{(P)}, and a common matrix ^{(1)}, ^{(2)}, ..., ^{(P)} are across the same samples, and ^{L} and ^{UL} according to input data, where L and UL mean “labeled” samples and “unlabeled” samples, respectively.

Considering that different datasets may play different roles in data integration, we adopted a method that can learn the weights of different input data through a weighted joint tri-NMF:

where Π = (π^{(1)}, π^{(2)}, ..., π^{(P)}). CBP-JMF differentiates the importance of datasets by the weight constraint ||Π||^{2}, and π^{(p)} will get a weight to represent the contribution of data ^{(p)} to objective function after optimization. If ^{(p)} contributes to the optimization of cost function, then it will be given a higher weight π^{(p)}, or if ^{(p)} contains lots of noises which hinder the optimization of objective function, it will be given a lower weight π^{(p)}.

In addition, ^{L} and unlabeled ^{UL} parts according to the labeled samples and unlabeled samples. In order to learn the correlation between labeled samples, we use a graph Laplacian to represent the distance of labeled sample in latent space (Guan et al.,

where ^{L} is the number of labeled samples in ^{a} (^{affinity}) and ^{p} (^{penalty}) are the weighted adjacency matrices (see ^{a} (^{affinity}) and ^{p} (^{penalty}) are the Laplacian matrix of ^{a} and ^{p}, respectively, where ^{a}=^{a} − ^{a}, ^{p}=^{p} − ^{p}, ^{a}=

Combining weighted joint tri-NMF and the constraints of correlation between labeled samples mentioned above, we give the formulation of the optimization objective function of CBP-JMF as follows (

Parameters β and ω represent the importance of the graph Laplacian regularization and weight constraint ||Π||^{2}. In total, each ^{(p)} is factorized into individual molecular matrix ^{(p)} and scale matrix ^{(p)} and a common sample matrix

To solve the problem of factorization

The CBP-JMF algorithm.

^{(1)}, ^{(2)}, ..., ^{(P)}, parameters |

^{(1)}, ^{(2)}, ..., ^{(P)}, ^{(1)}, ^{(2)}, ..., ^{(P)}, factor matrices ^{(1)}, ^{(2)}, ..., ^{(P)}) |

1: |

2: Initialize^{(1)}, ^{(2)}, ..., ^{(P)}, ^{(1)}, ^{(2)}, ..., ^{(P)}, V |

3: Initialize |

4: |

5: |

6: Fix V, update ^{(p)}^{(p)} |

7: |

8: Fix ^{(1)}, ^{(2)}, ..., ^{(P)}, update ^{L} |

9: Fix ^{(1)}, ^{(2)}, ..., ^{(P)}, update ^{UL} |

10: |

11: Fix |

12: |

13: |

14: |

15: |

To clarify the update rules of the objective function of CBP-JMF, we define

The partial derivatives of ^{(P)}) with

Based on the KKT conditions Ψ_{ij}_{ij} = 0, we can get the following update rules:

Similarly, we can get the update rules for ^{L}, and ^{UL}:

As for updating of π, when

Values in each column of ^{(p)} represent the relative contribution of each molecule in each module, and values in each row of ^{(p)}, ^{(p)} matrix. Firstly, we need to know the relationship between ^{(p)} matrix (see

To select features associated with each module, CBP-JMF calculates the z-scores of each molecule for each column vector of ^{(p)} as ^{(p)} and infer a latent feature associated with _{i}, and the length of

We applied CBP-JMF to BRCA with multi-omics data. The reason we chose BRCA as example is that breast cancer is a heterogeneous complex disease, and it is the most commonly occurring cancer. BRCA is also a type of cancer that can be divided into smaller groups based on certain characteristics of the cancer cells. Distinct complex biological processes represent different subtypes. Characterizing the processes can provide us comprehensive insights into the mechanisms of how multiple levels of molecules interact with each other and the heterogeneity of breast cancers.

Firstly, we downloaded the Gene Expression (GE) data, miRNA expression (ME) data, and copy number variation (CNV) data across the same set of 738 breast cancer samples from UCSC Xena (Goldman et al., ^{(1)} ∈ ^{2913×725} and ME data ^{(2)} ∈ ^{516×725}. Among 725 samples, 179 samples are marked with subtype labels (80 luminal A, 38 luminal B, 39 basal-like, 22 HER2-enriched) and shared between GE, ME, and CNV datasets. Furthermore, we calculated the Pearson correlation of 179 labeled samples using CNV data to construct ^{a} ∈ ^{179×179}, ^{p} ∈ ^{179×179}, and their Laplacian matrices to form the graph Laplacian regularization

In our example, we set parameters ^{(1)} ∈ ^{2913×4}, ^{(2)} ∈ ^{516×4}, ^{(1)} ∈ ^{4×4}, and ^{(2)} ∈ ^{4×4} and a common matrix ^{4 ×725}.

To get heterogeneous CBPs (^{(p)},

Complex biological processes of luminal B and basal-like subtype. We mapped the genes and miRNAs obtained from luminal B's module and basal-like's module to an integrated gene regulation network. The network was obtained through integrating three databases including Reactom, Kyoto Encyclopedia of Genes and Genomes, and Nci-PID Pathway Interaction Database. The interactions between genes and miRNAs were obtained from miRTarBase. The size of the node is proportional to the size of the degree. The thickness of the edges indicates the strength of the regulatory relationship expressed by the Pearson correlation coefficient between microRNA and gene.

To explore whether the genes in the CBPs of luminal B and basal-like subtype have significant biological importance or not, we performed an enrichment analysis with all 124 genes from

Enrichment analysis of the extracted module gene across six datasets.

Total | 51 | 43 | 947 | 516 | 61 | 102 |

Overlapped nodes | 2 | 5 | 13 | 6 | 3 | 6 |

0.049 | 0.0003 | 0.007 | 0.008 | 0.010 | 0.012 |

Part of complex biological processes luminal B and basal-like. The edges with checkmarks are the interactions that have been documented.

Evidences of luminal B's complex biological processes.

miR-34a->ERBB2 | Wang et al., |
MiR-34a modulates ErbB2 in breast cancer |

ERBB2->VAV2 | Wang et al., |
ErbB2 colocalizes with Vav2 |

VAV2->RAC3 | Rosenberg et al., |
Vav2 promotes Rac3 activation at invadopodia |

miR-200b->JUN | Jin et al., |
MiR-200b upregulates JUN in breast cancer |

JUN->CCND1 | Cicatiello et al., |
CCND1 promoter activation by estrogens in human breast cancer cells is mediated by the recruitment of a c-Jun/c-Fos/estrogen receptor |

JUN->ESR1 | Stossi et al., |
The activation of ESR1 gene locus in a process that was dependent upon activation and recruitment of the c-Jun transcription factor |

miR-26a->ESR1 | Howard and Yang, |
MiR-26a modulates ESR1 in breast cancer |

ESR1->VAV2 | Grassilli et al., |
ESR1 upregulates VAV2 in breast cancer cell lines |

Evidences of basal-like's complex biological processes.

CCNB1(CCNB2)->PLK1->CDK1 | Li et al., |
CCNB1 (CCNB2), PLK1, and CDK1 have interactions in chicken breast muscle |

miR221->FOS | Yao et al., |
miR221 modulates FOS |

miR221->PAK1 | Ergun et al., |
miR221 modulates PAK1 in breast cancer cell lines |

PAK1->PLK1 | Maroto et al., |
PAK1 regulates PLK1 |

MAPKAPK2->CDC25B | MAPK signaling pathway | MAPKAPK2 and CDC25B are involved in MAPK signaling pathway |

CDC25B->CDK1 | Timofeev et al., |
Timely assembly of CDK1 required CDC25B |

Meanwhile, to classify unlabeled samples into subtypes, CBP-JMF returned predicted labels for unlabeled samples (

Kaplan–Meier (K–M) survival analysis for patients which are classified using different methods.

Understanding CBPs is vital to help us further understand the development of disease and intervene in the disease. NMF is an effective tool for dimension reduction and data mining in high-throughput genomic data. In this paper, we proposed CBP-JMF, an improved method of multi-view data analysis. It is designed for heterogeneous biological data based on NMF. Moreover, we created an easy-to-use package in Python. CBP-JMF analyzes multi-dimensional genomic data across the same samples integrally. Our method can discover CBPs that underlie sample groups and classify unlabeled samples through learning the relationship between labeled samples.

We tested this framework on the gene expression data and miRNA expression data of BRCA. CBP-JMF discovered subtype-specific biological processes and classified unlabeled samples into four subtypes. We did survival analysis and function analysis, and the results showed that CBP-JMF has great performance. Furthermore, CBP-JMF is a weighted joint tri-NMF framework in essence. We expect that it can be applied to vast fields including disease subtypes, cell types, and population stratification. Meanwhile, we expect that CBP-JMF can be used to identify hub genes or predict the association between genes or non-coding mRNA and diseases by integrating a variety of data. Though CBP-JMF is efficient to uncover CBPs by integrating multi-omics data, CBP-JMF must integrate different multi-omics data that have the same samples. This weakness limits the use of more types of data and integrates more information to obtain more significant results.

In this article, we develop CBP-JMF, a matrix tri-factorization and weighted joint integration tool, for detecting CBPs, which characterize prior disease subtypes and cell groups in Python. We improve its usability by estimating the parameters, such as determining the number of features through consensus clustering. CBP-JMF always gives reference values of all parameters. In applications, CBP-JMF characterizes the CBPs of four subtypes of BRCA based on gene and miRNA expression data from TCGA, and we find the significantly different functional pathways that characterized luminal B and basal-like subtypes.

The datasets presented in this study are publicly available and the addresses for finding them are listed within the article. Prediction results and a reference implementation of CBP-JMF in Python are available at:

BW, YWu, and XM conceived and designed the experiments. YWu and MX performed the experiments. XM, RD, CZ, LY, XG, and LG analyzed the data. BW, YWu, XM, and YWa proofread the paper. All authors contributed to the article and approved the submitted version.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The Supplementary Material for this article can be found online at: