^{*}

^{*}

Edited by: Jialiang Yang, Geneis (Beijing) Co. Ltd., China

Reviewed by: Ali Salehzadeh-Yazdi, University of Rostock, Germany; JunLin Xu, Hunan University, China; Lan Yu, Inner Mongolia People’s Hospital, China

This article was submitted to Systems Biology, a section of the journal Frontiers in Genetics

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

It is known that miRNA plays an increasingly important role in many physiological processes. Disease-related miRNAs could be potential biomarkers for clinical diagnosis, prognosis, and treatment. Therefore, accurately inferring potential miRNAs related to diseases has become a hot topic in the bioinformatics community recently. In this study, we proposed a mathematical model based on matrix decomposition, named MFMDA, to identify potential miRNA–disease associations by integrating known miRNA and disease-related data, similarities between miRNAs and between diseases. We also compared MFMDA with some of the latest algorithms in several established miRNA disease databases. MFMDA reached an AUC of 0.9061 in the fivefold cross-validation. The experimental results show that MFMDA effectively infers novel miRNA–disease associations. In addition, we conducted case studies by applying MFMDA to three types of high-risk human cancers. While most predicted miRNAs are confirmed by external databases of experimental literature, we also identified a few novel disease-related miRNAs for further experimental validation.

Non-coding RNA (ncRNA) is a type of RNA that cannot be translated into protein. Although ncRNA cannot be translated into protein, its target gene can be regulated at the post-transcriptional level, thereby affecting disease (

In recent years, more and more studies have shown that miRNA plays a huge role in the process of cell differentiation, biological development, and disease development, which has also attracted more researchers’ attention (

However, using biological experiments to identify disease-associated miRNAs is expensive and time-consuming, and it is blind. Therefore, there is an urgent need for simple and effective computational prediction models for predicting disease-related miRNAs. With the rapid development of high-throughput sequencing technology, more and more omics data are published, which also provides data support for the study of computational prediction models (

Machine learning-based computational prediction methods predict the association of potential miRNAs with the disease can be divided into supervised-based machine learning methods and semi-supervised-based machine learning methods. The method based on supervision is mainly based on labeling sample set and label-less sample set to construct a machine learning model. Jiang et al. extracted feature sets based on known and unknown associations for training support vector machine (SVM) classifiers to predict potential miRNAs and disease associations, and achieved comparative prediction performance through cross-validation (

In addition to machine learning-based methods, network-based methods to predict disease-related miRNAs have also attracted the attention of many researchers. Such methods are mainly based on a common biological hypothesis, “miRNAs with similar functions are more likely to be associated with disease phenotypes with similar functions, and vice versa” (

Although research on miRNA disease association prediction models has made some progress, there is still room to further improve the prediction performance of the model. In this study, we propose a predictive model called matrix decomposition, which fully considers the similarity between miRNAs and the similarity between diseases. In order to evaluate the effectiveness of MFMDA, we tested it using a global fivefold and local LOOCV framework. MFMDA is superior to the benchmark algorithm used for comparison, and achieves reliable performance in the framework of fivefold CV and local LOOCV (AUC 0.9061 and 0.7933) in the HMDD (V2.0) data set. To further prove the superiority of MFMDA, we analyzed three common diseases. Based on the analysis of the test results, we can find that 18 of the top 30 potential miRNAs related to the three diseases predicted by MFMDA have been confirmed by other databases.

In the past few decades, as the technology has matured, a large number of omics data have been published, including a large number of pairs related to miRNA diseases. Here, we use the known miRNAs and disease-associated data set HMDD V2.0 as the benchmark dataset (_{ij} is 1; otherwise, the corresponding position is set to 0.

Based on previous research, it is not difficult to find that miRNAs with similar functions are more likely to be related to similar diseases (^{1}. Therefore, we constructed a functional similarity matrix FS between miRNAs based on these data, where

Semantic similarity is a common way to express the similarity of diseases in this field. MFMDA uses a layered directed acyclic graph (DAG) to calculate the similarity between two diseases (_{d} = (_{d},_{d}) be a DAG, where _{d } represents the ancestor node set of _{d} represents the hierarchical connection between diseases defined by the MeSH disease tree structure of the National Library of Medicine. For any _{d}, MFMDA defines the semantic contribution of disease

Where Δ is the semantic decay factor, which is set to 0.5 in the iterative equation according to previous researches (_{1} and _{2} can be defined as:

Among various similarity measurement algorithms, Gaussian similarity is a very good measurement method, which has been widely used in various fields. Let _{i}) be the vector related to miRNA _{i}in Y, i.e., the ^{th} column of Y. Then, the Gaussian similarity between the diseases _{i} and _{j}is calculated as follows:

Where γ_{m} is the adjustment parameter of the bandwidth (_{m} is as follows:

Similarly, the Gaussian similarity between miRNAs can be defined as follows:

In order to obtain a more comprehensive disease similarity, the semantic similarity of the disease is combined with the Gaussian interactive contour kernel similarity through the following piecewise function to obtain the final similarity between the diseases:

Similarly, the similarity between miRNAs can also be redefined as:

Matrix factorization (MF) is an effective technique that has been widely used in data representation (^{n×m}, that is, ^{n×k} and ^{m×k}, and ^{T}

where _{ij} = 0 if the entry (

The standard MF in Eq. 2 is just to find two matrices, and their product tries to approximate the original matrix. However, the effects caused by the similarity between miRNAs and diseases are ignored. Suppose the functions of the two miRNAs are very similar, and at the same time, the diseases implicitly learned that they should have a similar distance in the vector space. The diseases dimension is the same. For the same reason, the miRNAs size can also use this idea to constrain the drug’s implicit representation. That is, if the two diseases are similar, the distance of the miRNAs in the low-dimensional vector space should also be small.

where λ_{l}, λ_{d}, and λ_{v} are the regularization coefficients; _{i} and _{j} are the ^{v*} is the hidden social similarity between miRNAs and ^{d*} is the hidden social similarity between diseases.

In order to solve the local optimal solution problem of Eq. 3, we use the gradient descent algorithm to solve. According to the nature of the Frobenius norm, the corresponding Lagrange function _{E} of Eq. 2 can be redefined as:

where _{m} = _{m}−^{m*} and _{d} = _{d}−^{d*} are the graph Laplacian matrices for ^{m*} and ^{d*}, respectively; and _{m} and _{m} are the diagonal matrices whose entries are row (or column) sums of ^{m*} and ^{d*}, respectively.

The partial derivatives of the above functions with respect to W and H are:

According to the solution conditions of Karush–Kuhn–Tucker (KKT) (_{ik}_{ik} = 0and ψ_{jk}_{jk} = 0, thus obtain the following equations for

Therefore, we get the _{ik} and _{jk} update rules as follows:

The matrices W and H are updated based on Eq. 3 until convergence. Finally, we can obtain the predicted miRNAs–diseases association matrix as ^{∗} = ^{T}^{∗}. In principle, the miRNAs with the highest grade in ^{∗} are more likely to be associated with the disease. The flow chart of MFMDA is shown in

Diagram of MFMDA for predicting potential miRNA–disease associations.

There are many performance indicators for evaluating prediction models. In this field, ROC curve and AUC value, PR curve, and AUPR value are usually used to evaluate the performance of the algorithm (

The ROC curve, also called receiver operating characteristic curve or susceptibility curve, is a comprehensive indicator reflecting sensitivity and specificity. The ROC curve graphically reveals the correlation between sensitivity and specificity. By setting different thresholds, a series of corresponding sensitivities and specificities are calculated, and then plotted with the true positive rate on the ordinate and false positive rate on the abscissa curve. The simple assumption is that for binary classification problems (only two types, positive and negative samples), the calculation methods of TPR and FPR are shown in Eq. 15.

TP refers to the number of positive samples that are correctly predicted, that is, the number of positive samples that are predicted as positive samples; FP refers to the number of positive samples that are incorrectly predicted, that is, the number of negative samples that are predicted to be positive samples; the number of negative samples correctly predicted, that is, the number of negative samples predicted as negative samples; FN refers to the number of negative samples that are incorrectly predicted, that is, the number of positive samples predicted as negative samples. The area under the line of the ROC curve is AUC. The more convex the ROC curve, the closer to the upper left corner. The larger the AUC value, the better the prediction performance. The AUC value is generally between 0.5 and 1. The AUC value of 0.5 is the effect of random prediction. The AUC value of 1 has the best performance and the perfect classifier, that is, it can correct all positive and negative classes.

The PR curve calculates a series of accuracy and recall by setting different thresholds, and then draws the curve as the precision ordinate and recall as the abscissa. The precision and recall are calculated into the formulas 16:

The PR curve reflects the correlation between accuracy and recall. The area under the PR curve is AUPR. The larger the AUPR value, the better the performance.

We further compared the prediction performance of the MFMDA model with four benchmark prediction models (i.e., LRMCMDA, IMCMDA, NCPMDA, and RLSMDA). LRMCMDA and IMCMDA belong to the matrix completion algorithm, and have achieved good predictive performance in this field. NCPMDA is a network projection algorithm, which is one of the representatives of algorithms based on network prediction. RLSMDA is a semi-supervised learning method based on the Regularized Least Squares (RLS) framework, which represents a good opportunity to learn learning algorithms. Since the data used in this study are all from the public data set HMDD2.0, all the parameters of the comparison algorithm will also use the parameters given by the original author.

We applied MFMDA, LRMCMDA, IMCMDA, NCPMDA, and RLSMDA to HMDD V2.0 miRNA–disease association data, which contains 5430 unique associations between 495 miRNAs and 383 diseases, and draws their ROC curves of the global fivefold CV in

Comparison of MFMDA with four best performers for miRNA–disease associations.

However, considering the limited number of known and experimentally verified miRNA–disease associations, it is too arbitrary to use AUC to evaluate the performance of prediction methods. Therefore, we also include the exact recall (PR) curve and the AUPR in

For a new disease, if it can find its related miRNAs, it will provide a great help for people to understand the pathogenesis of the disease. Therefore, we performed _{d} experiment to test the performance of MFMDA in predicting miRNAs associated to a novel disease _{d}: CV on disease _{i}, we remove all the known miRNA–disease association of the disease _{i} (column vectors in matrix ^{m×n}) and build prediction model (for inferring the deleted associations) using the remaining data. As shown in

Comparison between MFMDA and benchmark algorithms based on local LOOCV.

Finally, we explored the effect of the disease similarity and miRNA similarity on prediction performance. Specifically, we performed global fivefold CV with parameters λ_{m}or λ_{d} from 0.2 to 1 and a step size of 0.2 (

Prediction AUCs of MFMDA at different choices of parameters.

MFMDA | λ_{m} = λ_{d} = 0.2 |
λ_{m} = λ_{d} = 0.4 |
λ_{m} = λ_{d} = 0.6 |
λ_{m} = λ_{d} = 0.8 |
λ_{m} = λ_{d} = 1 |

AUC | 0.9061 | 0.9058 | 0.9013 | 0.8924 | 0.8912 |

Next, three disease case studies were conducted to further validate the predictive power of the new miRNA disease pairs discovered by MFMDA. We first use the verified HMDD V2.0 pair as a training sample. For each predicted disease, the corresponding unverified miRNA is ranked according to the predicted score. Then, according to the other three well-known databases dbDEMC2.0 (

Endometrial cancer is a group of epithelial malignant tumors that occur in the endometrium, and it occurs in perimenopausal and postmenopausal women. Endometrial cancer is one of the most common tumors of the female reproductive system. There are nearly 200,000 new cases each year, and it is the third most common gynecological malignant tumor that causes death. Earlier studies have shown that the differential expression of miRNA in endometrial adenocarcinoma can play a key auxiliary role in understanding the diagnosis and treatment of endometrial adenocarcinoma (

The top 10 potential miRNA candidates detected by MFMDA for endometrial neoplasms.

Cancer | No. of confirmed miRNAs | Top 10 ranked predictions |
|||||

Rank | miRNAs | Evidences | Rank | miRNAs | Evidences | ||

Endometrial neoplasms | 9 | 1 | hsa-mir-146a | HMDD V3.0 | 6 | hsa-mir-34a | HMDD V3.0 |

2 | hsa-mir-221 | Unconfirmed | 7 | hsa-mir-29a | HMDD V3.0 | ||

3 | hsa-mir-20a | HMDD V3.0 | 8 | hsa-mir-145 | HMDD V3.0 | ||

4 | hsa-mir-17 | HMDD V3.0 | 9 | hsa-mir-15a | HMDD V3.0 | ||

5 | hsa-mir-16 | HMDD V3.0 | 10 | hsa-mir-29b | HMDD V3.0 |

In the second case study, we still choose the tumor that belongs to women with high incidence, namely, breast tumor. Breast tumors are malignant tumors that occur in the epithelial tissue of the breast glands. Currently, the treatment is mainly based on clinical and pathological features. Targeted therapy and personalized therapy are the ultimate goals. Related studies have shown that the occurrence of breast tumors is also related to abnormalities of related miRNAs. For example, an abnormal increase in miR-22 may promote the occurrence and metastasis of breast cancer and lead to a higher degree of tumor malignancy. Therefore, predicting miRNAs related to breast tumors through related algorithms will also provide corresponding help for human breast cancer treatment. As shown in

The top 10 potential miRNA candidates detected by MFMDA for breast neoplasms.

Cancer | No. of confirmed miRNAs | Top 10 ranked predictions |
|||||

Rank | miRNAs | Evidences | Rank | miRNAs | Evidences | ||

Breast neoplasms | 10 | 1 | hsa-mir-150 | dbDEMC 2.0 | 6 | hsa-mir-130a | dbDEMC 2.0 |

2 | hsa-mir-142 | dbDEMC 2.0 | 7 | hsa-mir-99a | dbDEMC 2.0 | ||

3 | hsa-mir-15b | dbDEMC 2.0 | 8 | hsa-mir-196b | dbDEMC 2.0 | ||

4 | hsa-mir-106a | dbDEMC 2.0 | 9 | hsa-mir-378a | dbDEMC 2.0 | ||

5 | hsa-mir-192 | dbDEMC 2.0 | 10 | hsa-mir-212 | dbDEMC 2.0 |

Finally, we conduct prediction studies on miRNAs associated with lung tumors. Lung cancer is one of the fastest growing morbidity and mortality rates, and the most threatening to the health and life of the population. In the past 50 years, many countries have reported that the incidence and mortality of lung cancer have increased significantly. The incidence and mortality of lung cancer in men accounted for the first place in all malignant tumors, the incidence in women accounted for the second place, and the mortality rate took the second place. Despite the important therapeutic value of chemotherapy, surgery is still the only way to treat lung cancer. There is an urgent need to find potential biomarkers that respond strongly to clinical observations. The researchers found that the expression level of miR-99a is related to the clinicopathological factors of lung cancer and lymph node metastasis. Identifying more miRNAs related to lung cancer helps to accurately assess clinical outcomes. Therefore, we conducted a lung cancer case study based on MFMDA. In the prediction list, nine of the top 10 predicted miRNAs confirmed their association with lung tumors (see

The top 10 potential miRNA candidates detected by MFMDA for lung neoplasms.

Cancer | No. of confirmed miRNAs | Top 10 ranked predictions |
|||||

Rank | miRNAs | Evidences | Rank | miRNAs | Evidences | ||

Lung neoplasms | 9 | 1 | hsa-mir-16 | miR2Disease | 6 | hsa-mir-141 | miR2Disease |

2 | hsa-mir-122 | dbDEMC 2.0 | 7 | hsa-mir-195 | miR2Disease | ||

3 | hsa-mir-15a | dbDEMC 2.0 | 8 | hsa-mir-429 | miR2Disease | ||

4 | hsa-mir-15b | Unconfirmed | 9 | hsa-mir-23b | dbDEMC 2.0 | ||

5 | hsa-mir-106b | dbDEMC 2.0 | 10 | hsa-mir-20b | dbDEMC 2.0 |

For a clear view, we illustrate in

Network of the top 10 predicted associations for the three diseases via MFMDA.

A large number of studies have shown that miRNA plays an increasingly important role in many physiological processes. Researchers are trying to identify disease-related miRNAs as valuable biomarkers that can be used for clinical measurement, diagnosis, prognosis, and treatment. Therefore, accurately inferring potential miRNAs related to diseases can help us study the pathogenesis of diseases and find more effective treatments. In this study, we proposed a mathematical model based on MF (MFMDA) to identify potential miRNA–disease associations. First, MFMDA not only uses known miRNA and disease-related data, but also integrates the similarities between miRNA and disease. Second, the model is a semi-supervised model, which does not rely on negative samples. Finally, in the process of solving the model, we use the alternating gradient descent algorithm to find the optimal solution to ensure a stable decomposition matrix. Experimental results show that, compared with other methods, MFDMA can effectively improve performance and is a powerful tool for discovering the association of potential diseases with miRNA. However, this method still has some limitations; we need to further optimize. For example, the similarity measure between diseases and miRNAs used by MFMDA is too single and may not be the best choice. How to integrate multiple omics information more effectively to improve prediction performance is also worthy of further research.

The original contributions presented in the study are included in the article/supplementary material. Further inquiries can be directed to the corresponding authors.

SH and RC designed the study. PS collected and wrote the manuscript. SY and YC reviewed the manuscript. All authors contributed to the article and approved the submitted version.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.