^{1}

^{1}

^{1}

^{2}

^{1}

^{3}

^{*}

^{1}

^{3}

^{*}

^{1}

^{2}

^{3}

Edited by: Shikui Tu, Shanghai Jiao Tong University, China

Reviewed by: Xiaofei Zhang, Central China Normal University, China; Chen Zheng, Michigan State University, United States; Minzhu Xie, Hunan Normal University, China

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Bioengineering and Biotechnology

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Identifying drug-disease associations is integral to drug development. Computationally prioritizing candidate drug-disease associations has attracted growing attention due to its contribution to reducing the cost of laboratory screening. Drug-disease associations involve different association types, such as drug indications and drug side effects. However, the existing models for predicting drug-disease associations merely concentrate on independent tasks: recommending novel indications to benefit drug repositioning, predicting potential side effects to prevent drug-induced risk, or only determining the existence of drug-disease association. They ignore crucial prior knowledge of the correlations between different association types. Since the Comparative Toxicogenomics Database (CTD) annotates the drug-disease associations as therapeutic or marker/mechanism, we consider predicting the two types of association. To this end, we propose a collective matrix factorization-based multi-task learning method (CMFMTL) in this paper. CMFMTL handles the problem as multi-task learning where each task is to predict one type of association, and two tasks complement and improve each other by capturing the relatedness between them. First, drug-disease associations are represented as a bipartite network with two types of links representing therapeutic effects and non-therapeutic effects. Then, CMFMTL, respectively, approximates the association matrix regarding each link type by matrix tri-factorization, and shares the low-dimensional latent representations for drugs and diseases in the two related tasks for the goal of collective learning. Finally, CMFMTL puts the two tasks into a unified framework and an efficient algorithm is developed to solve our proposed optimization problem. In the computational experiments, CMFMTL outperforms several state-of-the-art methods both in the two tasks. Moreover, case studies show that CMFMTL helps to find out novel drug-disease associations that are not included in CTD, and simultaneously predicts their association types.

Drugs are chemicals used to treat, cure, prevent, or diagnose diseases. The development of a new drug has three steps: discovery stage, preclinical stage, and clinical stage (Wilson,

Recently, a large number of computational methods have been proposed for the drug-disease association prediction. Gottlieb et al. (

The existing models for predicting drug-disease associations only focus on indication prediction or side effect prediction, but ignore the relatedness of the two tasks, which is vital for knowledge of drug-disease associations. Despite the fact that some studies (Yang and Agarwal,

In this paper, we propose a collective matrix factorization-based multi-task learning method (abbreviated as “CMFMTL”) to predict two types of drug-disease associations. From the CTD database, we collect drug-disease associations annotated as therapeutic or marker/mechanism (non-therapeutic), and then construct a drug-disease network with two types of links representing therapeutic effects and non-therapeutic effects. CMFMTL, respectively, approximates the association matrix regarding each link type by matrix tri-factorization, and shares the low-dimensional latent representations for drugs and diseases in the two related tasks for the goal of collective learning. We also develop an efficient algorithm to solve our proposed model. In the computational experiments, CMFMTL outperforms several state-of-the-art methods in both tasks. Moreover, case studies show that CMFMTL helps to find out novel drug-disease associations that are not included in CTD, and simultaneously predicts their association types.

The Comparative Toxicogenomics Database (CTD) (Davis et al.,

As we described above, chemical-disease associations in CTD are annotated as therapeutic or marker/mechanism. Therapeutic associations mean that chemicals play a therapeutic role in diseases, while marker/mechanism associations mean that chemicals correlate with diseases. In this study, we can easily label these associations as therapeutic associations or non-therapeutic (marker/mechanism) associations. Extremely few associations are simultaneously annotated as two association types. Without loss of statistical properties of the data, we only label the extreme cases as therapeutic associations. Finally, the benchmark dataset contains 18,416 drug-disease associations involving 269 drugs and 598 diseases. Among these associations, 6,244 associations are therapeutic associations and 12,172 associations are non-therapeutic associations.

Let

A feature of a drug is a collection of entities or attributes related to the drug. Thus, we can use the Tanimoto score (Tanimoto, _{i} and Γ_{j} denote features of two drugs, the Jaccard index is described as:

Let _{i} is encoded as _{i}; otherwise, it is set to zero. Obviously, the Equation (1) can be rewritten as:

As described in Wang et al. (_{d} = (_{d}, _{d}), where _{d} is the set of all ancestors of _{d} is the set of links from ancestor disease to their children. The semantic contribution of disease _{d} to disease _{i} and _{j} is calculated by:

Multi-task learning is an inductive transfer learning approach that captures the connections amongst multiple related learning tasks as an inductive bias by a specific shared mechanism (Ando and Zhang,

The workflow of the collective matrix factorization-based multi-task learning method (CMFMTL) is demonstrated in

Workflow of collective matrix factorization-based multi-task learning method (CMFMTL): ^{p} is the corresponding binary matrix for the therapeutic subnetwork; ^{n} is the corresponding binary matrix for non-therapeutic subnetwork; ^{m×k} and ^{n×k} are, respectively, the low-dimensional representations for drugs and diseases; ^{p} and ^{n} are coefficient matrices.

Given a set of drugs _{i} and disease _{j} is labeled as therapeutic link if the drug _{i} has a therapeutic effect on the disease _{j}; the edge is labeled as a non-therapeutic link if the drug _{i} has a non-therapeutic effect on the disease _{j}. Then the drug-disease association network ^{p} ∈ {0, 1}^{m×n} is the corresponding binary matrix for _{i} has a therapeutic link to the disease _{j}, otherwise ^{n} ∈ {0, 1}^{m×n} is the corresponding binary matrix for _{i} has a non-therapeutic link to the disease _{j}, otherwise ^{p} and ^{n}, respectively, and map the drugs (diseases) into common latent representations shared in two tasks. Specifically, we approximate the association matrices ^{p} and ^{n} by minimizing the reconstruction errors:
^{m×k} and ^{n×k} are the low-dimensional representations for drugs and diseases, respectively; ^{p} and ^{n} are coefficient matrices which model how the latent representations interact in the respective association type;

Since Equation (6) maps drugs and diseases into a low-dimensional space, a natural idea occurs that the low-dimensional representations should preserve the underlying interconnection information of drugs and diseases. Studies on manifold learning (Belkin et al., ^{r} ∈ ℝ^{m×m} where the (^{d} ∈ ℝ^{n×n} where the (^{U} = ^{r} − ^{r} and ^{V} = ^{d} − ^{d}, where ^{r} and ^{d} are, respectively, diagonal matrices whose diagonal elements are corresponding row sums of ^{r} and ^{d}. The graph Laplacian regularizations are formulated as:
_{i} (the disease _{i}) is closer to the drug _{j} (the disease _{j}) in the low-dimensional space if the similarity between them _{2} regularizations to reinforce the smoothness of ^{p}, and ^{n}. Therefore, we obtain the optimization objective of the CMFMTL by combining the _{2} regularizations, Equations (6) and (7):

To efficiently solve problem (8), we equivalently convert it into an equation constrained optimization problem:
_{1} > 0, ρ_{2} > 0 are called as the penalty parameters;

Next, differentiating ^{p} is simplified as:
^{p}). The objective function with regard to ^{n} shares the same optimization structure with the Equation (12), and thus we denote the solution as ^{n}).

Finally, the Lagrange multipliers and the penalty parameter are updated as follows:
^{2} + ^{3}) time. We set the maximal iterative number in conjugate gradient procedure as ^{3} + ^{3} + ^{3}), several matrix multiplications [in Equation (11) and the initialization for conjugate gradient procedure] that cost ^{2}^{2}^{2} + ^{2} + ^{3} + ^{2} + ^{3})

The updated process of CMFMTL.

^{p} ∈ {0, 1}^{m×n};^{n} ∈ {0, 1}^{m×n};^{r} ∈ ℝ^{m×m};^{d} ∈ ℝ^{n×n};^{Ap*}, ^{An*}^{n×k} and ^{m×k} in the interval [0, 1] randomly; _{1} = ρ_{2} = 1^{p} and ^{n} using^{p} = ^{p}), ^{n} = ^{n})_{1} and ρ_{2} via the equation (13)^{Ap*}, ^{An*} using^{Ap*} = ^{p}^{T},^{An*} = ^{n}^{T} |

In our experiment, 5-fold cross validation (5-CV) experiments are conducted to systematically evaluate prediction models. Considering assessing models in two tasks, where predicting drug-disease therapeutic associations is called task 1 and the other is called task 2, we respectively split known therapeutic associations and non-therapeutic associations into five equal-sized parts at random. In each task, one of the five subsets is considered as the testing set in turn, and the remaining four subsets are combined as the training set. The metrices can be calculated in each fold, and the average of five evaluations is adopted.

Several evaluation metrics, such as sensitivity (SE, also known as recall), specificity (SP), accuracy (ACC), precision (PRE), and F-measure (F), are calculated. Since they depend on a threshold to classify predictions as positive or negative, we adopt the threshold which produces the max F-measure. Moreover, the area under the receiver-operating characteristic curve (AUC) and the area under the precision-recall curve (AUPR) are adopted as the primary metrics.

The collective matrix factorization-based multi-task learning method (CMFMTL) has four key parameters: the dimensionality of the common latent space _{1} and ρ_{2} in Equation (13), we set μ = 1.1. By grid-search, we obtain the best results with an AUPR of 0.2122 in task 1 when α = β = 8, λ = 4 and _{2} regularization coefficient λ may control the trade-off between the two tasks, e.g., greater λ produces better performance in task 2 than task 1. When the dimensionality

Influence of parameters on the performance of CMFMTL involving two tasks:

As we discussed above, CMFMTL is a multi-task learning method that simultaneously predicts therapeutic and non-therapeutic associations between drugs and diseases. Existing methods only predict a certain type of drug-disease associations, such as drug indications and side effects. For this reason, we conduct each of several association prediction methods, respectively, on two tasks, and then compare the performance of them with our proposed CMFMTL model.

Here, we consider three state-of-the-art association prediction methods: TL-HGBI, LRSSL, and DRRS, which are the classic or latest works of predicting drug-disease associations. TL-HGBI (Wang et al., ^{p} without decomposing ^{n} in task 1. We also retain the graph regularizations and _{2} regularizations, and use the same algorithm and parameter setting in CMFMTL-R as in CMFMTL for fair comparison.

All methods are evaluated by 5-CV, and results are shown in

Performances of Prediction Models in Task 1.

CMFMTL | 0.2122 | 0.8898 | 0.2888 | 0.9926 | 0.2544 | 0.9866 | 0.2690 |

CMFMTL-R | 0.1217 | 0.8543 | 0.2135 | 0.9905 | 0.1644 | 0.9839 | 0.1849 |

TL-HGBI | 0.0444 | 0.7444 | 0.1265 | 0.9827 | 0.0624 | 0.9753 | 0.0808 |

LRSSL | 0.0420 | 0.7341 | 0.1489 | 0.9745 | 0.0490 | 0.9674 | 0.0731 |

DRRS | 0.1735 | 0.8893 | 0.2756 | 0.9917 | 0.2292 | 0.9856 | 0.2468 |

Performances of Prediction Models in Task 2.

CMFMTL | 0.1838 | 0.8661 | 0.3091 | 0.9798 | 0.2091 | 0.9686 | 0.2473 |

CMFMTL-R | 0.1465 | 0.8449 | 0.2623 | 0.9798 | 0.1812 | 0.9679 | 0.2139 |

TL-HGBI | 0.0635 | 0.7469 | 0.1839 | 0.9653 | 0.0840 | 0.9523 | 0.1140 |

LRSSL | 0.0606 | 0.7393 | 0.1812 | 0.9644 | 0.0801 | 0.9514 | 0.1106 |

DRRS | 0.1150 | 0.8570 | 0.3105 | 0.9690 | 0.1454 | 0.9580 | 0.1979 |

In practical application, one may be concerned about how many true associations can be recovered by the predictive models from highly ranked predictions. We evaluate the capabilities of all models for top-N predictions. Recall that we randomly select 20% of known therapeutic associations and 20% known non-therapeutic associations, and remove them in 1-fold of 5-CV. We can then investigate the recall scores and precision scores of all models in top predictions ranging from top 10 to top 1,000 (in a step size of 10), and the results are shown in

Top-N ranked recall and precision of all methods in two tasks:

In this section, we use case studies to demonstrate the practical usefulness of CMFMTL in predicting therapeutic and non-therapeutic associations. CMFMTL makes predictions by collective learning, and also shares predictive signals across two tasks. Hence, the prediction scores that the CMFMTL simultaneously generates for two tasks are able to measure the probabilities that drugs associate diseases in a certain association type. We use all drug-disease associations in our dataset to train the CMFMTL model and then rank the prediction scores of all unknown entries which remain unrecorded in the dataset. Then, we focus on the top predicted (drug, disease, association type) triples. We list top 10 ranked predictions in

Top 10 Drug-Disease Associations Predicted by CMFMTL.

Chloroquine | Bradycardia | −1 | Don Michael and Aiwazzadeh, |

Chlorpromazine | Coma | −1 | N.A. |

Risperidone | Anxiety disorders | 1 | Ravindran et al., |

Clozapine | Headache | −1 | |

Methotrexate | Neoplasms | 1 | |

Valproic Acid | Fatigue | −1 | N.A. |

Amitriptyline | Confusion | −1 | |

Ibuprofen | Drug hypersensitivity | −1 | Nanau and Neuman, |

Tamoxifen | Diarrhea | −1 | N.A. |

Vincristine | Neoplasms | 1 |

In this work, to simultaneously predict two types of drug-disease association, we present a novel model named collective matrix factorization-based multi-task learning (CMFMTL). Different from existing methods that focus on the existence of drug-disease associations, CMFMTL aims to predict the drug-disease associations and their corresponding association type. Since drug-disease associations are annotated into two categories, predicting each type of association can be served as one individual task. The underlying relatedness across the tasks is a vital piece of prior knowledge that can greatly improve learning abilities. CMFMTL captures the relations between two tasks and successfully utilizes all useful information to achieve high-accuracy and robust performances. The experimental results show that CMFMTL outperforms other state-of-the-art association prediction methods. Case studies demonstrate CMFMTL can find out novel associations and accurately infer the association type.

Nevertheless, CMFMTL still has limitations. CMFMTL predicts the probabilities of therapeutic associations and non-therapeutic associations for all non-interaction drug-disease pairs. However, we notice that some drug-disease associations are included in the top prediction of therapeutic associations as well as the top prediction of non-therapeutic associations. It means that these associations are predicted by CMFMTL to be both therapeutic and non-therapeutic, which is conflicting. The possible reason is that these drugs and diseases are very popular and have a great number of associations. Then, the model learns the data bias. In future work, we will optimize the proposed model to avoid this conflict. Note that similarity integration methods are usually able to achieve high-accuracy performance in similar bioinformatics issues (Zhang et al.,

Publicly available datasets were analyzed in this study. This data can be found here: The Comparative Toxicogenomics Database (CTD)

FH and SL designed the project and wrote the manuscript. YQ and QL performed the experiments and analyzed the results. SL and FN supervised and conceived the study. All authors read and approved the manuscript.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.