<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Archiving and Interchange DTD v2.3 20070202//EN" "archivearticle.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="methods-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Bioeng. Biotechnol.</journal-id>
<journal-title>Frontiers in Bioengineering and Biotechnology</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Bioeng. Biotechnol.</abbrev-journal-title>
<issn pub-type="epub">2296-4185</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fbioe.2020.00267</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Bioengineering and Biotechnology</subject>
<subj-group>
<subject>Methods</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Dipeptide Frequency of Word Frequency and Graph Convolutional Networks for DTA Prediction</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Wang</surname> <given-names>Xianfang</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/898076/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Liu</surname> <given-names>Yifeng</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Lu</surname> <given-names>Fan</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Li</surname> <given-names>Hongfei</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/903010/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Gao</surname> <given-names>Peng</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Wei</surname> <given-names>Dongqing</given-names></name>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>School of Computer Science and Technology, Henan Institute of Technology</institution>, <addr-line>Xinxiang</addr-line>, <country>China</country></aff>
<aff id="aff2"><sup>2</sup><institution>School of Computer and Information Engineering, Henan Normal University</institution>, <addr-line>Xinxiang</addr-line>, <country>China</country></aff>
<aff id="aff3"><sup>3</sup><institution>School of Life Sciences and Biotechnology, Shanghai Jiao Tong University</institution>, <addr-line>Shanghai</addr-line>, <country>China</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Wen Zhang, Huazhong Agricultural University, China</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Juexin Wang, University of Missouri, United States; Pu-Feng Du, Tianjin University, China; Liu Li, Inner Mongolia University, China</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Xianfang Wang <email>2wangfang&#x00040;163.com</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Bioengineering and Biotechnology</p></fn></author-notes>
<pub-date pub-type="epub">
<day>03</day>
<month>04</month>
<year>2020</year>
</pub-date>
<pub-date pub-type="collection">
<year>2020</year>
</pub-date>
<volume>8</volume>
<elocation-id>267</elocation-id>
<history>
<date date-type="received">
<day>30</day>
<month>01</month>
<year>2020</year>
</date>
<date date-type="accepted">
<day>13</day>
<month>03</month>
<year>2020</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2020 Wang, Liu, Lu, Li, Gao and Wei.</copyright-statement>
<copyright-year>2020</copyright-year>
<copyright-holder>Wang, Liu, Lu, Li, Gao and Wei</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract><p>Deep learning is an effective method to capture drug-target binding affinity, but low accuracy is still an obstacle to be overcome. Thus, we propose a novel predictor for drug-target binding affinity based on dipeptide frequency of word frequency encoding and a hybrid graph convolutional network. Word frequency characteristics of natural language are used to improve the frequency characteristics of peptides to express target proteins. For each drug molecules, the five different features of drug atoms and the atomic bond relationships are expressed as graphs. The obtained protein features and graph structure are used as the input of convolution neural network and the input of graph convolution neural network, respectively. A prediction model is established to predict the drug affinity by calculating the hidden relationship. In the KIBA data set test experiment, the consistency coefficient of the model is 0.901, which is 0.01 higher than the existing model, and the MSE (mean square error) of the model is 0.126, which is 5% lower than the existing model. In Davis data set test experiment, the consistency coefficient of the model is 0.895, which is 0.006 higher than the existing model, and the MSE of the model is 0.220, which is 4% lower than the existing model. These results show that our proposed method can not only predict the affinity better than those existing models, but also outperform unitary deep learning approaches.</p></abstract>
<kwd-group>
<kwd>drug-target binding affinity</kwd>
<kwd>dipeptide frequency of word frequency</kwd>
<kwd>graph convolutional network</kwd>
<kwd>variable importance measures</kwd>
<kwd>deep learning</kwd>
</kwd-group>
<contract-num rid="cn001">61173071</contract-num>
<contract-num rid="cn001">61503244</contract-num>
<contract-num rid="cn001">61832019</contract-num>
<contract-sponsor id="cn001">Foundation for Innovative Research Groups of the National Natural Science Foundation of China<named-content content-type="fundref-id">10.13039/501100012659</named-content></contract-sponsor>
<counts>
<fig-count count="7"/>
<table-count count="4"/>
<equation-count count="11"/>
<ref-count count="35"/>
<page-count count="10"/>
<word-count count="6093"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>Introduction</title>
<p>The discovery processes of the new drug are not only time consuming, but also cost expensively (Roses, <xref ref-type="bibr" rid="B26">2008</xref>). It usually spends about $ 2.6 billion and 10&#x02013;17 years on research and experimental processes (Yang et al., <xref ref-type="bibr" rid="B34">2017</xref>). One of core method is to find novel targets for existing drugs (Santos et al., <xref ref-type="bibr" rid="B27">2016</xref>) and overcome the current shortage capabilities of drug discovering (Chu et al., <xref ref-type="bibr" rid="B4">2019</xref>). It not only reduces experimental cost, but also greatly shortens drug discovery time (Martin et al., <xref ref-type="bibr" rid="B17">2018</xref>), by eliminating multiple experimental processes such as drug stability (Oprea and Mestres, <xref ref-type="bibr" rid="B21">2012</xref>). How to discover novel target proteins between drugs and targets has become an important task for drug development. And successful identification of drug-target interactions (DTI) is a prerequisite in this task (Ezzat et al., <xref ref-type="bibr" rid="B7">2019</xref>).</p>
<p>High-throughput screening (HTS) experiments are often used to identify the biological activity between drugs and targets, but this method has problems of expensive cost and consumable time (Cohen, <xref ref-type="bibr" rid="B5">2002</xref>). DTI prediction in silicon is one of the effective methods (Liu et al., <xref ref-type="bibr" rid="B16">2012</xref>), and machine learning is a prevalent way (Yan et al., <xref ref-type="bibr" rid="B33">2019</xref>). Support vector machine (SVM) (Keum and Nam, <xref ref-type="bibr" rid="B12">2017</xref>) and random forest (RF) (Wang et al., <xref ref-type="bibr" rid="B31">2018</xref>; Strobl et al., <xref ref-type="bibr" rid="B28">2019</xref>) are often used as predictors in existing research (Olayan et al., <xref ref-type="bibr" rid="B20">2018</xref>). Although these methods are effective, shallow learning models may simplify the relationship between drugs and targeted proteins (Nanni et al., <xref ref-type="bibr" rid="B18">2020</xref>), which are limited by the size of the dataset (Keogh and Mueen, <xref ref-type="bibr" rid="B11">2009</xref>). Deep learning methods have achieved remarkable results in many research areas, such as image processing (Zhou et al., <xref ref-type="bibr" rid="B35">2020</xref>), natural language recognition (Rabovsky and McClelland, <xref ref-type="bibr" rid="B24">2020</xref>), and bioinformatics (Khurana et al., <xref ref-type="bibr" rid="B13">2018</xref>). Its main advantage is that hidden relationships are obtained by calculating of non-linear mapping relationships in original data.</p>
<p>DTI prediction is often considered as a binary classification problem in existing studies (Ban et al., <xref ref-type="bibr" rid="B1">2019</xref>; Yan et al., <xref ref-type="bibr" rid="B33">2019</xref>; Le et al., <xref ref-type="bibr" rid="B15">2020</xref>), that whether or not is a correlation. However, the calculation methods ignore the degree information about DTI, which is the value of binding affinity. Binding affinity provides information about the strength of interactions between drug target (DT) pairs, usually expressed by measures such as dissociation constant (Kd), inhibition constant (Ki), or the half maximal inhibitory concentration (IC50) (Cer et al., <xref ref-type="bibr" rid="B3">2009</xref>). Drug-target binding affinity (DTA) calculated by deep learning algorithms has important research significance.</p>
<p>DeepDTA is a predictive tool for Drug-target binding affinity (Ozturk et al., <xref ref-type="bibr" rid="B22">2018</xref>), which is a Convolutional Neural Network (CNN) that using 1D coding and drug molecular to learn hidden relationships between features and predicting affinity. In order to obtain better model performance, WipeDTA (&#x000D6;zt&#x000FC;rk et al., <xref ref-type="bibr" rid="B23">2019</xref>) extracted four text-based information sources to represent proteins and drug structures on the basis of DeepDTA. GraphDTA is an effective prediction model (Nguyen and Venkatesh, <xref ref-type="bibr" rid="B19">2019</xref>), its framework is graph convolutional network that the inputs are graph structure of drugs. OneHot encoding is used to represent protein sequences as input for convolutional neural network. However, these problems what lower expression ability of protein sequence and low prediction ability are caused by the loss of correlation of the OneHot encoding for each residue individually encoded.</p>
<p>In order to overcome the above problems, we propose a novel feature extraction method which is polypeptide frequency of word frequency based on natural language word frequency characteristics to enhance the ability of protein sequence expression. The network model is constructed by merging the graph convolutional network that calculates the graph structure of drugs and the convolutional neural network that calculates the hidden relationship of protein features. The results of output are combined as the input of two hidden layers for regression training and prediction of DTA.</p>
</sec>
<sec id="s2">
<title>Data Sets and Feature Extraction</title>
<sec>
<title>Data Sets</title>
<p>We use two datasets: KIBA dataset (Tang et al., <xref ref-type="bibr" rid="B29">2014</xref>) and Davis dataset (Davis et al., <xref ref-type="bibr" rid="B6">2011</xref>) (The data sets can obtain from <xref ref-type="supplementary-material" rid="SM1">Supplementary Material</xref>), as shown in <xref ref-type="table" rid="T1">Table 1</xref>. KIBA (Tang et al., <xref ref-type="bibr" rid="B29">2014</xref>) was used as a benchmark dataset to evaluate the algorithm model. The Davis dataset (Davis et al., <xref ref-type="bibr" rid="B6">2011</xref>) is lysed selectively using the kinase protein family and associated inhibitors for the dissofarence constant (<italic>K</italic><sub><italic>d</italic></sub>) value, including the affinity of 442 proteins and 68 drugs. We calculate (<sub><italic>p</italic></sub><italic>K</italic><sub><italic>d</italic></sub>) value (as shown in formula 1) regarding the Davis data set use literature processing method to show.</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mi>p</mml:mi><mml:mi>K</mml:mi><mml:mi>d</mml:mi><mml:mo>=</mml:mo><mml:mo>-</mml:mo><mml:mo class="qopname">lg</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mi>K</mml:mi><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mi>e</mml:mi><mml:mn>9</mml:mn></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>It can be seen from <xref ref-type="table" rid="T1">Table 1</xref> that the number of true interrelationships in the KIBA dataset is about three times that of the statistical interrelationship. KIBA values are calculated based on combinations of different information sources such as IC50, <italic>K</italic><sub><italic>i</italic></sub>, and <italic>K</italic><sub><italic>d</italic></sub>. We used a filtered version of the KIBA data set, where each protein and ligand has ten interactions at least (He et al., <xref ref-type="bibr" rid="B9">2017</xref>).</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Units for magnetic properties number of data sets.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Data set</bold></th>
<th valign="top" align="center"><bold>Number of proteins</bold></th>
<th valign="top" align="center"><bold>Number of drugs</bold></th>
<th valign="top" align="center"><bold>Number of correlations</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Davis(pKd)</td>
<td valign="top" align="center">442</td>
<td valign="top" align="center">68</td>
<td valign="top" align="center">30,056</td>
</tr>
<tr>
<td valign="top" align="left">KIBA</td>
<td valign="top" align="center">229</td>
<td valign="top" align="center">2111</td>
<td valign="top" align="center">118,254</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>Drug Molecular Feature Extraction</title>
<p>The graphs of the drugs are constructed by using the GraphDTA (Nguyen and Venkatesh, <xref ref-type="bibr" rid="B19">2019</xref>) method. It reflects interactions of internal atom for each SMILES compound. RDkit, open source chemical informatics package (<xref ref-type="bibr" rid="B8">G</xref>, <xref ref-type="bibr" rid="B8">2013</xref>), is used to calculate the feature vectors of atom and adjacent atomic connection of drugs. The nodes of the graph represent the features of the drug&#x00027;s atoms, and the bonding bonds between the atoms are represented by the edges. The features vectors of the drug atomic are made up of five characteristics: atomic class, atomic rank, the total number of hydrogen atoms, implied value of atoms, and the existence or absence of aromatic groups. The atomic rank is the sum of the number of the bond between the current atom and neighboring atoms and the number of hydrogen atoms. The edge of graph represents the connection relation of adjacent atoms. The overall process is shown in <xref ref-type="fig" rid="F1">Figure 1</xref>.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Drug molecular feature extraction process. The input of the extraction process of drugs is the drug molecular structure. Each atom is represented as a node by 5 different characteristics, and the bond between the atom and adjacent atoms is used as the edge set. The red atom has a binding bond with two yellow atoms, and no binding bond with the green atom. The set of nodes and the set of edges are made up of all the atoms together to form the graph structure representing the current drug molecule.</p></caption>
<graphic xlink:href="fbioe-08-00267-g0001.tif"/>
</fig>
</sec>
<sec>
<title>Protein Sequence Feature Extraction</title>
<sec>
<title>Protein Sequence Representation</title>
<p>The first order-structure vectorization is a prerequisite for data analysis of protein sequences, formula 2 is used to discretize the primary structure of the protein.</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M2"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mo>.</mml:mo><mml:mo>&#x02026;</mml:mo><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02026;</mml:mo><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mi>K</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>S</italic><sub><italic>n</italic></sub> is the nth protein data, <italic>R</italic><sub><italic>i</italic></sub> is the <italic>ith</italic> amino acid residue in the protein sequence, <italic>K</italic> is the number of protein sequences in the data set.</p>
</sec>
<sec>
<title>Polypeptide Frequency of Word Frequency</title>
<p>Term frequency-inverse document frequency (TF-IDF) algorithm plays an important role in Natural Language Processing (NLP) (Kaur and Jatinderkumar, <xref ref-type="bibr" rid="B10">2019</xref>). TF-IDF is consisted of Term Frequency (TF) and Inverse Document Frequency (IDF). The algorithm of polypeptide frequency <italic>F</italic> (as shown in Formula 3) is similar to the calculation process of TF in bioinformatics.</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M3"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mi>F</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mn>25</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Where, <italic>n</italic> is the number of 25 residues contained in the polypeptide, thus 25<sup><italic>n</italic></sup> different polymers are formed by dehydration condensation, <italic>v</italic><sub><italic>i</italic></sub> represents the frequency of the <italic>ith</italic> feature of the polypeptide. The formula for <italic>v</italic><sub><italic>i</italic></sub> is as follows.</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M4"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mtext class="textrm" mathvariant="normal">=</mml:mtext><mml:msub><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi></mml:mrow></mml:msub><mml:mo>/</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mn>25</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:munderover></mml:mstyle><mml:msub><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi></mml:mrow></mml:msub><mml:mo>/</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>L</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>L</italic> represents the length of the protein sequence, <italic>n</italic><sub><italic>u</italic></sub> <italic>r</italic>epresents the occurrence times of uth dipeptide signature in the protein sequence.</p>
<p>IDF is the reversion document frequency to increase important weight of TF, as specified in formula 5.</p>
<disp-formula id="E5"><label>(5)</label><mml:math id="M5"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mi>I</mml:mi><mml:mi>D</mml:mi><mml:mi>F</mml:mi><mml:mo>=</mml:mo><mml:mo class="qopname">lg</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mn>3</mml:mn><mml:mo>,</mml:mo><mml:mn>4</mml:mn><mml:mo>,</mml:mo><mml:mo class="qopname">&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mn>25</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where, in bioinformatics, <italic>N</italic> is the number of protein sequences in the data set, and <italic>w</italic><sub><italic>i</italic></sub> is the number of protein sequences which contain the <italic>ith</italic> polypeptide. From the formula, it can be known that the occurrence frequency of current words is inversely proportional to IDF, so TF-IDF algorithm will assign a lower feature for the high-frequency words. Which is not suitable for bioinformatics calculation. Therefore, we propose the polypeptide frequency of method word frequency, which can avoid this problem by only calculates the word frequency. As shown in formula 6:</p>
<disp-formula id="E6"><label>(6)</label><mml:math id="M6"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mi>W</mml:mi><mml:mi>F</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>w</mml:mi><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>w</mml:mi><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>w</mml:mi><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:mi>w</mml:mi><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:mi>w</mml:mi><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mn>25</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where, <italic>n</italic> is the number of residues that make up the polypeptide, and <italic>wf</italic><sub><italic>i</italic></sub> is the frequency of the <italic>ith</italic> polypeptide of word frequency, as shown in formula 7.</p>
<disp-formula id="E7"><label>(7)</label><mml:math id="M7"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mi>w</mml:mi><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:mfrac><mml:mo>&#x000D7;</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>L</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where, <italic>w</italic><sub><italic>i</italic></sub> is the number of protein sequences containing the <italic>ith</italic> peptide, <italic>N</italic> is the total number of proteins contained in the data set, <italic>p</italic><sub><italic>i</italic></sub> is the number of times that the <italic>ith</italic> peptide appears in the current protein, and <italic>L</italic> is the number of residues contained in the current protein.</p>
</sec>
</sec>
<sec>
<title>Network Model Construction</title>
<p>A novel model that combining graph convolutional neural networks and convolutional neural networks are designed to regressively predict DTA. The multi-layers graph convolutional neural network is used to obtain the hidden relationships of drug graphs. The hidden relationships of the polypeptide frequency of word frequency are obtained through the convolutional neural network calculation. The output results of the two networks are combined as the input of fully connected layers. The complete process is shown in <xref ref-type="fig" rid="F2">Figure 2</xref>.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Network structure diagram.</p></caption>
<graphic xlink:href="fbioe-08-00267-g0002.tif"/>
</fig>
</sec>
<sec>
<title>Graph Convolutional Neural Network of Drug</title>
<p>We use the improved four types of graph convolutional neural networks by GraphDTA to discover potential relationships for the graph structure of drug features, which are GCN (Kipf and Welling, <xref ref-type="bibr" rid="B14">2017</xref>), GAT (Veli&#x0010D;kovi&#x00107; et al., <xref ref-type="bibr" rid="B30">2018</xref>), GIN (Xu et al., <xref ref-type="bibr" rid="B32">2019</xref>), GAT-GCN (Nguyen and Venkatesh, <xref ref-type="bibr" rid="B19">2019</xref>). The linear connected layer that the inputs are results of graph convolutional neural networks maps to a 128-dimensional features vectors, which is consistent with the size of feature vectors for protein.</p>
<p>The GCN model is originally proposed by Kipf and Welling (<xref ref-type="bibr" rid="B14">2017</xref>) as a graph structure learner for semi-supervised classification. In order to meet the requirements of regression in our work, three graph convolutional units are made that include a GCN layer and a ReLU activation layer. The number of output channels is 78, 156, 312, respectively. And a fully connected layer of 1,024 neurons is created, the results are mapped to a 128-dimensional features vector in the output layer.</p>
<p>Graph Isomorphism Network (GIN) is an improved algorithm based on GCN. Injective aggregation updates the parameters and performs the feature vector mapping to obtain better model performance. The network model of the five-layer GIN layer is designed, and each GIN layer consists of two linear calculations with an output size of 32. The input and output layers are mapped into 128-dimensional features vectors.</p>
<p>Graph Attention Network (GAT) is different from the GCN model, the difference is that it calculates the corresponding hidden information for each node and introduces an attention mechanism when computing its neighboring nodes. The network model is designed using two GAT layers. In the first layer, the number of output channel is 78, and the number of attention nodes is 10. In the second layer, the number of output channel is 128, and the number of attention nodes is 1. The results are input to the output layer, which map to a 128-dimensional features vector.</p>
<p>Based on the GAT and GCN models, GAT- CCN integrates the advantages of the two models in series to obtain better model performance. The output channel of the GAT layer is 78, the number of attention nodes is 10. And the output channel of the GCN layer is 780. And a fully connected layer of 1,500 neurons is created, results are mapped to a 128-dimensional features vector in the output layer.</p>
</sec>
<sec>
<title>Convolutional Neural Network of Protein</title>
<p>Convolutional neural network is used to obtain hidden relationships in vector of protein features. A 1D convolutional neural network is designed by analyzing the characteristic structure of protein word frequency and polypeptide frequency. The model contains a convolution kernel that the size is 32. The result of the convolution calculation is input to the fully connected layer for mapping to 256 neurons, keeping the size of the drug, and protein consistent.</p>
<p>We concatenate the feature vectors of proteins from convolutional neural networks and the feature vectors of drugs from graph convolutional neural networks. And they are input to two fully connected layers with 512 and 128 neutrons, respectively. And set the batch size to 512 and the learning rate to 0.00005.</p>
</sec>
</sec>
<sec id="s3">
<title>Results and Discussion</title>
<sec>
<title>Performance Evaluation</title>
<p>In this work, the datasets are divided into two parts: training set and test set. That is, 80% of data instances are used for training, and 20% are for testing the models. The performances of our model are comprehensively compared by several experiment using evaluation metrics such as Concordance Index (CI), Mean Squared Error (MSE), as well as Pearson correlation coefficient. The evaluation indicators are consistent with WideDTA and GraphDTA. the performance of the predicted models of output continuous values is evaluated by CI, the formula is as follows.</p>
<disp-formula id="E8"><label>(8)</label><mml:math id="M8"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mi>C</mml:mi><mml:mi>I</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>Z</mml:mi></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x003B4;</mml:mi><mml:mi>x</mml:mi><mml:mo>&#x0003E;</mml:mo><mml:mi>&#x003B4;</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:mi>h</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>b</mml:mi><mml:mi>x</mml:mi><mml:mo>-</mml:mo><mml:mi>b</mml:mi><mml:mi>y</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>b</italic><sub><italic>x</italic></sub> is the prediction value for the larger affinity &#x003B4;<sub><italic>x</italic></sub>, <italic>b</italic><sub><italic>y</italic></sub> is the prediction value for the smaller affinity &#x003B4;<sub><italic>y</italic></sub>. Z is the normalization constant, <italic>h(m)</italic> is the step function, and as shown in the following formula:</p>
<disp-formula id="E9"><label>(9)</label><mml:math id="M9"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr columnalign="left"><mml:mtd columnalign="left"><mml:mi>h</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable style="text-align:axis;" equalrows="false" columnlines="none" equalcolumns="false" class="array"><mml:mtr columnalign="left"><mml:mtd columnalign="left"><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mi>i</mml:mi><mml:mi>f</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mi>m</mml:mi><mml:mo>&#x0003E;</mml:mo><mml:mn>0</mml:mn></mml:mtd></mml:mtr><mml:mtr columnalign="left"><mml:mtd columnalign="left"><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>5</mml:mn><mml:mtext>&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mi>i</mml:mi><mml:mi>f</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mtd></mml:mtr><mml:mtr columnalign="left"><mml:mtd columnalign="left"><mml:mn>0</mml:mn><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mi>i</mml:mi><mml:mi>f</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mi>m</mml:mi><mml:mo>&#x0003C;</mml:mo><mml:mn>0</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>MSE is often used for the difference between the predicted value and the actual value vector, and it&#x00027;s an important index for evaluating regression models, the formula is as follows.</p>
<disp-formula id="E10"><label>(10)</label><mml:math id="M10"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mi>M</mml:mi><mml:mi>S</mml:mi><mml:mi>E</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>n</italic> is the number of data in the data set of KIBA or Davis, and other parameters have the same meaning as above.</p>
<p>Pearson correlation coefficient evaluates the difference of the affinity between the true value and the predicted value, the formula is as follows.</p>
<disp-formula id="E11"><label>(11)</label><mml:math id="M11"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mi>p</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>s</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo class="qopname">cov</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>p</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x003C3;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mi>&#x003C3;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>cov</italic> indicates the co-variance, <italic>p</italic> is predicted values, <italic>y</italic> is original values, &#x003C3; represents the standard deviation.</p>
</sec>
<sec>
<title>Contrast Experiments and Analysis of Different Characteristics</title>
<p>In this study, we introduced polypeptide frequency of word frequency that was a novel way of protein feature extraction. The peptide frequency includes several methods. For every protein sequence, we calculated the word frequency characteristics and frequency of single peptide, dipeptides, as well as tripeptides. And different graph convolutional network models were designed to predict drug-target binding affinity. The results of comparative experiment are shown in <xref ref-type="table" rid="T2">Table 2</xref>.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Comparative experimental results of word frequency feature of many different peptides.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Graph neural network model</bold></th>
<th valign="top" align="center"><bold>Peptides</bold></th>
<th valign="top" align="center" colspan="3" style="border-bottom: thin solid #000000;"><bold>KIBA</bold></th>
<th valign="top" align="center" colspan="3" style="border-bottom: thin solid #000000;"><bold>Davis</bold></th>
</tr>
<tr>
<th/>
<th/>
<th valign="top" align="center"><bold>MSE</bold></th>
<th valign="top" align="center"><bold>CI</bold></th>
<th valign="top" align="center"><bold>Pearson</bold></th>
<th valign="top" align="center"><bold>MSE</bold></th>
<th valign="top" align="center"><bold>CI</bold></th>
<th valign="top" align="center"><bold>Pearson</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">GAT</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">0.758</td>
<td valign="top" align="center">0.372</td>
<td valign="top" align="center">0.323</td>
<td valign="top" align="center">0.740</td>
<td valign="top" align="center">0.649</td>
<td valign="top" align="center">0.436</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">2</td>
<td valign="top" align="center">0.176</td>
<td valign="top" align="center">0.868</td>
<td valign="top" align="center">0.873</td>
<td valign="top" align="center">0.231</td>
<td valign="top" align="center">0.899</td>
<td valign="top" align="center">0.698</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">3</td>
<td valign="top" align="center">0.187</td>
<td valign="top" align="center">0.858</td>
<td valign="top" align="center">0.823</td>
<td valign="top" align="center">0.244</td>
<td valign="top" align="center">0.861</td>
<td valign="top" align="center">0.659</td>
</tr>
<tr>
<td valign="top" align="left">GIN</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">0.427</td>
<td valign="top" align="center">0.696</td>
<td valign="top" align="center">0.569</td>
<td valign="top" align="center">0.472</td>
<td valign="top" align="center">0.802</td>
<td valign="top" align="center">0.634</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">2</td>
<td valign="top" align="center">0.148</td>
<td valign="top" align="center">0.881</td>
<td valign="top" align="center">0.856</td>
<td valign="top" align="center">0.222</td>
<td valign="top" align="center">0.894</td>
<td valign="top" align="center">0.687</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">3</td>
<td valign="top" align="center">0.151</td>
<td valign="top" align="center">0.871</td>
<td valign="top" align="center">0.851</td>
<td valign="top" align="center">0.239</td>
<td valign="top" align="center">0.882</td>
<td valign="top" align="center">0.685</td>
</tr>
<tr>
<td valign="top" align="left">GCN</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">0.803</td>
<td valign="top" align="center">0.431</td>
<td valign="top" align="center">0.341</td>
<td valign="top" align="center">0.834</td>
<td valign="top" align="center">0.408</td>
<td valign="top" align="center">0.337</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">2</td>
<td valign="top" align="center">0.127</td>
<td valign="top" align="center">0.898</td>
<td valign="top" align="center">0.864</td>
<td valign="top" align="center">0.223</td>
<td valign="top" align="center">0.894</td>
<td valign="top" align="center">0.697</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">3</td>
<td valign="top" align="center">0.151</td>
<td valign="top" align="center">0.873</td>
<td valign="top" align="center">0.846</td>
<td valign="top" align="center">0.247</td>
<td valign="top" align="center">0.887</td>
<td valign="top" align="center">0.691</td>
</tr>
<tr>
<td valign="top" align="left">GAT_GCN</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">0.624</td>
<td valign="top" align="center">0.798</td>
<td valign="top" align="center">0.698</td>
<td valign="top" align="center">0.743</td>
<td valign="top" align="center">0.644</td>
<td valign="top" align="center">0.434</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">2</td>
<td valign="top" align="center"><bold>0.126</bold></td>
<td valign="top" align="center"><bold>0.901</bold></td>
<td valign="top" align="center"><bold>0.893</bold></td>
<td valign="top" align="center"><bold>0.220</bold></td>
<td valign="top" align="center"><bold>0.899</bold></td>
<td valign="top" align="center"><bold>0.701</bold></td>
</tr>
<tr>
<td/>
<td valign="top" align="center">3</td>
<td valign="top" align="center">0.191</td>
<td valign="top" align="center">0.852</td>
<td valign="top" align="center">0.839</td>
<td valign="top" align="center">0.224</td>
<td valign="top" align="center">0.896</td>
<td valign="top" align="center">0.693</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>The bold values are maximum</italic>.</p>
</table-wrap-foot>
</table-wrap>
<p>When the protein sequence is represented by the word frequency dipeptide frequency and the GAT_GCN model, the model is the best predictor for 3 evaluation metrics yielding a CI of 0.901, a MSE of 0.126, and a Pearson of 0.893 in KIBA data set, and yielding a CI of 0.895, a MSE of 0.220 and a Pearson of 0.701 in Davis data set. When word frequency dipeptide frequency was used to represent protein sequences, compared with the second best GCN model, the CI and Pearson of GAT_GCN model in KIBA data set are increased by 0.03 and 0.029, respectively, and the MSE value decreases by 0.01. Compared with GAT and GIN models, the CI values of GAT_GCN model are 0.033 and 0.020 higher, the MSE values are reduced by 0.050 and 0.022, and Pearson values are increased by 0.020 and 0.037, respectively. In the Davis data set, the CI value of GAT_GCN model is same with GAT model as the next-highest model, the MSE value is reduced by 0.011, and Pearson is increased by 0.003. The CI value of the GAT_GCN model is 0.005 higher than the GCN and 0.002 higher than GIN. The MSE values are decreased by 0.003 and 0.005, and the Pearson values are increased by 0.004 and 0.006, respectively. So, the GAT_GCN model has the best performance in these four models.</p>
<p>When the GAT_GCN model is used as a graph calculator, compared with the word frequency single peptide frequency and the word frequency tripeptide frequency, the CI values of word frequency dipeptide frequency in the KIBA dataset are higher by 0.103 and 0.049, the MSE values are reduced by 0.498 and 0.065, respectively. In the Davis data, CI values are 0.251 and 0.003 higher, the MSE values are decreased by 0.578 and 0.004, and Pearson values are increased by 0.195 and 0.054, respectively. The word frequency dipeptide frequency characteristics can also obtain the optimal index when combined with GIT, GAT, GCN models in the KIBA and Davis data sets, indicating that the word frequency dipeptide frequency characteristics have the best performance index compared to other characteristics.</p>
</sec>
<sec>
<title>Word Frequency Comparison Experiment</title>
<p>We also compared the differences in dipeptide frequencies with or without word frequency characteristics. The results are shown in <xref ref-type="table" rid="T3">Table 3</xref> and <xref ref-type="fig" rid="F3">Figure 3</xref>.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Comparison results of dipeptide features.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Features</bold></th>
<th valign="top" align="center" colspan="3" style="border-bottom: thin solid #000000;"><bold>KIBA</bold></th>
<th valign="top" align="center" colspan="3" style="border-bottom: thin solid #000000;"><bold>Davis</bold></th>
</tr>
<tr>
<th/>
<th valign="top" align="center"><bold>MSE</bold></th>
<th valign="top" align="center"><bold>CI</bold></th>
<th valign="top" align="center"><bold>Pearson</bold></th>
<th valign="top" align="center"><bold>MSE</bold></th>
<th valign="top" align="center"><bold>CI</bold></th>
<th valign="top" align="center"><bold>Pearson</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Dipeptide frequency of word frequency</td>
<td valign="top" align="center"><bold>0.126</bold></td>
<td valign="top" align="center"><bold>0.901</bold></td>
<td valign="top" align="center"><bold>0.893</bold></td>
<td valign="top" align="center"><bold>0.220</bold></td>
<td valign="top" align="center"><bold>0.899</bold></td>
<td valign="top" align="center"><bold>0.701</bold></td>
</tr>
<tr>
<td valign="top" align="left">Dipeptide frequency</td>
<td valign="top" align="center">0.148</td>
<td valign="top" align="center">0.882</td>
<td valign="top" align="center">0.857</td>
<td valign="top" align="center">0.239</td>
<td valign="top" align="center">0.881</td>
<td valign="top" align="center">0.690</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>The bold values are maximum</italic>.</p>
</table-wrap-foot>
</table-wrap>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Comparison results of dipeptide features.</p></caption>
<graphic xlink:href="fbioe-08-00267-g0003.tif"/>
</fig>
<p>After adding the word frequency characteristics based on the dipeptide frequency, the MSE decreased by 0.022 and the CI and Pearson increased by 0.009 and 0.033 in the KIBA data set, and MSE decreased by 0.019 and the CI and Pearson increased by 0.018 and 0.009 in Davis data set. This shows that the dipeptide frequency of word frequency is more conducive to the prediction of the classifier than the dipeptide frequency, and has better represented ability for protein sequences.</p>
</sec>
<sec>
<title>Analysis of Protein Features</title>
<p>Through the analysis of comparative experiments, we found that the model was obtained the best performance metrics when dipeptide frequency of word frequency be used to represent protein sequences. For every protein, we calculated the mean and variance in the Davis and KIBA datasets, respectively. The results are shown in <xref ref-type="fig" rid="F4">Figure 4</xref>.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Frequency chart of word frequency peptides. The X-axis is the peptide amino acid combination, and the Y-axis is the word frequency polypeptide frequency score. <bold>(a)</bold> Are the scores of the single peptide frequency of word frequency, <bold>(b)</bold> are the scores of the tripeptide frequency of word frequency, <bold>(c)</bold> are the scores of the dipeptide frequency of word frequency, and <bold>(d)</bold> are the scores of the dipeptide frequency. The red lines represent the upper and lower quartiles. The first and second columns are the Davis data set, and the first and second columns are the KIBA data set.</p></caption>
<graphic xlink:href="fbioe-08-00267-g0004.tif"/>
</fig>
<p>In the Davis dataset and the KIBA dataset, the distribution of score are basically same. The single peptide frequency of word frequency features scores are mainly concentrated between 0.20 and 0.61, and the variances are mainly concentrated between 0.007 and 0.220. Although there is a high features scores and large variance, the features have too high differences in the vectors of feature. And the number of features is only 25 dimensions, which contributes less to the spatially specific division of the model. Although the tripeptide frequency of word frequency features have a huge number of 15,625 dimensions, the scores are mainly distributed below 0.018, and the variances are mainly distributed below 0.003. The features have small differences between data, and there are a lot of features with value of 0. The scores of dipeptide frequency of word frequency characteristic mainly have a distribution range between 0 and 0.14, and the variances have a main distribution range between 0 and 0.0149, which has a good score and data difference.</p>
<p>Compared with the dipeptide frequency of word frequency, the score of dipeptide frequency are mainly distributed below 0.17, and the variances are mainly distributed between 0 and 0.021. Although it has a good score, the difference is high in vectors of feature, as same as the word frequency single peptide frequency. In order to discover the difference between the frequency characteristics of dipeptide and word frequency dipeptide, we draw a histogram of the frequency distribution of the two and perform overlapping processing, the results are shown in <xref ref-type="fig" rid="F5">Figure 5</xref>.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Histogram of frequency distribution. Yellow represents the histogram of the frequency distribution of the word frequency dipeptide, red represents the histogram of the frequency distribution of the word frequency dipeptide, and orange represents the overlap between the two.</p></caption>
<graphic xlink:href="fbioe-08-00267-g0005.tif"/>
</fig>
<p>After adding the word frequency characteristics, the number of dipeptide frequency of word frequency features is less than that of the dipeptide frequency in the score intervals [0.025, 0.175] and [0.250, 0.300]. And the number of dipeptide frequency of word frequency features is more than that of dipeptide frequency features in the score intervals [0, 0.025] and [0.175, 0.250]. Dipeptide frequency features distribution is at [0, 0.35], and the dipeptide frequency of word frequency features distribution is at [0, 0.45], and the interval range is greater and more continuous. It shows that the frequency characteristics of words can play a role in reducing non-significant features and improving score difference.</p>
</sec>
<sec>
<title>Analysis of Variable Importance Measure</title>
<p>The protein dipeptide frequency of word frequency is composed of 625-dimensional features. The Variable Importance Measures (VIM) is used to analyze the contribution of each feature. In bioinformatics, Random Forest (RF) is a commonly used classification and regression model (Belgiu et al., <xref ref-type="bibr" rid="B2">2016</xref>). And its unique advantage is to calculate VIM (Rawi et al., <xref ref-type="bibr" rid="B25">2018</xref>), compared with other machine learning algorithms such as support vector machine (SVM). We used the RF model containing 10,000 decision trees to obtain the VIM score of features in the dipeptide frequency of word frequency, as shown in <xref ref-type="fig" rid="F6">Figure 6</xref>. Features of non-zero VIM score have 199 dimensions, indicating that there&#x00027;s much noise in the vectors of features. The 27-dimensional features what a contribution &#x0003E;0.5% are listed in <xref ref-type="fig" rid="F7">Figure 7</xref>. The top five dipeptide frequency of word frequency features are PE (20.1%), WT (6.6%), AA (4.4%), EB (3.9%), and VV (3.2%). This shows that PE (the combination of proline and glutamic acid) is significantly related to the affinity prediction, which is about three times of the second highest WT (the combination of tryptophan and threonine) and much larger than other combinations.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Dipeptide frequency of word frequency VIM score. Its X-axis and Y-axis are 25 kinds of amino acids. Each point represents the importance score of the corresponding dipeptide frequency of word frequency characteristic variable. The color from white to purple represents the score from low to high.</p></caption>
<graphic xlink:href="fbioe-08-00267-g0006.tif"/>
</fig>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Features ranking diagram with contribution &#x0003E;0.5%.</p></caption>
<graphic xlink:href="fbioe-08-00267-g0007.tif"/>
</fig>
</sec>
<sec>
<title>Comparison of Existing Models</title>
<p>The predictor of our work was compared with state-of-the-art methods what DeepDTA, WideDTA, and GraphDTA by using an independent test set in Davis and KIBA. The results are shown in <xref ref-type="table" rid="T4">Table 4</xref>.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Algorithm comparison experiment results.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Features</bold></th>
<th valign="top" align="center" colspan="2" style="border-bottom: thin solid #000000;"><bold>KIBA</bold></th>
<th valign="top" align="center" colspan="2" style="border-bottom: thin solid #000000;"><bold>Davis</bold></th>
</tr>
<tr>
<th/>
<th valign="top" align="center"><bold>MSE</bold></th>
<th valign="top" align="center"><bold>CI</bold></th>
<th valign="top" align="center"><bold>MSE</bold></th>
<th valign="top" align="center"><bold>CI</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">DeepDTA</td>
<td valign="top" align="center">0.194</td>
<td valign="top" align="center">0.863</td>
<td valign="top" align="center">0.261</td>
<td valign="top" align="center">0.878</td>
</tr>
<tr>
<td valign="top" align="left">WipeDTA</td>
<td valign="top" align="center">0.179</td>
<td valign="top" align="center">0.875</td>
<td valign="top" align="center">0.262</td>
<td valign="top" align="center">0.886</td>
</tr>
<tr>
<td valign="top" align="left">GraphDTA</td>
<td valign="top" align="center">0.139</td>
<td valign="top" align="center">0.891</td>
<td valign="top" align="center">0.229</td>
<td valign="top" align="center">0.893</td>
</tr>
<tr>
<td valign="top" align="left">This model</td>
<td valign="top" align="center"><bold>0.126</bold></td>
<td valign="top" align="center"><bold>0.901</bold></td>
<td valign="top" align="center"><bold>0.220</bold></td>
<td valign="top" align="center"><bold>0.899</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>The bold values are maximum</italic>.</p>
</table-wrap-foot>
</table-wrap>
<p>Our method outperformed state-of-the-art methods with two main quality metrics as CI and MSE in Davis and KIBA. Compared with the DeepDTA and WipeDTA models, our model reduced the MSE by 0.068 and 0.041, which increased the CI by 0.048 and 0.026, respectively in the KIBA dataset. And MSE decreased by 0.061 and 0.062, CI increased by 0.021 and 0.013, respectively, in the Davis data set. It shows that the graph neural network model with input as the graph structure of the drug can obtained better performance. Our method outperformed the GraphDTA model using the same graph convolutional neural network, the MSE decreased by 5% (0.007) and the CI increased by 0.01 in the KIBA data set, the MSE decreased by 4% (0.009) and the CI increased by 0.006 in the Davis data set. It shown that the dipeptide frequency of word frequency has better ability to express targeted proteins and can obtain better prediction models than 1D coding.</p>
</sec>
</sec>
<sec sec-type="conclusions" id="s4">
<title>Conclusion</title>
<p>The DTA plays an important role in the discovery of new drugs. Dipeptide frequency of word frequency which is a novel feature extraction method is employed to represent protein sequences by natural language processing techniques. In addition, we use graphs to represent the drugs structure where the nodes is constructed by five different features and the edges represent atomic bond relationship. A network model is constructed, it is consisted of three parts: convolution neural network, graph convolution neural network, and fully connected layers. Convolutional neural network that input is dipeptide frequency of word frequency is to calculate hidden relationships of protein data. Graph Convolutional neural network is constructed to calculate hidden relationships for the graphs of drugs. The results of the two network models are mapped and combined to the fully connected layer predicting DTA. The results of peptide frequency comparison experiment showed that the dipeptide for the division of the spatial relationship was better than the monopeptide and tripeptide, so that the model performance can be obtained better. The results of the dipeptide frequency comparison experiment showed that adding word frequency characteristics for the dipeptide frequency can reduce the features difference. In comparison experiment of state-of-the-art model, our model has improved performance comparing with DeepDTA and WideDTA models, which indicating that the graphs can express the structure of drugs better. And experimental results show that our model has better performance than the GraphDTA model using graph convolutional neural network. In the KIBA dataset, MSE decreased by 5% (0.007) and CI increased by 0.01, and in the Davis dataset, MSE decreased by 4% (0.009) and CI increased by 0.006. It showed that the frequency characteristics of word frequency dipeptide could represent protein sequences better. Through the analysis of protein features, we observed that the vector have certain differences and intensity when the average score of the features is below 0.014 and the variance score is below 0.015, which are more conducive to the spatial division. In the analysis of variables importance, it was found that PE, WT, AA, EB, and VV had a high contribution to model prediction, among which PE (the combination of proline and glutamate) was highest by 20.1%. Besides drug discovery, the Dipeptide frequency of word frequency proposed in this work may also be applied in other field to represent protein sequence. Thus, it has the practical significance.</p>
</sec>
<sec sec-type="data-availability-statement" id="s5">
<title>Data Availability Statement</title>
<p>The datasets [Dives] for this study can be found in the [Comprehensive analysis of kinase inhibitor selectivity] [<ext-link ext-link-type="uri" xlink:href="https://www.nature.com/articles/nbt.1990">https://www.nature.com/articles/nbt.1990</ext-link>]. The datasets [KIBA] for this study can be found in the [Making Sense of Large-Scale Kinase Inhibitor Bioactivity Data Sets: A Comparative and Integrative Analysis] [<ext-link ext-link-type="uri" xlink:href="https://pubs.acs.org/doi/10.1021/ci400709d">https://pubs.acs.org/doi/10.1021/ci400709d</ext-link>].</p>
</sec>
<sec id="s6">
<title>Author Contributions</title>
<p>XW and YL designed the study and wrote the manuscript. FL translate manuscript. HL and PG analyzed data and drawn illustrations. DW provides theoretical guidance on Drug-Targets. All authors have read and approved the final manuscript.</p>
<sec>
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</sec>
</body>
<back>
<sec sec-type="supplementary-material" id="s7">
<title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fbioe.2020.00267/full&#x00023;supplementary-material">https://www.frontiersin.org/articles/10.3389/fbioe.2020.00267/full&#x00023;supplementary-material</ext-link></p>
<supplementary-material xlink:href="Table_1.XLSX" id="SM1" mimetype="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Table_2.XLSX" id="SM2" mimetype="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Table_3.XLSX" id="SM3" mimetype="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Table_4.XLSX" id="SM4" mimetype="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ban</surname> <given-names>T.</given-names></name> <name><surname>Ohue</surname> <given-names>M.</given-names></name> <name><surname>Akiyama</surname> <given-names>Y.</given-names></name></person-group> (<year>2019</year>). <article-title>NRLMF&#x003B2;: &#x003B2;-distribution-rescored neighborhood regularized logistic matrix factorization for improving the performance of drug-target interaction prediction</article-title>. <source>Biochem. Biophys. Rep.</source> <volume>18</volume>, <fpage>100615</fpage>&#x02013;<lpage>100615</lpage>. <pub-id pub-id-type="doi">10.1016/j.bbrep.2019.01.008</pub-id><pub-id pub-id-type="pmid">30793050</pub-id></citation></ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Belgiu</surname> <given-names>M.</given-names></name> <name><surname>Dr&#x00103;gut</surname> <given-names>L. J.</given-names></name> <name><surname>Sensing</surname> <given-names>R.</given-names></name></person-group> (<year>2016</year>). <article-title>Random forest in remote sensing: a review of applications and future directions</article-title>. <source>ISPRS J. Photogrammetry Remote Sens.</source> <volume>114</volume>, <fpage>24</fpage>&#x02013;<lpage>31</lpage>. <pub-id pub-id-type="doi">10.1016/j.isprsjprs.2016.01.011</pub-id></citation></ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cer</surname> <given-names>R.Z.</given-names></name> <name><surname>Mudunuri</surname> <given-names>U.</given-names></name> <name><surname>Stephens</surname> <given-names>R.</given-names></name> <name><surname>Lebeda</surname> <given-names>F. J.</given-names></name></person-group> (<year>2009</year>). <article-title>IC50-to-K-i: a web-based tool for converting IC50 to K-i values for inhibitors of enzyme activity and ligand binding</article-title>. <source>Nucleic Acids Res.</source> <volume>37</volume>, <fpage>W441</fpage>&#x02013;<lpage>W445</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkp253</pub-id></citation></ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chu</surname> <given-names>Y.</given-names></name> <name><surname>Kaushik</surname> <given-names>A.C.</given-names></name> <name><surname>Wang</surname> <given-names>X.</given-names></name> <name><surname>Wang</surname> <given-names>W.</given-names></name> <name><surname>Zhang</surname> <given-names>Y.</given-names></name> <name><surname>Shan</surname> <given-names>X.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>DTI-CDF: a cascade deep forest model towards the prediction of drug-target interactions based on hybrid features</article-title>. <source>Brief. Bioinform</source>. <volume>2019</volume>:<fpage>bbz152</fpage>. <pub-id pub-id-type="doi">10.1093/bib/bbz152</pub-id></citation></ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cohen</surname> <given-names>P.</given-names></name></person-group> (<year>2002</year>). <article-title>Protein kinases - The major drug targets of the twenty-first century?</article-title> <source>Nat. Rev. Drug Discov.</source> <volume>1</volume>, <fpage>309</fpage>&#x02013;<lpage>315</lpage>. <pub-id pub-id-type="doi">10.1038/nrd773</pub-id><pub-id pub-id-type="pmid">12120282</pub-id></citation></ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Davis</surname> <given-names>M. I.</given-names></name> <name><surname>Hunt</surname> <given-names>J. P.</given-names></name> <name><surname>Herrgard</surname> <given-names>S.</given-names></name> <name><surname>Ciceri</surname> <given-names>P.</given-names></name> <name><surname>Wodicka</surname> <given-names>L. M.</given-names></name> <name><surname>Pallares</surname> <given-names>G.</given-names></name> <etal/></person-group>. (<year>2011</year>). <article-title>Comprehensive analysis of kinase inhibitor selectivity</article-title>. <source>Nat. Biotechnol.</source> <volume>29</volume>, <fpage>1046</fpage>&#x02013;<lpage>1051</lpage>. <pub-id pub-id-type="doi">10.1038/nbt.1990</pub-id><pub-id pub-id-type="pmid">22037378</pub-id></citation></ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ezzat</surname> <given-names>A.</given-names></name> <name><surname>Wu</surname> <given-names>M.</given-names></name> <name><surname>Li</surname> <given-names>X.-L.</given-names></name> <name><surname>Kwoh</surname> <given-names>C.-K.</given-names></name></person-group> (<year>2019</year>). <article-title>Computational prediction of drug-target interactions using chemogenomic approaches: an empirical survey</article-title>. <source>Brief. Bioinform.</source> <volume>20</volume>, <fpage>1337</fpage>&#x02013;<lpage>1357</lpage>. <pub-id pub-id-type="doi">10.1093/bib/bby002</pub-id><pub-id pub-id-type="pmid">29377981</pub-id></citation></ref>
<ref id="B8">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>G</surname> <given-names>L.</given-names></name></person-group> (<year>2013</year>). <source>RDKit: Cheminformatics and Machine Learning Software</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://sourceforge.net/projects/rdkit/">https://sourceforge.net/projects/rdkit/</ext-link></citation></ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>He</surname> <given-names>T.</given-names></name> <name><surname>Heidemeyer</surname> <given-names>M.</given-names></name> <name><surname>Ban</surname> <given-names>F.</given-names></name> <name><surname>Cherkasov</surname> <given-names>A.</given-names></name> <name><surname>Ester</surname> <given-names>M.</given-names></name></person-group> (<year>2017</year>). <article-title>SimBoost: a read-across approach for predicting drug-target binding affinities using gradient boosting machines</article-title>. <source>J. Cheminform.</source> <volume>9</volume>:<fpage>24</fpage>. <pub-id pub-id-type="doi">10.1186/s13321-017-0209-z</pub-id><pub-id pub-id-type="pmid">29086119</pub-id></citation></ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kaur</surname> <given-names>J.</given-names></name> <name><surname>Jatinderkumar</surname> <given-names>S.</given-names></name></person-group> (<year>2019</year>). <article-title>Designing punjabi poetry classifiers using machine learning and different textual features</article-title>. <source>Int. Arab J. Inform. Tech.</source> <volume>17</volume>, <fpage>38</fpage>&#x02013;<lpage>44</lpage>. <pub-id pub-id-type="doi">10.34028/iajit/17/1/5</pub-id></citation></ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Keogh</surname> <given-names>E.</given-names></name> <name><surname>Mueen</surname> <given-names>A.</given-names></name></person-group> (<year>2009</year>). <article-title>Curse of dimensionality</article-title>. <source>Ind. Eng. Chem.</source> <volume>29</volume>, <fpage>48</fpage>&#x02013;<lpage>53</lpage>. <pub-id pub-id-type="doi">10.1007/978-1-4899-7687-1_192</pub-id></citation></ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Keum</surname> <given-names>J.</given-names></name> <name><surname>Nam</surname> <given-names>H.</given-names></name></person-group> (<year>2017</year>). <article-title>SELF-BLM: prediction of drug-target interactions via self-training SVM</article-title>. <source>PLoS ONE</source> <volume>12</volume>:<fpage>e0171839</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0171839</pub-id><pub-id pub-id-type="pmid">28192537</pub-id></citation></ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Khurana</surname> <given-names>S.</given-names></name> <name><surname>Rawi</surname> <given-names>R.</given-names></name> <name><surname>Kunji</surname> <given-names>K.</given-names></name> <name><surname>Chuang</surname> <given-names>G.-Y.</given-names></name> <name><surname>Bensmail</surname> <given-names>H.</given-names></name> <name><surname>Mall</surname> <given-names>R.</given-names></name></person-group> (<year>2018</year>). <article-title>DeepSol: a deep learning framework for sequence-based protein solubility prediction</article-title>. <source>Bioinformatics</source> <volume>34</volume>, <fpage>2605</fpage>&#x02013;<lpage>2613</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/bty166</pub-id><pub-id pub-id-type="pmid">29554211</pub-id></citation></ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kipf</surname> <given-names>T. N.</given-names></name> <name><surname>Welling</surname> <given-names>M.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Semi-Supervised Classification with Graph Convolutional Networks,&#x0201D;</article-title> in <source>Proceedings of the International Conference on Learning Representations (ICLR)</source>. <volume>arXiv</volume>:<fpage>1609.02907</fpage>.</citation></ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Le</surname> <given-names>N. Q. K.</given-names></name> <name><surname>Ho</surname> <given-names>Q. T.</given-names></name> <name><surname>Yapp</surname> <given-names>E. K. Y.</given-names></name> <name><surname>Ou</surname> <given-names>Y. Y.</given-names></name> <name><surname>Yeh</surname> <given-names>H. Y.</given-names></name></person-group> (<year>2020</year>). <article-title>DeepETC: a deep convolutional neural network architecture for investigating and classifying electron transport chain&#x00027;s complexes</article-title>. <source>Neurocomputing</source> <volume>375</volume>, <fpage>71</fpage>&#x02013;<lpage>79</lpage>. <pub-id pub-id-type="doi">10.1016/j.neucom.2019.09.070</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>Z.</given-names></name> <name><surname>Hong</surname> <given-names>F.</given-names></name> <name><surname>Reagan</surname> <given-names>K.</given-names></name> <name><surname>Xiaowei</surname> <given-names>X.</given-names></name> <name><surname>Donna</surname> <given-names>M.</given-names></name> <name><surname>William</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2012</year>). <article-title><italic>In silico</italic> drug repositioning - what we need to know</article-title>. <source>Drug Discov. Today</source> <volume>18</volume>, <fpage>110</fpage>&#x02013;<lpage>115</lpage>. <pub-id pub-id-type="doi">10.1016/j.drudis.2012.08.005</pub-id><pub-id pub-id-type="pmid">22935104</pub-id></citation></ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Martin</surname> <given-names>E. M</given-names></name> <name><surname>Jane</surname> <given-names>N.</given-names></name> <name><surname>Louise</surname> <given-names>N. J.</given-names></name></person-group> (<year>2018</year>). <article-title>Protein kinase inhibitors: insights into drug design from structure</article-title>. <source>Science</source> <volume>303</volume>, <fpage>1800</fpage>&#x02013;<lpage>1805</lpage>. <pub-id pub-id-type="doi">10.1126/science.1095920</pub-id></citation></ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nanni</surname> <given-names>L.</given-names></name> <name><surname>Lumini</surname> <given-names>A.</given-names></name> <name><surname>Pasquali</surname> <given-names>F.</given-names></name> <name><surname>Brahnam</surname> <given-names>S.</given-names></name></person-group> (<year>2020</year>). <article-title>iProStruct2D: identifying protein structural classes by deep learning via 2D representations</article-title>. <source>Exp. Systems Appl.</source> <volume>142</volume>, <fpage>8</fpage>. <pub-id pub-id-type="doi">10.1016/j.eswa.2019.113019</pub-id></citation></ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nguyen</surname> <given-names>T. A. L.</given-names></name> <name><surname>Venkatesh</surname> <given-names>S. H.</given-names></name></person-group> (<year>2019</year>). <article-title>GraphDTA: prediction of drug-target binding affinity using graph convolutional networks</article-title>. <source>BioRxiv [preprint]</source>. <pub-id pub-id-type="doi">10.1101/684662</pub-id></citation></ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Olayan</surname> <given-names>R. S.</given-names></name> <name><surname>Ashoor</surname> <given-names>H.</given-names></name> <name><surname>Bajic</surname> <given-names>V. B.</given-names></name></person-group> (<year>2018</year>). <article-title>DDR: efficient computational method to predict drug-target interactions using graph mining and machine learning approaches (vol 34, pg 1164, 2018)</article-title>. <source>Bioinformatics</source> <volume>34</volume>, <fpage>3779</fpage>&#x02013;<lpage>3779</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/bty417</pub-id></citation></ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Oprea</surname> <given-names>T. I.</given-names></name> <name><surname>Mestres</surname> <given-names>J.</given-names></name></person-group> (<year>2012</year>). <article-title>Drug repurposing: far beyond new targets for old drugs</article-title>. <source>Aaps J.</source> <volume>14</volume>, <fpage>759</fpage>&#x02013;<lpage>763</lpage>. <pub-id pub-id-type="doi">10.1208/s12248-012-9390-1</pub-id><pub-id pub-id-type="pmid">22826034</pub-id></citation></ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ozturk</surname> <given-names>H.</given-names></name> <name><surname>Ozgur</surname> <given-names>A.</given-names></name> <name><surname>Ozkirimli</surname> <given-names>E.</given-names></name></person-group> (<year>2018</year>). <article-title>DeepDTA: deep drug-target binding affinity prediction</article-title>. <source>Bioinformatics</source> <volume>34</volume>, <fpage>821</fpage>&#x02013;<lpage>829</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/bty593</pub-id><pub-id pub-id-type="pmid">30423097</pub-id></citation></ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>&#x000D6;zt&#x000FC;rk</surname> <given-names>H.</given-names></name> <name><surname>Ozkirimli</surname> <given-names>E.</given-names></name> <name><surname>&#x000D6;zg&#x000FC;r</surname> <given-names>A.</given-names></name></person-group> (<year>2019</year>). <article-title>WideDTA: prediction of drug-target binding affinity</article-title>. <source>Bioinformartics</source> <volume>34</volume>, <fpage>i821</fpage>&#x02013;<lpage>i829</lpage>.</citation></ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rabovsky</surname> <given-names>M.</given-names></name> <name><surname>McClelland</surname> <given-names>J. L.</given-names></name></person-group> (<year>2020</year>). <article-title>Quasi-compositional mapping from form to meaning: a neural network-based approach to capturing neural responses during human language comprehension</article-title>. <source>Philos. Transac. R. Soc. Biol. Sci.</source> <volume>375</volume>:<fpage>20190313</fpage>. <pub-id pub-id-type="doi">10.1098/rstb.2019.0313</pub-id><pub-id pub-id-type="pmid">31840583</pub-id></citation></ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rawi</surname> <given-names>R.</given-names></name> <name><surname>Mall</surname> <given-names>R.</given-names></name> <name><surname>Kunji</surname> <given-names>K.</given-names></name> <name><surname>Shen</surname> <given-names>C.-H.</given-names></name> <name><surname>Kwong</surname> <given-names>P. D.</given-names></name> <name><surname>Chuang</surname> <given-names>G.-Y.</given-names></name></person-group> (<year>2018</year>). <article-title>PaRSnIP: sequence-based protein solubility prediction using gradient boosting machine</article-title>. <source>Bioinformatics</source> <volume>34</volume>, <fpage>1092</fpage>&#x02013;<lpage>1098</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btx662</pub-id><pub-id pub-id-type="pmid">29069295</pub-id></citation></ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Roses</surname> <given-names>A. D.</given-names></name></person-group> (<year>2008</year>). <article-title>Pharmacogenetics in drug discovery and development: a translational perspective</article-title>. <source>Nat. Rev. Drug Discov.</source> <volume>7</volume>, <fpage>807</fpage>&#x02013;<lpage>817</lpage>. <pub-id pub-id-type="doi">10.1038/nrd2593</pub-id><pub-id pub-id-type="pmid">18806753</pub-id></citation></ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Santos</surname> <given-names>R.</given-names></name> <name><surname>Oleg</surname> <given-names>U.</given-names></name> <name><surname>Anna</surname> <given-names>G.</given-names></name> <name><surname>Bento</surname> <given-names>A</given-names></name> <name><surname>Ramesh</surname> <given-names>D.</given-names></name> <name><surname>Cristian</surname> <given-names>B.</given-names></name> <etal/></person-group>. (<year>2016</year>). <article-title>A comprehensive map of molecular drug targets</article-title>. <source>Nat. Rev. Drug Discov.</source> <volume>16</volume>, <fpage>19</fpage>&#x02013;<lpage>34</lpage>. <pub-id pub-id-type="doi">10.1038/nrd.2016.230</pub-id><pub-id pub-id-type="pmid">27910877</pub-id></citation></ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Strobl</surname> <given-names>C.</given-names></name> <name><surname>Boulesteix</surname> <given-names>A.-L.</given-names></name> <name><surname>Zeileis</surname> <given-names>A.</given-names></name> <name><surname>Hothorn</surname> <given-names>T.</given-names></name></person-group> (<year>2019</year>). <article-title>Bias in random forest variable importance measures: illustrations, sources and a solution</article-title>. <source>BMC Bioinform.</source> <volume>8</volume>, <fpage>25</fpage>&#x02013;<lpage>20</lpage>. <pub-id pub-id-type="doi">10.1186/1471-2105-8-25</pub-id><pub-id pub-id-type="pmid">17254353</pub-id></citation></ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tang</surname> <given-names>J.</given-names></name> <name><surname>Szwajda</surname> <given-names>A.</given-names></name> <name><surname>Shakyawar</surname> <given-names>S.</given-names></name> <name><surname>Xu</surname> <given-names>T.</given-names></name> <name><surname>Hintsanen</surname> <given-names>P.</given-names></name> <name><surname>Wennerberg</surname> <given-names>K.</given-names></name> <etal/></person-group>. (<year>2014</year>). <article-title>Making sense of large-scale kinase inhibitor bioactivity data sets: a comparative and integrative analysis</article-title>. <source>J. Chem. Inform. Model.</source> <volume>54</volume>, <fpage>735</fpage>&#x02013;<lpage>743</lpage>. <pub-id pub-id-type="doi">10.1021/ci400709d</pub-id><pub-id pub-id-type="pmid">24521231</pub-id></citation></ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Veli&#x0010D;kovi&#x00107;</surname> <given-names>P.</given-names></name> <name><surname>Cucurull</surname> <given-names>G.</given-names></name> <name><surname>Casanova</surname> <given-names>A.</given-names></name> <name><surname>Romero</surname> <given-names>A.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Graph attention networks,&#x0201D;</article-title> in <source>Proceedings of the International Conference on Learning Representations (ICLR)</source>.</citation></ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>L.</given-names></name> <name><surname>You</surname> <given-names>Z.-H.</given-names></name> <name><surname>Chen</surname> <given-names>X.</given-names></name> <name><surname>Yan</surname> <given-names>X.</given-names></name> <name><surname>Liu</surname> <given-names>G.</given-names></name> <name><surname>Zhang</surname> <given-names>W.</given-names></name></person-group> (<year>2018</year>). <article-title>RFDT: a rotation forest-based predictor for predicting drug-target interactions using drug structure and protein sequence information</article-title>. <source>Curr. Protein Peptide Sci.</source> <volume>19</volume>, <fpage>445</fpage>&#x02013;<lpage>454</lpage>. <pub-id pub-id-type="doi">10.2174/1389203718666161114111656</pub-id><pub-id pub-id-type="pmid">27842479</pub-id></citation></ref>
<ref id="B32">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Xu</surname> <given-names>K.</given-names></name> <name><surname>Hu</surname> <given-names>W.</given-names></name> <name><surname>Leskovec</surname> <given-names>J.</given-names></name> <name><surname>Jegelka</surname> <given-names>S.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;How Powerful are Graph Neural Networks?,&#x0201D;</article-title> in <source>Proceedings of the International Conference on Learning Representations (ICLR)</source>.</citation></ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yan</surname> <given-names>X. Y.</given-names></name> <name><surname>Zhang</surname> <given-names>S. W.</given-names></name> <name><surname>He</surname> <given-names>C. R.</given-names></name></person-group> (<year>2019</year>). <article-title>Prediction of drug-target interaction by integrating diverse heterogeneous information source with multiple kernel learning and clustering methods</article-title>. <source>Comput. Biol. Chem.</source> <volume>78</volume>, <fpage>460</fpage>&#x02013;<lpage>467</lpage>. <pub-id pub-id-type="doi">10.1016/j.compbiolchem.2018.11.028</pub-id><pub-id pub-id-type="pmid">30528728</pub-id></citation></ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname> <given-names>Q. H.</given-names></name> <name><surname>Zhong</surname> <given-names>Y. N.</given-names></name> <name><surname>Gillespie</surname> <given-names>C.</given-names></name> <name><surname>Merritt</surname> <given-names>R.</given-names></name> <name><surname>Bowman</surname> <given-names>B.</given-names></name> <name><surname>George</surname> <given-names>M. G.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Assessing potential population impact of statin treatment for primary prevention of atherosclerotic cardiovascular diseases in the USA: population-based modelling study</article-title>. <source>BMJ Open</source> <volume>7</volume>:<fpage>11</fpage>. <pub-id pub-id-type="doi">10.1136/bmjopen-2016-011684</pub-id><pub-id pub-id-type="pmid">28119384</pub-id></citation></ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhou</surname> <given-names>X.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name> <name><surname>Tian</surname> <given-names>Y.</given-names></name> <name><surname>Lu</surname> <given-names>B.</given-names></name> <name><surname>Hang</surname> <given-names>Y. Y.</given-names></name> <name><surname>Chen</surname> <given-names>Q. S.</given-names></name></person-group> (<year>2020</year>). <article-title>Development of deep learning method for lead content prediction of lettuce leaf using hyperspectral images</article-title>. <source>Int. J. Remote Sens.</source> <volume>41</volume>, <fpage>2263</fpage>&#x02013;<lpage>2276</lpage>. <pub-id pub-id-type="doi">10.1080/01431161.2019.1685721</pub-id></citation></ref>
</ref-list>
<fn-group>
<fn fn-type="financial-disclosure"><p><bold>Funding.</bold> This work was supported by the grants from the Key Research Area Grant 2016YFA0501703 of the Ministry of Science and Technology of China, the National Natural Science Foundation of China (Contract nos. 61802116, 61832019, 61503244), the State Key Lab of Microbial Metabolism and Joint Research Funds for Medical and Engineering and Scientific Research at Shanghai Jiao Tong University (YG2017ZD14).</p>
</fn>
</fn-group>
</back>
</article>