^{1}

^{2}

^{3}

^{1}

^{4}

^{*}

^{1}

^{2}

^{3}

^{4}

Edited by: Liang Cheng, Harbin Medical University, China

Reviewed by: Junwei Han, Harbin Medical University, China; Ying Wang, Xiamen University, China

This article was submitted to Molecular Medicine, a section of the journal Frontiers in Cell and Developmental Biology

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Mitochondria play essential roles in eukaryotic cells, especially in Plasmodium cells. They have several unusual evolutionary and functional features that are incredibly vital for disease diagnosis and drug design. Thus, predicting mitochondrial proteins of Plasmodium has become a worthwhile work. However, existing computational methods can only predict mitochondrial proteins of

The parasite Plasmodium is the main cause of malaria, and kills more than one million African children annually (Phillips et al.,

Traditional methods for predicting protein functions are based on biological experiments and they are costly and time-consuming. So, the researchers proposed the computational methods to predict the protein functions (Wei et al.,

For the features, there are many methods for extracting features from the protein primary sequences. The methods for extracting features of the amino acid, dipeptide and tripeptide from protein sequences can generate fixed-length data for the protein sequences with different length. Nakashima and Nishikawa (

There are lots of approaches for predicting mitochondrial proteins of

The proteins of PM275 are selected from UniprotKB/SwissProt (released 2020_01) by the following rules: (1) without ambiguous amino acids, such as “B,” “X,” and “Z;” (2) their function that have been confirmed by biological experiments; (3) sequences with > 50 length. Here we obtain 54 mitochondrial proteins as positive examples, and 340 non-mitochondrial proteins as negative examples, including cytosol proteins, secreted proteins, and apicoplast proteins. Next we used the CD-HIT (Fu et al.,

The PfM175 is mainly used in predict the mitochondrial proteins of

A protein sequence needs an efficient mathematical representation that can correctly express the inherent connection with the prediction types. To efficiently identify mitochondrial proteins of Plasmodium and build a robust model, we synthetically considered three sequences of features based on the protein primary sequence.

AAC has low complexity and has been widely used to predict the function of proteins. Given a protein sequence

where _{i} = _{i}/_{i} is the frequency of the

DPC computes the frequency of two amino acids. A protein sequence can be represented by a 400 dimensional vector. DPC contains information about the proportion of amino acids as well as the order of sequence.

where

TPC computes the frequency of three contiguous amino acids. A protein sequence can be represented by a 8000 dimensional vector.

where

SVM is a powerful and efficient machine learning algorithm for linear, non-linear classification and regression. Compared with other machine learning algorithms, the advantage of the SVM algorithm is that the dimension of SVM parameters equals the number of training samples (Zavaljevski et al.,

SVM algorithm aims to calculate an optimal hyperplane that can separate two samples correctly in space. The optimal hyperplane, also known as support vector, is a set of vectors obtained by maximizing the separating margin on the training set. For linear separable classification problems, the optimal hyperplane can be directly obtained by the constrained optimization problem. For non-linear classification, the advantage of SVM is to introduce kernel function and transform the non-linear classification problem into a linear classification problem (Amari and Wu, ^{−1}, 2^{3}] and [2^{−4}, 2^{−1}] with the step of 2.

TPC can obtain 8000 feature values for a protein. However, these feature values may contain redundant and noisy information which will affect the training model and can lead to low prediction accuracy eventually. Accordingly, it is vital to select appropriate features from TPC to improve the prediction accuracy. The analysis of variance (ANOVA) can filter out the tripeptides with low variance, which is suitable for processing TPC because of its lots of zero values. ANOVA can compute the difference in the mean of two or more samples. ANOVA computes a F-value by the difference within the same group and the difference among different groups (Anderson,

where Sb2 and Sw2 are calculated by the following formulas:

where K and M represent the number of groups and total number of samples. _{ξ}(_{i} represents the number of samples in the

We use 5-fold cross-validation to assess the prediction performance of our method. First, we randomly divide the dataset into five mutually exclusive subsets of similar size. Second, we choose one subset as the testing dataset and the other four subsets as the training dataset. So, we run five times of training and testing, and return the average value of five test results.

Here, six metrics for evaluating methods are used, accuracy, sensitivity, precision, recall, F-score, and the Matthews correlation coefficient (MCC), respectively. The detailed formulas are followings:

Here TP represents the number of mitochondrial proteins predicted correctly, FP represents the number of non-mitochondrial proteins predicted incorrectly, TN represents the number of non-mitochondrial proteins predicted correctly, FN denotes the number of mitochondrial proteins predicted incorrectly.

Experiments first evaluate the performances of using AAC, DPC, TPC as the features, and the different machine learning algorithms as classifiers. Results demonstrate that TPC performs better than other feature sets (

Cross-validation performances of AAC with different classifiers on PM275.

LR | 12.73% | 56.36% | ||||

NB | 57.82% | 58.16% | 50.68% | 0.12 | ||

SVM | 80.36% | 1.82% | 40.18% | 50% | 44.56% | 0 |

Cross-validation performances of DPC, TPC, and combination features using different classifiers on PM275.

LR | 83.27% | 33.27% | 76.59% | 64.36% | 66.84% | 0.39 | |

DPC | NB | 60.73% | 63.55% | 70.49% | 57.46% | 0.33 | |

SVM | 86.18% | 29.45% | 92.68% | 72.73% | 68.33% | 0.49 | |

LR | 82.55% | 31.45% | 74.25% | 63.23% | 65.32% | 0.35 | |

AAC+DPC | NB | 60.73% | 63.61% | 70.48% | 57.42% | 0.33 | |

SVM | 85.82% | 27.64% | 92.52% | 63.82% | 67.18% | 0.48 | |

LR | 88.73% | 42.55% | 93.91% | 71.27% | 75.54% | 0.60 | |

TPC | NB | 82.91% | 12.73% | 81.24% | 56.36% | 56.09% | 0.29 |

SVM | 48.18% |

Cross-validation performances of optimized TPC using different classifiers on PM275.

LR | 984 | 91.27% | 55.64% | 95.13% | 77.82% | 82.8% | 0.71 |

NB | 2578 | 85.09% | 24% | 92.22% | 62% | 63.98% | 0.43 |

SVM | 399 |

Cross-validation performance of PM-OTC compared with other methods on PfM175.

PlasMit (Bender et al., |
90.00% | 94.00% | 89.00% | 0.74 |

PFMpred (Verma et al., |
92.00% | 97.50% | 90.40% | 0.81 |

PfMP-N25 (Jia et al., |
96.00% | 87.50% | 98.50 | 0.93 |

PfMP-30 (Jia et al., |
98.80% | 97.50% | 0.97 | |

ID (Chen et al., |
92.00% | 89.63% | 0.82 | |

Ding (Ding and Li, |
97.10% | 90.00% | 0.92 | |

Our method | 97.50% | 98.75% |

We plot a histogram based on the frequency of each amino acid for each protein from PM275 (

Amino acid composition of 54 mitochondrial proteins (mito) and 221 non-mitochondrial proteins (non_mito). The abscissa represents the abbreviation of amino acid, and the ordinate represents the percentage content of amino acid.

Next, we consider three feature sets: DPC (Equations 2, 3), DPC combined with AAC, and TPC (Equations 4, 5). We input these three feature sets into three classifiers (LR, NB, SVM). Results are recorded in

We use ANOVA and IFS strategy to reduce the dimension of TPC and further obtain the optimized TPC as the features. We rank the 8000 dimensional TPC according their F-value (Equation 6) and adopt IFS to generate 8000 subsets. Then we input all 8000 subsets into three classifiers (LR, NB, SVM) and calculate the accuracy of 5-fold cross-validation of each subset.

The IFS curve for predicting mitochondrial proteins of Plasmodium using three classifiers. The accuracies of the SVM classifier and the LR classifier improve when the number of features is initially increased. When the number of features exceeds 399, the accuracy of the SVM classifier decreases significantly and finally returns to stable. With the increase in the number of features, the accuracy of the NB classifier first decreases significantly and then gradually increases to stable.

The structure of PM-OTC model. The input data is whole protein sequence. First, through extracting tripeptides from raw sequence, 8000-dimensional TPC are obtained. And then, TPC constitute a feature vector of 399 dimensions by ANOVA, which is fed into the SVM classifier for prediction.

Most of the published methods can only predict mitochondrial proteins of

Predicting mitochondrial proteins of Plasmodium is the key to treating malaria because mitochondrion is a suitable target for anti-malarial drugs. Here we build the PM-OTC to predict the mitochondrial proteins of Plasmodium instead of only predicting mitochondrial proteins of

The PM-OTC uses the optimized TPC as the features and the SVM as the classifier to predict mitochondrial proteins of Plasmodium. The performance of PM-OTC on PM275 indicates that PM-OTC performs well in predicting mitochondrial proteins of Plasmodium with an accuracy of 94.91%. The performance of PM-OTC on PfM175 shows that PM-OTC improves the accuracy by 0.64−9.43% compared with other methods. So, the PM-OTC is efficient and effective in predicting mitochondrial proteins of

All datasets presented in this study are included in the article/supplementary material.

The software of PM-OTC can download from

HB proposed the method. HB and JW designed the experiments. All author wrote the manuscript.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.