^{1}

^{2}

^{3}

^{*}

^{3}

^{4}

^{1}

^{3}

^{1}

^{2}

^{3}

^{4}

Edited by: Daniel B. Hier, Missouri University of Science and Technology, United States

Reviewed by: Karthik Seetharam, West Virginia State University, United States

This article was submitted to Health Informatics, a section of the journal Frontiers in Digital Health

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

We write to expand on Faes's et al. recent publication “

Prediction models can be based on traditional statistical learning methods, such as regression, and modern machine learning approaches, such as tree-based methods (random forests, XGBoost) and neural networks. These models can be evaluated along several evaluation axes. Measures for discrimination typically quantify the separation between low vs. high-risk subjects, independent of the event rate (^{2}), which reflect both discrimination and calibration performance. Lastly, measures for clinical utility have been proposed, which consider the clinical context with respect to the event rate and the decision threshold to define high vs. low risk (

We here highlight key measures focusing on discriminative ability and clinical utility [or effectiveness (

Evaluation measures from statistics and machine learning fields.

Area under the receiver operating characteristic-curve (AUROC) | S/ML | The receiver operating characteristic (ROC) curve plots sensitivity as a function of 1-specificity. The baseline is fixed. The area under the ROC-curve can be compared across settings with different event rates |

Area under the precision recall-curve (AUPRC) | ML | The precision recall curve plots the precision (positive predictive value) as a function of sensitivity. The baseline is determined by the ratio of positive predictions and total predictions. The area under the precision recall curve cannot be compared across settings with different event rates and ignores true negatives |

Crude accuracy | ML | Crude accuracy is the number of true positive and negative predictions divided by the total number of cases |

Sensitivity (recall) | S/ML | The sensitivity is the number of true positive predictions divided by the number of true positive cases at a specified probability threshold |

Specificity | S/ML | The specificity is the number of true negative predictions divided by the number of true negative cases at a specified probability threshold |

Positive predictive value (precision) | S/ML | The positive predictive value (PPV) is the number of true positive predictions divided by the total number of positive predictions at a specified probability threshold |

Negative predictive value | S/ML | The negative predictive value (NPV) is the number of true negative predictions divided by the total number of negative predictions at a specified probability threshold |

_{β}-score |
ML | The _{β}-score is the harmonic mean of sensitivity and positive predictive value controlled by the β coefficient: _{β}-score are the _{1}- and _{2}-score. The _{1}score implies equal weight for false negatives and false positive classifications, which is “absurd” for most medical contexts ( |

Net Benefit | S | Net Benefit is a weighted sum of true positive (TP) and false positive (FP) predictions at a given decision threshold (t): |

Relative utility | S | Relative utility is the maximum net benefit of risk prediction at a given decision threshold divided by the maximum net benefit of perfect prediction. A relative utility curve plots relative utility over a range of decision thresholds ( |

The precision recall-curve and F1-score are often described in the machine learning field as “superior for imbalanced data” (

Some measures are considered outdated in the classic statistical learning field, while still popular in the machine learning field. Such a measure is the crude accuracy (the fraction of correct classifications). Crude accuracy is event rate dependent, e.g., a 99% accuracy is the minimum for a setting with 1% event rate and classifying all subjects as “low risk.”

Decision analytical approaches move away from pure discrimination and toward clinical utility. Net benefit is the most popular among some recently proposed measures for clinical utility (

In conclusion, measures that are affected by the event rate are common in the machine learning field, such as the AUPRC, F1-score, and crude accuracy. They impede the comparison of model performance across different settings. The medical decision-making context is better captured in modern measures such as Net Benefit, which not only consider the event rate but also the clinical consequences of false-positive vs. true-positive decisions (harm vs. benefit), rather than arbitrary weighting these costs (

AH, BC, and ES conceived the idea, wrote the initial draft, edited, and approved the final manuscript. All authors contributed to the article and approved the submitted version.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.