^{1}

^{1}

^{2}

^{3}

^{4}

^{2}

This article was submitted to Medicine and Public Health, a section of the journal Frontiers in Artificial Intelligence

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

The rise of artificial intelligence (AI) has transformed many aspects of human life, especially in healthcare, personal transport, law-making, and entertainment (

The goal of this project is to develop a model that can infer the causality of clinical outcome from unstructured pharmacovigilance reports. Causality (also referred to as causation or cause and effect) is the influence by which one event, process, or state (a cause) contributes to the production of another event, process or state (an effect). Causal inference is the process of identifying the cause and effect based on the conditions of the occurrence of the event (

One of the conventional approaches to prove cause and effect is a randomized controlled trial. In a randomized controlled trial, the test subject is randomly assigned to the treatment or control groups, which are identical in every way other than one group receives drug (treatment) and one receives placebo (control). If the clinical outcome is better in one group than the other with statistical significance, then causality is established. However, conducting a randomized controlled trial to establish causality relationships is often time consuming, expensive and can be impractical in the real world. For example, it would be impractical to conduct a randomized controlled trial to demonstrate causality regarding the impact of a vegetarian diet on life expectancy. Thus, there is a pressing need to develop AI-powered language models that can identify potential causality from accumulated real-world data.

Only one attempt has been made so far to perform causal inference using text as a potential cause of an effect (

One of the potential applications of transformer-based language models for causal inference is pharmacovigilance. Pharmacovigilance, also known as drug safety, is the pharmacological science related to collecting, detecting, assessing, monitoring, and preventing adverse effects with pharmaceutical products (

In this study, we propose a novel transformer-based causal inference model—InferBERT, by integrating A Lite Bidirectional Encoder Representations from Transformers (ALBERT)

1. The FDA Adverse Event Reporting System (FAERS) case reports, including Analgesics-related acute liver failure and Tramadol-related mortalities, were extracted and preprocessed.

2. The preprocessed case reports were converted into the sentence-like descriptions for the subsequent pretrained language model ALBERT.

3. We fine-tuned the pretrained ALBERT model based on the transformed sentence-like descriptions to predict Analgesics-related acute liver failure and Tramadol-related mortalities, respectively.

4. Do-calculus was implemented into the fine-tuned ALBERT models for causal inference.

Workflow of the study.

The two critical aspects of causal relations in pharmacovigilance are 1) a drug causes the particular adverse drug reaction and 2) the causal relationship between the adverse drug reaction and different clinical factors needs to be established. Therefore, we employed two FAERS datasets, including Analgesics-induced acute liver failure, and Tramadol-related mortalities, to investigate the performance of the proposed Deep Causal Pharmacovigilance (InferBERT) approach.

Analgesics or painkillers form a group of drugs used to achieve analgesia and relief from pain. Analgesics include acetaminophen (APAP), the nonsteroidal anti-inflammatory drugs (NSAIDs) such as the salicylates, and opioid drugs such as morphine and oxycodone. Analgesics are one of the most common causes of drug-induced acute liver failure (

Tramadol is an opioid-related medicine used to treat severe pain. In the United States, there is a Boxed Warning to Tramadol labeling to ensure appropriate inclusion of the serious adverse reactions such as addiction, abuse, and misuse, life-threatening respiratory depression, accidental ingestion, and interaction with drugs affecting cytochrome P450 isoenzymes. In particular, the statement “Do not prescribe tramadol for patients who are suicidal or addiction-prone. Consideration should be given to the use of non-narcotic analgesics in patients who are suicidal or depressed” is highlighted in the Drug Abuse and Dependence section of the US FDA label (

The FAERS case reports curated in the PharmaPendium database (

The FAERS data in the PharmaPendium database has been preprocessed, including removing duplicating records, normalizing drug names, and standardizing adverse events terminology. However, some hurdles still exist for consolidating the information to carry out causal inference. Therefore, we implemented the following data cleaning procedure to further process the datasets:

1) We normalized the terms such as “UNK,” “UNKNOWN,” “()” and considered them as missing values.

2) Considering the different doses used in FAERS case reports, we unified the dose unit into milligram (mg). We categorized the dose into two classes: large than 100 mg and less than 100 mg.

3) We categorized the patient age into four groups: less than 18 years old, 18–39 years old, 40–64 years old, and older than 65 years.

4) For the tramadol-related mortalities dataset, we excluded the case reports without clinical outcome information since we used the clinical outcome as the prediction endpoint. As a result, we obtained a total of 36,661 and 27,245 case reports for Analgesics-induced acute liver failure and Tramadol-related mortalities, respectively.

Our proposed model for causal inference, InferBERT, is based on the transformer model _{1}, _{2}, … , _{N}), _{
i
} is the _{
j
} consists of a set of terms _{
j
} (e.g., feature gender includes terms male and female) as value, where _{
j
} = (_{
j1}, _{
j2}, … , _{
jG
}), _{
i
} = (_{
i1}, _{
i2}, … , _{
iM}), where _{
ij
} is the _{
ij
} ⊂ _{j}. Without losing generality, we set the _{
m
} as the end point, which means the

Then, we transformed each case report _{
i
} into the corresponding sentence _{
i
}. For example, in the FAERS dataset, the clinical features included gender, age, primary suspect drug, dose, indication, adverse events, and outcomes in each case report d_{i}. The generated sentence followed the template listed below:

Patient (gender and age) takes a primary suspect drug to treat which disease and cause some adverse events, leading to outcomes.

Then we generated the sentence set _{1}, _{2}, … , _{
N
}).

For the Analgesics-induced acute liver failure data, the term “acute liver failure” in clinical feature “adverse event” was used as the endpoint. Of 36,661 FAERS case reports, 15,224 cases with “acute liver failure” were considered as positives and remaining 21,437 cases as negatives (positive/negative ratio = 0.71). For Tramadol-related death data, the clinical feature “outcomes” was used as the endpoint. The case reports with the term “death” in the clinical feature “outcomes” were considered as positives and other case reports were used as negatives. Accordingly, a total of the 27,245 case reports with 9,846 positives and 17,399 negatives were obtained (positive/negative ratio = 0.57). Next, we employed a stratified splitting strategy to divide each sentence set

Sentence sets of Analgesics-related acute liver failure and Tramadol-related mortalities.

Endpoints | Datasets | Number of positives | Number of negatives | Positive versus negative ratio |
---|---|---|---|---|

Acute liver failure | Total | 15,224 | 21,437 | 0.71 |

Training set | 9,798 | 13,663 | 0.71 | |

Develop set | 2,399 | 3,467 | 0.69 | |

Test set | 3,027 | 4,307 | 0.70 | |

Tramadol-related death | Total | 9,846 | 17,399 | 0.57 |

Training set | 6,250 | 11,185 | 0.56 | |

Develop set | 1,588 | 2,722 | 0.57 | |

Test set | 2,008 | 3,442 | 0.58 |

Bidirectional Encoder Representations from Transformers (BERT) is a transformer that learns contextual bidirectional representations from unlabeled text documents by jointly conditioning on both left and right contexts (

Increasing the model size of pre-trained language models often results in an improved model performance for downstream tasks. However, The GPU/TPU memory limitations, longer training times, and model overfitting generate obstacles to further expand the model size. To address these obstacles, Google AI proposed a Lite BERT (ALBERT) by adopting three techniques to trim down BERT (

The ALBERT_{base} classification model was employed to classify the endpoint term of each instance. We build a simple SoftMax classifier for the downstream classification task of the ALBERT model. In the ALBERT model, the learned representation vector of the (CLS) special token of the last layer acts as the input of the downstream model, with no hidden layers. The dimensionality of the output layer in the classification model is two, where the SoftMax function is adopted to classify whether the endpoint term exists or not. The loss function of the classification model is shown as follows:_{
i
}) is the output of the classification model for _{
i
}, which is a calculated probability of the predicted class of _{
i
}. _{
i
} is the true probability of the end point of _{
i
}.

We denote _{
i
} = _{
i
}), as the output of classification model, where _{
i
} is the positive probability of the end point for instance _{1}, _{2}, … , _{N}),

Since the transformer is a generative model, the ALBERT based classification model can be seen as a conditional probability distribution

Based on the conditional probability distribution _{base} classifier, we performed the Do-calculus procedure to estimate the cause of the endpoint. The pseudo code of the Do-calculus procedure is shown below.

For all the terms in each clinical feature, we applied the Do-calculus algorithm to check whether it is the cause of the endpoint. For a term _{
jk
}, if a case report _{
i
} contains _{
jk
}, we say it is Do _{
jk
}, while if _{
ij
}≠∅ and _{
jk
} is not in _{
ij
}, then it is not do _{
jk
}. We assigned the case _{
i
} to different sets, _{
jk
}, while _{
jk
}. We used the one tail z-test to evaluate whether instances in _{
m
} and we want to see the impact of _{
11
} (the first term of the first feature), then for each instance _{
i
} we have the probability of _{
m
} being positive as follows:

As shown in _{11}, the set is

To establish all the causal terms of the end point, we evaluated every term in each feature. This generated the term set

To further explore the causal relationship among the enriched causal terms, we built a causal tree based on the Do-calculus. For each term in _{
m
}, we explored the secondary causal terms. For example, if _{11} is a term in _{21} is a secondary cause for the endpoint _{
m
}, then we fixed the _{11} term and performed a statistical significance test on the difference between the instances following distribution shown as

By recursively performing the do-calculus algorithm on the subset of

The proposed InferBERT model is based on the fine-tuned pretrained ALBERT_{base} for text classification and causal inference. Application of pretrained language models to the supervised downstream task is designed in the BERT model and its derivatives such as ALBERT. However, this process can be less than robust: even with the same parameter values, distinct random seeds can lead to different results (

Second, the percentage of overlapped terms (POT) strategy was used to investigate the consistency of the order of enriched terms. Specifically, we ranked the enriched terms based on their z-score from high to low. We then calculated the POT by the number of overlapping terms among three repeated runs divided by the number of enriched terms in each subset of the ranked enriched term list.

To further verify the results yielded by the proposed InferBERT model, we employed three conventional causal inference methods including the proportional reporting ratio (PRR) ^{2} value of four or more. For the ROR, a signal is detected if the lower limit of the 95% two-sided confidence interval exceeds one. For the EBGM, a signal is enriched when the lower one-sided 95% confidence limit of the EBGM (EB05) equal or more than two.

To facilitate the application of our model, we developed a standalone package to simplify the implementation process. The current version of the InferBERT is based on a lite version of BERT (ALBERT,

The distribution of sequence length:

Top 10 most frequent terms in the two sentence sets based on the tf-idf values.

Analgesics-related acute liver failure | Tramadol-related mortalities | ||
---|---|---|---|

Terms | Tf-idf value | Terms | Tf-idf value |

Acetylcysteine | 0.0318 | Abacavir | 0.0323 |

Acinetobacter | 0.0318 | Indomethacin | 0.0323 |

Alafenamide | 0.0318 | Glossodynia | 0.0315 |

Altered | 0.0318 | Idiopathic | 0.0315 |

Appendicectomy | 0.0318 | Amnestic | 0.0312 |

Appetite | 0.0318 | Assault | 0.0312 |

Assist | 0.0318 | Axetil | 0.0312 |

Atherosclerosis | 0.0318 | Bradyarrhythmia | 0.0312 |

Brucellosis | 0.0318 | Brugada | 0.0312 |

Cabazitaxel | 0.0318 | Cardiorenal | 0.0312 |

ALBERT_{base} model developed on the 16G BOOKCORPUS _{base} model consisted of 12 repeating layers, 128 embeddings, 768 hidden, and 12 heads with 11 M parameters. We further fine-tuned the ALBERT_{base} model with training sets and determined the optimized models based on text classification results in the development sets for the endpoints, (i.e. acute liver failure and death). We used one NVIDIA V100 (32 GB) GPU for fine-tuning the model. For the Analgesics-induced acute liver failure dataset, the maximum sequence length was fixed to 128, and the mini-batch size was set to 128. A total of 10,000 training steps were implemented with 2,000-step warmup, and the checkpoint step was set to 500 for recording the prediction results. For the Tramadol-related mortalities dataset, we used the same parameter settings except for a longer maximum sequence length, (i.e. 256). More training steps, (i.e. 20,000 steps) were selected as well since the Tramadol-average sequence length was longer than that of the Analgesics-induced acute liver failure dataset.

The relationship between cross-entropy loss and accuracy and training steps in fine-tuned ALBERT models:

To investigate whether the proposed InferBERT approach could capture the causal factors aligned with clinical knowledge, we further carried out the do-calculus analysis to decipher the causal factors for the Analgesics-induced acute liver failure and Tramadol-related mortalities datasets. There are 42 and 48 clinical terms enriched with an adjusted

Enriched causal clinical terms by the proposed InferBERT AI model.

Clinical categories | Clinical terms | Z-score | Average of do probabilities | Average of not do probabilities | Adjusted |
---|---|---|---|---|---|

Analgesics-induced acute liver failure | |||||

primary suspect drug | APAP | 153.92 | 0.84 | 0.33 | < 1E-16 |

Age | 18–39 | 36.01 | 0.54 | 0.35 | < 1E-16 |

Gender | Female | 17.06 | 0.41 | 0.35 | < 1E-16 |

Dose | Larger than 100 mg | 8.93 | 0.39 | 0.35 | < 1E-16 |

Outcome | Death | 119.33 | 0.68 | 0.30 | < 1E-16 |

Tramadol-related mortalities | |||||

Adversary events | Completed suicide | 252.27 | 1.00 | 0.28 | < 1E-16 |

Age | 40–64 | 18.33 | 0.44 | 0.32 | < 1E-16 |

Gender | Male | 3.62 | 0.37 | 0.34 | 0.0001 |

Dose | Drug abuse | 38.77 | 0.74 | 0.33 | < 1E-16 |

Primary suspect drug | Hydrocodone bitartrate | 23.67 | 0.91 | 0.36 | < 1E-16 |

For Analgesics-induced acute liver failure, the enriched root causal factors (z-score) including primary suspect drug^{—}APAP (153.92), age^{—}18–39 (36.01), gender^{—}female (17.06), dose^{—}larger than 100 mg (8.93), and outcome^{—}death (119.33) were enriched, which is highly consistent with the clinical backgrounds mentioned above. For Tramadol-related mortalities, the enriched root causal factors (z-score) consisted of primary suspect drug^{—}Hydrocodone Bitartrate (23.66), age^{—}40–64 (18.33), gender–male (3.62), dose^{—}drug abuse (38.77), and adverse events^{—}Completed suicide (252.27), which is aligned with its clinical background.

To further uncover the interrelationship among causal factors, we implemented a causal tree analysis using the causal factor with highest z-score as a start point.

Causal trees for

Robustness evaluation of the proposed InferBERT model. The yellow and green colors denote Analgesics-induced acute liver failure and Tramadol-related mortalities datasets, respectively. The Venn diagram illustrates the overlapping of the enriched causal terms by three repeated runs. The percentage of overlapping terms (POPs) shown in the dotted-line curve represent the consistency among ranked order terms from the three repeated runs.

We further compared the proposed InferBERT model with three conventional signal detection methods (i.e., PRR, ROR, and EBGM) widely applied in pharmacovigilance.

Comparison between the proposed InferBERT model and the three conventional causal inference models including PRR, ROR EBGM:

Transformer-based language models have greatly expanded the potential of NLP applications. However, few attempts have been made to apply transformer-based language models to address an unmet need for enhanced model-based reasoning for causality. To our best knowledge, the current study and description of InferBERT is the first to succeed in causal inference, aimed at boosting pharmacovigilance. To investigate the performance of our proposed InferBERT model, we used two FAERS case studies, Analgesics-induced acute liver failure and Tramadol-related mortalities, to prove the concept. The root causes of the two datasets were identified, and the results were consistent with the causal relationship derived from real-world data. Moreover, the proposed causal tree seamlessly linked the enriched causal factors into a hierarchical structure to decipher the interrelationship among the causal factors. Furthermore, the high reproducibility of the proposed InferBERT model warrants its potential real-world application.

The FAERS database is an essential resource for hypothesis generation to support pharmacovigilance. However, FAERS data derive from a spontaneous submission by pharmaceutical companies and physicians. There are many data integrity issues such as duplicate records, unstandardized terminologies, missed values, and missing information. Tremendous efforts have been made to clean, normalize, and standardize the data and format, enabling researchers to fully take advantage of the datasets (

To demonstrate the performance of the proposed InferBERT model, we employed synthetic sentences constructed by standard terminology from the processed FAERS data. The data quality of data resources is crucial for applying the model for causality analysis. For example, the complex causal relationship is embedded in the electronic medical records (EMR), which is essential to suggest the right clinical decision and improve the clinical outcome. Initial efforts such as ClinicalBERT have been proposed to address the clinical questions. A further investigation to combine the ClinicalBERT

There are two limitations in the current version of the InferBERT model, which needs to further investigation. First, we developed the InferBERT model based on FAERS data with a fixed pattern. Further investigation on the different types of free-text data in the biomedical fields is a “must” to evaluate the generalization of the proposed model. Second, we only investigated the model performance with two endpoints (i.e., Analgesics-related acute liver failure and Tramadol-related death). The proposed InferBERT model should be further evaluated with diverse free text-based biomedical datasets to lay out the pros and cons in real-world applications.

It would be valuable to consider some additional studies to investigate potential further improvement of the proposed InferBERT model. Firstly, the proposed InferBERT model was developed based on the ALBERT_{base} model. Other transformer-based language models could be further investigated to improve causal inference results. A comparative analysis between different transformer models on the improved performance is strongly recommended. The comparison could address the impact factor of model performance such as computational power, computer time, and improvement of model performance, which could be very helpful to select the “fit-for-purpose” model to carry out the causal inference toward real-world application. Secondly, the language model represents the interrelationship of variables in a probabilistic graph. Therefore, Bayesian theory could be considered as a possible route to improve causal inference. The proposed model needs to predefine the endpoint to carry out the causal analysis. The combination of the transformer model and Bayesian approaches may be a promising solution to comprehensively evaluate the causal relationship among different variables in the data. Thirdly, in the current study, we focus on the identification of causal factors of the endpoint. The developed InferBERT model could be utilized to test the potential influence of endpoints for any term combination, which may provide further confidence and establish a causality-based Question and Answering system. Lastly, the current developed InferBERT model is a supervised-based causal inference system. Future work for self-learning of interrelationships among variables directly derived from the pre-trained language models may provide a more intelligent way to identify causal factors for any clinical outcome.

Despite the current attention around AI, most AI-powered language models focus on predicting outcomes rather than understanding causality. Here, we explored the potential utility of transformer-based language models for causal inference in pharmacovigilance. We hope our study can further trigger community interest to examine the potential of AI for understanding the data and to improve the causal interpretability of AI models in the biomedical field.

The original contributions presented in the study are included in the article/

XX devised the deep causal model applied to this study. ZL and WT conceived and designed the study of utilizing the model for pharmacovigilance. XW coded the deep causal model. XW and ZL performed data analysis. ZL, XW, and XX wrote the manuscript. WT and RR revised the manuscript. All authors read and approved the final manuscript.

This article reflects the views of the authors and does not necessarily reflect those of the U.S. Food and Drug Administration. Any mention of commercial products is for clarification and is not intended as an endorsement.

RR is co-founder and co-director of ApconiX, an integrated toxicology and ion channel company that provides expert advice on non-clinical aspects of drug discovery and drug development to academia, industry, and not-for-profit organizations.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The Supplementary Material for this article can be found online at: