Because of the intrinsic vagueness present in medical visuals like X-rays, radiologists frequently utilize terms such as “may” or “likely” when characterizing the existence of a specific pathology, such as pneumonia.
But do the terms used by radiologists to convey their level of confidence truly represent how frequently a particular pathology presents in patients? A recent investigation indicates that when radiologists convey certainty regarding a certain pathology using a phrase like “very likely,” they often exhibit overconfidence, while the opposite occurs when they exhibit lower confidence using a term like “possibly.”
Employing clinical data, a diverse group of MIT researchers, in collaboration with researchers and medical professionals from hospitals linked with Harvard Medical School, developed a framework to assess the reliability of radiologists when they convey certainty through natural language terminology.
This method was utilized to propose clear recommendations that assist radiologists in selecting certainty phrases that could enhance the reliability of their clinical documentation. They also demonstrated that the same approach can effectively gauge and enhance the calibration of large language models by aligning the terminology these models use to signify confidence with the precision of their predictions.
By enabling radiologists to more precisely articulate the probability of specific pathologies in medical visuals, this new framework could elevate the reliability of essential clinical data.
“The terminology radiologists choose is significant. It influences how doctors intervene in their decision-making for the patient. If these professionals can become more reliable in their reporting, ultimately, patients will benefit,” states Peiqi Wang, an MIT graduate student and lead author of a paper on this study.
He is co-authored on the study by senior author Polina Golland, a Sunlin and Priscilla Chou Professor of Electrical Engineering and Computer Science (EECS), a principal investigator at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), and the head of the Medical Vision Group; along with Barbara D. Lam, a clinical fellow at Beth Israel Deaconess Medical Center; Yingcheng Liu, also an MIT graduate student; Ameneh Asgari-Targhi, a research fellow at Massachusetts General Brigham (MGB); Rameswar Panda, a research staff member at the MIT-IBM Watson AI Lab; William M. Wells, a professor of radiology at MGB and a research scientist in CSAIL; and Tina Kapur, an assistant professor of radiology at MGB. The findings will be presented at the International Conference on Learning Representations.
Interpreting uncertainty in language
A radiologist composing a report on a chest X-ray might indicate that the image reveals a “possible” pneumonia, which is an infection causing inflammation of the air sacs in the lungs. In such a scenario, a physician could request a follow-up CT scan to verify the diagnosis.
Conversely, if the radiologist declares that the X-ray indicates a “likely” pneumonia, the physician may start treatment immediately, such as prescribing antibiotics, while still ordering further tests to evaluate severity.
Attempting to assess the calibration, or reliability, of ambiguous natural language expressions such as “possibly” and “likely” poses numerous challenges, Wang notes.
Current calibration techniques typically depend on the confidence score provided by an AI model, reflecting the model’s estimated probability that its prediction is correct.
For example, a weather application might forecast an 83 percent chance of rain tomorrow. That model is well-calibrated if, across all instances where it predicts an 83 percent chance of rain, it rains about 83 percent of the time.
“However, humans utilize natural language, and if we map these expressions to a singular number, it does not accurately depict reality. When someone states that an event is ‘likely,’ they are not necessarily considering the exact probability, such as 75 percent,” Wang observes.
Instead of striving to map certainty expressions to a single percentage, the researchers’ approach treats them as probability distributions. A distribution illustrates the spectrum of potential values and their likelihoods — consider the conventional bell curve in statistics.
“This captures more nuances of what each term signifies,” Wang adds.
Evaluating and enhancing calibration
The researchers utilized prior work that surveyed radiologists to obtain probability distributions that align with each diagnostic certainty phrase, varying from “very likely” to “consistent with.”
For instance, since a larger number of radiologists interpret the phrase “consistent with” as indicating that a pathology is present in a medical image, its probability distribution rises sharply to a high peak, with most values clustered around the 90 to 100 percent range.
In contrast, the expression “may represent” implies greater uncertainty, resulting in a wider, bell-shaped distribution centered around 50 percent.
Standard methods assess calibration by comparing how well a model’s predicted probability scores correspond with the actual number of positive results.
The researchers’ method adheres to the same basic framework but extends it to acknowledge that certainty phrases signify probability distributions rather than simple probabilities.
To enhance calibration, the researchers formulated and addressed an optimization problem adjusting how frequently certain phrases are employed, to better align confidence with reality.
They derived a calibration map recommending certainty terms a radiologist should employ to ensure reports are more precise for a specific pathology.
“Perhaps, for this dataset, if every time a radiologist indicated that pneumonia was ‘present,’ they altered the phrase to ‘likely present’ instead, they would achieve better calibration,” Wang clarifies.
Upon applying their framework to assess clinical reports, the researchers discovered that radiologists were generally underconfident when diagnosing prevalent conditions like atelectasis, but overconfident with more ambiguous conditions like infection.
Additionally, the researchers assessed the reliability of language models using their method, providing a more nuanced depiction of confidence than traditional methods that depend on confidence scores.
In the future, the researchers intend to continue their collaboration with clinicians in hopes of enhancing diagnoses and treatment. They aim to broaden their study to incorporate data from abdominal CT scans.
Moreover, they are keen on investigating how receptive radiologists are to suggestions that improve calibration and whether they can effectively adjust their use of certainty phrases.
“The expression of diagnostic certainty is a pivotal component of the radiology report, as it influences critical management decisions. This study adopts an innovative approach to analyzing and calibrating how radiologists express diagnostic certainty in chest X-ray reports, providing feedback on term usage and corresponding outcomes,” remarks Atul B. Shinagare, associate professor of radiology at Harvard Medical School, who was not involved in this research. “This method has the potential to enhance the accuracy and communication of radiologists, ultimately improving patient care.”
The work received funding, in part, from a Takeda Fellowship, the MIT-IBM Watson AI Lab, the MIT CSAIL Wistrom Program, and the MIT Jameel Clinic.