Enhancing Trustworthiness of AI Models in Critical Applications

The vagueness in medical imaging can pose significant obstacles for healthcare professionals attempting to detect ailments. For example, in a chest X-ray, pleural effusion, which is an abnormal accumulation of fluid in the lungs, may closely resemble pulmonary infiltrates, which are collections of pus or blood.

An artificial intelligence model could aid the physician in analyzing X-rays by helping to pinpoint subtle nuances and enhancing the efficiency of the diagnostic process. However, due to the multitude of potential conditions that could be manifested in a single image, the clinician would likely prefer to assess a range of possibilities, rather than relying on just one AI forecast.

A promising approach to generate a set of possibilities, termed conformal classification, is advantageous as it can be easily integrated into an existing machine-learning model. Nevertheless, it might produce sets that are excessively large.

Researchers at MIT have now introduced a straightforward and effective enhancement that can reduce the size of prediction sets by as much as 30 percent, while also enhancing the reliability of the predictions.

A smaller prediction set could enable a clinician to focus on the most appropriate diagnosis more efficiently, thereby improving and streamlining patient treatment. This method may be beneficial across various classification tasks — for instance, determining the species of an animal in an image from a wildlife reserve — as it yields a more compact yet accurate selection of options.

“With fewer categories to evaluate, the sets of predictions naturally become more insightful since you are selecting from fewer alternatives. In a way, you are not truly compromising on accuracy for something more informative,” states Divya Shanmugam, PhD ’24, a postdoctoral researcher at Cornell Tech, who conducted this study during her time as a graduate student at MIT.

Shanmugam is joined on the paper by Helen Lu ’24; Swami Sankaranarayanan, a former postdoctoral researcher at MIT who is currently a research scientist at Lilia Biosciences; and lead author John Guttag, the Dugald C. Jackson Professor of Computer Science and Electrical Engineering at MIT and a member of the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). This research will be showcased at the Conference on Computer Vision and Pattern Recognition in June.

Prediction guarantees

AI tools employed for critical tasks, such as disease classification in medical images, typically generate a probability score alongside each prediction to allow a user to assess the model’s confidence level. For example, a model might estimate that there is a 20 percent likelihood an image corresponds to a specific diagnosis, such as pleurisy.

However, it is challenging to trust a model’s predicted confidence, as extensive prior studies indicate that these probabilities can be unreliable. With conformal classification, the model’s prediction is substituted with a set of the most plausible diagnoses, accompanied by a guarantee that the correct diagnosis exists within that set.

Yet, the inherent unpredictability of AI predictions often leads the model to produce sets that are excessively large to be practical.

For instance, if a model classifies an animal in an image from among 10,000 possible species, it might output a set of 200 predictions in order to provide a strong assurance.

“That’s quite a number of categories for someone to sift through to discern the correct one,” Shanmugam remarks.

The technique can also lack reliability because minor alterations to inputs, like slightly rotating an image, can generate entirely different sets of predictions.

To enhance the utility of conformal classification, the researchers implemented a technique designed to improve the accuracy of computer vision models known as test-time augmentation (TTA).

TTA creates several variations of a single image in a dataset, perhaps by cropping, flipping, zooming in, and so forth. It then applies a computer vision model to each version of the identical image and consolidates its predictions.

“In this manner, you receive multiple predictions from a single example. Aggregating predictions in this way enhances both the accuracy and robustness of the outcomes,” Shanmugam clarifies.

Maximizing accuracy

To implement TTA, the researchers set aside some labeled image data utilized for the conformal classification process. They learn to aggregate the augmentations on these reserved data, automatically modifying the images in a manner that maximizes the underlying model’s prediction accuracy.

Following this, they conduct conformal classification on the model’s new TTA-transformed predictions. The conformal classifier then outputs a smaller set of likely predictions while maintaining the same confidence guarantee.

“Integrating test-time augmentation with conformal prediction is straightforward to implement, effective in practice, and does not require any model retraining,” Shanmugam notes.

In comparison with previous work utilizing conformal prediction across various standard image classification benchmarks, their TTA-augmented approach diminished prediction set sizes across experiments by 10 to 30 percent.

Significantly, this method achieves a reduction in prediction set size while preserving the probability guarantee.

The researchers also discovered that, despite sacrificing some labeled data typically employed in the conformal classification procedure, TTA enhances accuracy sufficiently to offset the cost associated with losing that data.

“This raises intriguing questions regarding how we utilize labeled data following model training. The distribution of labeled data among various post-training processes is a crucial direction for future exploration,” Shanmugam states.

In the future, the researchers aim to validate the effectiveness of this approach within the context of models that classify text rather than images. To further enhance their work, they are also exploring ways to reduce the computational demands associated with TTA.

This research is partially funded by the Wistrom Corporation.

Leave a Reply Cancel reply