Annually, countless learners enroll in programs that equip them with the skills to implement artificial intelligence models designed to assist medical professionals in identifying ailments and formulating suitable treatments. Nevertheless, a significant number of these programs fail to include a crucial aspect: instructing students on how to pinpoint issues within the training data utilized to create the models.
Leo Anthony Celi, a prominent research scientist at MIT’s Institute for Medical Engineering and Science, a physician affiliated with Beth Israel Deaconess Medical Center, and an associate professor at Harvard Medical School, has documented these deficiencies in a recent publication and aims to convince course creators to guide students in thoroughly assessing their data before integrating it into their models. Various prior investigations have shown that models predominantly trained on clinical data from white males do not perform effectively when applied to other demographic groups. Here, Celi discusses the implications of such biases and how educators might tackle them in their instructional content about AI models.
Q: How does bias infiltrate these datasets, and what measures can rectify these issues?
A: Errors in the data will be inherently embedded in any modeling based on that data. Previously, we have mentioned tools and instruments that prove ineffective across various individuals. For instance, we discovered that pulse oximeters exaggerate oxygen saturation levels in people of color due to inadequate representation of this demographic in the clinical trials for these devices. We remind our students that medical instruments and technology is primarily optimized on healthy young males. They have never been fine-tuned for an 80-year-old woman suffering from heart failure, yet we continue to use them in such cases. Moreover, the FDA does not mandate that a device performs well across the diverse populations it will serve. They only require evidence of efficacy on healthy subjects.
Furthermore, the electronic health record systems are ill-suited to serve as foundational components for AI. These records were not intended to function as a learning system; thus, caution is necessary when leveraging electronic health records. Although a replacement for these systems is in consideration, such a change is not imminent; therefore, we must be more strategic and innovative in utilizing the existing data, regardless of their shortcomings, to develop algorithms.
A promising direction we are investigating involves creating a transformer model utilizing numeric electronic health record data, including laboratory test outcomes among others. By modeling the relationships among laboratory tests, vital signs, and treatments, we can alleviate the impact of absent data resulting from social determinants of health and provider inherent biases.
Q: Why is it crucial for AI courses to address the origins of potential bias? What findings emerged during your review of such course content?
A: Our course at MIT commenced in 2016, and we soon recognized that we were inadvertently motivating individuals to rapidly develop models that are overly tuned to specific statistical benchmarks of model performance, all while the data we were employing was fraught with issues unknown to many. This led us to question: How prevalent is this issue?
We suspected that, upon examining courses with available syllabi online or the online programs themselves, none seemed to alert students to maintain a healthy skepticism regarding the data. Upon investigation, we discovered that numerous online courses focus exclusively on model construction. How do you create the model? How do you visualize the data? Out of the 11 courses we assessed, merely five incorporated segments on dataset bias, and only two engaged in substantial discussions on this topic.
Nevertheless, we cannot discount the importance of these courses. Many personal accounts describe individuals self-studying via these online offerings, yet considering their significant influence and impact, we must emphasize the necessity for them to impart the correct skill sets, particularly as an increasing number of individuals are attracted to this AI landscape. It is vital for learners to empower themselves with the ability to navigate AI effectively. We hope this paper highlights the critical gap in the way we currently educate students about AI.
Q: What type of material should course developers be including?
A: First, providing a checklist of questions at the outset. Where did this data originate? Who collected the observations? Who were the doctors and nurses involved in data collection? Then, they should familiarize themselves with the context of the institutions involved. If it’s an ICU database, inquiring about who qualifies for ICU admission is essential, as this introduces a sampling selection bias. If minority patients are not even admitted to the ICU due to time constraints, then the models will fail to be representative for them. In my view, 50 percent of the course content should center on comprehending the data, if not more, because once you grasp the data, modeling becomes much simpler.
Since 2014, the MIT Critical Data consortium has been hosting datathons (data “hackathons”) globally. During these events, healthcare professionals and data scientists collaborate to analyze databases and explore health and disease within the local context. Textbooks and academic articles often present diseases based on observations and trials involving a narrow demographic typically from well-resourced countries.
Our primary goal now is to instill critical thinking capabilities. The key ingredient for fostering critical thinking is diverse collaboration.
Critical thinking cannot be effectively taught in an environment filled exclusively with CEOs or only with doctors. The dynamic simply isn’t conducive. During datathons, we need not explicitly teach critical thinking methods. Once we gather a suitable mix of individuals—not just diverse backgrounds but also various generations—critical thinking emerges organically. The environment becomes conducive to such thought processes. Therefore, we now advise our participants and students: please refrain from starting to build any model until you completely understand how the data was generated, which patients were included in the database, what devices measured the data, and whether those devices maintain consistent accuracy across different individuals.
When we conduct events worldwide, we encourage searching for local data sets to ensure relevance. While there is often resistance because participants fear uncovering the inadequacies of their data sets, we reassure them that this acknowledgment is a step towards improvement. If they remain unaware of their data’s limitations, they will continue to gather it poorly, resulting in its uselessness. They must accept that initial attempts may not yield perfect results, which is entirely acceptable. It took a decade to establish a reasonable schema for the MIMIC (Medical Information Marked for Intensive Care database created at Beth Israel Deaconess Medical Center), and we only developed this viable schema thanks to feedback highlighting MIMIC’s deficiencies.
We may not possess answers to all questions, but we can foster awareness among individuals regarding the numerous issues present in the data. I am consistently delighted to read blog posts from datathon participants who express how their perspectives have shifted. Many feel revitalized about the field, recognizing both its substantial potential and the significant risks of harm if approached incorrectly.