Racial bias exists in photo-based medical diagnosis despite AI help
While overall accuracy of dermatological diagnosis improves with AI, gap between patients with light and dark skin tones widens
- Link to: Northwestern Now Story
- Before AI assistance, physicians are 4 percentage points less accurate on darker skin tones versus lighter skin tones
- AI assistance demonstrates significant accuracy improvements in skin disease diagnoses for all skin tones
- However, AI assistance exacerbates the accuracy disparities in primary care physicans diagnoses by 5 percentage points
EVANSTON, Ill. --- As is in many fields, medical professionals are still figuring out whether artificial intelligence will help or hinder their work. Experts also are exploring how biases within society make their way into human-created machines. A new Northwestern University study finds that physician-machine partnerships boost diagnostic accuracy in dermatology, but accuracy disparities across skin color persist.
In a large-scale digital experiment of dermatology diagnoses by doctors and doctors paired with AI, a research team led by Matthew Groh of Northwestern’s Kellogg School of Management sought to benchmark how well physicians were able to diagnose skin diseases in images and evaluate how AI assistance influences physicians’ diagnoses.
The study was published today (Feb. 5) in Nature Medicine.
Research suggests that decision support from deep learning systems (DLS) can help improve the accuracy of a physican’s diagnosis — both primary care physicans and dermatology specialists — but the gap in accuracy of primary care physicans across skin tones widens. The study compares both human and human and AI performance because both humans and AI are susceptible to systematic errors, especially for diagnosis of underrepresented populations.
“While research has shown there is less representation of dark skin in textbooks and in dermatology resident programs, there's minimal research around how accurate doctors are on light or dark skin with respect to diagnosing diseases,” said Groh, assistant professor of management and organizations. “Our study reveals that there are disparities in accuracy of physicians on light versus dark skin. And in this case, it’s not the AI that is biased, it’s how physicians use it.”
Humans versus AI
Without AI assistance, across all skin tones and skin conditions in the experiement, dermatology specialists were 38% accurate in their diagnosis, and primary care physicans were 19% acccurate. Having little experience with darker-skinned patients had a particularly concerning result — primary care providers who reported seeing mostly or all white patients were less accurate on dark versus light skin.
“We suspected bias, but specialists don't have this AI-exacerbated bias, whereas primary care physicans do,” Groh said. “When a specialist sees advice from AI, they take their own vast knowledge into account when diagnosing. Whereas primary care physicans might not have that same deep intuition of pattern matching, so they go with the AI suggestion on patterns that they are aware of.”
When decision support from a deep learning system (DLS) was introduced, diagnostic accuracy increased by 33% among dermatologists and 69% among primary care physicans. Among dermatologists, DLS support increased accuracy relatively evenly across skin tones. However, the same was not true for primary care physicans — their accuracy increased more in light skin tones than dark ones. AI assistance exacerbated the accuracy disparities by primary care physicans by 5 percentage points, which is statistically significant.
Until now, many researchers in machine learning for health care have approached the medical problem by training a model to classify diagnoses from an image or set of images. But perhaps, Groh said, it would be more helpful to train a DLS to generate lists of possible diagnoses — or even to generate descriptions of the skin condition (such as its size, shape, color and texture) that could guide doctors in their diagnoses.
Patients are often sending images to their doctors virtually for an evaluation. Apps like VisualDX, RXPhoto and Piction Health are being used by doctors and patients alike. The study gives a sense of how accurate diagnoses on a single image without further context might be.
For the study, the researchers recruited over 1,100 physicans, including dermatologists and primary care physicans. They were then presented with 364 images spanning 46 skin diseases and asked to submit up to four different diagnoses. Specialists and primary care physicans achieved a diagnostic accuracy of 38% and 19%, respectively, but both specialists and primary care physicans were 4 percentage points less accurate for diagnosis of images of dark skin as compared to light skin. DLS decision support improved the diagnostic accuracy of both specialists and primary care physicans by more than 33% but exacerbated the gap in the diagnostic accuracy of generalists across skin tone.
Groh said he wants to continue to see how society can solve problems more effectively with AI assistance. He hopes this research will lay the foundation to challenge other medical fields.
“Future research in human-computer interaction, machine learning and psychology is going to need to evaluate how well a model does and how people and machines adapt when things go wrong,” Groh said. “We have to find a way to incorporate underrepresented demographics in our research. That way we will be ready to accurately implement these models in the real world and build AI systems that serve as tools that are designed to avoid the kind of systematic errors we know humans and machines are prone to. Then you can update curricula, you can change norms in different fields and hopefully everyone gets better.”
The study is titled “Deep learning-aided decision support for diagnosis of skin disease across skin tones.” In addition to Groh, co-authors of the study include Omar Badri, Northeast Dermatology Associates; Roxana Daneshjou, Stanford University; Arash Koochek, Banner Health; Caleb Harris, MIT; Luis Soenksen, MIT; P. Murali Doraiswamy, Duke University; and Rosalind Picard, MIT.
Funding for the study was provided by MIT Media Lab consortium members and the Harold Horowitz (1951) Student Research Fund.