The 'New World Order'

How Artificial Intelligence Is Transforming Pathology

Some researchers say that deep-learning ‘foundation’ models will revolutionize the field — but others are not so sure.

The increasing digitization of microscope slides is making pathology more amenable to advances in AI.

If you’ve ever had a biopsy, you — or at least, your excised tissues — have been seen by a pathologist. “Pathology is the cornerstone of diagnosis, especially when it comes to cancer,” says Bo Wang, a computer scientist at the University of Toronto in Canada.

But pathologists are increasingly under strain. Globally, demand is outstripping sup- ply, and many countries are facing shortages. At the same time, pathologists’ jobs have become more demanding. They now involve not only more and more conventional tasks, such as sectioning and staining tissues, then viewing them under a microscope, but also tests that require extra tools and expertise, such as assays for genes and other molecular markers. For Wang and others, one solution to this growing problem could lie in artificial intelligence (AI).

AI tools can help pathologists in several ways: highlighting suspicious regions in the tissue, standardizing diagnoses and uncovering patterns invisible to the human eye, for instance. “They hold the potential to improve diagnostic accuracy, reproducibility and also efficiency,” Wang says, “while also enabling new research directions for mining large-scale pathological and molecular data.”

Over the past few decades, slides have increasingly been digitized, enabling pathologists to study samples on-screen rather than under a microscope — although many still prefer the microscope. The resulting images, which can encompass whole slides, have proved invaluable to computer scientists and biomedical engineers, who have used them to develop AI-based assistants. Moreover, the success of AI chatbots such as ChatGPT and DeepSeek has inspired researchers to apply similar techniques to pathology. “This is a very dynamic research area, with lots of new research coming up every day,” Wang says. “It’s very exciting.”

Scientists have designed AI models to perform tasks such as classifying illnesses, predicting treatment outcomes and identifying biological markers of disease. Some have even produced chatbots that can assist physicians and researchers seeking to decipher the data hidden in stained slices of tissue. Such models “can essentially mimic the entire pathology process”, from analysing slides and ordering tests to writing reports, says Faisal Mahmood, a computer scientist at Harvard Medical School in Boston, Massachusetts. “All of that is possible with technology today,” he says.

But some researchers are wary. They say that AI models have not yet been validated sufficiently — and that the opaque nature of some models poses challenges to deploying them in the clinic. “At the end of the day, when these tools go into the hospital, to the bedside of the patient, they need to provide reliable, accurate and robust results,” says Hamid Tizhoosh, a computer scientist at Mayo Clinic in Rochester, Minnesota. “We are still waiting for those.”

Building foundations

The earliest AI tools for pathology were designed to perform clearly defined tasks, such as detecting cancer in breast-tissue biopsy samples. But the advent of ‘foundation’ models — which can adapt to a broad range of applications that they haven’t been specifically trained for — has provided an alternative approach.

Among the best-known foundation models are the large language models that drive generative-AI tools such as ChatGPT. However, ChatGPT was trained on much of the text on the Internet, and pathologists have no com- parably vast resource with which to train their software. For Mahmood, a potential solution to this problem came in 2023, when researchers at tech giant Meta released DINOv2, a foundation model designed to perform visual tasks, such as image classification1 . The study describing DINOv2 provided an important insight, Mahmood says — namely, that the diversity of a training data set was more important than its size.

By applying this principle, Mahmood and his team launched, in March 2024, what they describe as a general-purpose model for pathology, named UNI. They gathered a data set of more than 100 million images from 100,000 slides representing both diseased and healthy organs and tissues. The researchers then used the data set to train a self-supervised learning algorithm — a machine-learning model that teaches itself to detect patterns in large data sets. The team reported that UNI could outperform existing state-of-the-art computation- al-pathology models on dozens of classification tasks, including detecting cancer metastases and identifying various tumour subtypes in the breast and brain. The current version, UNI, has an expanded training data set, which includes more than 200 million images and 350,000 slides.

A second foundation model designed by the team used the same philosophy regarding diverse data sets, but also included images from pathology slides and text obtained from PubMed and other medical databases. (Such models are called multimodal.)

Like UNI, the model — called CONCH (for Contrastive Learning from Captions for Histopathology) — could perform classification tasks, such as cancer subtyping, better than could other models, the researchers found. For example, it could distinguish between subtypes of cancer that contain mutations in the BRCA genes with more than 90% accuracy, whereas other models performed, for the most part, no better than would be expected by chance. It could also classify and caption images, retrieving text in response to image queries and vice versa, to produce graphics of the patterns seen in specific cancers. However, it was not as accurate in these tasks as it was for classification. In head-to-head evaluations, CONCH consistently outperformed baseline approaches even in cases where very few data points were available for downstream model training.

UNI and CONCH are publicly available on the model-sharing platform Hugging Face. Researchers have used them for a variety of applications, including grading and subtyping neural tumours called neuroblastomas, predicting treatment outcomes and identifying gene-expression biomarkers associated with specific diseases. With more than 1.5 million downloads and hundreds of citations, the models have “been used in ways that I never thought people would be using them for”, Mahmood says. “I had no idea that so many people were interested in computational pathology.”

Other groups have developed their own foundation models for pathology. Microsoft’s GigaPath, for instance, is trained on more than 170,000 slides obtained from 28 US cancer centres to do tasks such as cancer subtyping. mSTAR (for Multimodal Self-taught Pretraining), designed by computer scientist Hao Chen at the Hong Kong University of Science and Technology and his team, folds together gene-expression profiles, images and text. Also available on Hugging Face, mSTAR was designed to detect metastases, to subtype cancers and to perform other tasks.

Now, Mahmood and Chen’s teams have built ‘copilots’ based on their models. In June 2024, Mahmood’s team released PathChat — a generalist AI assistant that combined UNI with a large language model. The resulting model was then fine-tuned with almost one million questions and answers using information derived from articles on PubMed, case reports and other sources. Pathologists can use it to have ‘conversations’ about uploaded images and generate reports, among other things. Licensed to Modella AI, a biomedical firm in Boston, Massachusetts, the chatbot received breakthrough-device designation from the US Food and Drug Administration earlier this year.

Similarly, Chen’s team has developed SmartPath, a chatbot that Chen says is being tested in hospitals in China. Pathologists are going head to head with the tool in assessments of breast, lung and colon cancers.

Beyond classification tasks, both PathChat and SmartPath have been endowed with agent like capabilities — the ability to plan, make decisions and act autonomously. According to Mahmood, this enables PathChat to streamline a pathologist’s workflow — for example, by highlighting cases that are likely to be positive for a given disease, ordering further tests to support the diagnostic process and writing pathology reports.

Foundation models, says Jakob Kather, an oncologist at the Technical University of Dresden in Germany, represent “a really transformative technological advancement” in pathology — although they are yet to be approved by regulatory authorities. “I think it’ll take about two or three years until these tools are widely available, clinical proven products,” he says.

An AI revolution?

Not everyone is convinced that foundation models will bring about groundbreaking changes in medicine — at least, not in the short term.

One key concern is accuracy. Specifically, how to quantify it, says Anant Madabhushi, a biomedical engineer at Emory University in Atlanta, Georgia. Owing to a relative lack of data, most pathology-AI studies use a cross-validation approach, in which one chunk of a data set is reserved for training and another for testing. This can lead to issues such as overfitting, meaning that an algorithm performs well on data similar to the information that the model has encountered before but does poorly on disparate data.

“The problem with cross-validation is that it tends to provide fairly optimistic results,” Madabhushi explains. “The cleanest and best way to validate these models is through external, independent validation, where the external test set is separate and distinct from the training set and ideally from a separate institution.”

Furthermore, models also might not per- form as well in the field as their developers suggest they do. In a study published in February, Tizhoosh and his colleagues put a handful of pathology foundation models to the test, including UNI and GigaPath. Using a zero-shot approach, in which a model is tested on a data set it hasn’t yet encountered (in this case, data from the Cancer Genome Atlas, which contains around 11,000 slides from more than 9,000 individuals), the team found that the assessed models were, on average, less accurate at identifying cancers than a coin toss would be — although some models did perform better for specific organs, such as the kidneys.

According to Tizhoosh, the discrepancy between published performance and what his team observed could come down to “fine-tuning”. Researchers typically tweak models before use by providing numerous examples related to their specific question, but Tizhoosh’s team used the models as is. Still, these findings suggest that AI-based pathology tools might not be as revolutionary as their designers state, he says. “I’m worried that they are overpromising, and that this will create a new wave of disappointment in AI.”

Several groups have launched efforts to standardize validation and benchmarking processes. For example, Tizhoosh, along with colleagues at the Memorial Sloan Kettering Cancer Center in New York City and the University of Texas MD Anderson Cancer Center in Houston, is preparing a challenge in which participants will be given 150 million images on which to train their models. They will then submit the models for independent testing. “We are hoping that from that undertaking, which will be closed by the end of the year, a set of rules and guidelines could emerge,” Tizhoosh says.

Another group, led by computer scientist Francesco Ciompi at the Radboud University Medical Center in Nijmegen, the Netherlands, has also launched several such challenges.

One, called UNICORN (Unified Benchmark for Imaging in Computational Pathology, Radiology and Natural Language) will test multimodal foundation models on 20 pathology-related tasks, including scoring biopsies, identifying regions of interest and classifying diseases. “The goal is to see how well these foundation models do without much fine-tuning,” Ciompi says.

No simple task

Even those enthusiastic about foundation models acknowledge that validation is no simple task. The models are designed to be open ended and adaptable. The “most conservative” way to assess them, says Kather, is to test every application. “So, if you have 1,000 different use cases, you have to collect hundreds of tissue slides for every single one of these use cases and apply your model to that.”

Kather says that there are ongoing discussions around radically different approaches for measuring performance. For instance, he suggests, as AI models become more human like in their abilities, perhaps they should be tested the same way people are. “You don’t evaluate a human in every single use case, you evaluate their general understanding of things — you pick some examples and evaluate their performance.”

There are other issues, too, including generalizability: making sure that these tools work across a diverse range of people. In 2021, for instance, Oncotype DX, a molecular test that assesses whether people with breast cancer could benefit from chemotherapy, came under fire. Researchers found that, despite having been on the market for at least two decades, the test was much less effective for Black women than for white women. “If you’re not intentional about how you’re developing and also validating these algorithms, you’re going to run into these catastrophic errors,” Madabhushi says.

There’s also the problem of hallucination, in which chatbots fabricate responses to queries. In medicine, an incorrect answer could lead to a wrong or missed diagnosis. “How do you measure the safety and the interpretability of these models to alleviate risks when it comes to patient diagnosis?” Wang says. “Regulators like the FDA simply do not have any guidelines for generative models in the health-care space.”

And then there’s the fact that foundation models are effectively black boxes, meaning that it can be difficult to work out how they arrived at their answers. “Foundation models are exciting, but we still lack an understanding of what these models are picking up,” says Madabhushi.

Madabhushi works on what he calls “explainable AI” — models based on conventional techniques in which researchers program algorithms to pinpoint specific biological features associated with diseases. For example, his team has developed models that search for specific patterns of collagen fibres that identify early-stage breast cancer and arrangements of immune cells that predict outcomes in individuals with cancer who receive immunotherapy. (Madabhushi co-founded Picture Health, a biotechnology company in Cleveland, Ohio, that has licensed these technologies and is working towards regulatory approval.)

Other researchers are working on opening the models’ black boxes — at least to some extent. Chen, for one, says that he and his team are working on ways to trace the steps their models take to get to the answers, in hopes of shedding light on how these algorithms make the decisions they do. “We want our models to be accurate and trustworthy,” Chen says. “And for doctors, one of the most important things is explainability.”

The field has a long way to go, but Chen is optimistic. “This is just the beginning,” he says. “Some people may overestimate the power of this technology — but it can also easily be underestimated in the long term.”