A case for AI involvement in clinical studies?

A commercially available artificial intelligence (AI) algorithm had comparable diagnostic performance to radiologists in assessing screening mammograms and, when used to back up first-read radiologists, was more effective at detecting cancer than when first readers were combined with second human readers.

While population-wide mammography has resulted in earlier detection of breast cancer tumors, as well as accompanying reductions in breast cancer mortality, Mattie Salim, MD, Department of Oncology-Pathology, Karolinska Institute, Stockholm, Sweden, and colleagues pointed out that it also places a considerable workload burden on radiologists and that their assessments can vary.

“Having a computer algorithm that performs at, or above, the level of radiologists in mammography assessment would be valuable,” the authors wrote in JAMA Oncology. “An added benefit of artificial intelligence computer-aided detection (CAD) algorithms would be to reduce the broad variation in performance among human readers that has been shown in previous studies.” And, according to the study authors, the results of their study suggest an AI algorithm has the diagnostic power to be used as an independent reader in prospective clinical studies.

In this study, the authors evaluated three commercially available artificial intelligence (AI) computer-aided detection algorithms (AI-1, AI-2, and AI-3) as independent mammography readers, as well as their performance when combined with human readers.

The subjects of the study included 8,805 women (ages 40-74 years) who underwent double-reader mammography screening at an academic hospital in Stockholm between 2008 and 2015. Of those women, 739 were diagnosed with breast cancer (positive), while the remainder were healthy controls (negative).

The authors determined that the area under the receiver operating curve (AUC) for the three algorithms was 0.956 (95% CI, 0.948-0.965) for AI-1, 0.922 (95% CI, 0.910-0.934) for AI-2, and 0.920 (95% CI, 0.909-0.931) for AI-3. The differences between AI-1 and the other 2 algorithms were considered to be statistically significant, and A1-1 had a significantly higher area under the curve for all analyzed subgroups (age, mode of detection, and breast density). For example, the AUC for clinically detected cancer after a negative assessment by a radiologist was 0.810 for AI-1 compared to 0.728 and 0.744 for A1-2 and A1-3, respectively.

There was no statistically significant difference in the AUC between A1-2 and A1-3 algorithms, Salim and colleagues noted.

As for the comparison with radiologists’ assessments, the sensitivities were 81.9% for AI-1, 67.0% for AI-2, 67.4% for AI-3, 77.4% for first-reader radiologist, and 80.1% for second-reader radiologist. These results represented a significant sensitivity difference between A1-1 and the other 2 algorithms, and between AI-1 and the first reader, but not between A1-1 and the second reader.

Combining AI-1 with first-reader radiologists achieved 88.6% sensitivity at 93.0% specificity, a sensitivity level that was not surpassed by any other combination of radiologists and AI algorithms. Thus, Salim and colleagues pointed out, combining the first reader with the best performing algorithm (AI-1) identified more cancer cases than combining the first and second readers.

“In conclusion, our results suggested that the best computer algorithm evaluated in this study assessed screening mammograms with a diagnostic performance on par with or exceeding that of radiologists in a retrospective cohort of women undergoing regular screening,” the authors concluded. “We believe that the time has come to evaluate AI CAD algorithms as independent readers in prospective clinical studies in mammography screening programs.”

In a commentary accompanying the study, Constance Dobbins Lehman, MD, PhD, Department of Radiology, Harvard Medical School, Massachusetts General Hospital, Boston, wrote that “rigorous studies evaluating whether results from simulation studies will translate to success in routine clinical practice are now essential.”

For, example, she noted “there is much to learn” from the failures of previous computer-aided-detections programs that originally showed promise but, in the end, failed to show improved outcomes for patients who received mammogram interpretations supported by CAD. “Many studies have confirmed that humans respond differently to CAD assistance, and the same may be true for AI-assisted readings,” Lehman pointed out.

“In the continued evolution of AI applied to improve human health, it is time to move beyond simulation and reader studies and enter the critical phase of rigorous, prospective clinical evaluation,” she wrote. “The need is great and a more rapid pace of research in this domain can be partnered with safe, careful, and effective testing in prospective clinical trials.”

  1. The diagnostic performance of a commercially available artificial intelligence algorithm proved comparable to that of human radiologists in assessing screening mammograms, according to results from a prospective case-controlled study.

  2. The study authors argued that it may be time to use such an algorithm as an independent reader in prospective clinical studies.

Michael Bassett, Contributing Writer, BreakingMED™

Lehman’s institution receives grants from GE Healthcare outside the submitted work.

Cat ID: 191

Topic ID: 83,191,730,22,191,691,192,925,481