A prognostic model based on a large electronic health record (EHR) lung cancer cohort helped predict survival odds out to 5 years for patients with non-small cell lung cancer (NSCLC), a large EHR-based study has shown.
“These findings suggest that with well-designed strategies involving machine learning, NLP (natural language processing), and quality assessment, EHR data may be used for cancer research,” Qianyu Yuan, PhD, Harvard T.H. Chan School of Public Health, Boston, and colleagues reported in JAMA Network Open.
The researchers noted that the primary goal of the study “was to build a large and reliable lung cancer EHR cohort that could be used for studying lung cancer progression with a set of generalizable approaches.” The prognostic model for overall survival (OS) for those with NSCLC was estimated with good discrimination as well, the authors added.
However, not everyone is convinced that we’re quite there yet in culling research data from the EHR.
Commenting on the findings, Neal Meropol, MD, Flatiron Health, New York, New York, and colleagues suggested that the report by Yuan et al hints at both the promise and the challenges that remain when applying artificial intelligence to the interpretation of real-world data derived from EHRs.
“In oncology, a small minority of patients take part in prospective clinical trials of investigational therapies, and those who do tend to be younger, have fewer comorbidities, and be less sociodemographically diverse than the broad population of patients with cancer,” the editorialists wrote, and added, “the generalizability of clinical trial results may be questioned.”
On the one hand, it is therefore quite possible that the use of real-world data as gleaned from EHRs as the authors of this current study provided may lead to information that is more representative of the oncology population overall, Meropol and colleagues suggested. On the other hand, an algorithm that identifies patients with a 90% specificity—as was the case with this particular patient cohort—may not be suitable to answer all research questions.
For example, for certain cohorts, it may be more important to calibrate sensitivity towards minimizing false negatives rather than calibrate it towards specificity by minimizing false positives, they argued.
“The true promise of machine-based approaches is in enabling a learning health care system in which patient data are used for research and clinical applications and evolving care patterns and outcomes measurements are incorporated in a continuous feedback loop,” Meropol and colleagues wrote. “Success demands a broad recognition of the importance of high-quality data collection, data standards, and the benefits of data sharing for patients and public health.”
In the study, of the 76,643 patients with a diagnosis of lung cancer in the Mass General Brigham health care system, 42,069 patients were identified as having lung cancer using the new classification algorithm, yielding a positive predictive value (PPV) of 94.4%.
As expressed as an area under the receiver operating characteristics curve (AUROC), the model estimated OS at an AUROC of 0.828 (95% CI, 0.815-0.842) at 1 year; 0.825 (95% CI, 0.812-0.836) at 2 years; 0.814 (95% CI, 0.800-0.826) at 3 years; 0.814 (95% CI, 0.799-0.829) at 4 years, and 0.812 (95% CI, 0.798-0.825) at 5 years, investigators reported.
Patients with NSCLC diagnosed between January 2000 and January 2015 in the EHR cohort were included in the analysis. The study cohort included 35,375 patients with lung cancer, over 85% of whom were White. The median age at diagnosis was 66.7 years (Interquartile range (IQR) 58.4-74.1 years) and over 92% of the cohort had a history of smoking.
A total of 11,724 patients were included in the final analysis.
Independent predictors associated with OS included:
- Male sex: Hazard Ratio (HR): 1.30 (95% CI, 1.17-1.44; P<0.001).
- Stage 4 cancer versus stage 1 cancer: HR 4.83 (95% CI, 4.16-5.62; P<0.001).
- Squamous cell carcinoma versus adenocarcinoma: HR 1.14 (95% CI, 1.01-1.29: P=0.03).
- Neutrophil-lymphocyte radio: HR 1.23 (95% CI, 1.10-1,38; P<0.001).
Some 6,225 patients from the Massachusetts General Hospital Boston Lung Cancer Study (BLCS) were compared to the EHR-based cohort.
“We found that HR estimates obtained from the BLCS cohort data were similar to those obtained from extracted EHR data,” Yuan and colleagues wrote “[And w]e [also] found that NICE (natural language processing interpreter for cancer extraction) was able to reliably extract important cancer prognostic factor information embedded in EHR notes, including cancer stage, histologic type, and somatic variants,” they added.
The authors also pointed out that this EHR-based cohort study offers data that are different to data typically found in existing registries such as real-world treatment effects, molecular profiling, and laboratory test results.
Thus, “this cohort may help provide a better understanding of the association between specific drugs and improved survival outcomes, modeling clinical outcomes with comprehensive variables collected in routine clinical care,” they observed.
Limitations of the study include incomplete mortality data; it was difficult to determine diagnosis data for patients with recurrences or who were transferred from other hospitals; there are challenges extracting stage through NLP; data on treatments may not be complete; and the patients studied were not extracted from the U.S. population.
Prognostic model based on an EHR cohort using machine learning predicted survival odds out to 5 years for NSCLC patients.
These findings suggest that with well-designed strategies involving machine learning, NLP (natural language processing), and quality assessment, EHR data may be used for cancer research. However, be aware that experts suggest that challenges remain when applying artificial intelligence to the interpretation of real-world data derived from EHRs.
Pam Harrison, Contributing Writer, BreakingMED™
The study was funded by the National Cancer Institute.
Yuan had no conflicts of interest to declare.
The editorialists reported being employed at Flatiron Health, an independent subsidiary of the Roche Group as well as owning stock in Roche.
Cat ID: 24
Topic ID: 78,24,730,24,192,925