Most lung cancers are diagnosed at an advanced stage. Pre-symptomatic identification of high-risk individuals can prompt earlier intervention and improve long-term outcomes.
To develop a model to predict a future diagnosis of lung cancer based on routine clinical and laboratory data, using machine-learning.
We assembled 6,505 non-small cell lung cancer (NSCLC) cases and 189,597 contemporaneous controls and compared the accuracy of a novel machine-learning model to a modified version of the well-validated PLCOm2012 risk model, using the area under the receiver operating characteristic curve (AUC), sensitivity and diagnostic odds ratio (OR) as measures of model performance.
Among ever-smokers in the test set, the a machine-learning model was more accurate than the modified PLCOm2012 for identifying NSCLC 9-12 months before clinical diagnosis (P<0.00001), with an AUC of 0.86, a diagnostic OR of 12.8 3 and a sensitivity of 40.31% at a pre-defined specificity of 95%. In comparison, the modified PLCOm2012 had an AUC of 0.79, an OR of 7.4 and a sensitivity of 27.9% at the same specificity. The machine-learning model was more accurate than standard eligibility criteria for lung cancer screening and more accurate than the modified PLCOm2012 model when applied to a screening-eligible population. Influential model variables included known risk factors and novel predictors such as white blood cell and platelet counts.
A machine-learning model was more accurate for early diagnosis of NSCLC than either standard eligibility criteria for screening or the modified PLCOm2012, demonstrating the potential to help prevent lung cancer deaths through early detection.