Posted: Thursday, November 3, 2022
Raymond H. Mak, MD, of Mass General Brigham, Boston, and colleagues developed a strategy for the clinical validation of deep learning models for segmenting primary non–small cell lung cancer (NSCLC) tumors and involved lymph nodes in CT images. Their findings, which were published in The Lancet Digital Health, revealed that in silico geometric segmentation metrics may not correlate with the clinical utility of the artificial intelligence (AI) models.
“The benefits of this approach for patients include greater consistency in segmenting tumors and accelerated times to treatment,” commented Dr. Mak in an institutional press release. “The clinician benefits include a reduction in mundane but difficult computer work.”
Using CT images from 787 patients, the investigators trained the model to distinguish tumors from other tissues. The algorithm’s performance was tested using scans from more than 1,300 patients from external data sets.
Compared with the interobserver benchmark, the models demonstrated improvements in the volumetric (0.83 vs. 0.91; P = .0062) and surface (0.72 vs. 0.86; P = .0005) dice coefficients. Primary validation of the AI models on internal Harvard-RT1 data, which were segmented by the same expert who segmented the discovery data, revealed a volumetric dice coefficient of 0.83 and a surface dice coefficient of 0.79. The AI models demonstrated volumetric and surface dice coefficients of 0.70 and 0.50, respectively, when tested on internal Harvard-RT2 data segmented by other experts. For the RTOG-0617 data set, the volumetric dice coefficient was 0.71, and the surface dice coefficient was 0.47; testing on the diagnostic radiology data sets NSCLC-radiogenomics and Lung-PET-CT-Dx yielded similar results.
Despite the geometric overlap, the models yielded target volumes with equivalent radiation dose coverage to those of the experts. The performances of the de novo expert and AI-assisted segmentation methods did not seem to significantly differ. Physicians worked 65% quicker (P < .0001) and with 32% less variation (P = .013) when editing an AI-produced segmentation versus a manually produced one.
Disclosure: For full disclosures of the study authors, visit thelancet.com.