Published on in Vol 11 (2025)

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/64399, first published .
Next-Generation Sequencing–Based Testing Among Patients With Advanced or Metastatic Nonsquamous Non–Small Cell Lung Cancer in the United States: Predictive Modeling Using Machine Learning Methods

Next-Generation Sequencing–Based Testing Among Patients With Advanced or Metastatic Nonsquamous Non–Small Cell Lung Cancer in the United States: Predictive Modeling Using Machine Learning Methods

Next-Generation Sequencing–Based Testing Among Patients With Advanced or Metastatic Nonsquamous Non–Small Cell Lung Cancer in the United States: Predictive Modeling Using Machine Learning Methods

Original Paper

1Eli Lilly and Company, Sydney, Australia

2Eli Lilly and Company, Indianapolis, IN, United States

3Syneos Health, Morrisville, NC, United States

Corresponding Author:

Lisa M Hess, PhD

Eli Lilly and Company

LCC Corporate Center

Indianapolis, IN, 46285

United States

Phone: 1 317 908 1872

Email: hess_lisa_m@lilly.com


Background: Next-generation sequencing (NGS) has become a cornerstone of treatment for lung cancer and is recommended in current treatment guidelines for patients with advanced or metastatic disease.

Objective: This study was designed to use machine learning methods to determine demographic and clinical characteristics of patients with advanced or metastatic non–small cell lung cancer (NSCLC) that may predict likelihood of receiving NGS-based testing (ever vs never NGS-tested) as well as likelihood of timing of testing (early vs late NGS-tested).

Methods: Deidentified patient-level data were analyzed in this study from a real-world cohort of patients with advanced or metastatic NSCLC in the United States. Patients with nonsquamous disease, who received systemic therapy for NSCLC, and had at least 3 months of follow-up data for analysis were included in this study. Three strategies, logistic regression models, penalized logistic regression using least absolute shrinkage and selection operator penalty, and extreme gradient boosting with classification trees as base learners, were used to identify predictors of ever versus never and early versus late NGS testing. Data were split into D1 (training+validation; 80%) and D2 (testing; 20%) sets; the 3 strategies were evaluated by comparing their performance on multiple m=1000 splits in the training (70%) and validation data (30%) within the D1 set. The final model was selected by evaluating performance using the area under the receiver operating curve while taking into account considerations of simplicity and clinical interpretability. Performance was re-estimated using the test data D2.

Results: A total of 13,425 met the criteria for the ever NGS-tested, and 17,982 were included in the never NGS-tested group. Performance metrics showed the area under the receiver operating curve evaluated from validation data was similar across all models (77%-84%). Among those in the ever NGS-tested group, 84.08% (n=11,289) were early NGS-tested, and 15.91% (n=2136) late NGS-tested. Factors associated with both ever having NGS testing as well as early NGS testing included later year of NSCLC diagnosis, no smoking history, and evidence of programmed death ligand 1 testing (all P<.05). Factors associated with a greater chance of never receiving NGS testing included older age, lower performance status, Black race, higher number of single-gene tests, public insurance, and treatment in a geography with Molecular Diagnostics Services Program adoption (all P<.05).

Conclusions: Predictors of ever versus never as well as early versus late NGS testing in the setting of advanced or metastatic NSCLC were consistent across machine learning methods in this study, demonstrating the ability of these models to identify factors that may predict NGS-based testing. There is a need to ensure that patients regardless of age, race, insurance status, and geography (factors associated with lower odds of receiving NGS testing in this study) are provided with equitable access to NGS-based testing.

JMIR Cancer 2025;11:e64399

doi:10.2196/64399

Keywords



The care of patients with non–small cell lung cancer (NSCLC) has changed dramatically since the early 2010s, from a chemotherapy-based approach that was tailored only to the disease histology (squamous or nonsquamous tumors) to becoming a disease with multiple actionable biomarkers that can identify targeted therapies associated with superior outcomes based on individual patient genomic characteristics [Mitchell CL, Zhang AL, Bruno DS, Almeida FA. NSCLC in the era of targeted and immunotherapy: what every pulmonologist must know. Diagnostics (Basel). 2023;13(6):1117. [FREE Full text] [CrossRef] [Medline]1,VanderLaan PA, Rangachari D, Majid A, Parikh MS, Gangadharan SP, Kent MS, et al. Tumor biomarker testing in non-small-cell lung cancer: a decade of change. Lung Cancer. 2018;116:90-95. [FREE Full text] [CrossRef] [Medline]2]. This has led to the adoption of next-generation sequencing (NGS) recommendations included in treatment guidelines for patients with NSCLC [NCCN Clinical Practice Guidelines in Oncology: non-small cell lung cancer version 3.2023. National Comprehensive Cancer Network. 2023. URL: https://www.nccn.org/guidelines/category_1 [accessed 2023-07-02] 3].

Unfortunately, despite these recommendations, multiple studies have shown that NGS-based testing is not being used for all patients with advanced or metastatic NSCLC, and only about half of all patients in some studies receive comprehensive biomarker testing [Hess LM, Krein PM, Haldane D, Han Y, Sireci AN. Biomarker testing for patients with advanced/metastatic nonsquamous NSCLC in the United States of America, 2015 to 2021. JTO Clin Res Rep. 2022;3(6):100336. [FREE Full text] [CrossRef] [Medline]4-Evangelist MC, Butrynski JE, Paschold JC, Ward PJ, Spira AI, Ali K, et al. Contemporary biomarker testing rates in both early and advanced NSCLC: results from the MYLUNG pragmatic study. J Clin Oncol. 2023;41(16 Suppl):9109-9109. [CrossRef]6]. The reasons for the lack of testing are unclear but may include barriers to ordering tests, insufficient tissue, clinical deterioration, or a crisis that requires immediate care [Evangelist MC, Butrynski JE, Paschold JC, Ward PJ, Spira AI, Ali K, et al. Contemporary biomarker testing rates in both early and advanced NSCLC: results from the MYLUNG pragmatic study. J Clin Oncol. 2023;41(16 Suppl):9109-9109. [CrossRef]6]. More recent studies have also demonstrated a racial disparity in receipt of biomarker testing; patients who are Black are significantly less likely than those who are White to receive NGS-based testing in the United States [Bruno DS, Hess LM, Li X, Su EW, Patel M. Disparities in biomarker testing and clinical trial enrollment among patients with lung, breast, or colorectal cancers in the United States. JCO Precis Oncol. 2022;6:e2100427. [CrossRef]7].

Studies evaluating the barriers to testing have typically taken a specific hypothesis-driven a priori categorization of potential barriers to investigate the lack of testing [Evangelist MC, Butrynski JE, Paschold JC, Ward PJ, Spira AI, Ali K, et al. Contemporary biomarker testing rates in both early and advanced NSCLC: results from the MYLUNG pragmatic study. J Clin Oncol. 2023;41(16 Suppl):9109-9109. [CrossRef]6,Bruno DS, Hess LM, Li X, Su EW, Patel M. Disparities in biomarker testing and clinical trial enrollment among patients with lung, breast, or colorectal cancers in the United States. JCO Precis Oncol. 2022;6:e2100427. [CrossRef]7]. While certainly this approach is critical to investigate specific issues such as racial disparities, this falls short when trying to evaluate the complexity of care and the multiple and potentially interacting factors. Clinical prediction models are an alternative approach to using patient-level evidence to help inform health care decision makers about patient care. These models have been used for decades by health care professionals [Steyerberg EW. Clinical Prediction Models. Washington DC. Springer; 2019. 8]. Traditionally, prediction models combine patient demographic, clinical, and treatment characteristics in the form of a statistical or mathematical model, usually regression, classification, or neural networks, but deal with a limited number of predictor variables (usually below 25). Flexible machine learning methods can be used, by which the researcher does not force the model to evaluate a limited set of covariates, but rather the models themselves learn by trial and error from the data to make predictions, without having a predefined set of rules for decision-making. Simply, machine learning can be better understood as “learning from data” [Vamathevan J, Clark D, Czodrowski P, Dunham I, Ferran E, Lee G, et al. Applications of machine learning in drug discovery and development. Nat Rev Drug Discov. 2019;18(6):463-477. [FREE Full text] [CrossRef] [Medline]9]. The setting of biomarker testing provides an opportunity to apply these methods to more thoroughly explore the factors that are associated with the lack of recommended biomarker testing.

While machine learning methods have been more commonly used for biomarker identification and treatment selection, there is little evidence of these methods applied to the prediction of biomarker testing itself. To date, the investigations surrounding the gaps in biomarker testing have remained largely limited to descriptive research and opinion pieces [Robert NJ, Espirito JL, Chen L, Nwokeji E, Karhade M, Evangelist M, et al. Biomarker testing and tissue journey among patients with metastatic non-small cell lung cancer receiving first-line therapy in The US Oncology Network. Lung Cancer. 2022;166:197-204. [FREE Full text] [CrossRef] [Medline]10-Burns L, Jani C, Radwan A, Omari OA, Patel M, Oxnard GR, et al. Implementation challenges and disparities in molecular testing for patients with stage IV NSCLC: perspectives from an urban safety-net hospital. Clin Lung Cancer. 2023;24(2):e69-e77. [CrossRef] [Medline]13]. Therefore, this study was designed to fill this gap in evidence by applying machine learning methods to the question of biomarker testing for patients with advanced or metastatic nonsquamous NSCLC to determine demographic and clinical characteristics that may predict receipt of NGS-based testing. A second objective was to further determine the characteristics that predict receipt of NGS-based testing (early testing) in accordance with clinical guidelines that can inform first-line therapy (vs those who receive NGS-based testing after the first-line therapy is underway). These objectives were pursued to better understand factors associated with experiencing barriers to recommended testing and the timing of such testing to inform future intervention strategies.


Data Source

This study used the Advanced NSCLC Analytic Cohort from the nationwide Flatiron Health electronic health record–derived longitudinal database, comprising deidentified patient-level structured and unstructured data, curated via technology-enabled abstraction [Ma X, Long L, Moon S, Adamson BJS, Baxi SS. Comparison of population characteristics in real-world clinical oncology databases in the US: Flatiron Health, SEER, and NPCR. medRxiv. Preprint posted online January 5, 2023. [CrossRef]14,Birnbaum B, Nussbaum N, Seidl-Rathkopf K, Agrawal M, Estevez M, Estola E, et al. Model-assisted cohort selection with bias analysis for generating large-scale cohorts from the EHR for oncology research. ArXiv. Preprint posted online on January 13, 2020. [FREE Full text]15]. The data are deidentified and subject to obligations to prevent reidentification and protect patient confidentiality and are not considered human participants in accordance with the US Code of Federal Regulations [Definition of human subjects research. National Institutes of Health. 2024. URL: https://grants.nih.gov/policy/humansubjects/research.htm [accessed 2024-03-07] 16]. These deidentified data originate from approximately 280 cancer clinics (~800 sites of care) in the United States. Patients in this database are those who have lung cancer ICD (International Classification of Diseases) codes 162.x (ICD-9 [International Classification of Diseases, Ninth Revision]), C34x, or C39.9 (ICD-10 [International Statistical Classification of Diseases, Tenth Revision]) on at least 2 documented clinical visits on different days occurring on or after January 1, 2011. Longitudinal patient-level data were available through November 2021. Patients must further have had pathology consistent with NSCLC and have advanced or metastatic disease (diagnosed with stage IIIB, IIIC, IVA, or IVB disease or diagnosed with early-stage NSCLC and subsequently developed recurrent or progressive disease).

Definitions of NGS Testing Cohorts

Patients were included in this analysis if they were in the Flatiron Health Advanced NSCLC Analytic Cohort, had nonsquamous NSCLC, evidence of receipt of systemic therapy, and at least 3 months of follow-up in the database. Receipt of testing by NGS is a field recorded in the electronic medical record database by the health care provider that was used for testing identification in this study. The method of NGS testing (tissue or circulating tumor) is not specified. Patients were excluded who had evidence of NGS-based testing more than 20 days prior to initial NSCLC diagnosis. Patients meeting the inclusion criteria for this study were categorized into 2 groups. The ever NGS-tested group included patients with at least 1 NGS test recorded in the database. All remaining patients were included in the never NGS-tested group, as this group was comprised of patients with no evidence of any NGS test recorded in the database. Among those in the ever NGS-tested group, individuals were further subgrouped by the timing of NGS-based testing. Each patient in the ever NGS-tested group was either included in the early NGS-tested subgroup, including patients whose first or only NGS-based test occurred prior to the start of first-line therapy through day 7 of first-line therapy, or the late NGS-tested subgroup, all remaining patients whose first NGS-based test occurred 8 days or later after the start of first-line therapy. The date of advanced or metastatic diagnosis was considered the index diagnosis date.

Candidate Predictors

Candidate predictors for receipt and timing of NGS-based testing were prespecified based on published literature, analyses of real-world data, and expert input from the field of cancer diagnostics [Hess LM, Krein PM, Haldane D, Han Y, Sireci AN. Biomarker testing for patients with advanced/metastatic nonsquamous NSCLC in the United States of America, 2015 to 2021. JTO Clin Res Rep. 2022;3(6):100336. [FREE Full text] [CrossRef] [Medline]4,Bruno DS, Hess LM, Li X, Su EW, Patel M. Disparities in biomarker testing and clinical trial enrollment among patients with lung, breast, or colorectal cancers in the United States. JCO Precis Oncol. 2022;6:e2100427. [CrossRef]7,Bruno DS, Li X, Hess LM. Biomarker testing, targeted therapy and clinical trial participation by race among patients with lung cancer: a real-world Medicaid database study. JTO Clin Res Rep. 2024;5(3):100643. [CrossRef] [Medline]17]. These variables included patient age at advanced or metastatic diagnosis date (years), sex (male or female), race (Asian, Black, White, and other), insurance type (public, private, or other), Eastern Cooperative Oncology Group (ECOG) performance status (0-4), smoking history (ever vs never smoker), body weight (kilograms), BMI (kg/m2), practice setting (academic or community), practice volume (the average number of those with NSCLC receiving care at the site where the included patient received care by index year over the period 2011 to 2021), biomarker result (positive, not positive, and not tested) by each available biomarker (anaplastic lymphoma kinase [ALK]; epidermal growth factor receptor [EGFR]; V-Raf murine sarcoma viral oncogene homolog B [BRAF]; Kirsten rat sarcoma virus [KRAS]; c-ros oncogene 1 [ROS1]; mesenchymal epithelial transition [MET]; neurotrophic tyrosine receptor kinase [NTRK]; rearranged during transfection [RET]; and programmed death ligand 1 [PD-L1]), stage of disease at initial diagnosis (0-IV), laboratory value (low, normal, high, or not tested) by blood test (alkaline phosphatase, alanine transaminase, aspartate transferase, bilirubin, creatinine, lymphocyte count, red blood cell count, hematocrit, platelet count, white blood cell count, and hemoglobin), number of non-NGS biomarker tests received (total number of fluorescence in situ hybridization, immunohistochemistry, polymerase chain reaction, or other non-NGS–based tests), as well as 2 variables to identify periods of environmental changes. The first of these variables categorized the status of National Comprehensive Cancer Network (NCCN) Clinical Guidelines: prior to 2016, before NGS was recommended in the guidelines; 2016-2019, when broad-based testing was recommended; and 2020 and later, when NGS-based testing was recommended [NCCN Clinical Practice Guidelines in Oncology. National Comprehensive Cancer Network. 2022. URL: https://www.nccn.org/guidelines/category_1 [accessed 2022-07-18] 18]. The second variable evaluated the timing of US Food and Drug Administration approval of drugs that targeted the available biomarkers: period (1) January 1, 2011-August 25, 2011 (EGFR drugs only); period (2) August 26, 2011-March 10, 2016 (EGFR+ALK); period (3) March 11, 2016-June 21, 2017 (EGFR+ALK+ROS1); period (4) June 22, 2017-November 25, 2018 (EGFR+ALK+ROS1+BRAF); period (5) November 26, 2018-May 5, 2020 (EGFR+ALK+ROS1+BRAF+NTRK); period (6) May 6, 2020-May 26, 2021 (EGFR+ALK+ROS1+BRAF+NTRK+MET+RET); and period (7) May 27, 2021, and later (EGFR+ALK+ROS1+BRAF+MET+NTRK+RET+KRAS) [Oncology (cancer)/hematologic malignancies approval notification. Food and Drug Administration. 2022. URL: https:/​/www.​fda.gov/​drugs/​resources-information-approved-drugs/​oncology-cancer-hematologic-malignancies-approval-notifications [accessed 2022-07-01] 19]. Additionally, candidate predictors of Medicare Administrative Contractor (MAC) region [Experian. MAC Jurisdiction Map. 2016. URL: https://www.experian.com/blogs/healthcare/wp-content/uploads/2016/08/MAC-A-B-Jurisdiction-Map.jpg [accessed 2016-06-30] 20] and Molecular Diagnostics Services (MolDX) Program adoption (yes or no) [MolDX® Program (Administered by Palmetto GBA). Palmetto-GBA. 2022. URL: https://palmettogba.com/palmetto/moldxv2.nsf [accessed 2025-05-30] 21] were included. These variables explored the policies in place at the geography in which the patient received care. MACs are private companies that process claims for Medicare beneficiaries. These companies are geographically distinct and identifiable by unique alphanumeric designations (eg, J8=jurisdiction 8) and by private company names (eg, Noridian and Palmetto) [CMS. Medicare Administrative Contractors (MACs). 2023. URL: https://www.cms.gov/medicare/coding-billing/medicare-administrative-contractors-macs/whats-mac [accessed 2025-04-10] 22]. The MolDX Program determines the coverage of diagnostic testing in 4 MACs across 28 states [Experian. MAC Jurisdiction Map. 2016. URL: https://www.experian.com/blogs/healthcare/wp-content/uploads/2016/08/MAC-A-B-Jurisdiction-Map.jpg [accessed 2016-06-30] 20,MolDX® Program (Administered by Palmetto GBA). Palmetto-GBA. 2022. URL: https://palmettogba.com/palmetto/moldxv2.nsf [accessed 2025-05-30] 21]. Importantly, all candidate predictor variables were required to be recorded prior to the end of the early NGS testing period to ensure that no covariates were recorded after the measurement of the NGS testing outcome.

The following interactions were deemed to be clinically relevant and forced into the models for evaluation: smoking and sex, smoking and NCCN guideline periods, race and insurance type, age and ECOG performance status, MAC region and public insurance, and MolDX region and public insurance. The estimates of the expected direction of these relationships were defined in the study protocol and are summarized in Table 1.

Table 1. Expected direction of candidate predictors for next-generation sequencing (NGS) testing.
Candidate predictor variableExpected direction
Year of advanced or metastatic diagnosisAs year increases, NGS testing is more likely.
Smoking status (yes vs no)Smoking=no, NGS testing is more likely.
Sex (male vs female)Sex=female, NGS testing is more likely.
Race (Asian, Black, White, other)Race=Asian or White, NGS testing is more likely.
Practice volume (continuous)As practice volume increases, NGS testing is more likely.
BMI (using WHOa categories)BMI=underweight, NGS testing is less likely.
ECOGb performance status (0, 1, 2, 3, or 4)As ECOG performance status increases, NGS testing is less likely.
Body weight (continuous, in kilograms)As weight increases, NGS testing is more likely.
Stage at initial diagnosis (0-I, II, III, or IV)Stage 0-II=NGS is more likely than stage III; stage IV=NGS is more likely than stage III.
EGFRc (not tested, positive, not positive) by non-NGS testEGFR=positive, NGS less likely.
ROS1d (not tested, positive, not positive) by non-NGS testROS1=positive, NGS less likely.
ALKe (not tested, positive, not positive) by non-NGS testALK=positive, NGS less likely.
BRAFf (not tested, positive, not positive) by non-NGS testBRAF=positive, NGS less likely.
KRASg (not tested, positive, not positive) by non-NGS testKRAS=positive, NGS less likely.
PD-L1h (not tested, positive, not positive)PD-L1=positive, NGS less likely.
Number of single-gene tests (continuous)As the number of single-gene tests increase, NGS less likely.
Practice setting (academic, community)Practice setting=academic, NGS more likely.
Insurance status (public, private, other)This relationship is unknown. It is possible that insurance status=public, NGS less likely; however, it is possible that in some cases, insurance status=private only, NGS could be less likely.
MACi regionNo direction is known.
MolDXjWhile this only applies to Medicare, states may adopt broader policies, and the relationship is uncertain. MolDX may make NGS more likely, but it is largely unknown.
NCCNk guidelines (pre, broad, or NGS)NCCN guidelines=NGS, NGS more likely.
Drug approval periods (1, 2, 3, 4, 5, 6, 7)As drug approval periods increase, NGS more likely.
Laboratory values (high, normal, low, not tested) for alkaline phosphatase, alanine transaminase, aspartate transferase, bilirubin, creatine, lymphocyte count, red blood cell count, hematocrit, platelet count, white blood cell count, hemoglobinThe direction of a single laboratory value is unknown. However, generally one would expect multiple out-of-range values to reflect poor health and may make NGS less likely, but the a priori assumed direction is unknown.

aWHO: World Health Organization.

bECOG: Eastern Cooperative Oncology Group.

cEGFR: epidermal growth factor receptor.

dROS1: c-ros oncogene 1.

eALK: anaplastic lymphoma kinase.

fBRAF: V-Raf murine sarcoma viral oncogene homolog B.

gKRAS: Kirsten rat sarcoma virus.

hPD-L1: programmed death ligand 1.

iMAC: Medicare Administrative Contractor.

jMolDX: Molecular Diagnostics Services.

kNCCN: National Comprehensive Cancer Network.

Statistical Analysis

Descriptive analyses were conducted to summarize available data and to understand the extent of missingness in the database. Categorical variables were assessed using a 1-sided chi-square test or Fisher exact test and continuous variables using a 2-sided t test. Missing values were imputed using the random forest missing data algorithm (impute.rfsrc function in R package randomForestSRC) [Tang F, Ishwaran H. Random forest missing data algorithms. Stat Anal Data Min. 2017;10(6):363-377. [CrossRef] [Medline]23].

Three modeling strategies were used to identify potential predictors of NGS-based testing with 2 sets of outcomes for ever versus never NGS-tested (model 1) and early versus late NGS-tested (model 2). The 3 modeling strategies included logistic regression (LR) models, penalized logistic regression (PLR) using least absolute shrinkage and selection operator (LASSO) penalty, and extreme gradient boosting (XGBoost) with trees as base learners. LR was implemented using forward selection on the main effects and predefined interactions (listed earlier), starting with the predefined variables and adding the most significant terms to the model. PLR was implemented using sparse group LASSO on the main effects and predefined interactions, forcing some predefined variables into the model with the penalty selected using 5-fold cross-validation. XGBoost is a decision tree–based machine learning algorithm [Chen T, Guestrin C. XGBoost: a scalable tree boosting system. 2016. Presented at: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; August 13, 2016; San Francisco, CA, United States. [CrossRef]24]. The model matrix for XGBoost was built using main effects and predefined interactions. Hyperparameters were selected based on 5-fold cross-validation over a grid search, and hyperparameters included the shrinkage (learning rate), the number of trees, and tree depth. Table S1 in

Multimedia Appendix 1

Details of study methods, model performance, and variable importance plots.

DOCX File , 406 KBMultimedia Appendix 1 contains the full list of hyperparameters used in this study. The data extraction approach and modeling process is summarized in Figure 1.

In step 1, data were extracted based on the prespecified inclusion and exclusion criteria. Step 2 involved variable recoding, which included transforming all categorical variables with missing information by creating an additional level to represent missing data. Step 3 was a data quality method used to identify any unusual observations that needed to be excluded or recoded in addition to any imputation that was required. Steps 4 to 6 outline the implementation of models, evaluation of the performance of these models, and interpretation of the final features selected using LR. Figure 2 provides an overview of the model strategy evaluation process for the 2 outcomes mentioned in step 4 of Figure 1.

First, the data were split into D1 (training+validation; 80%) and D2 (testing; 20%) sets. Then, the 3 strategies were evaluated by comparing their performance on multiple m=1000 splits in the training (70%) and validation data (30%) within the D1 set. Specifically, for each split, all 3 strategies were fit to training data, and performance measures (eg, area under the receiver operating curve) were computed on the validation data. Modeling was done using R packages, sparsegl was used for LASSO, XGBoost for gradient boosting, and PRROC, which computes the areas under the precision-recall and ROC curve, for performance measures. PLR and XGBoost involved hyperparameters that were fine-tuned using 5-fold cross-validation nested within training datasets. Prediction models were developed on 2 different groups: ever versus never and early versus late NGS-tested groups. In total, 146 features (including all levels of all variables) were entered into both the XGBoost and LASSO models, with only 36 features (main effects and interactions) being used in the LR model. Preselection of features consisted of excluding variables that have little to no association with the outcomes of interest.

The final model was selected by evaluating performance as described earlier (area under the receiver operating curve from validation data) and by considering the simplicity and clinical interpretability. Model performance was re-estimated using the test data D2. For the final model choice, the features with nonzero coefficients selected by PLR were run on the D1 data. These variables were fitted to an LR model within the test data D2 to calculate model estimates (odds ratios, 95% CIs, and P values). Odds ratios for main effects in the presence of interaction terms were calculated using the analytical formula presented in

Multimedia Appendix 2

Computing partial effects from covariates from logistic regression models.

DOCX File , 16 KBMultimedia Appendix 2. All analyses were conducted using SAS (version 9.4; SAS Institute Inc) and R (version 4.0.3; R Foundation for Statistical Computing).

Figure 1. Data extraction and modeling flow. AUC: area under the curve; CI: confidence interval; NSCLC: non–small cell lung cancer; OR: odds ratio; ROC: receiver operating curve.
Figure 2. Modeling evaluation flow. EHR: electronic health record; NGS: next-generation sequencing.

Ethical Considerations

The data used for this study are deidentified and subject to obligations to prevent reidentification and protect patient confidentiality, and as such are not considered human subjects research and are exempt from review in accordance with the US Code of Federal Regulations [Definition of human subjects research. National Institutes of Health. 2024. URL: https://grants.nih.gov/policy/humansubjects/research.htm [accessed 2024-03-07] 16].


A total of 74,211 patient records were available in the Flatiron Health NSCLC dataset for this analysis. After applying eligibility criteria, a total of 31,407 patients were included in this analysis. Of all patients, 42.75% (n=13,425) were included in the ever NGS-tested group and 57.25% (n=17,982) were included in the never NGS-tested group. Among those in the ever NGS-tested group, 84.08% (n=11,289) were early NGS-tested, and 15.91% (n=2136) late NGS-tested. Characteristics of these groups and subgroups used as features in the machine learning models are listed in Tables 2-11.

Most features were significantly different between both the ever and never NGS-tested as well as the early NGS versus late NGS-tested groups. Of note, smoking rates and testing conducted during the NCCN prerecommendations period were lower for the ever NGS-tested group (n=10,589, 78.88% vs n=14,987, 83.34% and n=2663, 19.84% vs n=10,734, 59.69%, respectively), and ECOG status of 0 (n=4410, 32.85% vs n=4665, 25.94%) was higher for the ever NGS-tested group versus those who were never tested. Similarly, for the early versus late NGS-tested groups, there was a higher proportion of patients with a history of smoking (n=9025, 79.95% vs n=1564, 73.22%) and a lower proportion of testing conducted during the NCCN prerecommendations period (n=1746, 15.47% vs n=917, 42.93%) as well as a lower proportion of ECOG status of 0 (n=3606, 31.94% vs n=804, 37.64%) for the early tested group.

Comparison of performance metrics for each model showed that the percent AUC was similar across models (80%-84% and 77%-80%) and marginally better when the models were fit on the ever versus never NGS-tested groups. In addition, other metrics were also comparable (Table S2 in

Multimedia Appendix 1

Details of study methods, model performance, and variable importance plots.

DOCX File , 406 KBMultimedia Appendix 1). The final model chosen was the LASSO model, as it was able to identify important features including interactions (those with nonzero coefficients after shrinkage) and the metrics for each model were highly comparable (Table S2 in

Multimedia Appendix 1

Details of study methods, model performance, and variable importance plots.

DOCX File , 406 KB
Multimedia Appendix 1
). Figures S1 and S2 in

Multimedia Appendix 1

Details of study methods, model performance, and variable importance plots.

DOCX File , 406 KB
Multimedia Appendix 1
show the feature importance plots for both groups. The most important factors associated with ever versus never testing included year of diagnosis, observation of a PD-L1 test, Black or African American race, and number of single-gene tests observed. The most important factors associated with early versus late testing included the observation of a PD-L1 test, a positive single-gene test result, the year of diagnosis, and the geographical region of care. Later year of diagnosis, evidence of PD-L1 testing, patient race, positive single-gene test results, and region were among the top 5 predictors of NGS testing for both ever versus never as well as early versus late NGS testing.

Table 2. Demographic characteristics of the overall, ever, and never NGSa-tested study cohorts prior to imputation.
CharacteristicOverall (N=31,407)Ever NGS-testedb (n=13,425)Never NGS-testedc (n=17,982)Ever NGS-tested versus never NGS-tested, P valued
Age at initial diagnosis (years), mean (SD)67.2 (9.8)67.2 (10.1)67.3 (9.5).66
Sex, n (%).0007

Female16,680 (53.11)7281 (54.23)9399 (52.27)

Male14,726 (46.89)6144 (45.77)8582 (47.73)

Unknown or missing1 (0)0 (0)1 (0.01)
Race, n (%)<.0001

Asian1050 (3.34)552 (4.11)498 (2.77)

Black or African American2845 (9.06)1089 (8.11)1756 (9.77)

White21,248 (67.65)9109 (67.85)12,139 (67.51)

Other3269 (10.41)1392 (10.37)1877 (10.44)

Unknown or missing2995 (9.54)1283 (9.56)1712 (9.52)
Smoking status, n (%)<.0001

History of smoking25,576 (81.43)10,589 (78.88)14,987 (83.34)

No history of smoking5657 (18.01)2826 (21.05)2831 (15.74)

Unknown or missing174 (0.55)10 (0.07)164 (0.91)
ECOGe performance status, n (%)<.0001

09075 (28.89)4410 (32.85)4665 (25.94)

111,215 (35.71)5275 (39.29)5940 (33.03)

23401 (10.83)1393 (10.38)2008 (11.17)

3762 (2.43)306 (2.28)456 (2.54)

451 (0.16)17 (0.13)34 (0.19)

Unknown or missing6903 (21.98)2024 (15.08)4879 (27.13)

aNGS: next-generation sequencing.

bPatients in the overall study cohort with evidence of NGS-based biomarker testing in the database.

cPatients in the overall study cohort with no evidence of NGS-based biomarker testing.

dTwo-sided t test for continuous variables; chi-square or Fisher exact test (where expected cell size <5) for categorical variables.

eECOG: Eastern Cooperative Oncology Group.

Table 3. Biomarker status of the overall, ever, and never NGSa-tested study cohorts prior to imputation.
CharacteristicOverall (N=31,407)Ever NGS-testedb (n=13,425)Never NGS-testedc (n=17,982)Ever NGS-tested vs never NGS-tested, P valued
Non-NGS–based (single gene) ALKe status, n (%)<.0001

Positive617 (1.96)253 (1.88)364 (2.02)

Not positive15,626 (49.75)6278 (46.76)9348 (51.99)

Not tested15,164 (48.28)6894 (51.35)8270 (45.99)
Non-NGS–based (single gene) BRAFf status, n (%)<.0001

Positive94 (0.30)32 (0.24)62 (0.34)

Not positive3775 (12.02)1729 (12.88)2046 (11.38)

Not tested27,538 (87.68)11,664 (86.88)15,874 (88.28)
Non-NGS–based (single gene) EGFRg status, n (%)<.0001

Positive2822 (8.99)928 (6.91)1894 (10.53)

Not positive12,312 (39.20)3427 (25.53)8885 (49.41)

Not tested16,273 (51.81)9070 (67.56)7203 (40.06)
Non-NGS–based (single gene) KRASh status, n (%)<.0001

Positive1141 (3.63)298 (2.22)843 (4.69)

Not positive2958 (9.42)1082 (8.06)1876 (10.43)

Not tested27,308 (86.95)12,045 (89.72)15,263 (84.88)
Non-NGS–based (single gene) ROS1i status, n (%)<.0001

Positive128 (0.41)58 (0.43)70 (0.39)

Not positive9383 (29.88)5011 (37.33)4372 (24.31)

Not tested21,896 (69.72)8356 (62.24)13,540 (75.30)
Non-NGS–based (single gene) METj status, n (%)<.0001

Positive7 (0.02)3 (0.02)4 (0.02)

Not positive1965 (6.26)1517 (11.30)448 (2.49)

Not tested29,435 (93.72)11,905 (88.68)17,530 (97.49)
Non-NGS–based (single gene) RETk status, n (%)<.0001

Positive34 (0.11)27 (0.20)7 (0.04)

Not positive2381 (7.58)1679 (12.51)702 (3.90)

Not tested28,992 (92.31)11,719 (87.29)17,273 (96.06)
Non-NGS–based (single gene) NTRKl status, n (%)<.0001

Positive2 (0.01)1 (0.01)1 (0.01)

Not positive747 (2.38)617 (4.60)130 (0.72)

Not tested30,658 (97.62)12,807 (95.40)17,851 (99.27)
Non-NGS–based (single gene) testingm, n (%)<.0001

Any positive result observed4795 (15.27)1576 (11.74)3219 (17.90)

Never tested11,968 (38.11)5661 (42.17)6307 (35.07)

Tested, but no positive results observed14,644 (46.63)6188 (46.09)8456 (47.02)
PD-L1n status, n (%)<.0001

Positive1826 (5.81)1289 (9.60)537 (2.99)

Not positive9988 (31.80)6354 (47.33)3634 (20.21)

Not tested19,593 (62.38)5782 (43.07)13,811 (76.80)
Single-gene tests receivedm, mean (SD)2.1 (2.0)2.3 (2.0)2.0 (1.9)<.0001

aNGS: next-generation sequencing.

bPatients in the overall study cohort with evidence of NGS-based biomarker testing in the database.

cPatients in the overall study cohort with no evidence of NGS-based biomarker testing.

dTwo-sided t test for continuous variables; chi-square or Fisher exact test (where expected cell size <5) for categorical variables.

eALK: anaplastic lymphoma kinase.

fBRAF: V-Raf Murine Sarcoma Viral Oncogene Homolog B.

gEGFR: epidermal growth factor receptor.

hKRAS: Kirsten rat sarcoma virus.

iROS1: c-ros oncogene 1.

jMET: mesenchymal epithelial transition.

kRET: rearranged during transfection.

lNTRK: neurotrophic tyrosine receptor kinase.

mResults are based on biomarkers ALK, BRAF, EGFR, KRAS, ROS1, MET, RET, and NTRK.

nPD-L1: programmed death ligand 1.

Table 4. Geographic and time characteristics of the overall, ever, and never NGSa-tested study cohorts prior to imputation.
CharacteristicOverall (N=31,407), n (%)Ever NGS-testedb (n=13,425), n (%)Never NGS-testedc (n=17,982), n (%)Ever NGS-tested vs never NGS-tested, P valued
MACe region<.0001

JE Noridian2097 (6.68)814 (6.06)1283 (7.13)

JF Noridian2476 (7.88)1111 (8.28)1365 (7.59)

J6 NGS856 (2.73)335 (2.50)521 (2.90)

J5 WPS603 (1.92)235 (1.75)368 (2.05)

J8 WPS2025 (6.45)1051 (7.83)974 (5.42)

JK NGS2459 (7.83)1102 (8.21)1357 (7.55)

JL Novitas2817 (8.97)1283 (9.56)1534 (8.53)

JM Palmetto2218 (7.06)858 (6.39)1360 (7.56)

J15 CGS924 (2.94)397 (2.96)527 (2.93)

JJ Cahaba4194 (13.35)2049 (15.26)2145 (11.93)

JH Novitas6093 (19.40)2176 (16.21)3917 (21.78)

Unknown or missing4645 (14.79)2014 (15)2631 (14.63)
MolDXf Program<.0001

Yes14,294 (45.51)6399 (47.66)7895 (43.91)

No12,468 (39.70)5012 (37.33)7456 (41.46)

Unknown or missing4645 (14.79)2014 (15)2631 (14.63)
NCCNg guideline period<.0001

Prerecommendations13,397 (42.66)2663 (19.84)10,734 (59.69)

Broad-based testing recommended13,552 (43.15)7339 (54.67)6213 (34.55)

NGS-based testing recommended4458 (14.19)3423 (25.50)1035 (5.76)
Timing of diagnosis by drug approval period<.0001

Period 11223 (3.89)96 (0.72)1127 (6.27)

Period 212,850 (40.91)2823 (21.03)10,027 (55.76)

Period 34396 (14)1868 (13.91)2528 (14.06)

Period 44877 (15.53)2724 (20.29)2153 (11.97)

Period 54613 (14.69)3224 (24.01)1389 (7.72)

Period 62858 (9.10)2216 (16.51)642 (3.57)

Period 7590 (1.88)474 (3.53)116 (0.65)

aNGS: next-generation sequencing.

bPatients in the overall study cohort with evidence of NGS-based biomarker testing in the database.

cPatients in the overall study cohort with no evidence of NGS-based biomarker testing.

dTwo-sided t test for continuous variables; chi-square or Fisher exact test (where expected cell size <5) for categorical variables.

eMAC: Medicare Administration Contractor.

fMolDX: Molecular Diagnostics Services.

gNCCN: National Comprehensive Cancer Network.

Table 5. Clinical care characteristics of the overall, ever, and never NGSa-tested study cohorts prior to imputation.
CharacteristicOverall (N=31,407)Ever NGS-testedb (n=13,425)Never NGS-testedc (n=17,982)Ever NGS-tested vs never NGS-tested, P valued
Practice setting, n (%)<.0001

Academic3626 (11.55)1783 (13.28)1843 (10.25)

Community27,781 (88.45)11,642 (86.72)16,139 (89.75)
Insurance type, n (%)<.0001

Private+public4301 (13.69)1940 (14.45)2361 (13.13)

Private only7083 (22.55)3601 (26.82)3482 (19.36)

Public only4037 (12.85)1560 (11.62)2477 (13.77)

Multiple types8997 (28.65)4066 (30.29)4931 (27.42)

Unknown or missing6989 (22.25)2258 (16.82)4731 (26.31)
Stage at initial diagnosis, n (%)<.0001

0-I2736 (8.71)1208 (9)1528 (8.50)

II1453 (4.63)671 (5)782 (4.35)

III5621 (17.90)2227 (16.59)3394 (18.87)

IV20,929 (66.64)9096 (67.75)11,833 (65.80)

Unknown or missing668 (2.13)223 (1.66)445 (2.47)
Year of index diagnosis, n (%)<.0001

20111896 (6.04)158 (1.18)1738 (9.67)

20122402 (7.65)229 (1.71)2173 (12.08)

20132699 (8.59)476 (3.55)2223 (12.36)

20143054 (9.72)664 (4.95)2390 (13.29)

20153346 (10.65)1136 (8.46)2210 (12.29)

20163397 (10.82)1372 (10.22)2025 (11.26)

20173472 (11.05)1708 (12.72)1764 (9.81)

20183401 (10.83)1966 (14.64)1435 (7.98)

20193282 (10.45)2293 (17.08)989 (5.50)

20202777 (8.84)2066 (15.39)711 (3.95)

20211681 (5.35)1357 (10.11)324 (1.80)
Practice volumee, mean (SD)154.1 (143.6)169.2 (156.0)142.8 (132.5)<.0001
BMI, n (%)<.0001

Underweight1373 (4.37)597 (4.45)776 (4.32)

Normal weight10,593 (33.73)4638 (34.55)5955 (33.12)

Overweight8897 (28.33)4019 (29.94)4878 (27.13)

Obese6492 (20.67)2920 (21.75)3572 (19.86)

Unknown or missing4052 (12.90)1251 (9.32)2801 (15.58)
Body weight (kg), mean (SD)75.0 (18.6)75.3 (18.8)74.8 (18.4).04
Duration of follow-up (days), mean (SD)704.8 (638.1)735.1 (636.5)682.2 (638.3)<.0001

aNGS: next-generation sequencing.

bPatients in the overall study cohort with evidence of NGS-based biomarker testing in the database.

cPatients in the overall study cohort with no evidence of NGS-based biomarker testing.

dTwo-sided t test for continuous variables; chi-square or Fisher exact test (where expected cell size <5) for categorical variables.

eNumber of patients with non-small cell lung cancer receiving care at the same practice per year.

Table 6. Laboratory values of the overall, ever, and never NGSa-tested study cohorts prior to imputation.
CharacteristicOverall
(N=31,407), n (%)
Ever NGS-testedb (n=13,425), n (%)Never NGS-testedc (n=17,982), n (%)Ever NGS-tested vs never NGS-tested, P valued
ALPe<.0001

High3550 (11.30)1583 (11.79)1967 (10.94)

Low146 (0.46)60 (0.45)86 (0.48)

Normal15,295 (48.70)6805 (50.69)8490 (47.21)

Not tested12,416 (39.53)4977 (37.07)7439 (41.37)
ALTf<.0001

High1480 (4.71)676 (5.04)804 (4.47)

Low850 (2.71)384 (2.86)466 (2.59)

Normal16,606 (52.87)7389 (55.04)9217 (51.26)

Not tested12,471 (39.71)4976 (37.07)7495 (41.68)
ASTg<.0001

High1364 (4.34)579 (4.31)785 (4.37)

Low1018 (3.24)447 (3.33)571 (3.18)

Normal16,706 (53.19)7479 (55.71)9227 (51.31)

Not tested12,319 (39.22)4920 (36.65)7399 (41.15)
Bilirubin<.0001

High461 (1.47)212 (1.58)249 (1.38)

Low1200 (3.82)545 (4.06)655 (3.64)

Normal16,014 (50.99)7138 (53.17)8876 (49.36)

Not tested13,732 (43.72)5530 (41.19)8202 (45.61)
Creatinine<.0001

High2272 (7.23)950 (7.08)1322 (7.35)

Low2143 (6.82)965 (7.19)1178 (6.55)

Normal15,512 (49.39)6917 (51.52)8595 (47.80)

Not tested11,480 (36.55)4593 (34.21)6887 (38.30)
Lymphocyte count<.0001

High435 (1.39)162 (1.21)273 (1.52)

Low7325 (23.32)3270 (24.36)4055 (22.55)

Normal12,238 (38.97)5504 (41)6734 (37.45)

Not tested11,409 (36.33)4489 (33.44)6920 (38.48)
Red blood cell count<.0001

High371 (1.18)135 (1.01)236 (1.31)

Low5751 (18.31)2336 (17.40)3415 (18.99)

Normal12,350 (39.32)5551 (41.35)6799 (37.81)

Not tested12,935 (41.19)5403 (40.25)7532 (41.89)
Hematocrit<.0001

High482 (1.53)187 (1.39)295 (1.64)

Low6440 (20.50)2772 (20.65)3668 (20.40)

Normal13,085 (41.66)6026 (44.89)7059 (39.26)

Not tested11,400 (36.30)4440 (33.07)6960 (38.71)
Platelet count.003

High2605 (8.29)1038 (7.73)1567 (8.71)

Low675 (2.15)271 (2.02)404 (2.25)

Normal14,807 (47.15)6436 (47.94)8371 (46.55)

Not tested13,320 (42.41)5680 (42.31)7640 (42.49)
White blood cell count.03

High5171 (16.46)2166 (16.13)3005 (16.71)

Low461 (1.47)195 (1.45)266 (1.48)

Normal13,237 (42.15)5790 (43.13)7447 (41.41)

Not tested12,538 (39.92)5274 (39.28)7264 (40.40)
Hemoglobin, whole blood<.0001

High406 (1.29)141 (1.05)265 (1.47)

Low6973 (22.20)2969 (22.12)4004 (22.27)

Normal13,193 (42.01)5997 (44.67)7196 (40.02)

Not tested10,835 (34.50)4318 (32.16)6517 (36.24)

aNGS: next-generation sequencing.

bPatients in the overall study cohort with evidence of NGS-based biomarker testing in the database.

cPatients in the overall study cohort with no evidence of NGS-based biomarker testing.

dTwo-sided t test for continuous variables; chi-square or Fisher exact test (where expected cell size <5) for categorical variables.

eALP: alkaline phosphatase.

fALT: alanine transaminase.

gAST: aspartate aminotransferase.

Table 7. Demographic characteristics of early and late NGSa-tested study cohorts prior to imputation.
CharacteristicEarly NGS-testedb (n=11,289)Late NGS-testedc (n=2136)Early NGS-tested vs late NGS-tested, P valued
Age at initial diagnosis (years), mean (SD)67.5 (10.1)65.5 (10.0)<.0001
Sex, n (%).02

Female6073 (53.80)1208 (56.55)

Male5216 (46.20)928 (43.45)
Race, n (%)<.0001

Asian408 (3.61)144 (6.74)

Black or African American897 (7.95)192 (8.99)

White7655 (67.81)1454 (68.07)

Other1215 (10.76)177 (8.29)

Unknown or missing1114 (9.87)169 (7.91)
Smoking status, n (%)<.0001

History of smoking9025 (79.95)1564 (73.22)

No history of smoking2256 (19.98)570 (26.69)

Unknown or missing8 (0.07)2 (0.09)
ECOGe performance status, n (%)<.0001

03606 (31.94)804 (37.64)

14440 (39.33)835 (39.09)

21220 (10.81)173 (8.10)

3282 (2.50)24 (1.12)

415 (0.13)2 (0.09)

Unknown or missing1726 (15.29)298 (13.95)

aNGS: next-generation sequencing.

bPatients in the ever NGS-tested group whose first or only NGS-based test occurred prior to the start of first-line therapy through day 7 of first-line therapy.

cPatients in the ever NGS-tested group whose first NGS-based test occurred 8 days or later after the start of first-line therapy.

dTwo-sided t test for continuous variables; chi-square or Fisher exact test (where expected cell size <5) for categorical variables.

eECOG: Eastern Cooperative Oncology Group.

Table 8. Biomarker status of early and late NGSa-tested study cohorts prior to imputation.
CharacteristicEarly NGS-testedb (n=11,289)Late NGS-testedc (n=2136)Early NGS-tested vs late NGS-tested, P valued
Non-NGS–based (single gene) ALKe status, n (%)<.0001

Positive193 (1.71)60 (2.81)

Not positive5208 (46.13)1070 (50.09)

Not tested5888 (52.16)1006 (47.10)
Non-NGS–based (single gene) BRAFf status, n (%).04

Positive25 (0.22)7 (0.33)

Not positive1420 (12.58)309 (14.47)

Not tested9844 (87.20)1820 (85.21)
Non-NGS–based (single gene) EGFRg status, n (%)<.0001

Positive435 (3.85)493 (23.08)

Not positive2589 (22.93)838 (39.23)

Not tested8265 (73.21)805 (37.69)
Non-NGS–based (single gene) KRASh status, n (%)<.0001

Positive221 (1.96)77 (3.60)

Not positive825 (7.31)257 (12.03)

Not tested10,243 (90.73)1802 (84.36)
Non-NGS–based (single gene) ROS1i status, n (%)<.0001

Positive44 (0.39)14 (0.66)

Not positive4376 (38.76)635 (29.73)

Not tested6869 (60.85)1487 (69.62)
Non-NGS–based (single gene) METj status, n (%)<.0001

Positive2 (0.02)1 (0.05)

Not positive1449 (12.84)68 (3.18)

Not tested9838 (87.15)2067 (96.77)
Non-NGS–based (single gene) RETk status, n (%)<.0001

Positive27 (0.24)0 (0)

Not positive1558 (13.80)121 (5.66)

Not tested9704 (85.96)2015 (94.34)
Non-NGS–based (single gene) NTRKl status, n (%)<.0001

Positive1 (0.01)0 (0)

Not positive596 (5.28)21 (0.98)

Not tested10,692 (94.71)2115 (99.02)
Non-NGS–based (single gene) testingm, n (%)<.0001

Any positive result observed931 (8.25)645 (30.20)

Never tested4959 (43.93)702 (32.87)

Tested, but no positive results observed5399 (47.83)789 (36.94)
PD-L1n status, n (%)<.0001

Positive1228 (10.88)61 (2.86)

Not positive5785 (51.24)569 (26.64)

Not tested4276 (37.88)1506 (70.51)
Number of single-gene tests receivedm, mean (SD)2.3 (2.1)2.2 (2.0).002

aNGS: next-generation sequencing.

bPatients in the ever NGS-tested group whose first or only NGS-based test occurred prior to the start of first-line therapy through day 7 of first-line therapy.

cPatients in the ever NGS-tested group whose first NGS-based test occurred 8 days or later after the start of first-line therapy.

dTwo-sided t test for continuous variables; chi-square or Fisher exact test (where expected cell size <5) for categorical variables.

eALK: anaplastic lymphoma kinase.

fBRAF: V-Raf murine sarcoma viral oncogene homolog B.

gEGFR: epidermal growth factor receptor.

hKRAS: Kirsten rat sarcoma virus.

iROS1: c-ros oncogene 1.

jMET: mesenchymal epithelial transition.

kRET: rearranged during transfection.

lNTRK: neurotrophic tyrosine receptor kinase.

mResults are based on biomarkers ALK, BRAF, EGFR, KRAS, ROS1, MET, RET, and NTRK.

nPD-L1: programmed death ligand 1.

Table 9. Geographic and time characteristics of early and late NGSa-tested study cohorts prior to imputation.
CharacteristicEarly NGS-testedb (n=11,289), n (%)Late NGS-testedc (n=2136), n (%)Early NGS-tested vs late NGS-tested, P valued
MACe region<.0001

JE Noridian639 (5.66)175 (8.19)

JF Noridian956 (8.47)155 (7.26)

J6 NGS283 (2.51)52 (2.43)

J5 WPS205 (1.82)30 (1.40)

J8 WPS921 (8.16)130 (6.09)

JK NGS924 (8.18)178 (8.33)

JL Novitas1094 (9.69)189 (8.85)

JM Palmetto707 (6.26)151 (7.07)

J15 CGS339 (3)58 (2.72)

JJ Cahaba1734 (15.36)315 (14.75)

JH Novitas1786 (15.82)390 (18.26)

Unknown or missing1701 (15.07)313 (14.65)
MolDXf Program.38

Yes5402 (47.85)997 (46.68)

No4186 (37.08)826 (38.67)

Unknown or missing1701 (15.07)313 (14.65)
NCCNg guideline period<.0001

Pre recommendations1746 (15.47)917 (42.93)

Broad-based testing recommended6286 (55.68)1053 (49.30)

NGS-based testing recommended3257 (28.85)166 (7.77)
Timing of diagnosis by drug approval period<.0001

Period 143 (0.38)53 (2.48)

Period 21902 (16.85)921 (43.12)

Period 31458 (12.92)410 (19.19)

Period 42347 (20.79)377 (17.65)

Period 52955 (26.18)269 (12.59)

Period 62122 (18.80)94 (4.40)

Period 7462 (4.09)12 (0.56)

aNGS: next-generation sequencing.

bPatients in the ever NGS-tested group whose first or only NGS-based test occurred prior to the start of first-line therapy through day 7 of first-line therapy.

cPatients in the ever NGS-tested group whose first NGS-based test occurred 8 days or later after the start of first-line therapy.

dTwo-sided t test for continuous variables; chi-square or Fisher exact test (where expected cell size <5) for categorical variables.

eMAC: Medicare Administration Contractor.

fMolDX: Molecular Diagnostics Services.

gNCCN: National Comprehensive Cancer Network.

Table 10. Clinical care characteristics of early and late NGSa-tested study cohorts prior to imputation.
CharacteristicEarly NGS-testedb (n=11,289)Late NGS-testedc (n=2136)Early NGS-tested vs late NGS-tested, P valued
Practice setting, n (%).50

Academic1509 (13.37)274 (12.83)

Community9780 (86.63)1862 (87.17)
Insurance type, n (%)<.0001

Private+public1689 (14.96)251 (11.75)

Private only3026 (26.80)575 (26.92)

Public only1317 (11.67)243 (11.38)

Multiple types3491 (30.92)575 (26.92)

Unknown or missing1766 (15.64)492 (23.03)
Stage at initial diagnosis, n (%).0004

0-I1051 (9.31)157 (7.35)

II580 (5.14)91 (4.26)

III1822 (16.14)405 (18.96)

IV7655 (67.81)1441 (67.46)

Unknown or missing181 (1.60)42 (1.97)
Year of index diagnosis<.0001

201169 (0.61)89 (4.17)

2012121 (1.07)108 (5.06)

2013295 (2.61)181 (8.47)

2014430 (3.81)234 (10.96)

2015831 (7.36)305 (14.28)

20161038 (9.19)334 (15.64)

20171425 (12.62)283 (13.25)

20181712 (15.17)254 (11.89)

20192111 (18.70)182 (8.52)

20201939 (17.18)127 (5.95)

20211318 (11.68)39 (1.83)
Practice volumee, mean (SD)169.6 (156.7)166.8 (152.4).44
BMI, n (%)<.0001

Underweight529 (4.69)68 (3.18)

Normal weight3963 (35.10)675 (31.60)

Overweight3410 (30.21)609 (28.51)

Obese2435 (21.57)485 (22.71)

Unknown or missing952 (8.43)299 (14)
Body weight (kg), mean (SD)75.2 (18.8)75.9 (18.7).11
Duration of follow-up (days), mean (SD)644.1 (547.5)1216.2 (829.1)<.0001

aNGS: next-generation sequencing.

bPatients in the ever NGS-tested group whose first or only NGS-based test occurred prior to the start of first-line therapy through day 7 of first-line therapy.

cPatients in the ever NGS-tested group whose first NGS-based test occurred 8 days or later after the start of first-line therapy.

dTwo-sided t test for continuous variables; chi-square or Fisher exact test (where expected cell size <5) for categorical variables.

eNumber of patients with non-small cell lung cancer receiving care at the same practice per year.

Table 11. Laboratory values of early and late NGSa-tested study cohorts prior to imputation.
CharacteristicEarly NGS-testedb (n=11,289)Late NGS-testedc (n=2136)Early NGS-tested vs late NGS-tested, P valued
ALPe.79

High1339 (11.86)244 (11.42)

Low48 (0.43)12 (0.56)

Normal5720 (50.67)1085 (50.80)

Not tested4182 (37.04)795 (37.22)
ALTf.01

High577 (5.11)99 (4.63)

Low345 (3.06)39 (1.83)

Normal6192 (54.85)1197 (56.04)

Not tested4175 (36.98)801 (37.50)
ASTg.07

High486 (4.31)93 (4.35)

Low396 (3.51)51 (2.39)

Normal6278 (55.61)1201 (56.23)

Not tested4129 (36.58)791 (37.03)
Bilirubin.47

High182 (1.61)30 (1.40)

Low470 (4.16)75 (3.51)

Normal5992 (53.08)1146 (53.65)

Not tested4645 (41.15)885 (41.43)
Creatinine.52

High815 (7.22)135 (6.32)

Low813 (7.20)152 (7.12)

Normal5808 (51.45)1109 (51.92)

Not tested3853 (34.13)740 (34.64)
Lymphocyte count.003

High122 (1.08)40 (1.87)

Low2792 (24.73)478 (22.38)

Normal4611 (40.85)893 (41.81)

Not tested3764 (33.34)725 (33.94)
Red blood cell count.001

High108 (0.96)27 (1.26)

Low2004 (17.75)332 (15.54)

Normal4594 (40.69)957 (44.80)

Not tested4583 (40.60)820 (38.39)
Hematocrit.02

High150 (1.33)37 (1.73)

Low2378 (21.06)394 (18.45)

Normal5031 (44.57)995 (46.58)

Not tested3730 (33.04)710 (33.24)
Platelet count.04

High855 (7.57)183 (8.57)

Low233 (2.06)38 (1.78)

Normal5372 (47.59)1064 (49.81)

Not tested4829 (42.78)851 (39.84)
White blood cell count.04

High1837 (16.27)329 (15.40)

Low162 (1.44)33 (1.54)

Normal4811 (42.62)979 (45.83)

Not tested4479 (39.68)795 (37.22)
Hemoglobin, whole blood.0004

High111 (0.98)30 (1.40)

Low2564 (22.71)405 (18.96)

Normal4987 (44.18)1010 (47.28)

Not tested3627 (32.13)691 (32.35)

aNGS: next-generation sequencing.

bPatients in the ever NGS-tested group whose first or only NGS-based test occurred prior to the start of first-line therapy through day 7 of first-line therapy.

cPatients in the ever NGS-tested group whose first NGS-based test occurred 8 days or later after the start of first-line therapy.

dTwo-sided t test for continuous variables; chi-square or Fisher exact test (where expected cell size <5) for categorical variables.

eALP: alkaline phosphatase.

fALT: alanine transaminase.

gAST: aspartate aminotransferase.

Over the 1000 bootstrap samples over the training data D1, an average of 135 and 89 features were identified by the LASSO models for the ever versus never and early versus late NGS-tested groups, respectively. These variables were then entered into an LR model using the testing set. The final model was established after removing any nonsignificant interaction terms, as explained earlier in the study methods. Details of the model fit statistics are shown in Table S3 in

Multimedia Appendix 1

Details of study methods, model performance, and variable importance plots.

DOCX File , 406 KBMultimedia Appendix 1. All main effects identified from the modeling for each group are shown in Figures 3-Vamathevan J, Clark D, Czodrowski P, Dunham I, Ferran E, Lee G, et al. Applications of machine learning in drug discovery and development. Nat Rev Drug Discov. 2019;18(6):463-477. [FREE Full text] [CrossRef] [Medline]9.

There were lower odds of ever receiving NGS testing among patients with later age at initial diagnosis, bilirubin not tested, worse ECOG performance status, treated in geographies under the MolDX Program, a total higher number of genetic tests received, had only public insurance, and who were of Black or African American race as compared with those who were never tested. Patients who were obese, had a later year of initial NSCLC diagnosis, were from larger practices, had evidence of PD-L1 testing, no results for platelet testing, no history of smoking, had stage II disease, and were treated in a MAC region other than JH Novitas or J6 NGS had higher odds of ever receiving NGS-based testing.

For early versus late NGS testing (Figures 10-Bruno DS, Li X, Hess LM. Biomarker testing, targeted therapy and clinical trial participation by race among patients with lung cancer: a real-world Medicaid database study. JTO Clin Res Rep. 2024;5(3):100643. [CrossRef] [Medline]17), there were greater odds of receiving early NGS-based testing among patients with a later year of initial NSCLC diagnosis, who had no history of smoking, who were in later drug period approval periods, had a PD-L1 test, treated in the MAC J8 WPS, and who had no other biomarker tests or inconclusive testing.

Figure 3. Forest plot of ever versus never NGS-tested: variables determined by a logistic regression model from variables preselected by a LASSO model: clinical care and demographic variables. Ever NGS-tested: patients in the overall study cohort with evidence of NGS-based biomarker testing in the database; never NGS-tested: patients in the overall study cohort with no evidence of NGS-based biomarker testing. Index year: year of index diagnosis; LASSO: least absolute shrinkage and selection operator; NGS: next-generation sequencing; OR: odds ratio.
Figure 4. Forest plot of ever versus never NGS-tested: variables determined by a logistic regression model from variables preselected by a LASSO model: ECOG performance status and stage. Ever NGS-tested: patients in the overall study cohort with evidence of NGS-based biomarker testing in the database; never NGS-tested: patients in the overall study cohort with no evidence of NGS-based biomarker testing. ECOG: Eastern Cooperative Oncology Group; LASSO: least absolute shrinkage and selection operator; NGS: next-generation sequencing; OR: odds ratio.
Figure 5. Forest plot of ever versus never NGS-tested: variables determined by a logistic regression model from variables preselected by a LASSO model: biomarkers and MolDX region. Ever NGS-tested: patients in the overall study cohort with evidence of NGS-based biomarker testing in the database; never NGS-tested: patients in the overall study cohort with no evidence of NGS-based biomarker testing. LASSO: least absolute shrinkage and selection operator; MolDX: Molecular Diagnostics Services; NGS: next-generation sequencing; OR: odds ratio; PDL1: programmed death ligand 1.
Figure 6. Forest plot of ever versus never NGS-tested: variables determined by a logistic regression model from variables preselected by a LASSO model: insurance. Ever NGS-tested: patients in the overall study cohort with evidence of NGS-based biomarker testing in the database; never NGS-tested: patients in the overall study cohort with no evidence of NGS-based biomarker testing. LASSO: least absolute shrinkage and selection operator; NGS: next-generation sequencing; OR: odds ratio.
Figure 7. Forest plot of ever versus never NGS-tested: variables determined by a logistic regression model from variables preselected by a LASSO model: NCCN guidelines. Ever NGS-tested: patients in the overall study cohort with evidence of NGS-based biomarker testing in the database; never NGS-tested: patients in the overall study cohort with no evidence of NGS-based biomarker testing. LASSO: least absolute shrinkage and selection operator; NCCN: National Comprehensive Cancer Network; NGS: next-generation sequencing; OR: odds ratio.
Figure 8. Forest plot of ever versus never NGS-tested: variables determined by a logistic regression model from variables preselected by a LASSO model: laboratory values. Ever NGS-tested: patients in the overall study cohort with evidence of NGS-based biomarker testing in the database; never NGS-tested: patients in the overall study cohort with no evidence of NGS-based biomarker testing. ALP: alkaline phosphatase; ALT: alanine transaminase; AST: aspartate aminotransferase; LASSO: least absolute shrinkage and selection operator; NGS: next-generation sequencing; OR: odds ratio.
Figure 9. Forest plot of ever versus never NGS-tested: variables determined by a logistic regression model from variables preselected by a LASSO model: geographic region. Geographic regions reflect Medicare Administration Contractors. Ever NGS-tested: patients in the overall study cohort with evidence of NGS-based biomarker testing in the database; never NGS-tested: patients in the overall study cohort with no evidence of NGS-based biomarker testing. LASSO: least absolute shrinkage and selection operator; NGS: next-generation sequencing; OR: odds ratio.
Figure 10. Forest plots early versus late NGS-tested: variables determined by logistic regression from variables preselected by a LASSO model: clinical care and demographic variables. Early NGS-tested: patients in the ever NGS-tested group whose first or only NGS-based test occurred prior to the start of first-line therapy through day 7 of first-line therapy; late NGS-tested: patients in the ever NGS-tested group whose first NGS-based test occurred 8 days or later after the start of first-line therapy. Index year: year of index diagnosis; LASSO: least absolute shrinkage and selection operator; NGS: next-generation sequencing; OR: odds ratio.
Figure 11. Forest plots early versus late NGS-tested: variables determined by logistic regression from variables preselected by a LASSO model: ECOG performance status and stage. Early NGS-tested: patients in the ever NGS-tested group whose first or only NGS-based test occurred prior to the start of first-line therapy through day 7 of first-line therapy; late NGS-tested: patients in the ever NGS-tested group whose first NGS-based test occurred 8 days or later after the start of first-line therapy. ECOG: Eastern Cooperative Oncology Group; LASSO: least absolute shrinkage and selection operator; NGS: next-generation sequencing; OR: odds ratio.
Figure 12. Forest plots early versus late NGS-tested: variables determined by logistic regression from variables preselected by a LASSO model: biomarkers. Early NGS-tested: patients in the ever NGS-tested group whose first or only NGS-based test occurred prior to the start of first-line therapy through day 7 of first-line therapy; late NGS-tested: patients in the ever NGS-tested group whose first NGS-based test occurred 8 days or later after the start of first-line therapy. LASSO: least absolute shrinkage and selection operator; NGS: next-generation sequencing; OR: odds ratio; PDL1: programmed death ligand 1.
Figure 13. Forest plots early versus late NGS-tested: variables determined by logistic regression from variables preselected by a LASSO model: insurance. Early NGS-tested: patients in the ever NGS-tested group whose first or only NGS-based test occurred prior to the start of first-line therapy through day 7 of first-line therapy; late NGS-tested: patients in the ever NGS-tested group whose first NGS-based test occurred 8 days or later after the start of first-line therapy. LASSO: least absolute shrinkage and selection operator; NGS: next-generation sequencing; OR: odds ratio.
Figure 14. Forest plots early versus late NGS-tested: variables determined by logistic regression from variables preselected by a LASSO model: NCCN guidelines. Early NGS-tested: patients in the ever NGS-tested group whose first or only NGS-based test occurred prior to the start of first-line therapy through day 7 of first-line therapy; late NGS-tested: patients in the ever NGS-tested group whose first NGS-based test occurred 8 days or later after the start of first-line therapy. LASSO: least absolute shrinkage and selection operator; NCCN: National Comprehensive Cancer Network; NGS: next-generation sequencing; OR: odds ratio.
Figure 15. Forest plots early versus late NGS-tested: variables determined by logistic regression from variables preselected by a LASSO model: laboratory values. Early NGS-tested: patients in the ever NGS-tested group whose first or only NGS-based test occurred prior to the start of first-line therapy through day 7 of first-line therapy; late NGS-tested: patients in the ever NGS-tested group whose first NGS-based test occurred 8 days or later after the start of first-line therapy. ALT: alanine transaminase; LASSO: least absolute shrinkage and selection operator; NGS: next-generation sequencing; OR: odds ratio.
Figure 16. Forest plots early versus late NGS-tested: variables determined by logistic regression from variables preselected by a LASSO model: geographic region. Geographic regions reflect Medicare Administration Contractors. Early NGS-tested: patients in the ever NGS-tested group whose first or only NGS-based test occurred prior to the start of first-line therapy through day 7 of first-line therapy; late NGS-tested: patients in the ever NGS-tested group whose first NGS-based test occurred 8 days or later after the start of first-line therapy. LASSO: least absolute shrinkage and selection operator; NGS: next-generation sequencing; OR: odds ratio.
Figure 17. Forest plots early versus late NGS-tested: variables determined by logistic regression from variables preselected by a LASSO model: time period variables. Early NGS-tested: patients in the ever NGS-tested group whose first or only NGS-based test occurred prior to the start of first-line therapy through day 7 of first-line therapy; late NGS-tested: patients in the ever NGS-tested group whose first NGS-based test occurred 8 days or later after the start of first-line therapy. ALK: anaplastic lymphoma kinase; BRAF: V-Raf murine sarcoma viral oncogene homolog B; EGFR: epidermal growth factor receptor; KRAS: Kirsten rat sarcoma virus; LASSO: least absolute shrinkage and selection operator; MET: mesenchymal epithelial transition; NGS: next-generation sequencing; NTRK: neurotrophic tyrosine receptor kinase; OR: odds ratio; RET: rearranged during transfection; ROS1: c-ros oncogene 1; period 1: January 1-August 25, 2011 (EGFR drugs only); period 2: August 26, 2011-March 10, 2016 (EGFR+ALK); period 3: March 11, 2016-June 21, 2017 (EGFR+ALK+ROS1); period 4: June 22, 2017-November 25, 2018 (EGFR+ALK+ROS1+BRAF); period 5: November 26, 2018-May 5, 2020 (EGFR+ALK+ROS1+BRAF+NTRK); period 6: May 6, 2020-May 26, 2021 (EGFR+ALK+ROS1+BRAF+NTRK+MET+RET); period 7: May 27, 2021 and later (EGFR+ALK+ROS1+BRAF+MET+NTRK+RET+KRAS).

Overview

This study applied machine learning methods and traditional statistical tools that identified several factors that were significantly associated with not only receiving NGS-based testing but also receiving the testing early when there is a potential for early intervention with targeted therapies. Factors associated with both ever having NGS testing as well as early NGS testing included later year of NSCLC diagnosis, no history of smoking, and evidence of PD-L1 testing. These factors were consistent with the hypothesized direction of candidate variables, as NGS-based testing has been increasing over time, and it was not unexpected that the rate of testing has increased in recent years [Hess LM, Krein PM, Haldane D, Han Y, Sireci AN. Biomarker testing for patients with advanced/metastatic nonsquamous NSCLC in the United States of America, 2015 to 2021. JTO Clin Res Rep. 2022;3(6):100336. [FREE Full text] [CrossRef] [Medline]4,Swaminathan A, Stergiopoulos S, Snider J, Fisher V, Castellanos E, Snow T. Changes over time in real-world next-generation sequencing (NGS) test use in patients (pts) with advanced non-small cell lung cancer (aNSCLC). Cancer Res. 2021;81(13_Suppl):864. [CrossRef]25]. In addition, consistent with the hypothesized direction of these relationships, patients without a smoking history were more likely to undergo NGS-based testing. The lack of environmental causal factors would lead one to seek other explanations for the onset of lung cancer, including certain genomic abnormalities, which are frequently observed among nonsmokers with lung cancer [Mack PC, Klein MI, Ayers KL, Zhou X, Guin S, Fink M, et al. Targeted next-generation sequencing reveals exceptionally high rates of molecular driver mutations in never-smokers with lung adenocarcinoma. Oncologist. 2022;27(6):476-486. [CrossRef] [Medline]26]. PD-L1 testing is generally conducted alongside the NGS test and was only available in later years, so the observation of these relationships was also not unexpected.

Principal Findings

Factors associated with a greater chance of never receiving NGS testing included older age, lower ECOG performance status, Black race, higher number of single-gene tests, public insurance, and treatment in a geography associated with MolDX Program adoption. Patient age and public insurance are factors that are closely related. Patients aged 65 years and older generally have Medicare coverage, whereas younger patients will have private insurance. The median age of lung cancer diagnosis is 71 years [Cancer stat facts: lung and bronchus cancer. National Cancer Institute. SEER. URL: https://seer.cancer.gov/statfacts/html/lungb.html [accessed 2024-03-20] 27], and it is highly likely that a younger patient presenting with NSCLC could raise questions about the genomic aspects of the disease that should be investigated as a result be associated with a higher likelihood of receiving early NGS-based testing as noted in the published literature [Wu X, Zhao J, Yang L, Nie X, Wang Z, Zhang P, et al. Next-generation sequencing reveals age-dependent genetic underpinnings in lung adenocarcinoma. J Cancer. 2022;13(5):1565-1572. [CrossRef] [Medline]28]. Importantly, patient race, similar to prior research [Bruno DS, Hess LM, Li X, Su EW, Patel M. Disparities in biomarker testing and clinical trial enrollment among patients with lung, breast, or colorectal cancers in the United States. JCO Precis Oncol. 2022;6:e2100427. [CrossRef]7], remains a significant factor that continues to demonstrate the lack of equity in receipt of NGS-based testing. Of all factors evaluated in this study, racial inequity cannot be explained by any reasonable clinical factors and requires immediate attention by the health care community.

Several factors that did not have a clear association with NGS-based testing were those that also did not have a hypothesized direction associated with a potential relationship. While blood test results may have captured some aspect of well-being, there was no consistent relationship identified. Similarly, while patients with better performance status were more likely to receive NGS-based testing, this relationship was not strong, and the factor was not among those with the highest importance scores observed in this study. Therefore, this study suggests that these factors are likely not largely factored into a decision to receive NGS-based testing and could be why little data were observed in the published literature related to these factors.

The roles of the MolDX program and the MAC region are unclear. The emergence of MAC region J8 WPS as a predictive factor for greater odds of receiving early NGS testing and both JH Novitas and J6 NGS at lower odds of receiving any NGS testing could be an artifact of a large dataset with multiple subgroups or could reflect underlying factors related to this region that could not be explored, given the available data in the electronic data used for this study. Additionally, the timing of MolDX program adoption was not taken into account, so the patients in these regions could have had the decision made at a time that was unrelated to this variable (“yes” or “no”). Other geographic factors such as distance to a clinic, access to testing resources, and site of care could certainly have played a role as well; therefore, the relationship with MolDX should not be overinterpreted. Additionally, not all patients in these regions had Medicare coverage, so there is a great deal of uncertainty in these variables. A study with more comprehensive variables related to patient care in these regions would be needed to come to any clear conclusion about these relationships.

Limitations

First, this study is based on real-world data. The Flatiron Health deidentified data, as with most other electronic health record–based datasets, do not contain all potentially relevant variables to investigate all aspects of the complex question of NGS-based testing. Factors such as tissue availability, tissue quality, a patient crisis requiring immediate care, and other health care system–related factors were not recorded and may be additional factors that could impact access and receipt of NGS-based testing. The availability of these data, however, would not invalidate the factors that were observed in this study. Second, there were some patients who could have received NGS-based testing at an early stage diagnosis who were not included in this study due to our eligibility criteria, requiring testing within the time frame of advanced or metastatic diagnosis. Therefore, this study may not be generalizable to those diagnosed and tested at earlier stages of the disease. Third, as with all real-world data sources, missingness is a potential issue. However, the rates of NGS-based testing in this study are very similar to other estimates from different data sources, which provides confidence in the outcome variable assessed within the database used for this study [Robert NJ, Espirito JL, Chen L, Nwokeji E, Karhade M, Evangelist M, et al. Biomarker testing and tissue journey among patients with metastatic non-small cell lung cancer receiving first-line therapy in The US Oncology Network. Lung Cancer. 2022;166:197-204. [FREE Full text] [CrossRef] [Medline]10]. Finally, when evaluating predictive models, a cutoff of 0.5 was applied to the predicted probability of events. While this may result in a suboptimal trade-off of specificity versus sensitivity for certain models (eg, for modeling “early vs late” NGS testing, it resulted in low specificities of ~20% and very high sensitivity of ~98%; Table S2 in

Multimedia Appendix 1

Details of study methods, model performance, and variable importance plots.

DOCX File , 406 KBMultimedia Appendix 1), the objective of this study was to identify predictors of NGS testing rather than optimizing predictive rules. The probability cutoffs could be further calibrated to strike a desired balance between false positives and negatives.

Conclusions

Despite the limitations of these data, this study reinforces the need to assure equity in access to NGS-based testing that has been observed in prior research. Black race is consistently associated with lower biomarker testing rates [Bruno DS, Hess LM, Li X, Su EW, Patel M. Disparities in biomarker testing and clinical trial enrollment among patients with lung, breast, or colorectal cancers in the United States. JCO Precis Oncol. 2022;6:e2100427. [CrossRef]7]. Other factors may be more associated with disease trajectory (eg, age, lower ECOG performance status, and single-gene tests), emphasizing the flexibility needed in testing for those patients who may not be well enough for systemic therapy or who have an actionable biomarker previously identified. While efforts must be made to ensure all patients diagnosed with NSCLC have equal access to NGS-based testing early in the trajectory of the disease, there may be consideration for the specific patient needs in these cases.

Acknowledgments

This was an unfunded research project with employee time and materials provided by Eli Lilly and Company.

Data Availability

The data that support the findings of this study have been originated by Flatiron Health, Inc. Requests for data sharing by license or by permission for the specific purpose of replicating results in this manuscript can be submitted to dataaccess@flatiron.com.

Authors' Contributions

AJMB conceptualized the study, planned the methodology, investigated the study, and revised and edited the final manuscript. LMH conceptualized the study, planned the methodology, investigated, wrote the original manuscript, and revised and edited the final manuscript. PMK conceptualized the study, investigated the study, and revised and edited the final manuscript. IL planned the methodology, conducted the final analysis, validated and investigated the study, and revised and edited the final manuscript. ZK planned the methodology, investigated the study, conducted final analysis and validation, produced visualization, and revised and edited the final manuscript. DH investigated the study, conducted final analysis, validation, and data curation, created visualization, and revised and edited the final manuscript.

Conflicts of Interest

AJMB, IL, ZK, and LMH are employees of Eli Lilly and Company, and PMK was an employee of Loxo@Lilly, subsequently Eli Lilly and Company, when this work was conducted. DH is an employee of Syneos Health, an organization that receives funding from Eli Lilly and Company for analytic support services.

Multimedia Appendix 1

Details of study methods, model performance, and variable importance plots.

DOCX File , 406 KB

Multimedia Appendix 2

Computing partial effects from covariates from logistic regression models.

DOCX File , 16 KB

  1. Mitchell CL, Zhang AL, Bruno DS, Almeida FA. NSCLC in the era of targeted and immunotherapy: what every pulmonologist must know. Diagnostics (Basel). 2023;13(6):1117. [FREE Full text] [CrossRef] [Medline]
  2. VanderLaan PA, Rangachari D, Majid A, Parikh MS, Gangadharan SP, Kent MS, et al. Tumor biomarker testing in non-small-cell lung cancer: a decade of change. Lung Cancer. 2018;116:90-95. [FREE Full text] [CrossRef] [Medline]
  3. NCCN Clinical Practice Guidelines in Oncology: non-small cell lung cancer version 3.2023. National Comprehensive Cancer Network. 2023. URL: https://www.nccn.org/guidelines/category_1 [accessed 2023-07-02]
  4. Hess LM, Krein PM, Haldane D, Han Y, Sireci AN. Biomarker testing for patients with advanced/metastatic nonsquamous NSCLC in the United States of America, 2015 to 2021. JTO Clin Res Rep. 2022;3(6):100336. [FREE Full text] [CrossRef] [Medline]
  5. Sireci AN, Krein PM, Hess LM, Khan T, Willey J, Ayars M, et al. Real-world biomarker testing patterns in patients with metastatic non-squamous non-small cell lung cancer (NSCLC) in a US community-based oncology practice setting. Clin Lung Cancer. 2023;24(5):429-436. [CrossRef] [Medline]
  6. Evangelist MC, Butrynski JE, Paschold JC, Ward PJ, Spira AI, Ali K, et al. Contemporary biomarker testing rates in both early and advanced NSCLC: results from the MYLUNG pragmatic study. J Clin Oncol. 2023;41(16 Suppl):9109-9109. [CrossRef]
  7. Bruno DS, Hess LM, Li X, Su EW, Patel M. Disparities in biomarker testing and clinical trial enrollment among patients with lung, breast, or colorectal cancers in the United States. JCO Precis Oncol. 2022;6:e2100427. [CrossRef]
  8. Steyerberg EW. Clinical Prediction Models. Washington DC. Springer; 2019.
  9. Vamathevan J, Clark D, Czodrowski P, Dunham I, Ferran E, Lee G, et al. Applications of machine learning in drug discovery and development. Nat Rev Drug Discov. 2019;18(6):463-477. [FREE Full text] [CrossRef] [Medline]
  10. Robert NJ, Espirito JL, Chen L, Nwokeji E, Karhade M, Evangelist M, et al. Biomarker testing and tissue journey among patients with metastatic non-small cell lung cancer receiving first-line therapy in The US Oncology Network. Lung Cancer. 2022;166:197-204. [FREE Full text] [CrossRef] [Medline]
  11. Boehmer L, Basu Roy UK, Schrag J, Martin NA, Salinas GD, Coleman B, et al. Identifying barriers to equitable biomarker testing in underserved patients with NSCLC: A mixed-methods study to inform quality improvement opportunities. J Clin Oncol. 2021;39(28 Suppl 123). [CrossRef]
  12. Mileham KF, Schenkel C, Bruinooge SS, Freeman-Daily J, Basu Roy U, Moore A, et al. Defining comprehensive biomarker-related testing and treatment practices for advanced non-small-cell lung cancer: results of a survey of U.S. oncologists. Cancer Med. 2022;11(2):530-538. [FREE Full text] [CrossRef] [Medline]
  13. Burns L, Jani C, Radwan A, Omari OA, Patel M, Oxnard GR, et al. Implementation challenges and disparities in molecular testing for patients with stage IV NSCLC: perspectives from an urban safety-net hospital. Clin Lung Cancer. 2023;24(2):e69-e77. [CrossRef] [Medline]
  14. Ma X, Long L, Moon S, Adamson BJS, Baxi SS. Comparison of population characteristics in real-world clinical oncology databases in the US: Flatiron Health, SEER, and NPCR. medRxiv. Preprint posted online January 5, 2023. [CrossRef]
  15. Birnbaum B, Nussbaum N, Seidl-Rathkopf K, Agrawal M, Estevez M, Estola E, et al. Model-assisted cohort selection with bias analysis for generating large-scale cohorts from the EHR for oncology research. ArXiv. Preprint posted online on January 13, 2020. [FREE Full text]
  16. Definition of human subjects research. National Institutes of Health. 2024. URL: https://grants.nih.gov/policy/humansubjects/research.htm [accessed 2024-03-07]
  17. Bruno DS, Li X, Hess LM. Biomarker testing, targeted therapy and clinical trial participation by race among patients with lung cancer: a real-world Medicaid database study. JTO Clin Res Rep. 2024;5(3):100643. [CrossRef] [Medline]
  18. NCCN Clinical Practice Guidelines in Oncology. National Comprehensive Cancer Network. 2022. URL: https://www.nccn.org/guidelines/category_1 [accessed 2022-07-18]
  19. Oncology (cancer)/hematologic malignancies approval notification. Food and Drug Administration. 2022. URL: https:/​/www.​fda.gov/​drugs/​resources-information-approved-drugs/​oncology-cancer-hematologic-malignancies-approval-notifications [accessed 2022-07-01]
  20. Experian. MAC Jurisdiction Map. 2016. URL: https://www.experian.com/blogs/healthcare/wp-content/uploads/2016/08/MAC-A-B-Jurisdiction-Map.jpg [accessed 2016-06-30]
  21. MolDX® Program (Administered by Palmetto GBA). Palmetto-GBA. 2022. URL: https://palmettogba.com/palmetto/moldxv2.nsf [accessed 2025-05-30]
  22. CMS. Medicare Administrative Contractors (MACs). 2023. URL: https://www.cms.gov/medicare/coding-billing/medicare-administrative-contractors-macs/whats-mac [accessed 2025-04-10]
  23. Tang F, Ishwaran H. Random forest missing data algorithms. Stat Anal Data Min. 2017;10(6):363-377. [CrossRef] [Medline]
  24. Chen T, Guestrin C. XGBoost: a scalable tree boosting system. 2016. Presented at: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; August 13, 2016; San Francisco, CA, United States. [CrossRef]
  25. Swaminathan A, Stergiopoulos S, Snider J, Fisher V, Castellanos E, Snow T. Changes over time in real-world next-generation sequencing (NGS) test use in patients (pts) with advanced non-small cell lung cancer (aNSCLC). Cancer Res. 2021;81(13_Suppl):864. [CrossRef]
  26. Mack PC, Klein MI, Ayers KL, Zhou X, Guin S, Fink M, et al. Targeted next-generation sequencing reveals exceptionally high rates of molecular driver mutations in never-smokers with lung adenocarcinoma. Oncologist. 2022;27(6):476-486. [CrossRef] [Medline]
  27. Cancer stat facts: lung and bronchus cancer. National Cancer Institute. SEER. URL: https://seer.cancer.gov/statfacts/html/lungb.html [accessed 2024-03-20]
  28. Wu X, Zhao J, Yang L, Nie X, Wang Z, Zhang P, et al. Next-generation sequencing reveals age-dependent genetic underpinnings in lung adenocarcinoma. J Cancer. 2022;13(5):1565-1572. [CrossRef] [Medline]


ALK: anaplastic lymphoma kinase
BRAF: V-Raf murine sarcoma viral oncogene homolog B
ECOG: Eastern Cooperative Oncology Group
EGFR: epidermal growth factor receptor
ICD: International Classification of Diseases
ICD-9: International Classification of Diseases, Ninth Revision
ICD-10: International Statistical Classification of Diseases, Tenth Revision
KRAS: Kirsten rat sarcoma virus
LASSO: least absolute shrinkage and selection operator
LR: logistic regression
MAC: Medicare Administrative Contractor
MET: mesenchymal epithelial transition
MolDX: Molecular Diagnostics Services
NCCN: National Comprehensive Cancer Network
NGS: next-generation sequencing
NSCLC: non–small cell lung cancer
NTRK: neurotrophic tyrosine receptor kinase
PD-L1: programmed death ligand 1
PLR: penalized logistic regression
RET: rearranged during transfection
ROS1: c-ros oncogene 1
XGBoost: extreme gradient boosting


Edited by N Cahill; submitted 16.07.24; peer-reviewed by P Wu; comments to author 17.12.24; revised version received 26.02.25; accepted 26.03.25; published 11.06.25.

Copyright

©Alan James Michael Brnabic, Ilya Lipkovich, Zbigniew Kadziola, Dan He, Peter M Krein, Lisa M Hess. Originally published in JMIR Cancer (https://cancer.jmir.org), 11.06.2025.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Cancer, is properly cited. The complete bibliographic information, a link to the original publication on https://cancer.jmir.org/, as well as this copyright and license information must be included.