This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Cancer, is properly cited. The complete bibliographic information, a link to the original publication on http://cancer.jmir.org/, as well as this copyright and license information must be included.
The integration of data from disparate sources could help alleviate data insufficiency in real-world studies and compensate for the inadequacies of single data sources and short-duration, small sample size studies while improving the utility of data for research.
This study aims to describe and evaluate a process of integrating data from several complementary sources to conduct health outcomes research in patients with non–small cell lung cancer (NSCLC). The integrated data set is also used to describe patient demographics, clinical characteristics, treatment patterns, and mortality rates.
This retrospective cohort study integrated data from 4 sources: administrative claims from the HealthCore Integrated Research Database, clinical data from a Cancer Care Quality Program (CCQP), clinical data from abstracted medical records (MRs), and mortality data from the US Social Security Administration. Patients with lung cancer who initiated second-line (2L) therapy between November 01, 2015, and April 13, 2018, were identified in the claims and CCQP data. Eligible patients were 18 years or older and received atezolizumab, docetaxel, erlotinib, nivolumab, pembrolizumab, pemetrexed, or ramucirumab in the 2L setting. The main analysis cohort included patients with claims data and data from at least one additional data source (CCQP or MR). Patients without integrated data (claims only) were reported separately. Descriptive and univariate statistics were reported.
Data integration resulted in a main analysis cohort of 2195 patients with NSCLC; 2106 patients had CCQP and 407 patients had MR data. The claims-only cohort included 931 eligible patients. For the main analysis cohort, the mean age was 62.1 (SD 9.27) years, 48.56% (1066/2195) were female, the median length of follow-up was 6.8 months, and for 37.77% (829/2195), death was observed. For the claims-only cohort, the mean age was 66.6 (SD 12.69) years, 52.1% (485/931) were female, the median length of follow-up was 8.6 months, and for 29.3% (273/931), death was observed. The most frequent 2L treatment was immunotherapy (1094/2195, 49.84%), followed by platinum-based regimens (472/2195, 21.50%) and single-agent chemotherapy (441/2195, 20.09%); mean duration of 2L therapy was 5.6 (SD 4.9, median 4) months. We describe challenges and learnings from the data integration process, and the benefits of the integrated data set, which includes a richer set of clinical and outcome data to supplement the utilization metrics available in administrative claims.
The management of patients with NSCLC requires care from a multidisciplinary team, leading to a lack of a single aggregated data source in real-world settings. The availability of integrated clinical data from MRs, health plan claims, and other sources of clinical care may improve the ability to assess emerging treatments.
Real-world health outcomes research is often challenged by data insufficiency resulting from studies using a single data source and/or short durations [
Lung cancer is the second most common cancer in the United States, with approximately 230,000 new diagnoses in 2020 [
Treatment sequencing in the setting of NSCLC is not well characterized, largely because of the sparseness of applicable studies, which tend to be limited by inadequate data. This study was designed based on the rationale that a combination of retrospective data from multiple sources, such as MRs, administrative claims, and care quality initiatives, would provide a solid foundation for observing and characterizing real-world treatment outcomes at a lower cost than a traditional site-based prospective approach.
The central objective of this study is to create an integrated database from several complementary sources and to assess the feasibility and effectiveness of these integrated observational data for health outcomes research. Patient characteristics and outcomes were described to evaluate the enrichment attained through integration. This analysis presents a descriptive summary of the final study cohort that was obtained for the study.
RESOUNDS (Real-World Treatment Sequences and Outcomes Among Patients With Non-Small Cell Lung Cancer) was a retrospective, observational cohort study that integrated data from 4 sources: administrative claims from the HealthCore Integrated Research Database (HIRD), clinical data from a quality initiative called the Cancer Care Quality Program (CCQP), clinical data extracted from patients’ MRs obtained from treating providers, and all-cause mortality data from the Death Master File of the US Social Security Administration. Details of the RESOUNDS study design and each of these data sources have been published previously [
Patients diagnosed with lung cancer who initiated 2L therapy between November 01, 2015, and April 13, 2018, were identified in the HIRD and CCQP data. Patients were required to receive 1 of the following 2L therapies alone or in combination: atezolizumab, docetaxel, erlotinib, nivolumab, pembrolizumab, pemetrexed, or ramucirumab. This subset of the original set of therapies listed in the protocol [
Patients were first identified in the CCQP data, where information on the type of lung cancer (NSCLC or not) was available, and information for patients with a record of 2L therapies of interest was retained. All cancer stages were included in the analyses. Second, lung cancer diagnosis and treatment claims were used to identify patients with 2L treatment in the HIRD. Patients who also had claims for other primary cancers were retained. All patients identified in the CCQP data were also included in the HIRD sample; patients who appeared in the HIRD but not the CCQP were retained. Third, copies of MRs were obtained from selected patients’ 2L prescribers (focusing on oncologists, as identified in the HIRD) and screened for qualification (presence of evidence for NSCLC and that the index treatment was used as therapy for NSCLC). Regulatory and operational requirements for inclusion in this process consisted of patients having a fully insured status (vs administrative services only) and presence of complete contact information for the 2L prescriber. Once obtained and screened, clinical information was abstracted from each record by trained health information management technicians using a standardized form. The target sample size for MR abstraction was 398 patients, based on the expected feasible accrual over the 2.5-year patient identification period.
Data from each source were accumulated in 3 consecutive waves to continuously build the database. After each MR abstraction wave was complete, the claims and CCQP data were refreshed to the most current date at that point to obtain additional follow-up outcomes. The integrated data were used to establish the main analysis cohort, consisting of patients with both claims and either CCQP or MR data (or both). Eligible patients from the HIRD who did not appear in the CCQP and for whom no MRs were obtained were included in the claims-only cohort (these patients could have any type and stage of lung cancer).
Demographic and clinical characteristics, treatment patterns, and survival outcomes were recorded. Baseline was defined as the 6 months before the index date (start of 2L therapy). The Quan-Charlson Comorbidity Index (QCI) was calculated, excluding lung cancer and metastatic carcinoma [
Univariate statistics including means, SDs, and medians for continuous variables and relative frequencies and percentages for categorical variables were reported. No hypothesis testing was performed. Statistical analysis was performed using SAS version 9.3 (SAS Institute Inc).
Following data integration, the main analysis cohort consisted of 2195 patients. All patients had claims data, 2106 patients had CCQP data, and 407 patients had MR data (
Approximately 47.14% (997/2115) of patients fulfilled regulatory and operational requirements for their MRs to be requested from their 2L-prescribing providers; for 54.5% (543/997) of those, the records were obtained. A large number of MRs were not obtained as outreach was stopped after the planned sample size (n=398) was achieved; others could not be obtained because the provider did not have a record of the particular patient or because of inability to contact the provider. Among the obtained records, the most frequent reason for exclusion was the absence of confirmation of NSCLC (43/543, 7.9% of the obtained records). The claims-only cohort comprised 931 patients.
Patient selection.
Criteria | First wave sample (patients, n) | Second wave samplea (patients, n) | Third wave sample (patients, n) | Final sampleb (patients, n) | |
|
|||||
|
Step 1: Patients with non–small cell lung cancer | 295 | 760 | 1428 | —d |
|
Step 2: From step A1, patients with 2Le therapyf | 174 | 469 | 863 | — |
|
|||||
|
Step 1: Patients with lung cancer claim before start of first-line therapy | 640 | 1058 | 2187 | — |
|
Step 2: From step B1, patients with 2L therapy | 368 | 600 | 1127 | — |
|
|||||
|
Step 1: From A2 and B2, unique patients with 2L therapy | 423 | 756 | 1732 | 2115 |
|
|||||
|
Step 1: Patients used for MR outreach | 149 | 279 | 718 | 997 |
|
Step 2: Number of patient MRs obtained | 102 | 194 | 349 | 543 |
|
Step 3: Number of failed MRsh | 15 | 20 | 45 | 65 |
|
Step 4: Not used (target had been met previously) | — | — | 62 | 62 |
|
Step 5: Final MRs used | 87 | 174 | 242 | 416 |
|
272 | 791 | 1446 | 2195i | |
|
Step 1: Patients with CCQP data | 223 | 748 | 1399 | 2106 |
|
Step 2: Patients with MR data | 85 | 168 | 239 | 407 |
Step F: Claims-only cohort (patients with claims data only, no CCQP or MR data) | 377 | 243 | 659 | 931i |
aSecond wave included all patients from the first wave.
bThe final sample removed duplicates that were included in >1 wave. For those patients, information from the most recent wave was used for analysis.
cCCQP: Cancer Care Quality Program.
dNot available.
e2L: second-line therapy.
f2L medications of interest included atezolizumab, docetaxel, erlotinib, nivolumab, pembrolizumab, pemetrexed, or ramucirumab.
gMR: medical record.
hMedical records excluded due to one or more of the following: no documentation of lung cancer, no documentation of non–small cell lung cancer, and patient mismatch (missing or unmatched name, sex, or date of birth; wrong timeframe; inconsistent clinical information).
iThese are the final sample sizes for the 2 cohorts of interest.
Variable sourcing by database type.
Variable | HealthCore Integrated Research Database (claims) | Cancer Care Quality Program | Medical record |
Length of follow-up | ✓a | —b | — |
Age | ✓ | — | ✓ |
Gender | ✓ | — | ✓ |
Health plan type | ✓ | — | — |
Geographic region of patient residence | ✓ | — | — |
Race/ethnicity | — | — | ✓ |
Weight, height, and BMI | — | — | ✓ |
Histology | — | ✓ | ✓ |
Staging | Yc | ✓ | ✓ |
Treating physician specialty | ✓ | — | — |
Smoking status | — | — | ✓ |
Performance status (Eastern Cooperative Oncology Group) | — | ✓ | ✓ |
Comorbidities | ✓ (Quan-Charlson Comorbidity Index, secondary cancers) | — | — |
Mortality | Zd | — | — |
aIndicates variable was sourced from the data set listed in the column header.
bVariable was not sourced from the data set listed in the column header.
cIndicates the presence of claims for metastatic disease.
dThis was based on the Death Master File data from the US Social Security Administration.
In the main analysis cohort, mean age was 62.1 (SD 9.27) years and 48.56% (1066/2195) were female (
Demographic characteristics at baseline (on or close to second-line therapy initiation date).
Variables | Main analysis cohort (n=2195) | Claims-only cohort (n=931) | |
Age at second-line therapy initiation (years), mean (SD) | 62.1 (9.27) | 66.6 (12.69) | |
|
|||
|
18-39 | 22 (1.0) | 33 (3.5) |
|
40-64 | 1509 (68.7) | 343 (36.8) |
|
65-74 | 412 (18.8) | 278 (29.9) |
|
≥75 | 252 (11.5) | 277 (29.8) |
Female, n (%) | 1066 (48.6) | 485 (52.1) | |
|
|||
|
Health maintenance organization | 769 (35.0) | 225 (24.2) |
|
Preferred provider organization | 1126 (51.3) | 628 (67.5) |
|
Consumer-driven health plan | 300 (13.7) | 78 (8.4) |
Medicare Advantagea, n (%) | 505 (23.0) | 457 (49.1) | |
Affordable Care Act exchange plan, n (%) | 550 (25.1) | 106 (11.4) | |
|
|||
|
Northeast | 344 (15.7) | 164 (17.6) |
|
Midwest | 815 (37.1) | 262 (28.1) |
|
South | 683 (31.1) | 274 (29.4) |
|
West | 353 (16.1) | 231 (24.8) |
|
|||
|
Oncology | 1482 (67.5) | 286 (30.7) |
|
Pulmonary medicine | 34 (1.5) | 18 (1.9) |
|
Primary care provider | 77 (3.5) | 36 (3.9) |
|
Other | 481 (21.9) | 133 (14.3) |
|
Missing | 121 (5.5) | 458 (49.2) |
aIncludes Supplemental and Part D plans.
In the main analysis cohort, the mean QCI was 1.6 (SD 1.59). The most frequent comorbidities were dyspnea (1417/2195, 64.56%), chronic pulmonary disease (1125/2195, 51.25%), hypertension (1073/2195, 48.88%), anemia (880/2195, 40.09%), and dyslipidemia (792/2195, 36.08%;
In the main analysis cohort, additional clinical information was available via CCQP and/or MRs (
Clinical characteristics from claims at baseline (over 6 months before second-line therapy initiation date).
Variables | Main analysis cohort (n=2195) | Claims-only cohort (n=931) | |||
QCIa, mean (SD) | 1.6 (1.59) | 1.8 (1.69) | |||
|
|||||
|
0 | 570 (26.0) | 230 (24.7) | ||
|
1 | 705 (32.1) | 271 (29.1) | ||
|
2 | 414 (18.9) | 185 (19.9) | ||
|
3-5 | 444 (20.2) | 212 (22.8) | ||
|
6+ | 62 (2.8) | 33 (3.5) | ||
|
|||||
|
Myocardial infarction | 112 (5.1) | 46 (4.9) | ||
|
Congestive heart failure | 195 (8.9) | 111 (11.9) | ||
|
Peripheral vascular disease | 357 (16.3) | 186 (20.0) | ||
|
Cerebrovascular disease | 255 (11.6) | 100 (10.7) | ||
|
Dementia | 18 (0.8) | 10 (1.1) | ||
|
Chronic pulmonary disease | 1125 (51.2) | 390 (41.9) | ||
|
Connective tissue/rheumatic disease | 57 (2.6) | 32 (3.4) | ||
|
Peptic ulcer disease | 31 (1.4) | 13 (1.4) | ||
|
Mild liver disease | 421 (19.2) | 162 (17.4) | ||
|
Moderate or severe liver disease | 10 (0.5) | <10b | ||
|
Paraplegia and hemiplegia | 50 (2.3) | <10b | ||
|
Renal disease | 172 (7.8) | 127 (13.6) | ||
|
Diabetes with chronic complications | 96 (4.4) | 75 (8.1) | ||
|
Diabetes without chronic complications | 380 (17.3) | 211 (22.7) | ||
|
Malignancy (excluding lung cancer) | 1224 (55.8) | 681 (73.1) | ||
|
Metastatic carcinoma | 1743 (79.4) | 632 (67.9) | ||
|
AIDS/HIV | <10b | <10b | ||
|
|||||
|
Anemia (any) | 880 (40.1) | 376 (40.4) | ||
|
Anemia due to chemotherapy | 323 (14.7) | 92 (9.9) | ||
|
Asthma | 166 (7.6) | 88 (9.5) | ||
|
Cardiac dysrhythmias | 375 (17.1) | 199 (21.4) | ||
|
Coronary heart disease | 410 (18.7) | 209 (22.4) | ||
|
Depression | 338 (15.4) | 139 (14.9) | ||
|
Dyslipidemia | 792 (36.1) | 402 (43.2) | ||
|
Dyspnea | 1417 (64.6) | 542 (58.2) | ||
|
Hypertension | 1073 (48.9) | 565 (60.7) | ||
|
Idiopathic fibrosis of the lung | 15 (0.7) | <10b | ||
|
Interstitial lung disease | 29 (1.3) | <10b | ||
|
Peripheral vascular disease | 361 (16.4) | 187 (20.1) | ||
|
Pneumonia | 508 (23.1) | 151 (16.2) | ||
|
Pneumonitis | 29 (1.3) | 16 (1.7) | ||
|
Pulmonary fibrosis | 112 (5.1) | <10b | ||
|
Stroke | 255 (11.6) | 100 (10.7) | ||
|
Thyroid disease | 272 (12.4) | 165 (17.7) | ||
|
Tuberculosis | <10b | <10b |
aQCI: Quan-Charlson Comorbidity Index.
bValues <10 have not been reported for patient confidentiality.
Clinical characteristics from Cancer Care Quality Program and/or medical records at baseline (on or close to second-line therapy initiation date).
Variables | Main analysis cohort | |||
|
||||
|
|
|||
|
|
Current smoker | 67 (16.5) | |
|
|
Former smoker | 241 (59.2) | |
|
|
Never smoker | 58 (14.3) | |
|
|
Not documented | 41 (10.1) | |
|
|
201 (49.4) | ||
|
|
Number of years smoked, mean (SD) | 36.1 (13.48) | |
|
|
371 (91.2) | ||
|
|
Weight (pounds), mean (SD) | 165.0 (44.48) | |
|
|
341 (83.8) | ||
|
|
Height (inches), mean (SD) | 66.5 (3.88) | |
|
|
339 (83.3) | ||
|
|
BMI, mean (SD) | 26.1 (6.36) | |
|
|
|||
|
|
Adenocarcinoma | 271 (66.6) | |
|
|
Large-cell carcinoma | 9 (2.2) | |
|
|
Bronchioloalveolar carcinoma | 2 (0.5) | |
|
|
Mixed | 3 (0.7) | |
|
|
Unspecified nonsquamous | 2 (0.5) | |
|
|
Other | 4 (1.0) | |
|
|
Unknown/not documented | 116 (28.5) | |
|
|
387 (95.1) | ||
|
|
Lymph nodes (thoracic region) | 289 (71.0) | |
|
|
Supraclavicular nodes | 87 (21.4) | |
|
|
Superior mediastinal nodes | 201 (49.4) | |
|
|
Aortic nodes | 64 (15.7) | |
|
|
Inferior mediastinal nodes | 132 (32.4) | |
|
|
Hilar, lobar, and/or (sub)segmental nodes | 199 (48.9) | |
|
|
Bone | 190 (46.7) | |
|
|
Other respiratory systems (not trachea) | 163 (40.0) | |
|
|
Brain | 121 (29.7) | |
|
|
Liver | 72 (17.7) | |
|
|
Adrenal gland | 59 (14.5) | |
|
Number of metastases sites, mean (SD) | 3.2 (1.90) | ||
|
||||
|
|
2113 (96.26) | ||
|
|
0 | 464 (21.96) | |
|
|
1 | 1201 (56.84) | |
|
|
2 | 364 (17.23) | |
|
|
3 | 74 (3.50) | |
|
|
4 | 10 (0.47) | |
|
|
5 | 0 (0) | |
|
|
2146 (97.77) | ||
|
|
0 | 0 (0) | |
|
|
1 | <10 | |
|
|
2 | 32 (1.49) | |
|
|
3 | 167 (7.78) | |
|
|
4 | 1935 (90.17) | |
|
|
Unknown or not documented | <10 |
aMR: medical record.
bTNM: tumor/lymph nodes/metastasis cancer staging system.
The mean length of follow-up in months was 7.9 (SD 5.77) for the main analysis cohort (median 6.8) and 9.1 (SD 6.06) for the claims-only cohort (median 8.6). Death (for all causes) was observed in 37.77% (829/2195) of the main analysis cohort and 29.3% (273/931) of the claims-only cohort.
Among the 1974 patients with first-line (1L) treatment information, 69.50% (1372/1974) used platinum-based regimens, 37.69% (744/1974) used pemetrexed-containing regimens, and 16.51% (326/1974) used single-agent chemotherapy (treatment groups are not mutually exclusive;
Treatment patterns from Cancer Care Quality Program and claims, measured from the initiation of first-line treatment to the end of follow-up.
Therapy | Main analysis cohort (N=2195) | |||
|
1974 (89.9) | |||
|
|
|||
|
|
Platinum-based regimen | 1372 (69.5) | |
|
|
Nonplatinum-based regimen | 90 (4.6) | |
|
|
Pemetrexed-containing regimen | 744 (37.7) | |
|
|
Single-agent chemotherapy | 326 (16.5) | |
|
|
|||
|
|
PD-1/PD-(L)1b inhibitor–containing regimen | 241 (12.2) | |
|
|
|||
|
|
EGFRc TKIsd-containing regimen | 98 (5.0) | |
|
|
EGFR mAbe-containing regimen | 11 (0.6) | |
|
|
VEGFf mAb-containing regimen | 308 (15.6) | |
|
|
ALKg inhibitor | 21 (1.1) | |
|
Duration of time (days) between initial lung cancer diagnosis and 1L treatment, mean (SD) | 134.6 (380.98) | ||
|
Duration (days) of 1L therapy, mean (SD)h | 127.7 (142.75) | ||
|
|
|
||
|
|
Gap of ≤90 days before 2Li | 1122 (56.8) | |
|
|
Gap of >90 days before 2L | 852 (43.2) | |
|
2195 (100.0) | |||
|
|
|||
|
|
Platinum-based regimen | 472 (21.5) | |
|
|
Nonplatinum-based regimen | 221 (10.1) | |
|
|
Pemetrexed-containing regimen | 344 (15.7) | |
|
|
Single-agent chemotherapy | 441 (20.1) | |
|
|
|||
|
|
PD-1/PD-L1 inhibitor–containing regimen | 1094 (49.8) | |
|
|
|||
|
|
EGFR TKIs-containing regimen | 36 (1.6) | |
|
|
EGFR mAb-containing regimen | 10 (0.5) | |
|
|
VEGF mAb-containing regimen | 141 (6.4) | |
|
|
ALK inhibitor | <10j | |
|
Duration (days) of 2L therapy, mean (SD)k | 168.6 (148.4) | ||
|
269 (12.3) | |||
|
|
|||
|
|
Curative | 21 (7.8) | |
|
|
Palliative | 124 (46.1) | |
|
|
Both curative and palliative (separate instances) | 15 (5.6) | |
|
|
Unknown | 109 (40.5) |
a1L: first-line therapy.
bPD-(L)1: programmed death-(ligand) 1.
cEGFR: epidermal growth factor receptor.
dTKI: tyrosine kinase inhibitor.
emAb: monoclonal antibodies.
fVEGF: vascular endothelial growth factor.
gALK: anaplastic lymphoma kinase.
hMedian 90.0.
i2L: second-line therapy.
jValues <10 have not been reported for patient confidentiality.
kMedian 121.0.
This study combined 3 data sources for the analysis of real-world outcomes in patients with NSCLC, conducting data integration on a large scale across disparate but complementary sources. It was designed to simulate a prospective observational study by identifying patients upfront within large preexisting databases and then following them within the data set to examine outcomes. One of the potential strengths of this approach is the development of a database that includes demographic, clinical, and health care resource utilization data that can more accurately assess health outcomes.
The use of big data from multiple sources, such as health plan enrollment, disease registries, and scanned image repositories, among others, is becoming more important for the accurate determination of patient outcomes, particularly in the setting of NSCLC [
Real-world evidence (RWE), which is largely derived from big health care data, has increasingly been driven by important technological advances, including machine learning, natural language processing improvements in electronic medical systems, and the ability to link clinical and health claims data in private and public systems [
Furthermore, a bigger role for RWE is developing in decision making across the health care system, including regulators, payers, providers, and patients. Part of the reason is that although RCTs have internal validity, which is essential for safety and efficacy determinations, results from clinical studies may have limited external validity. At the same time, RWE studies using big data are able to explore key clinical questions that are outside the scope of RCTs. Such studies are well suited for investigations seeking safety and effectiveness outcomes data for broader target populations. This is especially valuable for the evaluation of fast-tracked medical products, which typically gain regulatory approval based on limited data. In addition, large RWE studies are invaluable in detecting the side effects of treatments over longer periods. Other circumstances in which RWE is valuable include exploration of rare diseases, assessing the impact of treatment adherence, when rapid retrospective results are needed, comparing multiple treatments that have not been explored in trials, and focusing on population subsets of interest, given more heterogeneity and larger population sizes in real-world data compared with clinical trials [
Due to the frequency of onset of NSCLC later in life, our study sample included patients with an average age greater than 60 years, with females constituting about half of the study population, which is consistent with other real-world US outcomes studies that examined patients with NSCLC [
The ability of our study to integrate data across 3 sources to create a cohort of NSCLC patients with rich clinical and economic data offers an important addition to the comparatively small body of data on the performance of data integration methods and the determination of health outcomes based on these data for patients with NSCLC. To the extent that our study sample reflects the larger national population affected by lung cancer and with commercial insurance, these data could be instructive for a range of decisions made by multiple health care stakeholders including providers and patients requiring insights into the allocation of resources and overall disease management that cannot be completely ascertained from a single data source alone. One example would be the interaction of biomarker testing, treatment choice, and health outcomes. Integrated data sets such as RESOUNDS that can be refreshed regularly also offer many opportunities for future research, such as treatment sequencing, disease progression, and health care resource utilization and costs.
Our study also highlighted some challenges in the creation, maintenance, and analysis of large integrated data sets. Integration of data sets in the midst of a rapid shift in the treatment landscape (such as the introduction of immune checkpoint inhibitors for oncology) may impact the value of data sets that are large and deep, but that include periods of time that are no longer relevant to current standards of care. The maintenance of these data sets requires constant refresh and update, so that the periods of interest to the investigator can be current and available for analysis. The wealth of data available in MRs presents challenges in identifying the trade-offs between generating a limited set of relevant but reasonably quickly available data versus a broader set of data that is potentially available but more difficult to obtain and prepare for analysis. Methods of data integration and data extraction may be improved with machine learning or natural language processing to reduce the manual extraction via data collection forms that was used in this study. Patient sample sizes available for analysis diminish when multiple data sources are required. Finally, there were specific data integration challenges in our study that resulted in additional effort needed by the project team to understand and address (eg, the estimated 2L therapy start date for a given patient sometimes differed between the data sources, plan enrollment changes entailed patients leaving or entering the data set multiple times, and conflicts between data sources for a given variable had to be resolved).
Results based on integrated data must also be viewed with some limitations. The data quality and content will depend on the underlying data selected for integration. Specific to the data used for this project, limitations include the following: CCQP data were collected at the time of the prior authorization request, not at diagnosis. CCQP offers incentives to physicians for treating according to evidence-based guidelines created by the health plan, which could have influenced treatment choices. MR data may be underreported or missing due to vague, incomplete, or illegible entries; the inability to locate the required information; or missing patient files. ECOG performance status, a standard data item in cancer trials, is not always assessed in real-world patient care settings (in our study, this variable was available for 96.26% (2113/2195) of the sample, mostly from the CCQP), and information on race/ethnicity is often missing in claims data. Similarly, tumor growth and progression information is collected in various formats and levels of detail outside of a clinical trial setting. As a result, some of our research questions of interest had underpopulated data. Efforts by payers to tie provider reimbursement to the collection of key data points, for example, through quality improvement initiatives, may over time alleviate some of the missing data issues. Data collected during MR abstraction may have measurement errors linked to inconsistent coding, transcription, and data transfer errors. The typical limitations of claims data also apply. For example, a diagnosis code on a medical claim (eg, for secondary malignancies) does not guarantee the presence of a disease. Similarly, a claim for a prescription fill does not indicate that the medication was consumed or taken as prescribed. The generalizability of claims-based results is confined to similarly insured populations (eg, commercial, US-based in this study).
The care of patients with NSCLC requires a range of resources in a variety of settings in the real world. NSCLC and other forms of cancer are increasingly being managed like chronic diseases with a broad range of increasingly effective treatments. The assessment of real-world data to evaluate outcomes among patients with NSCLC will require the integration of a broad range of clinical data with health plan claims data. Overcoming data integration and completeness challenges will allow better informed decision making by all stakeholders of the health care system.
first-line
second-line
Cancer Care Quality Program
Eastern Cooperative Oncology Group
HealthCore Integrated Research Database
medical record
non–small cell lung cancer
Quan-Charlson Comorbidity Index
randomized clinical trial
Real-World Treatment Sequences and Outcomes Among Patients With Non-Small Cell Lung Cancer
real-world evidence
Funding for the study was provided to HealthCore, Inc by Eli Lilly and Company. Bernard Tulsi, an employee of HealthCore, Inc at the time of the study, provided writing and editorial support for this manuscript.
MG is an employee of HealthCore, Inc, an independent research organization that received funding from Eli Lilly and Company for the conduct of this study. CM, KW, ZC, and LH are employees and stockholders of Eli Lilly and Company. GC was an employee of Eli Lilly and Company at the time the study was conducted. LW was an employee of HealthCore at the time the study was conducted.