Data Integration to Improve Real-world Health Outcomes Research for Non–Small Cell Lung Cancer in the United States: Descriptive and Qualitative Exploration

Background: The integration of data from disparate sources could help alleviate data insufficiency in real-world studies and compensate for the inadequacies of single data sources and short-duration, small sample size studies while improving the utility of data for research. Objective: This study aims to describe and evaluate a process of integrating data from several complementary sources to conduct health outcomes research in patients with non–small cell lung cancer (NSCLC). The integrated data set is also used to describe patient demographics, clinical characteristics, treatment patterns, and mortality rates. Methods: This retrospective cohort study integrated data from 4 sources: administrative claims from the HealthCore Integrated Research Database, clinical data from a Cancer Care Quality Program (CCQP), clinical data from abstracted medical records (MRs), and mortality data from the US Social Security Administration. Patients with lung cancer who initiated second-line (2L) therapy between November 01, 2015, and April 13, 2018, were identified in the claims and CCQP data. Eligible patients were 18 years or older and received atezolizumab, docetaxel, erlotinib, nivolumab, pembrolizumab, pemetrexed, or ramucirumab in the 2L setting. The main analysis cohort included patients with claims data and data from at least one additional data source (CCQP or MR). Patients without integrated data (claims only) were reported separately. Descriptive and univariate statistics were reported. Results: Data integration resulted in a main analysis cohort of 2195 patients with NSCLC; 2106 patients had CCQP and 407 patients had MR data. The claims-only cohort included 931 eligible patients. For the main analysis cohort, the mean age was 62.1 (SD 9.27) years, 48.56% (1066/2195) were female, the median length of follow-up was 6.8 months, and for 37.77% (829/2195), death was observed. For the claims-only cohort, the mean age was 66.6 (SD 12.69) years, 52.1% (485/931) were female, the median length of follow-up was 8.6 months, and for 29.3% (273/931), death was observed. The most frequent 2L treatment was immunotherapy (1094/2195, 49.84%), followed by platinum-based regimens (472/2195, 21.50%) and single-agent chemotherapy (441/2195, 20.09%); mean duration of 2L therapy was 5.6 (SD 4.9, median 4) months. We describe challenges and learnings from the data integration process, and the benefits of the integrated data set, which includes a richer set of clinical and outcome data to supplement the utilization metrics available in administrative claims. Conclusions: The management of patients with NSCLC requires care from a multidisciplinary team, leading to a lack of a single aggregated data source in real-world settings. The availability of integrated clinical data from MRs, health plan claims, and other sources of clinical care may improve the ability to assess emerging treatments. (JMIR Cancer 2021;7(2):e23161) doi: 10.2196/23161 JMIR Cancer 2021 | vol. 7 | iss. 2 | e23161 | p. 1 https://cancer.jmir.org/2021/2/e23161 (page number not for citation purposes) Grabner et al JMIR CANCER


Introduction Background
Real-world health outcomes research is often challenged by data insufficiency resulting from studies using a single data source and/or short durations [1][2][3]. For example, medical records (MRs) generally do not contain details of care outside of the point of service of the single health care provider, claims data contain few variables related to clinical outcomes, and registries often do not contain complete longitudinal data [4][5][6][7]. The integration of clinical data from different sources such as MRs [8], disease registries, or quality initiatives with large administrative claims repositories has been shown to increase the volume and quality of available data [9][10][11][12]. For example, integrated data allow the inclusion of important clinical factors when analyzing health care utilization and costs, as recorded in claims [13]. Such integrated observational data sets have also been used to generate predictive algorithms to better identify patients with cancer [14][15][16][17] and their disease characteristics [18][19][20].
Lung cancer is the second most common cancer in the United States, with approximately 230,000 new diagnoses in 2020 [21]. It is the leading cause of cancer-related deaths in the United States, projected at 136,000 in 2020 [22]. Non-small cell lung cancer (NSCLC) accounts for approximately 85% of all lung cancer cases [23]. Treatment modalities for advanced and/or metastatic NSCLC include radiotherapy, chemotherapy, targeted therapy, or a combination therapy [24]. Over the last few years, second-line (2L) treatment options have expanded rapidly with the introduction of immune checkpoint and epidermal growth factor receptor inhibitors and associated predictive biomarkers [25].
Treatment sequencing in the setting of NSCLC is not well characterized, largely because of the sparseness of applicable studies, which tend to be limited by inadequate data. This study was designed based on the rationale that a combination of retrospective data from multiple sources, such as MRs, administrative claims, and care quality initiatives, would provide a solid foundation for observing and characterizing real-world treatment outcomes at a lower cost than a traditional site-based prospective approach.

Objectives
The central objective of this study is to create an integrated database from several complementary sources and to assess the feasibility and effectiveness of these integrated observational data for health outcomes research. Patient characteristics and outcomes were described to evaluate the enrichment attained through integration. This analysis presents a descriptive summary of the final study cohort that was obtained for the study.

Study Design
RESOUNDS (Real-World Treatment Sequences and Outcomes Among Patients With Non-Small Cell Lung Cancer) was a retrospective, observational cohort study that integrated data from 4 sources: administrative claims from the HealthCore Integrated Research Database (HIRD), clinical data from a quality initiative called the Cancer Care Quality Program (CCQP), clinical data extracted from patients' MRs obtained from treating providers, and all-cause mortality data from the Death Master File of the US Social Security Administration. Details of the RESOUNDS study design and each of these data sources have been published previously [26]. The study protocol was approved by the New England Institutional Review Board before the commencement of data collection activities. This study was conducted in full compliance with the relevant provisions of the Health Insurance Portability and Accountability Act.

Patient Identification
Patients diagnosed with lung cancer who initiated 2L therapy between November 01, 2015, and April 13, 2018, were identified in the HIRD and CCQP data. Patients were required to receive 1 of the following 2L therapies alone or in combination: atezolizumab, docetaxel, erlotinib, nivolumab, pembrolizumab, pemetrexed, or ramucirumab. This subset of the original set of therapies listed in the protocol [26] was selected based on treatment guidelines and observed frequency of use during the study period, to ensure sufficient sample sizes to evaluate treatment patterns. Patients aged under 18 years at the start of 2L therapy were excluded. Due to the absence of specific International Classification of Diseases, Ninth and Tenth Revision, Clinical Modification codes for NSCLC, cancer type was confirmed via CCQP or MR data. Follow-up for all-cause death events was conducted through March 31, 2019.

Integrated Database Development
Patients were first identified in the CCQP data, where information on the type of lung cancer (NSCLC or not) was available, and information for patients with a record of 2L therapies of interest was retained. All cancer stages were included in the analyses. Second, lung cancer diagnosis and treatment claims were used to identify patients with 2L treatment in the HIRD. Patients who also had claims for other primary cancers were retained. All patients identified in the CCQP data were also included in the HIRD sample; patients who appeared in the HIRD but not the CCQP were retained. Third, copies of MRs were obtained from selected patients' 2L prescribers (focusing on oncologists, as identified in the HIRD) and screened for qualification (presence of evidence for NSCLC and that the index treatment was used as therapy for NSCLC). Regulatory and operational requirements for inclusion in this process consisted of patients having a fully insured status (vs administrative services only) and presence of complete contact information for the 2L prescriber. Once obtained and screened, clinical information was abstracted from each record by trained health information management technicians using a standardized form. The target sample size for MR abstraction was 398 patients, based on the expected feasible accrual over the 2.5-year patient identification period.
Data from each source were accumulated in 3 consecutive waves to continuously build the database. After each MR abstraction wave was complete, the claims and CCQP data were refreshed to the most current date at that point to obtain additional follow-up outcomes. The integrated data were used to establish the main analysis cohort, consisting of patients with both claims and either CCQP or MR data (or both). Eligible patients from the HIRD who did not appear in the CCQP and for whom no MRs were obtained were included in the claims-only cohort (these patients could have any type and stage of lung cancer).

Patient Characteristics and Outcomes
Demographic and clinical characteristics, treatment patterns, and survival outcomes were recorded. Baseline was defined as the 6 months before the index date (start of 2L therapy). The Quan-Charlson Comorbidity Index (QCI) was calculated, excluding lung cancer and metastatic carcinoma [27]. A patient was considered to be on the same line of therapy until new agents were added (except for maintenance and platinum agent switching), a gap of >90 days between treatments, end of follow-up, or (for 2L and higher) discontinuation. The percentage of patients flagged as deceased (for all causes) was calculated using a combination of the Death Master File, a hospitalization discharge code of deceased from claims, and mortality recorded from the health plan enrollment files.

Statistical Analysis
Univariate statistics including means, SDs, and medians for continuous variables and relative frequencies and percentages for categorical variables were reported. No hypothesis testing was performed. Statistical analysis was performed using SAS version 9.3 (SAS Institute Inc).

Data Integration and Patient Selection
Following data integration, the main analysis cohort consisted of 2195 patients. All patients had claims data, 2106 patients had CCQP data, and 407 patients had MR data ( Table 1).
Approximately 47.14% (997/2115) of patients fulfilled regulatory and operational requirements for their MRs to be requested from their 2L-prescribing providers; for 54.5% (543/997) of those, the records were obtained. A large number of MRs were not obtained as outreach was stopped after the planned sample size (n=398) was achieved; others could not be obtained because the provider did not have a record of the particular patient or because of inability to contact the provider. Among the obtained records, the most frequent reason for exclusion was the absence of confirmation of NSCLC (43/543, 7.9% of the obtained records). The claims-only cohort comprised 931 patients. Table 2 details what variables were obtained from which source. Step 1: Patients with non-small cell lung cancer -863 469 174 Step 2: From step A1, patients with 2L e therapy f Step B: Patients identified from claims -2187 1058 640 Step 1: Patients with lung cancer claim before start of first-line therapy -1127 600 368 Step 2: From step B1, patients with 2L therapy Step C: Combined patients from CCQP and claims 2115 1732 756 423 Step 1: From A2 and B2, unique patients with 2L therapy h Medical records excluded due to one or more of the following: no documentation of lung cancer, no documentation of non-small cell lung cancer, and patient mismatch (missing or unmatched name, sex, or date of birth; wrong timeframe; inconsistent clinical information). i These are the final sample sizes for the 2 cohorts of interest.

Demographic Characteristics at Baseline
In the main analysis cohort, mean age was 62.1 (SD 9.27) years and 48.56% (1066/2195) were female (Table 3), whereas in the claims-only cohort, mean age was 66.6 (SD 12.69) years and 52.1% (485/931) were female. More than two-thirds (1498/2195, 68.25%) of the main analysis cohort were from the Midwest and South, and 23

Clinical Characteristics at Baseline
In the main analysis cohort, the mean QCI was 1.6 (SD 1.59 In the main analysis cohort, additional clinical information was available via CCQP and/or MRs (

Length of Follow-Up and Mortality
The mean length of follow-up in months was 7.9 (SD 5.77) for the main analysis cohort (median 6.8) and 9.1 (SD 6.06) for the claims-only cohort (median 8.6). Death (for all causes) was observed in 37.77% (829/2195) of the main analysis cohort and 29.3% (273/931) of the claims-only cohort.

Principal Findings
This study combined 3 data sources for the analysis of real-world outcomes in patients with NSCLC, conducting data integration on a large scale across disparate but complementary sources. It was designed to simulate a prospective observational study by identifying patients upfront within large preexisting databases and then following them within the data set to examine outcomes. One of the potential strengths of this approach is the development of a database that includes demographic, clinical, and health care resource utilization data that can more accurately assess health outcomes.
The use of big data from multiple sources, such as health plan enrollment, disease registries, and scanned image repositories, among others, is becoming more important for the accurate determination of patient outcomes, particularly in the setting of NSCLC [28][29][30][31]. With the current availability of a wide range of newer, more effective systemic therapies, including several novel biologic agents, the use of diverse provider, institutional, and registry databases is increasingly necessary to evaluate outcomes due to the gaps in administrative claims data alone [32][33][34][35]. As treatments in oncology have improved, patients with lung cancer are living longer with the ability to personalize care with novel targeted therapies. This approach, coupled with more effective treatment, means that treatment strategies are increasingly complex, and factors influencing these strategies and their resultant outcomes are not fully identifiable in administrative claims data. As a result, the effective evaluation of treatment outcomes increasingly draws on data from multiple sources across lines of treatment, providers, and institutions.
Real-world evidence (RWE), which is largely derived from big health care data, has increasingly been driven by important technological advances, including machine learning, natural language processing improvements in electronic medical systems, and the ability to link clinical and health claims data in private and public systems [9]. As RWE grows and gains value, especially for pragmatic clinical trials (PCTs), the traditional gold standard of a randomized clinical trial (RCT) is facing major hurdles: low recruitment rates, small patient populations, long durations, and high costs. This evolving environment, along with growing interest in PCTs, is increasing the importance of big data and RWE as a complement to RCTs [36,37].
Furthermore, a bigger role for RWE is developing in decision making across the health care system, including regulators, payers, providers, and patients. Part of the reason is that although RCTs have internal validity, which is essential for safety and efficacy determinations, results from clinical studies may have limited external validity. At the same time, RWE studies using big data are able to explore key clinical questions that are outside the scope of RCTs. Such studies are well suited for investigations seeking safety and effectiveness outcomes data for broader target populations. This is especially valuable for the evaluation of fast-tracked medical products, which typically gain regulatory approval based on limited data. In addition, large RWE studies are invaluable in detecting the side effects of treatments over longer periods. Other circumstances in which RWE is valuable include exploration of rare diseases, assessing the impact of treatment adherence, when rapid retrospective results are needed, comparing multiple treatments that have not been explored in trials, and focusing on population subsets of interest, given more heterogeneity and larger population sizes in real-world data compared with clinical trials [36][37][38].
Due to the frequency of onset of NSCLC later in life, our study sample included patients with an average age greater than 60 years, with females constituting about half of the study population, which is consistent with other real-world US outcomes studies that examined patients with NSCLC [39][40][41][42][43][44][45][46][47][48]. All prior studies, to our knowledge, that focused on the United States used 1 or 2 of the following data sources: administrative claims, registry data, or MR. Limitations of these studies fall into 2 categories: (1) missing data on potential confounders and/or outcomes of interest (eg, claims data can assess utilization outcomes but lack disease characteristics; MR data have a rich set of clinical characteristics but lack longitudinality and utilization or cost data) and (2) limited generalizability (eg, the SEER-Medicare linked data in the United States capture claims and cancer registry data only for patients aged 65 years or older).
The ability of our study to integrate data across 3 sources to create a cohort of NSCLC patients with rich clinical and economic data offers an important addition to the comparatively small body of data on the performance of data integration methods and the determination of health outcomes based on these data for patients with NSCLC. To the extent that our study sample reflects the larger national population affected by lung cancer and with commercial insurance, these data could be instructive for a range of decisions made by multiple health care stakeholders including providers and patients requiring insights into the allocation of resources and overall disease management that cannot be completely ascertained from a single data source alone. One example would be the interaction of biomarker testing, treatment choice, and health outcomes. Integrated data sets such as RESOUNDS that can be refreshed regularly also offer many opportunities for future research, such as treatment sequencing, disease progression, and health care resource utilization and costs.

Data Integration Challenges
Our study also highlighted some challenges in the creation, maintenance, and analysis of large integrated data sets.
Integration of data sets in the midst of a rapid shift in the treatment landscape (such as the introduction of immune checkpoint inhibitors for oncology) may impact the value of data sets that are large and deep, but that include periods of time that are no longer relevant to current standards of care. The maintenance of these data sets requires constant refresh and update, so that the periods of interest to the investigator can be current and available for analysis. The wealth of data available in MRs presents challenges in identifying the trade-offs between generating a limited set of relevant but reasonably quickly available data versus a broader set of data that is potentially available but more difficult to obtain and prepare for analysis. Methods of data integration and data extraction may be improved with machine learning or natural language processing to reduce the manual extraction via data collection forms that was used in this study. Patient sample sizes available for analysis diminish when multiple data sources are required. Finally, there were specific data integration challenges in our study that resulted in additional effort needed by the project team to understand and address (eg, the estimated 2L therapy start date for a given patient sometimes differed between the data sources, plan enrollment changes entailed patients leaving or entering the data set multiple times, and conflicts between data sources for a given variable had to be resolved).

Study Limitations
Results based on integrated data must also be viewed with some limitations. The data quality and content will depend on the underlying data selected for integration. Specific to the data used for this project, limitations include the following: CCQP data were collected at the time of the prior authorization request, not at diagnosis. CCQP offers incentives to physicians for treating according to evidence-based guidelines created by the health plan, which could have influenced treatment choices. MR data may be underreported or missing due to vague, incomplete, or illegible entries; the inability to locate the required information; or missing patient files. ECOG performance status, a standard data item in cancer trials, is not always assessed in real-world patient care settings (in our study, this variable was available for 96.26% (2113/2195) of the sample, mostly from the CCQP), and information on race/ethnicity is often missing in claims data. Similarly, tumor growth and progression information is collected in various formats and levels of detail outside of a clinical trial setting. As a result, some of our research questions of interest had underpopulated data. Efforts by payers to tie provider reimbursement to the collection of key data points, for example, through quality improvement initiatives, may over time alleviate some of the missing data issues. Data collected during MR abstraction may have measurement errors linked to inconsistent coding, transcription, and data transfer errors. The typical limitations of claims data also apply. For example, a diagnosis code on a medical claim (eg, for secondary malignancies) does not guarantee the presence of a disease. Similarly, a claim for a prescription fill does not indicate that the medication was consumed or taken as prescribed. The generalizability of claims-based results is confined to similarly insured populations (eg, commercial, US-based in this study).

Conclusions
The care of patients with NSCLC requires a range of resources in a variety of settings in the real world. NSCLC and other forms of cancer are increasingly being managed like chronic diseases with a broad range of increasingly effective treatments. The assessment of real-world data to evaluate outcomes among patients with NSCLC will require the integration of a broad range of clinical data with health plan claims data. Overcoming data integration and completeness challenges will allow better informed decision making by all stakeholders of the health care system.