Digital Biomarkers of Symptom Burden Self-Reported by Perioperative Patients Undergoing Pancreatic Surgery: Prospective Longitudinal Study

Background: Cancer treatments can cause a variety of symptoms that impair quality of life and functioning but are frequently missed by clinicians. Smartphone and wearable sensors may capture behavioral and physiological changes indicative of symptom burden, enabling passive and remote real-time monitoring of fluctuating symptoms Objective: The aim of this study was to examine whether smartphone and Fitbit data could be used to estimate daily symptom burden before and after pancreatic surgery. Methods: A total of 44 patients scheduled for pancreatic surgery participated in this prospective longitudinal study and provided sufficient sensor and self-reported symptom data for analyses. Participants collected smartphone sensor and Fitbit data and completed daily symptom ratings starting at least two weeks before surgery, throughout their inpatient recovery, and for up to 60 days after postoperative discharge. Day-level behavioral features reflecting mobility and activity patterns, sleep, screen time, heart rate, and communication were extracted from raw smartphone and Fitbit data and used to classify the next day as high or low symptom burden, adjusted for each individual’s typical level of reported symptoms. In addition to the overall symptom burden, we examined pain, fatigue, and diarrhea specifically. Results: Models using light gradient boosting machine (LightGBM) were able to correctly predict whether the next day would be a high symptom day with 73.5% accuracy, surpassing baseline models. The most important sensor features for discriminating high symptom days were related to physical activity bouts, sleep, heart rate, and location. LightGBM models predicting next-day diarrhea (79.0% accuracy), fatigue (75.8% accuracy), and pain (79.6% accuracy) performed similarly. Conclusions: Results suggest that digital biomarkers may be useful in predicting patient-reported symptom burden before and after cancer surgery. Although model performance in this small sample may not be adequate for clinical implementation, findings support the feasibility of collecting mobile sensor data from older patients who are acutely ill as well as the potential clinical value of mobile sensing for passive monitoring of patients with cancer and suggest that data from devices that many patients JMIR Cancer 2021 | vol. 7 | iss. 2 | e27975 | p. 1 https://cancer.jmir.org/2021/2/e27975 (page number not for citation purposes) Low et al JMIR CANCER


Introduction
Cancer treatments such as chemotherapy and surgery cause a variety of symptoms and side effects that can impair subjective quality of life and functioning. Across a variety of cancer types, fatigue, pain, nausea, and other physical symptoms are highly prevalent and often severe [1,2], and many patients experience multiple symptoms simultaneously [3]. Patients who report more significant symptoms tend to exhibit worse performance status and functional ability [4,5]. Unfortunately, symptoms remain undetected by clinicians up to half of the time [6,7], limiting opportunities for timely and effective clinical management and resulting in undue patient suffering and functional impairment.
Remotely monitoring symptoms between hospital or clinic visits may improve our ability to capture severe or bothersome symptoms when they begin to emerge [8]. Smartphones, now owned by 81% of adults and increasing proportions of older adults, those living in rural areas, and all racial groups, offer new opportunities for remote symptom monitoring [9]. Systems leveraging smartphones for real-time patient-reported outcome (PRO) assessment during outpatient chemotherapy have been demonstrated to be feasible [10,11] and to reduce chemotherapy-related morbidity [12]. Although daily PRO symptom data are valuable, long-term assessment of PROs (eg, over months or years of chemotherapy) is burdensome. Indeed, previous work suggests that patients become significantly less compliant at recording symptoms over time [13], with patient compliance dropping to below 50% after 1 month in one longitudinal study [14]. Developing a remote symptom monitoring system that is less reliant on patient compliance may enable longitudinal symptom tracking and management throughout cancer treatment and even after treatment is completed, when symptoms persist for many survivors.
Smartphones are equipped with a rich array of sensors capable of measuring many behavioral and contextual variables, including mobility, location, ambient light and noise, and social interactions [15]. Most users keep their smartphones within arm's reach at all times and spend over 4 hours per day interacting with the device [16]. Thus, smartphones can gather digital traces as individuals go about their daily routines. From these raw digital data, meaningful behavioral features such as number of unique locations visited, number of outgoing calls placed, and average level of ambient noise detected during the night can be calculated to provide information about behavior patterns in real-world contexts [17].
Smartwatches and other wearable commercial activity monitors are also becoming more widely used, with about 1 in 5 adults using a wearable device [9]. Wearable devices contain sensors such as accelerometers and photoplethysmography which can provide continuous information about activity, sleep, and physiology (eg, heart rate). Together, these mobile sensing technologies enable objective assessment of behavioral patterns that may reflect worsening health status, including severe or increasing symptoms. Moreover, this high-density, multimodal, and objective data collection can be completed with minimal burden to patients; this feature makes this approach highly scalable and appropriate for remotely monitoring patients, even older patients and those who are acutely ill and even over long periods. Given evidence that physical activity and sleep behaviors as well as heart rate have prognostic value in oncology, technology that enables passive quantification of these metrics holds considerable promise for clinical cancer research [18][19][20].
Applying machine learning classification to smartphone sensor data has been shown to accurately discriminate depressed from nondepressed individuals [21], to recognize depressive and manic episodes in patients with bipolar disorder [22][23][24], to predict mental health indicators in schizophrenia [25], and to detect binge drinking and other substance use [26]. These methods can also shed light on which behavioral features are most useful for detecting or predicting mental health states or risky behaviors. Work applying this approach to passively detect physical health status in patients with cancer is more limited, but results from 14 recent small studies suggest that wearable and smartphone sensor data are related to symptom burden, quality of life, and other clinical oncology outcomes [27].
The perioperative context is an especially critical time for remote patient monitoring, as complications after cancer surgery are common and can escalate into re-admissions that may be preventable if detected and managed earlier. Results from similar studies of patients undergoing surgical oncology procedures found that accelerometer data were useful for quantifying differences in postoperative recovery [28] and for predicting re-admission risk [29]. In this study, we aimed to examine whether smartphone and wearable sensors can be useful in detecting overall patient-reported symptom burden as well as 3 specific physical symptoms (fatigue, pain, and diarrhea) among patients undergoing pancreatic cancer surgery, a complex but potentially curative procedure with postoperative morbidity rates as high as 40% [30].

Participants
Potential study participants were identified for the study by their surgical oncology care team. Men and women aged 18 years or older who were scheduled for pancreatic surgery at a large academic cancer center were eligible and were enrolled at their preoperative clinic visit. Of 72 eligible and approached patients, 60 consented to participate in this study. Surgery was canceled for 4 patients, and 2 withdrew from the study prior to surgery due to poor health or feeling overwhelmed. An additional 10 had insufficient sensor data for analyses based on data cleaning thresholds (described in detail later), leaving 44 participants in our analytic sample (mean age 65.7 years, range 40-82; 41% [18/44]

Study Procedure
Study assessments began prior to surgery and continued during inpatient recovery after surgery (mean 7-day stay, range 2-22) and for 60 days after postoperative discharge. A total of 13/44 patients (30%) were re-admitted to the hospital at some point during the 60 days. At their preoperative visit, participants were provided with an Android smartphone with the AWARE app installed [31]. AWARE was used to passively collect smartphone sensor data, including movement and approximate location of the phone, device use, metadata about call and SMS events, and ambient light and noise levels. AWARE was also used to collect patient-reported symptom ratings each morning; participants rated the severity of 10 physical and psychological symptoms (pain, fatigue, sleep disturbance, trouble concentrating/remembering things, feeling sad or down, feeling anxious or worried, shortness of breath, numbness or tingling, nausea, diarrhea or constipation) on a scale from 0 (not present) to 10 (as bad as you can imagine). These symptoms were selected because they reflect common core symptoms during oncology treatment [32] and the symptom severity rating format was adapted from the MD Anderson Symptom Inventory [33]. AWARE stored this information on the device and transmitted deidentified data to a secure server over a secure network connection when the device was connected to Wi-Fi. Participants were asked to keep the phone charged and with them at all times and to use the phone for communication as much as possible.
Participants were also given a Fitbit Charge 2 device to wear for the duration of the study, which they were invited to keep after study completion. The Fitbit collected data about activity, sleep, and heart rate. The Fitbit Charge 2 has been shown to measure activity and sleep parameters with acceptable accuracy in older free-living adults [34].
After study completion, participants returned the mobile phones to the study team and received a compensation of US $150. The University of Pittsburgh institutional review board approved all study procedures.

Patient-Reported Symptoms
To compute daily symptom burden scores, we summed all 10 symptom ratings to create a composite reflecting total daily symptom burden (mean 15, range 0-97). We then calculated the mean daily symptom burden for each individual patient and then subtracted individual means from each of that patient's daily symptom burden scores and categorized the resulting residual into average or below average (residual of daily score -individual mean ≤ 0) or high (residual of daily scoreindividual mean > 0). This approach allowed us to classify each day as a high or low symptom burden day, adjusting for each individual's typical level of reported symptoms. Approximately 35.99% (487/1353) of all days were classified as high symptom days (proportion of high symptom days for individual patients ranged from ranged from 0% [0/11] to 80% [8/10]). As the data set was imbalanced, we used the support vector machine synthetic minority over-sampling technique (SVM SMOTE) to resample the minority class. We also examined 3 specific physical symptoms (pain, fatigue, and diarrhea because these were the most common in our sample) using a similar approach.

Passive Smartphone and Wearable Sensor Data
We computed day-level (24 hours from midnight to midnight) behavioral features from both AWARE and Fitbit data using our Reproducible Analysis Pipeline for Data Streams (RAPIDS) [35]. Accelerometer, activity recognition, application, battery, call, conversation, light, location, SMS text message, and screen features were extracted from AWARE data. Heart rate, step, and sleep features were extracted from Fitbit data. For sleep, features were extracted for any sleep episodes that ended on that day to capture both overnight main sleep and naps. In total, we extracted 213 features from smartphone and Fitbit data; feature descriptions can be found in RAPIDS documentation [35,36]. We also included 3 additional features judged to be important for symptom prediction: (1) days since surgery, because symptoms tended to considerably increase immediately after surgery and then decline over time; (2) most recent symptom burden score, given that high symptom burden scores today tended to predict high symptom burden tomorrow; and (3) participant's average symptom burden score up to current time point, given the substantial between-participant variability in the range of symptom severities reported. Because symptom ratings were completed each morning, sensor data were used to predict the next day's symptom burden class.
We dropped sensor and symptom data from the date of surgery (as devices were with caregivers while patients were in the operating room) and from days that the patient was hospitalized (both after surgery and during any subsequent re-admissions, as we anticipated behavioral patterns to differ systematically in the hospital and we are most interested in detecting symptoms when patients are not in a health care setting).
To clean data, we first excluded days with less than 20 hours of sensor data and participants with fewer than 5 days of sensor data. We then dropped features missing more than 30% of values (days) or with 0 variance as well as days missing more than 30% of values (features). We merged sensor data with high/low symptom labels, then again filtered out participants with less than 5 days of valid labeled sensor feature data. After data cleaning, we had 1353 (mean 30.75, range 5-67 per patient) days of sensor data including 142 features from 44 patients.
On average, participants were missing 7.25% of data values (range 0%-19.08%). For each participant, we imputed continuous missing data as follows: (1) missing features in the training set (ie, subset of data used to train the model) were replaced with the average of the 2 closest days; (2) missing features in the test set (ie, subset of data used to evaluate model performance) were replaced with the last valid day's feature from the training set; and (3) if a participant is missing a specific feature, replace it with the average from the rest of the participants' data. We imputed categorical missing data as follows: (1) missing features were replaced with the mode of that participant's training data; (2) if a participant is missing a specific feature, replace it with the mode of the remaining participants' training data.
Categorical features were converted into integer representation via one-hot encoding. Because the scale of features will not influence the results of tree-based algorithms (eg, light gradient boosting machine [LightGBM]), we normalized numerical features with either min-max, z-score, or scikit-learn package's robust scaler for the rest of the models. A total of 75 features were selected via mutual information.
We evaluated a number of different binary classifiers, including logistic regression, k-nearest neighbors, support vector machine, random forest, gradient boosting, extreme gradient boosting, and LightGBM. Model performance (ability of the model to generate predicted binary class labels [0 vs 1] that match true class labels) was compared with several baselines: majority class, random weighted classifier, and decision tree using days since surgery, most recent score, and average score (ie, the 3 nonsensor features used in our models). We used nested cross-validation. Three-fold cross-validation was considered for the inner loop to tune hyperparameters and leave-one-day-out cross-validation was considered for the outer loop to evaluate performance and calculate accuracy, precision, recall, F1, and area under the receiver operating characteristic curve (AUC) across all folds. Because our ultimate goal is real-time clinical implementation of these algorithms, we trained models only on past data from that participant as well as data from other participants (ie, data collected after the test day were not included in the training set for that fold). The code for feature extraction and analysis is available online [37].

Results
Models using LightGBM performed best for the population model. We used 0 as the random seed, 200 as the number of boosted trees, and 128 as the maximum tree leaves. The learning rate was chosen from {0.008, 0.01, 0.012} and the subsample ratio of columns when constructing each tree was chosen from {0.68, 0.7, 0.72}. Using this approach, models using smartphone and wearable feature data were able to correctly predict whether the next day would be a high symptom day with 73.5% accuracy (0.611 recall for the high symptom class and 0.772 AUC). This model surpassed the accuracy and performance of all 3 baseline models (Table 1).  The most important features included the most recent symptom burden score, days since surgery, average symptom burden score, duration of active and exertional activity bouts, minimum heart rate, number of unique activities, time spent at the most frequent location, maximum ambient lux, total duration of time awake and asleep, and total duration of the heart rate in cardio zone (70%-84% of the participant's maximum heart rate) and peak zone (85%-100% of the participant's maximum heart rate; Figure 1). In this plot, features with many instances in red with SHAP (SHapley Additive exPlanations) [38] value greater than 0 had a positive relationship with symptom burden (eg, longer median duration of nonexertional episodes related to high symptom burden), whereas those in blue had an inverse association (eg, shorter total duration of active bouts related to high symptom burden).
We also generated population models for diarrhea, fatigue, and pain, respectively. All steps are the same as above except for the target values. Instead of calculating the labels based on the summation of all 10 symptom ratings, diarrhea score or fatigue score or pain score is applied directly.
Like the overall symptom burden results, LightGBM models outperformed all 3 baseline models and predicted next-day diarrhea with 79.0% accuracy (AUC 83.41%), next-day fatigue with 75.8% accuracy (AUC 80.29%), and next-day pain with 79.6% accuracy (AUC 83.48%; Table 2). Location features are very important for diarrhea prediction, while step features and sleep features are very important for fatigue prediction and pain prediction, respectively. The most recent symptom burden score, days since surgery, and average symptom burden score are the most important features for all symptoms.  Table 2. Performance of population models classifying next-day diarrhea or fatigue or pain symptom class (1=higher than average) from wearable and smartphone sensors.

Discussion
The purpose of this prospective longitudinal study was to evaluate passive smartphone and wearable sensor features as predictors of symptom burden in perioperative patients undergoing pancreatic surgery. Results suggest that machine learning models developed using mobile sensor data were more accurate than non-sensor-based baseline models in predicting whether the next-day patient-reported overall symptom burden would be higher than average for that patient. The most important features for symptom prediction included features related to physical activity, heart rate, and location. Models also accurately predicted next-day diarrhea, fatigue, and pain, although the most important features in each model differed across specific symptoms.
This work contributes to a small but growing literature investigating associations between consumer mobile sensors and clinical outcomes in oncology [27]. Similar to studies of patients undergoing chemotherapy [39] and hematopoietic cell transplant [40], features related to physical activity were most strongly related to fluctuations in physical symptom severity. Feature importance revealed that these were not simple features such as daily step counts but rather features reflecting patterns of activity and included measurements from both wearable Fitbit devices (eg, number, total duration, and maximum duration of active bouts) and smartphones (eg, duration of nonexertional episodes from phone accelerometer, number of unique activities recognized). Heart rate and sleep features were also important, suggesting that future work in this area should consider using wearable devices that enable collection of 24-hour behavioral and physiological data and examination of circadian rest-activity rhythms previously linked to outcomes in patients with cancer [41].
Because wearable and smartphone sensor data can be collected continuously as patients go about their daily lives, requiring minimal effort or attention from patients or their caregivers, mobile sensing offers an opportunity for long-term remote patient monitoring over months or years of cancer treatment and survivorship. This study supports the feasibility of collecting mobile sensor data, even from patients who are seriously ill during times of acute sickness and recovery. Despite undergoing invasive surgery and (for most patients) grappling with one of the deadliest cancer diagnoses, over 80% of participants had sufficient sensor data for analyses. This is also noteworthy given that the average age of patients was over 65 and that, as these data were collected in 2017, participants varied considerably in their comfort and familiarity with mobile technology.
Although models trained on past mobile sensor data outperformed baseline models, model performance still may not be adequate for clinical implementation. For example, recall of the high overall symptom burden class (when timely clinical action would be needed) was only 61%, meaning nearly 40% of high symptom days would be missed by our model. This may be due in part to the relatively small sample and data set, the use of study-provided (rather than personal) smartphones, or the powerful effect of major abdominal surgery and prolonged hospitalization on patient symptom profiles as well as behavior. Future studies with larger samples that collect data using their own personal devices over a period with less dramatic shifts in symptoms and behavior may yield better model performance. In future studies with larger data sets more robust to class imbalance, setting a higher threshold for severe symptoms requiring care provider attention or intervention may also result in more clinically useful models. Regardless, mobile sensor data may be a useful complement to patient-reported symptom data, allowing for a more personalized and adaptive delivery of symptom ratings when behavioral fluctuations are detected, reducing patient burden and improving early capture of worsening side effects and symptoms. Predictive models based on sensor and patient-reported data could also be used to deliver symptom self-management instructions to patients, an approach demonstrated to benefit patients undergoing pancreatic cancer surgery [42].
Given the small data set, we focused on building population models that used data from all other participants, which also may have constrained model performance. Because each participant had on average only 30 rows of data, individual models were unstable, but with more training data could be useful in learning patterns based on each participant's behavior and its relationship to symptoms and developing more accurate predictions. Developing models based on similar subgroups of participants (based on demographic, clinical, or behavioral factors) could be a useful approach for future work and could yield superior results to a single population model.
Strengths of the study include longitudinal sensor data collection over a wide perioperative window, from presurgery to 60 days after discharge following pancreatic surgery. We considered a wide range of features from both wearable and smartphone sensors and examined prediction of next-day overall symptom burden as well as next-day pain, fatigue, and diarrhea specifically. Our models were also trained on past data only so that we could evaluate how well models could perform if implemented in real-world clinical settings.
This study suggests that digital biomarkers may be useful in predicting patient-reported symptom burden during cancer treatment. In an ongoing study, we are following up on this work by collecting 3 months of smartphone and wearable sensor data as well as daily symptom reports from a large sample of patients undergoing outpatient chemotherapy. With a larger outpatient sample using their own smartphones, we hope to improve upon the models developed here and to use real-time next-day symptom predictions to deliver more timely and personalized symptom management support.