This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Cancer, is properly cited. The complete bibliographic information, a link to the original publication on https://cancer.jmir.org/, as well as this copyright and license information must be included.
To assess the impact of COVID-19 on cancer survivors, we fielded a survey promoted via email and social media in winter 2020. Examination of the data showed suspicious patterns that warranted serious review.
The aim of this paper is to review the methods used to identify and prevent fraudulent survey responses.
As precautions, we included a Completely Automated Public Turing test to tell Computers and Humans Apart (CAPTCHA), a hidden question, and instructions for respondents to type a specific word. To identify likely fraudulent data, we defined a priori indicators that warranted elimination or suspicion. If a survey contained two or more suspicious indicators, the survey was eliminated. We examined differences between the retained and eliminated data sets.
Of the total responses (N=1977), nearly three-fourths (n=1408) were dropped and one-fourth (n=569) were retained after data quality checking. Comparisons of the two data sets showed statistically significant differences across almost all demographic characteristics.
Numerous precautions beyond the inclusion of a CAPTCHA are needed when fielding web-based surveys, particularly if a financial incentive is offered.
The COVID-19 pandemic resulted in significant delays to health care administration. To assess the impact of the pandemic on cancer survivors in the United States, the study team fielded a survey in the winter of 2020. The survey was promoted via email and, briefly, via social media. The volume of results in a short time period suggested that the data should be reviewed for fraudulent responses.
Social media can be an efficient way to disseminate web-based surveys [
We recruited cancer survivors primarily via an email request sent to physician liaisons and cancer registrars at institutions accredited by the Commission on Cancer (CoC). The study invitation, which came directly from the CoC, asked recipients to forward the invitation to their cancer center survivorship coordinator, who in turn was asked to forward the invitation to patients. Emails were sent on October 13, 2020, followed by two reminders, each 1 week apart. In addition, the study team disseminated the survey to community partners on October 8, 2020; posted on the Association of Community Cancer Centers eXchange and Association of Oncology Social Work listservs; and included the survey link in a George Washington University newsletter to health care professionals.
Participants were asked to complete a 20-minute survey and were told they would receive a US $25 gift card to thank them for their time.
To dissuade bots, we included a Completely Automated Public Turing test to tell Computers and Humans Apart (CAPTCHA), a question asking how the participant heard about the survey, time stamps, open-ended questions, and pairs of items that could be compared for consistency. After receiving over 1000 responses in the first 3 days after opening the survey, we examined the data and identified suspicious patterns. We then removed all links from social media and added additional precautions based on extant literature about optimizing valid responses for public-access surveys [
Our survey questions included demographics and health history: age, sex, and gender identity; sexual orientation; race/ethnicity; marital status; household size; education; income; age at diagnosis; cancer stage; cancer type; employment status; and insurance type. We also included questions related to COVID-19 and patient-reported outcomes.
Data were exported from Research Electronic Data Capture (REDCap) and analyzed in SAS 9.4 (SAS Institute). As of Thursday, December 3, 2020, we had received 1977 responses. We thus developed criteria to identify suspicious and fraudulent data.
We began by eliminating those who were ineligible: respondents who were living outside of the United States, had stage 0 cancer, had no cancer diagnosis (n=83), or reported that they had only nonmelanoma skin cancer (n=46) [
We analyzed irregularities in the remaining data (n=1650) and eliminated responses that contained two or more suspicious indicators (
We sent emails to all respondents excluded from the final data set to alert them that their responses had not passed a quality check, and we welcomed them to reach out to the study team with any questions. We received only 1 response, which said: “Why.” We also emailed all of the respondents who were retained in the data set and instructed them on how to claim their incentive. We received 1 response from a person who did not recall participating in the study. As additional quality control, we reviewed a subset of data for respondents who indicated hearing about the survey from a specific community partner. Of the 35 respondents who indicated hearing about the survey from this partner, we excluded 30. Upon member checking, all 5 participants retained in the data set were confirmed as clients of the community partner, and only 1 of the excluded respondents was a legitimate client.
Types of fraudulent or suspicious data identified in eliminated survey responses (n=1081).a
Description | Value, n (%) |
Year of birth is reported as 2020, or reported age and age calculated from reported date of birth are different by more than 1 year | 250 (17.8) |
Reported age is <40 years and cancer type is rare for those aged <40 years | 283 (20.1) |
Respondents indicate a survey source prior to dissemination of the survey from that source | 820 (58.2) |
Open-ended comments focus on information technology rather than answering the question asked | 56 (4) |
Open-ended telehealth comments are duplicates | 34 (2.4) |
Final open-ended suggestion responses are duplicates | 107 (7.6) |
Email addresses are duplicates | 20 (1.4) |
Time since diagnosis is <2 years, but time since treatment is 2-5 years | 11 (0.8) |
Time since diagnosis is ≤5 years, but time since treatment is >5 years | 57 (4) |
Suspicious survey time (at least 10 surveys completed in succession within 5 minutes of each other or completed between midnight and 4 AM EST) | 986 (70) |
Email/address is suspicious (for email: at least 10 random numbers or letters in a row, or strange punctuation or capitalization; for address: incomplete address, address of a business, address is not real, address includes quotation marks, or pattern of strange capitalization or spacing) | 166 (11.8) |
Name/suffix is suspicious (first and last name flipped, part of last name in first name field or vice versa, male suffix and female name, random letters or numbers in suffix field) | 78 (5.5) |
aIndividuals could be counted in as many indicators as their responses suggested; thus, the n values do not add up to the total of excluded data.
Flow chart of survey response elimination. REDCap: Research Electronic Data Capture.
After eliminating responses deemed as fraudulent, we used means and frequencies to create a demographics table comparing respondents who were included with those who were excluded. We used chi-square or Fisher exact tests to examine differences between groups.
This study was deemed exempt by the George Washington University Institutional Review Board (IRB) (NCR202819).
Of the total sample (N=1977), 1408 responses were excluded (327 due to ineligibility and 1081 due to suspicious responses) and 569 were retained. Most surveys eliminated were dated October 9-11, 2020 (n=1072). These dates align with the period when the survey link was posted on social media.
Comparisons of retained and excluded respondents showed statistically significant differences across most demographic characteristics (
Differences between respondents in the retained and excluded samples.
Characteristic | Retained sample (n=569) | Excluded sample (n=1081) | |||
Current age (years), mean (SD) | 55.9 (13.1) | 41.4 (8.2) | <.001 | ||
|
|
|
|
||
|
Cisgender male | 132 (23.2) | 575 (53.2) | <.001 | |
|
Transgender male, transgender female, gender fluid, or two-spirit | 1 (0.2) | 32 (3.0) | <.001 | |
|
Cisgender female | 399 (70.1) | 463 (42.8) | <.001 | |
|
Other/prefer not to answer/Do not understand the question | 40 (7.0) | 14 (1.3) | <.001 | |
|
<.001 | ||||
|
Straight | 532 (93.5) | 984 (91.0) |
|
|
|
Lesbian, gay, homosexual, bisexual/pansexual, queer, two-spirit | 23 (4.0) | 89 (8.2) |
|
|
|
Other/prefer not to answer/do not understand the question | 14 (2.5) | 8 (0.7) |
|
|
|
|||||
|
Asian | 19 (3.3) | 58 (5.4) | .06 | |
|
Black | 83 (14.6) | 200 (18.5) | .045 | |
|
Hispanic/Latinx | 42 (7.4) | 90 (8.3) | .50 | |
|
Native American/Alaska Native/Pacific Islander | 17 (3.0) | 83 (7.7) | <.001 | |
|
White | 411 (72.2) | 677 (62.6) | <.001 | |
|
<.001 | ||||
|
Single | 93 (16.3) | 152 (14.1) |
|
|
|
Married/partnered | 388 (68.2) | 884 (81.8) |
|
|
|
Divorced/separated | 60 (10.5) | 37 (3.4) |
|
|
|
Widowed | 28 (4.9) | 8 (0.7) |
|
|
Number of individuals in household, mean (SD) | 2.6 (1.3) | 3.3 (0.9) | <.001 | ||
|
<.001 | ||||
|
Some high school or less | 17 (3.0) | 38 (3.5) |
|
|
|
High school diploma or GEDc/vocational school | 88 (15.8) | 294 (27.2) |
|
|
|
Some college | 164 (28.8) | 415 (38.4) |
|
|
|
Completed 4-year degree | 156 (27.4) | 261 (24.1) |
|
|
|
Graduate school | 144 (25.3) | 73 (6.8) |
|
|
|
<.001 | ||||
|
<25,000 | 59 (10.4) | 46 (4.3) |
|
|
|
25,001-50,000 | 106 (18.6) | 383 (35.4) |
|
|
|
50,001-75,000 | 124 (21.7) | 375 (34.7) |
|
|
|
75,001-100,000 | 61 (10.7) | 182 (16.8) |
|
|
|
>100,000 | 129 (22.7) | 93 (8.6) |
|
|
|
I prefer not to answer | 90 (15.8) | 1 (0.09) |
|
|
Age at cancer diagnosis (years), mean (SD) | 51.4 (13.4) | 36.8 (8.6) | <.001 | ||
|
<.001 | ||||
|
I | 172 (30.2) | 456 (42.2) |
|
|
|
II | 167 (29.4) | 367 (34.0) |
|
|
|
III | 88 (15.5) | 177 (16.4) |
|
|
|
IV | 62 (10.9) | 51 (4.7) |
|
|
|
Unknown | 66 (11.6) | 24 (2.2) |
|
|
|
|||||
|
Melanoma | 26 (4.6) | 57 (5.3) | .53 | |
|
Lung | 23 (4) | 199 (18.4) | <.001 | |
|
Prostate | 37 (6.5) | 90 (8.3) | .19 | |
|
Breast | 328 (57.6) | 161 (14.9) | <.001 | |
|
Colorectal | 39 (6.9) | 117 (10.8) | .008 | |
|
Kidney | 8 (1.4) | 63 (5.8) | <.001 | |
|
Bladder | 8 (1.4) | 83 (7.7) | <.001 | |
|
Blood cancer (leukemia, lymphoma, myeloma) | 44 (7.7) | 82 (7.6) | .92 | |
|
Uterine/cervical | 32 (5.6) | 160 (14.8) | <.001 | |
|
Thyroid | 31 (5.5) | 91 (8.4) | .03 | |
|
Other | 62 (10.9) | 13 (1.2) | <.001 | |
|
<.001 | ||||
|
<2 | 238 (43.4) | 476 (44.1) |
|
|
|
2-5 | 168 (30.7) | 488 (45.2) |
|
|
|
>5 | 142 (25.9) | 116 (10.7) |
|
|
|
|||||
|
My cancer is in remission or no evidence of disease | 447 (78.6) | 612 (56.6) | <.001 | |
|
I have chronic cancer | 77 (13.5) | 240 (22.2) | <.001 | |
|
I am receiving palliative care | 30 (5.3) | 253 (23.4) | <.001 | |
|
I am in hospice care | 0 (0) | 60 (5.6) | <.001 | |
|
None of these apply to me | 42 (7.4) | 39 (3.6) | <.001 | |
Part of a tribe or territory, n (%)b | 41 (7.2) | 397 (38.1) | <.001 | ||
|
|||||
|
Retired | 198 (34.8) | 48 (4.4) | <.001 | |
|
Paid work (full- or part-time) | 251 (44.1) | 667 (61.7) | <.001 | |
|
Unpaid work (homemaker, volunteer) | 44 (7.7) | 127 (11.8) | .01 | |
|
Unemployed | 77 (13.5) | 247 (22.9) | <.001 | |
|
|||||
|
Private insurance | 320 (56.2) | 436 (40.3) | <.001 | |
|
Medicaid | 83 (14.6) | 491 (45.4) | <.001 | |
|
Medicare | 210 (36.9) | 633 (58.6) | <.001 | |
|
Tricare/COBRAd/other | 48 (8.4) | 64 (5.9) | .054 | |
|
I do not have health insurance | 31 (5.5) | 45 (4.2) | .24 | |
|
<.001 | ||||
|
Excellent/very good | 165 (29.0) | 375 (34.7) |
|
|
|
Good | 226 (39.7) | 318 (29.4) |
|
|
|
Fair | 101 (17.8) | 254 (23.5) |
|
|
|
Poor | 17 (3.0) | 133 (12.3) |
|
aRespondents could select multiple responses for this question.
bResponses may not add up to n=569 or n=1081 due to missing data or multiple responses.
cGED: General Educational Development.
dCOBRA: Consolidated Omnibus Budget Reconciliation Act.
The samples also differed in cancer stage, type, health status, and insurance coverage status. The retained sample reported more stage IV cancer and a higher percentage of breast cancer than the excluded sample. The excluded sample reported more lung, kidney, bladder, and uterine/cervical cancers than the retained sample (
Numerous indications support the greater integrity of the data in the retained sample (n=569) compared to the excluded sample (n=1081). First, discordant data reported by the same respondent, such as the anatomical site of their cancer not being physically possible for their reported sex/gender, were clear signs of random survey completion. Second, the younger mean age of the excluded sample combined with cancers more likely to be diagnosed at a later age (eg, lung, kidney, and bladder cancers), more serious disease (chronic, receiving palliative care, or hospice), and poorer health is highly suspicious. Conversely, the higher self-reported diagnosis of breast cancer in the retained sample aligns with the authors’ prior research experience in more easily recruiting breast cancer survivors than those with a history of other cancers.
This study contributes to the literature by providing guidance for identifying potentially fraudulent data. Importantly, use of screening questions and CAPTCHA was insufficient to dissuade fraudulent respondents. Consistent with past research, we found that examining repeated personal data across responses [
Social media is an efficient and cost-effective method for health research. However, the potential for loss of data integrity must be weighed with the efficiency and cost-effectiveness [
To minimize bot contamination and reduce duplicate entries, precautions similar to those taken in this study are warranted. Additional recommendations include using software with fraud prevention and detection capabilities (eg, Qualtrics), capturing IP addresses, capturing time stamps for both start and stop times, including a required open text question, and distributing surveys only to closed groups on social media or avoiding social media altogether. If social media is used, financial incentives should be avoided. If providing financial incentives, (1) require participants to check a box indicating they acknowledge that responses from ineligible respondents or those who respond multiple times will not receive the financial incentive and downplay the incentive, and (2) indicate that investigators reserve the right to confirm eligibility by telephone (or other means) and include a required telephone number field.
Once data are collected, data integrity checks such as those in
The criteria used to eliminate responses were subjective, and it is impossible to know if all fraudulent data were removed. The authors erred on the side of potentially eliminating valid responses rather than retaining responses that were likely to be invalid. Limitations in our ability to detect potentially fraudulent responses included the inability to capture IP addresses or completion times.
Providing a survey incentive in combination with social media recruitment may increase the likelihood of fraudulent activity. CAPTCHA alone is unlikely to prevent fraudulent responses in internet-based research promoted on social media. Precautions to prevent and detect fraud are important for the validity of research findings. Ethical considerations of participant privacy and incentive payments should be weighed with data integrity concerns to ensure valid, meaningful health research results.
Completely Automated Public Turing test to tell Computers and Humans Apart
Commission on Cancer
Institutional Review Board
Research Electronic Data Capture
Funding for the study was provided by a Patient Centered Outcomes Research Institute Engagement Award (EADI-12744). Thank you to the Community Advisory Board for this project, including Katie Bathje, Benoit Blondeau, Cindy Cisneros, Ysabel Duron, Maureen Killackey, Larissa Nekhlyudov, Beth Sieloff, and Megan Slocum. Thank you to Ysabel Duron and Larissa Nekhlyudov for review and feedback on an earlier draft of this manuscript.
None declared.