This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Cancer, is properly cited. The complete bibliographic information, a link to the original publication on https://cancer.jmir.org/, as well as this copyright and license information must be included.
A cancer diagnosis is a source of psychological and emotional stress, which are often maintained for sustained periods of time that may lead to depressive disorders. Depression is one of the most common psychological conditions in patients with cancer. According to the Global Cancer Observatory, breast and colorectal cancers are the most prevalent cancers in both sexes and across all age groups in Spain.
This study aimed to compare the prevalence of depression in patients before and after the diagnosis of breast or colorectal cancer, as well as to assess the usefulness of the analysis of free-text clinical notes in 2 languages (Spanish or Catalan) for detecting depression in combination with encoded diagnoses.
We carried out an analysis of the electronic health records from a general hospital by considering the different sources of clinical information related to depression in patients with breast and colorectal cancer. This analysis included ICD-9-CM (International Classification of Diseases, Ninth Revision, Clinical Modification) diagnosis codes and unstructured information extracted by mining free-text clinical notes via natural language processing tools based on Systematized Nomenclature of Medicine Clinical Terms that mentions symptoms and drugs used for the treatment of depression.
We observed that the percentage of patients diagnosed with depressive disorders significantly increased after cancer diagnosis in the 2 types of cancer considered—breast and colorectal cancers. We managed to identify a higher number of patients with depression by mining free-text clinical notes than the group selected exclusively on ICD-9-CM codes, increasing the number of patients diagnosed with depression by 34.8% (441/1269). In addition, the number of patients with depression who received chemotherapy was higher than those who did not receive this treatment, with significant differences (
This study provides new clinical evidence of the depression-cancer comorbidity and supports the use of natural language processing for extracting and analyzing free-text clinical notes from electronic health records, contributing to the identification of additional clinical data that complements those provided by coded data to improve the management of these patients.
Cancer continues to be one of the main causes of morbidity and mortality in the world, with approximately 19.3 million new cancer cases in 2020 [
A cancer diagnosis is life‑changing; it is a source of important psychological and emotional stress, which is usually maintained for sustained periods of time that may lead to depressive disorders [
For these reasons, it is critical to detect, diagnose, and treat depression symptoms in patients with cancer and depression. Based on the information available in electronic health records (EHRs), it is possible to have a complete clinical history of these patients, but it is necessary to fully exploit its content to make the most of these information systems [
In this study, we identified and analyzed the presence of depressive disorders in patients with the most common cancers in Spain—breast or colorectal cancer—using 2 different sources of clinical information: diagnosis codes in ICD-9-CM (International Classification of Diseases, Ninth Revision, Clinical Modification) and free-text clinical notes, including mentions of depression diagnoses, their symptoms, and antidepressants.
The aim of the study was twofold: (1) to compare the association between depression in patients with breast or colorectal cancer before and after these diagnoses and (2) to determine the usefulness of the free-text clinical notes analysis using NLP for detecting the diagnosis of depression among patients with cancer in combination with encoded structured clinical information.
The clinical database used for the study was the EHR of the Parc de Salut Mar Barcelona, a complete health care services organization with its information system database (IMASIS). IMASIS includes the clinical information of 2 general hospitals, 1 mental health care center, and 1 social health care center in the Barcelona city area (Catalonia, Spain) since 1990, including different settings such as admissions, outpatient consultations, and emergency department visits [
The diagnoses included in IMASIS-2 are encoded using the ICD-9-CM codification [
The Hospital del Mar Cancer Registry, which included 37,741 diagnosed malignant tumors, was also used as an additional source of information, providing data on the number of cases, characteristics, diagnostic and therapeutic process, and survival of patients with cancer at Parc de Salut Mar Barcelona [
The initial group of patients considered in our study consisted of the 10,668 individuals who were diagnosed with breast cancer (in women; ICD-9-CM–related code 174) and colorectal cancer (ICD-9-CM–related codes 153 and 154). The patients with cancer were classified in the Cancer Registry by stage (one of in situ, I, II, III, or IV stages) and the type of treatment received including chemotherapy. We obtained a sample of 10,668 patients with breast cancer or colorectal cancer. Of the total 10,668 patients, 2485 were excluded due to having more than 1 cancer or incomplete clinical information, with 8147 patients remaining. Of these 8147 patients, we selected 4238 individuals for the study who had (1) at least 4 or more visits recorded in the IMASIS-2, including 2 before and 2 after the cancer diagnosis; (2) breast or colorectal cancer that were in the “in situ” stage or stages I, II, or III; and (3) complete information about the treatments received for cancer. Patients in stage IV were not included because these patients were in an advanced stage of cancer, and they usually received palliative care or experienced depression [
Flow diagram of the study process.
To get thorough information describing the occurrence of depressive disorders among patients with breast and colorectal cancers, we used a combination of different sources of clinical information present in the EHR. The included sources are the occurrence of ICD-9-CM diagnosis codes registered and related to depressive disorders (
We analyzed the textual content of the 272,575 clinical notes from the visits of the 4238 patients with the considered cancers. The text of each clinical note was processed by means of the FreeLing [
The different text mining tools used and applied for the clinical annotations analysis.
Language identification: The FreeLing language analyzer determined, for each clinical note, the language used (Spanish or Catalan). All subsequent NLP analyses performed were language-specific.
Tokenization and part-of-speech tagging: The text of each clinical note was divided into tokens (substrings with assigned and identified meaning), and the part of speech of each token was identified (determiner, preposition, conjunction, punctuation, verb, adjective, pronoun, adverb, and name).
Terms detection: In the text of each clinical note, mentions of the following types of terms were identified: (1) names of the active substances of the 35 antidepressants and their corresponding 82 brand names used in Spain; and (2) SNOMED CT with depressive disorders–related terms, including the lexicalizations of the 139 concepts classified under the concept “trastorno depresivo (trastorno)” (depressive disorder [disorder] in Spanish; SNOMED CT ID 35489007). We searched for mentions of antidepressant active substances and their commercial drug names over the whole textual content of clinical notes. For this purpose, we exploited the Elasticsearch search and analytics tool [
Negation characterization: A negation detection algorithm tailored to the Spanish and Catalan languages was applied to the clinical notes for both SNOMED CT depressive disorders terms and antidepressant active substance and brand names to exclude the negated occurrences of these terms from our study. This detection was performed using a negation detection algorithm implemented as a token sequence tagger, relying on Conditional Random Fields. For this purpose, a corpus of 949 sentences (572 in Spanish and 277 in Catalan) extracted from clinical notes were manually annotated, detecting for each sentence the negation marker and the related negation span (ie, the portion of the text of the sentence that is actually negated). This corpus has been used to train a Conditional Random Fields sequence tagger that is able to automatically identify negation markers and related spans inside the text of clinical notes in Spanish and Catalan.
When needed, the names of antidepressant active substances as well as the names of depressive disorders–related terms from SNOMED CT were manually translated into Spanish and Catalan by a bilingual psychologist, since the textual content of the clinical notes analyzed in our study includes both languages.
The study was approved by the Hospital del Mar Research Ethics Committee (Comitè Ètic d'Investigació Clínica del Parc de Salut Mar; 2016/7130/l) and performed according to the Declaration of Helsinki, the General Data Protection Regulation (EU 2016/679), and the Spanish Law (3/2018) for data protection. All data were anonymized and treated with maximal confidentiality and respect according to good clinical practice guidelines.
The number of patients with cancer included in our study was 4238. There were 2032 women with breast cancer with a mean age of 62.3 (SD 13.2) years, and there were 2206 patients with colorectal cancer with a mean age of 70.5 (SD 11.4) years, including 1277 (57.9%) men and 929 (42.1%) women with significant differences in the ages of both groups of patients with these cancers (
Distribution of age by the stages of breast and colorectal cancers. The median age is shown as a vertical line.
The total number of patients with depression based on the use of ICD-9-CM, antidepressants drug mentions, SNOMED CT concepts related to depressive disorders, or the combination of these 3 methods was 1269. The percentage of patients diagnosed with depressive disorders increased after cancer diagnosis, with significant differences across all the types of cancer considered (
The increase in the number of patients with depression observed was a trend that we found separately in the ICD-9-CM codes, mentions of antidepressant drugs, and mentions of the set of SNOMED CT depression concepts. In the tables below, we show the number of patients with depression before and after the diagnosis of cancer using 3 different methods to detect them: the ICD-9-CM depression codes, antidepressant drug mentions, and SNOMED CT concepts related to “trastorno depresivo,” and the combination of the 3 methods.
Considering exclusively the ICD-9-CM codes of depressive disorders and excluding patients diagnosed with depression in visits both before and after the date of cancer diagnosis (n=164), of the 4074 remaining patients, 16.3% (n=664) were diagnosed with depression, and 86.6% (575/664) were diagnosed after the cancer diagnosis date (see
Considering the diagnosis of depression based on antidepressant drug mentions and excluding patients diagnosed with depression in visits both before and after the date of diagnosis cancer (n=68), of the 4170 remaining patients, 15% (n=624) were diagnosed with depression, and 91% (568/624) were diagnosed after the cancer diagnosis date (see
Of the 824 antidepressant mentions, the most frequent were citalopram (n=274, 33.3%), escitalopram (n=174, 21.1%), amitriptyline (n=125, 15.2%), trazodone (n=64, 7.8%), venlafaxine (n=57, 6.9%), paroxetine (n=37, 4.5%), desvenlafaxine (n=22, 2.7%), fluoxetine (n=22, 2.7%), and bupropion (n=21, 2.5%).
Considering the mentions of SNOMED CT depression concepts and excluding patients diagnosed with depression in visits both before and after the date of cancer diagnosis (n=20), of the 4218 remaining patients, 379 (89%, N=426) patients with depression were diagnosed after the data of cancer diagnosis—222 (94.5%) out of 235 for breast cancer and 157 (82.2%) out of 191 for colorectal cancer (see
Distribution of patients according to the type of cancer, stage, and diagnosis of depression based on ICD-9-CM (International Classification of Diseases, Ninth Revision, Clinical Modification) codification.
Cancer type, cancer stage | Number of patients, n/N (%) | Depression (ICD-9-CM) after cancer diagnosis, n/N (%) | |
|
|||
|
In situ | 234/2032 (11.5) | 40/234 (17.1) |
|
Stage I | 739/2032 (36.4) | 152/739 (20.6) |
|
Stage II | 781/2032 (38.4) | 166/781 (21.3) |
|
Stage III | 278/2032 (13.7) | 82/278 (29.5) |
|
All stages | 2032/2032 (100) | 440/2032 (21.7) |
|
|||
|
In situ | 544/2206 (24.7) | 48/544 (8.8) |
|
Stage I | 438/2206 (19.9) | 61/438 (13.9) |
|
Stage II | 656/2206 (29.7) | 94/656 (14.3) |
|
Stage III | 568/2206 (25.7) | 96/568 (16.9) |
|
All stages | 2206/2206 (100) | 299/2206 (13.6) |
Total | 4238/4238 (100) | 739/4238 (17.4) |
Number of patients characterized by ICD-9-CM (International Classification of Diseases, Ninth Revision, Clinical Modification) depression diagnosis codes before and after the cancer diagnosis date.
Cancer type | Before cancer diagnosis date, n/N (%) | After cancer diagnosis date, n/N (%) | Patients with depression, n/N (%) | Patients without depression, n/N (%) |
Breast | 39/398 (9.8) | 359/398 (90.2) | 398/1951 (20.4) | 1553/1951 (79.6) |
Colorectal | 50/266 (18.8) | 216/266 (81.2) | 266/2123 (12.5) | 1857/2123 (84.5) |
Total | 89/664 (13.4) | 575/664 (86.6) | 664/4074 (16.3) | 3410/4074 (83.7) |
Number of patients with antidepressant drug mentions before and after the cancer diagnosis date.
Cancer type | Before cancer diagnosis date, n/N (%) | After cancer diagnosis date, n/N (%) | Patients with depression, n/N (%) | Patients without depression, n/N (%) |
Breast | 27/352 (7.7) | 325/352 (92.3) | 352/2009 (17.5) | 1657/2009 (82.5) |
Colorectal | 29/272 (10.7) | 243/272 (89.3) | 272/2161 (12.6) | 1889/2161 (87.4) |
Total | 56/624 (9) | 568/624 (91) | 624/4170 (15) | 3546/4170 (85) |
Number of patients with mentions of SNOMED CT (Systematized Nomenclature of Medicine Clinical Terms) concepts related to “trastorno depresivo” (depressive disorder in Spanish) before and after the cancer diagnosis date.
Cancer type | Before cancer diagnosis date, n/N (%) | After cancer diagnosis date, n/N (%) | Patients with depression, n/N (%) | Patients without depression, n/N (%) |
Breast | 13/235 (5.5) | 222/235 (94.5) | 235/2021 (11.6) | 1786/2021 (88.4) |
Colorectal | 34/191 (17.8) | 157/191 (82.2) | 191/2197 (8.7) | 2006/2197 (91.3) |
Total | 47/426 (11) | 379/426 (89) | 426/4218 (10) | 3792/4218 (90) |
When we considered the previous 3 selection criteria together (ICD-9 codes, drug mentions, and SNOMED CT concepts) to detect patients with a diagnosis of depression and excluded the patients with a depression diagnosis both before and after cancer diagnosis date (n=248), of a total of 1021 patients, 920 (90.1%) were diagnosed after the cancer diagnosis date—533 (92.5%) out of 576 for breast cancer and 387 (87%) out of 445 for colorectal cancer (see
Of the total 4238 individuals, we identified 1269 (30%) characterized by 1 or more diagnoses of depression by analyzing their clinical histories (both ICD-9-CM codes and clinical notes, including drug mentions and SNOMED CT concepts detection). The identification of a diagnosis of depression in 441 (34.8%) patients out of 1269 has been performed by relying exclusively on the analysis of clinical notes using text mining (drugs and SNOMED CT concepts detection)—such patients would have not been considered as having been diagnosed with depression by relying on ICD-9-CM clinical codes. If we consider patients with breast cancer, the diagnosis of depression has been performed by relying exclusively on text mining in 30.6% (211/690) of the patients; this percentage is 39.7% (230/579) when we consider patients with colorectal cancer. Consequently, thanks to the analysis of clinical notes, we detected a considerably larger number (828/1269, 65.2%) of patients diagnosed with depression, with 34.8% (441/1269) more individuals using text mining (drugs or SNOMED CT concept mentions), by relying on ICD-9-CM codes in combination or not with drugs or SNOMED CT concepts mentions (see
Finally, we tried to determine if there was a relationship between the onset of depression and receiving chemotherapy. Of the 2032 patients with breast cancer, 907 (44.6%) received chemotherapy and 1125 (55.4%) did not. Of the 2206 patients with colorectal cancer, 564 (25.6%) received chemotherapy and 1642 (74.4%) did not. The number of patients with depression who received chemotherapy was higher than those who did not receive chemotherapy, with significant differences (
Number of patients with ICD-9-CM (International Classification of Diseases, Ninth Revision, Clinical Modification) codes of depressive disorders, a mention of antidepressant drugs, or a mention of one of the sets of 139 SNOMED CT (Systematized Nomenclature of Medicine Clinical Terms) concepts subsumed by the concept “trastorno depresivo” (depressive disorder in Spanish), before and after the cancer diagnosis date.
Cancer type | ICD-9-CM codes or mentions of drugs and SNOMED CT concepts before cancer diagnosis date, n/N (%) | ICD-9-CM codes or mentions of drugs and SNOMED CT concepts after cancer diagnosis date, n/N (%) | ICD-9-CM codes or mentions of drugs and SNOMED CT concepts, n/N (%) | No ICD-9-CM codes or mentions of drugs and SNOMED CT, concepts, n/N (%) |
Breast | 43/576 (7.5) | 533/576 (92.5) | 576/1918 (30) | 1342/1918 (70) |
Colorectal | 58/445 (13) | 387/445 (87) | 445/2072 (21.5) | 1627/2072 (78.5) |
Total | 101/1021 (9.9) | 920/1021 (90.1) | 1021/3990 (25.6) | 2969/3990 (74.4) |
Number of patients with ICD-9-CM (International Classification of Diseases, Ninth Revision, Clinical Modification) codes with or without mentions of drugs or SNOMED CT (Systematized Nomenclature of Medicine Clinical Terms) concepts.
Cancer type | ICD-9-CM codes without mentions of drugs or SNOMED CT concepts, n/N (%) | ICD-9-CM codes with mentions of drugs or SNOMED CT concepts, n/N (%) |
Breast | 479/690 (69.4) | 211/690 (30.6) |
Colorectal | 349/579 (60.3) | 230/579 (39.7) |
Total | 828/1269 (65.2) | 441/1269 (34.8) |
The detection of depressive disorders in patients with cancer is a key element in the management of these patients, which can impact the treatment outcomes of cancer [
The use of unstructured data for the identification of conditions such as depression, as well as other diseases and comorbidities [
The value of relying on these 2 types of clinical information—structured and unstructured—has been analyzed in other conditions such as geriatric syndrome [
This study has some limitations. It is not uncommon that if the main cause of admission of a patient is a complication of cancer, other secondary diagnoses such as depression are not included in the medical discharge report, and for this reason, these diagnoses can be underrecorded. However, specific words and expressions used by medical doctors to mention depression-related symptoms in clinical notes may not have been included among the terms used in this study. We based our analyses of clinical notes exclusively on the terminology encoded in SNOMED CT to capture mentions of depressive disorders, and therefore, our terminology could underestimate the number of patients with depression. In this regard, free text can be further explored to identify other expressions and terms used by clinicians to describe depression symptoms [
This study demonstrated that the use of NLP for extracting and processing unstructured clinical information, which is present in free-text clinical notes in the EHR, in combination with encoded diagnosis can contribute to the identification of relevant clinical data—in this case, the detection of depressive disorders in patients with breast and colorectal cancers. This study shows the possibility of combining structured and unstructured data included in the EHR, providing new opportunities to better understand and manage complex diseases and their comorbidities, such as cancer and depression, to the benefit of these patients. In future works, we intend to extract information from the EHR using NLP in combination with machine learning methods and apply prediction models to estimate different possible outcomes.
ICD-9-CM (International Classification of Diseases, Ninth Revision, Clinical Modification) diagnosis codes related to depressive disorders used in the study.
Names of the active substances of the 35 antidepressants and their corresponding 82 brand names used in Spain.
electronic health record
International Classification of Diseases, Ninth Revision, Clinical Modification
natural language processing
Systematized Nomenclature of Medicine Clinical Terms
The Research Programme on Biomedical Informatics (GRIB) is a member of the Spanish National Bioinformatics Institute and funded by Instituto de Salud Carlos III and FEDER (PT17/0009/0014). The GRIB is also supported by the Agència de Gestió d’Ajuts Universitaris i de Recerca, Generalitat de Catalunya (2017 SGR 00519). This research was carried out under the framework of the project Creating medically-driven integrative bioinformatics applications focused on oncology, CNS disorders and their comorbidities (MedBioinformatics, H2020-EU; grant 634143); and partially funded by the Institute of Health Carlos III (project IMPaCT-Data; IMP/00019) and cofunded by the European Union, European Regional Development Fund (“A way to make Europe”); and the Clinical Knowledge Aggregation by Mining Medical Reports (CliKA-MinE; PI17/00230), which is funded by Institute of Health Carlos III and cofunded by the European Union.
The study involves the use of patients’ medical data from the Hospital del Mar according to the General Data Protection Regulation. The data is not publicly available due to the ethical regulations under which the data is collected from our hospital database.
The first draft was written jointly by AL, MAM, and FR. All the authors have read and agreed to the final version of the manuscript.
None declared.