Introduction
Environmental social determinants of health (SDOH), such as one’s living circumstances and housing stability, significantly impact a person’s overall health, and the lack of stable housing can lead to serious adverse effects [Reference Kushel, Gupta, Gee and Haas1]. Housing directly impacts a person’s access and means to medical care; lack of housing is related to a disproportionately higher reliance on emergency medical services and ambulance transports [Reference Abramson, Sanko and Eckstein2]. The most severe manifestation of housing instability, known as housing deprivation or homelessness, can reduce life expectancy by as much as 12 years and increase rates of illness or disability [3,Reference Kushel, Vittinghoff and Haas4]. Experiencing homelessness is linked to notably increased rates of hospital readmissions and extended hospital stays [Reference Khatana, Wadhera and Choi5].
Despite the importance of housing and its relevance to health, housing issues are underreported in electronic health records (EHRs) due to a lack of national standards, social stigma, and reliance on self-reporting [Reference Brown and Steinman6]. A previous study on housing found that diagnosis codes used for billing only identified 58.5% of the population experiencing housing instability or homelessness [Reference Harris, Anthony, Quesinberry and Delcher7]; the remaining population was only identifiable through clinical notes or address data [Reference Harris, Anthony, Quesinberry and Delcher7]. Clinical text combined with natural language processing (NLP) techniques may assist in identifying housing issues from unstructured data in EHRs [Reference Bejan, Angiolillo and Conway8–Reference Chapman, Jones and Kelley12]. We extend these housing-related techniques and findings as part of a national effort to capture housing-related concepts.
The Evolve to Next-Gen Accrual to Clinical Trials (ENACT) Network spans the Clinical and Translational Science Award (CTSA) consortium and connects CTSA sites with a single interface capable of querying over 142 million patients using the ENACT web-based query tool [Reference Morrato, Lennox and Dearing13]. One of the goals of ENACT is to allow informatics researchers to develop and validate new EHR research tools; a working group (WG) for developing NLP tools was established across participating ENACT sites. This paper outlines the WG’s progress on using clinical text to help identify housing issues and to supplement the known gap of underreported housing instability in structured clinical data by using NLP with unstructured EHR data. We present our custom lexicon of housing-related terms constructed after a literature review and discuss the performance of our initial implementation using three unique data sets.
Materials and methods
Lexicon development
We conducted a literature review of studies involving housing instability and homelessness to identify relevant works and to help construct a lexicon of housing-related terms [Reference Harris, Anthony, Quesinberry and Delcher7,Reference Chapman, Jones and Kelley12,Reference Rollings, Kunnath, Ryus, Janke and Ibrahim14,Reference Richards and Kuhn15]. An existing Open Health Natural Language Processing (OHNLP) project on food and housing insecurity was reviewed to compare important words, phrases, and patterns [Reference Zhang, Huang and Zong16]. We organized our relevant findings into six concepts: homeless, unstable housing, recovery housing, emergency housing, temporary housing, and exposure. These concepts and associated phrases are summarized in Table 1 and were selected to support fine-grain querying of housing in the ENACT query tool and clinical trial recruitment.
Algorithm development
We developed patterns to identify housing-related issues that were compatible with the OHNLP Toolkit, developed by the OHNLP consortium, for automated concept extraction from clinical notes [Reference Wen, Fu and Moon17]. The OHNLP Toolkit was selected due to its customizable interfaces which could support NLP efforts in multiple domains, including those beyond housing, and its easy integration with the ENACT query tool. This toolkit utilizes MedTagger, a lightweight tool for indexing based on dictionaries and patterns as the core component for information extraction [Reference Wen, Fu and Moon17,Reference Liu, Bielinski and Sohn18]. The phrases in Table 1 were converted to regular expressions, which compressed the list. For example, “lack of housing” and “lack of shelter” are reduced to one pattern: “lack of (shelter|housing).” Furthermore, patterns were developed to allow flexible matching. For example, “living on the streets” became “living on the (the)? street(s),” where the article “the” is optional and streets may be plural or not. A common misspelling of “homeless” as “homelesss” was added based on the observational experience of the team in the housing domain.
The OHNLP Toolkit uses an expanded version of the ConText algorithm to classify whether identified entities are negated or part of a patient’s medical history [Reference Harkema, Dowling, Thornblade and Chapman19]. Irrelevant ConText rules, such as “did not demonstrate,” were removed from the rule list to avoid wrongly negating detected entities; housing issues are not items for which patients test positive or negative. Each clinical document was divided into sentences as a preprocessing step; this was necessary after observing hits with negations generated from the wrong contextual window. These sentences were input into the OHNLP Toolkit, and annotated text files of the results were produced as output.
Testing
Each participating implementation site developed its own test data. For piloting the implementation, we developed a collection of emergency department notes using ChatGPT 3.5 that could be shared across sites for testing purposes. The initial question was “Can you write a sample discharge note from an Emergency Department for a homeless person?” Several additional prompts were used to generate positive cases in which the hypothetical patient has housing problems and negative cases in which there is no housing concern (“Can you generate a report for someone who is not homeless and not experiencing housing instability?”).
We repurposed a selection of 250 documents from a related study on housing within a cohort with substance use disorders (SUD) specific to stimulant and opioid use disorders (randomly sampled from individuals having ICD-10-CM diagnosis codes of F11.*, F14.*, F15.*, T40.[1-6].*, and T43.6*) [Reference Harris, Anthony, Quesinberry and Delcher7]. These documents were manually annotated as positive or negative for housing issues; patients experiencing housing issues have higher rates of SUD and are at higher risk of overdose, which highlights the importance of housing as an SDOH [Reference Doran, Rahai and McCormack20]. We also created a collection of 24,917 documents from a cohort (n = 225) diagnosed (ICD-10-CM Z59.6) with problems related to housing and economic circumstances (HEC) from UT Physicians, a multispecialty medical group associated with the University of Texas Health Sciences Center at Houston (UTHealth) and the UTHealth Harris County Psychiatric Center.
Results
The results of running the OHNLP Toolkit with our custom ENACT rule set and custom patterns on our three test data sets are summarized in Table 2. The results of MedTagger contain a flag for negation per each hit; a note was considered a positive case for housing issues if any of its hits were positive. True negatives were cases that either had no documented housing issues or all mentions of housing were negated. For the HEC cohort, the extracted hits were reviewed for correctness, so only precision is reported.
ENACT = Evolve to Next-Gen Accrual to Clinical Trials; FN = false negative; FP = false positive; HEC = housing and economic circumstances; NLP = natural language processing; SUD = substance use disorders; TN = true negative; TP = true positive.
Table 3 lists common errors observed. For the SUD collection, false positives mostly stemmed from “Patient Education” notes that list dozens of community resources available for any patient; the “Homeless Veterans Center” caused false positives as it was only listed as a generic resource and did not imply the patient was a homeless veteran. Another false positive stemmed from a note describing someone who visited the emergency department after finding “a homeless person sleeping in her bathroom.”
HEC = housing and economic circumstances; SUD = substance use disorders.
For the SUD cohort, there were only 2 false negatives at the note level, where all individual hits were negated, and at the individual hit level, there were 10 false negatives; these hits are described in Table 3. These examples are all failures to understand what concept is being negated in the sentence. For example, in “Patient is not safe candidate for home IV abx therapy given active IVDA and homelessness,” the concept of a safe candidate is intended to be negated instead of homelessness.
We report the distribution of concepts identified for each cohort in Table 4. The most frequently identified concept across all cohorts was homelessness. The second most frequent concept varied across cohorts. Temporary housing was likely popular in the SUD cohort due to a large number of patients staying in shelters; the HEC cohort was a general population where unstable housing may be more common than staying in a shelter.
HEC = housing and economic circumstances; SUD = substance use disorders.
Discussion
Our pilot suggests that developing a lexicon for housing-related issues and rule-based NLP methods for identifying housing concepts in unstructured EHR data is a realistic goal for the ENACT Network. The OHNLP platform is easily deployable and customizable by any ENACT site. The OHNLP toolkit can be customized to read and write to any database; the input can be clinical data warehouses containing the clinical notes and the output can be the ENACT database that stores the searchable patient observations.
The ENACT web-based query tool is based on a browsable ontology that organizes concepts and codes that can be used in a “drag and drop” fashion. Our housing results are searchable on two tiers: the overall housing concept and the embedded individual concepts described in Table 1.
Our ChatGPT performance was without error; this performance is largely unrealistic and likely a reflection of how formulaic ChatGPT output appears. Additionally, ChatGPT occasionally documented negative cases of housing issues as “no visible signs of homelessness” which is highly unlikely to occur in a real note; if a patient does not appear homeless, the clinical documentation for homelessness would simply be absent. The phrase “no visible signs of homelessness” may be pejorative if included in clinical documentation. Despite these limitations, the ChatGPT notes are useful for prototyping and setting up the infrastructure needed to run MedTagger and to interface with the ENACT Network. We leave improving ChatGPT’s formulaic responses as future work where prompt engineering could potentially produce a more realistic data set. We leave exploring the role of generative models in identifying housing issues as future work.
Our SUD results highlighted a false positive where the note references an unhoused individual who is not the patient; this example would be difficult to fix using rule-based methods as there are very little contextual clues or markers in these sentences to emphasize the unhoused person was not the patient. The HEC results highlighted an example where the patient’s family member was experiencing homelessness, which may be addressable by fine-tuning the ConText algorithm to correctly identify family history.
Our study is limited by the breadth and depth of our housing lexicon. Although our intent was to be comprehensive, there may exist phrases or patterns that were not found during our literature review or during our tests. Furthermore, the language used to describe patients experiencing housing problems may change over time. We did not study recall in the HEC cohort due to the large number of notes; a smaller sampling strategy may be needed to manually review and validate recall. We also did not evaluate the temporality of the housing concepts or occurrences of where stable housing is explicitly mentioned.
Conclusion
The ENACT Network is based largely on querying structured, standardized codes; diagnostic billing codes are insufficient for identifying patients experiencing housing instability or homelessness. We designed our housing lexicon and rule-based NLP methods based on a literature review of other studies and how they reference housing issues. We piloted our methods across a small group of ENACT sites and will be moving to implement these findings as routine updates to the entire ENACT Network, where cohort size estimates can be calculated across sites and in support of innovation clinical trials involving those experiencing housing instability.
Author contributions
Conception and design: DRH, SF, and YW; collection or contribution of data: DRH, SF, AC, DH, JH, DO, and YW; contribution of analysis tools or expertise: AW, JH, and DO; drafting of manuscript: DRH, SF, AW, AC, DH, JH, DO, and YW.
Funding statement
The project described was supported by the NIH National Center for Advancing Translational Sciences through grant numbers UL1TR001998, UL1TR001857, and U24TR004111. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. The SUD cohort was supported by the Centers for Disease Control and Prevention of the US Department of Health and Human Services as part of grant 1R01CE003360-01-00.
Competing interests
None.