Coding long COVID: characterizing a new disease through an ICD-10 lens

Background
 Naming a newly discovered disease is a difficult process; in the context of the COVID-19 pandemic and the existence of post-acute sequelae of SARS-CoV-2 infection (PASC), which includes long COVID, it has proven especially challenging. Disease definitions and assignment of a diagnosis code are often asynchronous and iterative. The clinical definition and our understanding of the underlying mechanisms of long COVID are still in flux, and the deployment of an ICD-10-CM code for long COVID in the USA took nearly 2 years after patients had begun to describe their condition. Here, we leverage the largest publicly available HIPAA-limited dataset about patients with COVID-19 in the US to examine the heterogeneity of adoption and use of U09.9, the ICD-10-CM code for “Post COVID-19 condition, unspecified.” Methods We undertook a number of analyses to characterize the N3C population with a U09.9 diagnosis code (n = 33,782), including assessing person-level demographics and a number of area-level social determinants of health; diagnoses commonly co-occurring with U09.9, clustered using the Louvain algorithm; and quantifying medications and procedures recorded within 60 days of U09.9 diagnosis. We stratified all analyses by age group in order to discern differing patterns of care across the lifespan. Results We established the diagnoses most commonly co-occurring with U09.9 and algorithmically clustered them into four major categories: cardiopulmonary, neurological, gastrointestinal, and comorbid conditions. Importantly, we discovered that the population of patients diagnosed with U09.9 is demographically skewed toward female, White, non-Hispanic individuals, as well as individuals living in areas with low poverty and low unemployment. Our results also include a characterization of common procedures and medications associated with U09.9-coded patients. Conclusions This work offers insight into potential subtypes and current practice patterns around long COVID and speaks to the existence of disparities in the diagnosis of patients with long COVID. This latter finding in particular requires further research and urgent remediation. Supplementary Information The online version contains supplementary material available at 10.1186/s12916-023-02737-6.

groups in the prevalence for each condition that is prevalent in at least one age cohort (a total of 46 conditions). Although the observed value for the age-stratified prevalence of a few conditions was < 5, the expected value in each cell was > 5 across all conditions, allowing us to use the chi-square test instead of an exact test of independence that accounts for issues associated with small cell sizes. Of the 46 conditions present in all cohorts, the difference in prevalence was statistically significant for all 46 conditions (p < 0.05).

Standard N3C Data Quality Checks
N3C's Data Ingestion and Harmonization team run a suite of quality checks against each participating site's initial data submission before those data are incorporated into the N3C secure enclave for use in research. Sites that do not pass our minimum checks are asked to remediate issues and resubmit, with that process continuing iteratively unless all checks are passed. N3C's data quality workflow is described in detail in [23]. The following list, adapted from [23], summarizes the minimum requirements for inclusion.
1. Source common data model (CDM) conformance. All tables required by the site's source CDM (e.g., OMOP, PCORnet, ACT, TriNetX) are present, with all required fields populated; fields that use a controlled value set (eg, "M" for male, "F" for female, etc.) are populated with valid values. 2. Demographics. Count of patients qualifying for COVID phenotype (e.g., COVID positive) is reasonable when compared with sites of similar size; sex, race, and ethnicity distributions are reasonable for the site's population; month of birth evenly distributed throughout the calendar year. 3. COVID tests. All COVID tests must be coded with a standard concept; all COVID test results must be coded with a standard concept; numbers of negative and positive COVID tests are reasonable when compared with sites of similar size. 4. Conditions. Clinical encounters coded with U07.1 (ICD-10 code for COVID) are present, and those encounters are distributed across various visit types (e.g., outpatient, inpatient, emergency) 5. Encounters. Clinical encounters are distributed across a variety of visit types (e.g., outpatient, inpatient, emergency); the distribution of visit types is reasonable when compared with similar sites; the majority of inpatient visits have valid end dates; the mean duration of visits is reasonable for that type of visit; the vast majority of visit end dates are later than or equal to the visit start date 6. Coding completeness. No more than 20% of records in any domain are coded with nonstandard OMOP concept IDs without further explanation (OMOP sites only); no more than 20% of records in any domain are coded with "0-No Matching Concept" without further explanation (OMOP sites only); the PERSON_ID attached to all records in domain tables must exist in the PERSON table; primary keys  are valid with no duplicate rows in any table; if applied by the site, date shifting is consistent within each patient across all domains.
Supplemental Table 1 Demographic breakdown of all COVID-positive patients across 34 N3C sites. The cohort shown in this table is composed of any COVID-positive patient at our 34 eligible sites. This cohort was used as a comparator against the U09.9-coded population in Table 1, enabling us to ascertain the ways in which the U09.9 cohort differs from the COVID-positive cohort from the same sites. Uptake of U09.9 and B94.8 across 34 N3C sites. Each box represents one of our 34 sites, with counts of U09.9 and B94.8 diagnosis codes plotted over time. Most sites follow a similar uptake pattern, with U09.9 use (blue) rising since its release on October 1, 2021, and B94.8 (pink) decreasing or plateauing after that date. Note that because N3C sites refresh their data on different cycles, some sites have more recent data than others, leading to different end dates across these plots.

Age
Supplemental Figure 2 Common medications among patients with a U09.9 code. Medications shown occur within 60 days after a patient's U09.9 diagnosis, and do not occur prior to the U09.9 (i.e., new medications). Medications are coded using the ATC terminology. Because a single drug can have multiple ATC codes, some medications are counted in more than one category. Category totals represent unique patient -drug pairs, not necessarily unique individuals. Medication classes associated with fewer than 20 patients or less than 0.5% of the agestratified cohort size are not shown, per N3C download policy. Percentages in each column are shown relative to the total n in that column.
When using EHR data, it can be difficult to discern indication from drug records, particularly when drugs are recorded at the ingredient level, and particularly when those ingredients can be used in a wide variety of medications and medication forms. For this reason, the categories of Opthalmologicals; Otologicals; Corticosteroids, Dermatological Preparations; and Blood Substitutes and Perfusion Solutions are artificially inflated. They are included for completeness, but should not be interpreted at face value.
Supplemental Figure 3 Common conditions among patients with a U09.9 code. Conditions shown occur within 60 days after a patient's U09.9 diagnosis. Conditions associated with fewer than 20 patients or less than 1% of the agestratified cohort size are not shown.  (63)