Impact of data source choice on multimorbidity measurement: a comparison study of 2.3 million individuals in the Welsh National Health Service

Background Measurement of multimorbidity in research is variable, including the choice of the data source used to ascertain conditions. We compared the estimated prevalence of multimorbidity and associations with mortality using different data sources. Methods A cross-sectional study of SAIL Databank data including 2,340,027 individuals of all ages living in Wales on 01 January 2019. Comparison of prevalence of multimorbidity and constituent 47 conditions using data from primary care (PC), hospital inpatient (HI), and linked PC-HI data sources and examination of associations between condition count and 12-month mortality. Results Using linked PC-HI compared with only HI data, multimorbidity was more prevalent (32.2% versus 16.5%), and the population of people identified as having multimorbidity was younger (mean age 62.5 versus 66.8 years) and included more women (54.2% versus 52.6%). Individuals with multimorbidity in both PC and HI data had stronger associations with mortality than those with multimorbidity only in HI data (adjusted odds ratio 8.34 [95% CI 8.02-8.68] versus 6.95 (95%CI 6.79-7.12] in people with ≥ 4 conditions). The prevalence of conditions identified using only PC versus only HI data was significantly higher for 37/47 and significantly lower for 10/47: the highest PC/HI ratio was for depression (14.2 [95% CI 14.1–14.4]) and the lowest for aneurysm (0.51 [95% CI 0.5–0.5]). Agreement in ascertainment of conditions between the two data sources varied considerably, being slight for five (kappa < 0.20), fair for 12 (kappa 0.21–0.40), moderate for 16 (kappa 0.41–0.60), and substantial for 12 (kappa 0.61–0.80) conditions, and by body system was lowest for mental and behavioural disorders. The percentage agreement, individuals with a condition identified in both PC and HI data, was lowest in anxiety (4.6%) and highest in coronary artery disease (62.9%). Conclusions The use of single data sources may underestimate prevalence when measuring multimorbidity and many important conditions (especially mental and behavioural disorders). Caution should be used when interpreting findings of research examining individual and multiple long-term conditions using single data sources. Where available, researchers using electronic health data should link primary care and hospital inpatient data to generate more robust evidence to support evidence-based healthcare planning decisions for people with multimorbidity. Supplementary Information The online version contains supplementary material available at 10.1186/s12916-023-02970-z.

available, researchers using electronic health data should link primary care and hospital inpatient data to generate more robust evidence to support evidence-based healthcare planning decisions for people with multimorbidity.

Background
Multimorbidity, most commonly defined as the coexistence of two or more long-term conditions, is an issue of global importance because of its association with increased healthcare use, mortality, and reduced quality of life [1].Accurately estimating the prevalence of multimorbidity is therefore important to inform policy decision-making, healthcare planning, and research.The widespread availability of electronic health data, including electronic and administrative health records, presents opportunities for medical research examining multimorbidity, the volume of which has rapidly increased over the past two decades [2,3].
Measurement of multimorbidity in research is highly variable in terms of the definition of multimorbidity used (for example two or more conditions, or three or more conditions from three or more body systems) [3,4], the number and selection of conditions considered in the count [5], study setting, and participant age [6], resulting in widely varying estimates of the prevalence of multimorbidity [6].Additionally, little is known about the impact of data source on multimorbidity research findings.A recent systematic review found that although many studies examining multimorbidity are based in primary care or community settings (441 of 566 [77.9%]), a lower proportion used electronic health records rather than patient self-report measures (142 of 441 [32.2%] in primary or community versus 89 of 103 [86.4%] in hospital settings respectively) [3].
Electronic health data are increasingly available for research, with 50% of upper-middle and high-income countries globally adopting these in primary and/or secondary care settings [7,8].Despite this, barriers to accessing electronic health data, particularly from primary care settings, can include access restrictions imposed by information governance legislation and challenges faced by researchers when manipulating and interpreting non-intuitive records of events and conditions [9].Although the primary purpose of these data is to record the provision of clinical care, their large size and inclusion of populations often underrepresented in clinical trials and registries, such as women and people with multimorbidity [10], mean they better reflect true clinical populations [11].However, routinely collected data may under-ascertain some conditions [12].For example, using only primary care (PC) or only hospital inpatient (HI) data may lead to under-ascertainment of the prevalence and incidence of stroke [13] and myocardial infarction [12,13].It is unclear how general these findings are across the many conditions recommended for inclusion in studies of multimorbidity [14].Despite the importance and widespread availability of these data, and the intensity of multimorbidity research, there is no standard approach to choice of data and little is currently understood about how the choice of data source impacts on the estimated prevalence of multimorbidity and its constituent conditions.The aim of this study was to compare the estimated prevalence of multimorbidity and the 47 constituent conditions using only PC, only HI, and linked PC-HI data in the SAIL Databank and to examine associations of condition counts derived in the same three ways with mortality.

Study design and population
This cross-sectional study used routinely collected anonymised data available in the SAIL Databank and consisted of individuals of all ages living in Wales and registered with a GP contributing data to the Secure Anonymised Information Linkage (SAIL) Databank on 1 January 2019.Intentionally, this study examines condition coding outside the COVID-19 pandemic to avoid capturing the effects of related restrictions and associated decreases in the diagnosis of physical and mental health conditions [15].The study population was limited to people with at least 1 year of GP registration before 1 January 2019 to improve the stability of records and avoid underascertainment where an individual has recently moved practice and their PC record has not yet been populated with historic codes [16] and to those registered with GP practices who contribute data to SAIL Databank (80% of GP practices and 83% of Welsh residents [17]).The population was stratified into groups according to age, sex, and deprivation status of neighbourhood residence (using deciles of the Welsh Index of Multiple Deprivation [WIMD] 2019) [18].Mortality was measured in the subsequent calendar year (to 31 December 2019).

Data sources
PC data obtained from the Welsh Longitudinal General Practice Dataset (WLGP) were used to define conditions using Read version 2 codes (SNOMED-CT codes were not operational in the SAIL Databank during the study period), prescribing and/or laboratory data [19].HI data were derived from general and psychiatric HI episodes obtained from the Patient Episode Database for Wales (PEDW) using all recorded International Classification of Diseases 10th Revision codes present for each hospital discharge [20].PEDW records hospital inpatient events for English hospitals where a patient is registered with a Welsh GP; however, neither PEDW nor WLGP will provide data for patients prior to when they registered with a Welsh GP.Unlike Hospital Episode Statistics (HES) that records admissions, A&E attendances, and outpatient appointments from NHS England, PEDW records hospital inpatient episodes only [21].Mortality data were derived from the Welsh Demographic Service Dataset.

Definition of long-term conditions
Choice of the 47 conditions was based on results of a recent Delphi consensus study recommending those to include in the measurement of multimorbidity (Additional file 1) [14], and multimorbidity was defined as the presence of two or more conditions [1].Phenotype definition and look-back duration for the codes defining each of the conditions followed rules defined by Barnett et al. [22] where possible.For the remaining conditions, inclusion criteria were agreed through discussion between authors CM, SWM, and BG.In certain cases, look-back durations varied within conditions to reflect the impact living with the condition was likely to have on an individual.For example, anaemia was defined as a relevant code ever recorded for aplastic anaemia, sickle cell anaemia, and thalassaemia (conditions that are either life-long or life-threatening), but as a relevant code dated in the last 12 months for iron-, B12-, or folate-deficient anaemias (conditions that are more likely to be transient), with the results of both combined into a single variable defining the presence of 'anaemia' on 1 January 2019.Unless the look-back duration was specifically stipulated, for example, 1 year for asthma clinical codes, codes present between 1 January 2000 and the study cross-section date of 1 January 2019 were used for both PC and HI data.This approach was taken to avoid relative over-ascertainment of PC codes.Historic codes are present for lifetime records that have been transcribed into the electronic record in the PC data source, but the first electronic records HI held within PEDW began on 1 April 1995.Code lists used to define conditions were those created by Kuan et al. [23] available on the HDR UK Phenotype Library [17] and de novo code lists created specifically by the authors of this study where required (detailed in Additional file 2).We adapted prescribing code lists from the Cambridge Multimorbidity Score by Payne et al. to qualify conditions that resolve as 'active' on 1 January 2019 (e.g.asthma, epilepsy) [24].
Prescribing and laboratory data were available within the PC datasource (WLGP).To ensure that the study reflected a fair comparison between ascertainment using codes present in PC and HI datasets based on availability within each data source, prescribing data were applied to only PC and to linked PC-HI data.Conditions were categorised by the International Classification of Diseases and Related Health Problems 10th Revision (ICD-10) (Additional file 2).

Data analysis
We conducted a suite of analyses to estimate the prevalence and concordance of individual conditions and multimorbidity, and associations with mortality, between data sources.First, prevalence estimates for multimorbidity and each of the 47 conditions were calculated separately using only PC, only HI, and linked PC and HI (PC-HI) data.Second, the number of conditions each individual had was calculated using only PC, only HI, and linked PC-HI data.Associations with 12-month mortality were estimated using binary logistic regression and were used to calculate unadjusted and adjusted (by age, sex, and deprivation) odds ratios between morbidity counts (grouped into 0, 1, 2, 3, and 4 + conditions) with 95% confidence intervals.Third, PC/HI prevalence ratios were calculated by dividing the estimated prevalence measured using only PC data by the estimated prevalence measured using only HI data.Fourth, the proportion ascertained by each data source alone compared with linked PC-HI data was calculated, with Wilson's exact method used to calculate 95% confidence intervals [25].Finally, we estimated concordance between only PC and only HI data by [1] calculating the percentage of patients identified as having each of the 47 conditions in both PC and HI data (hereinafter referred to as 'percent agreement') and [2] calculating Cohen's kappa for each individual condition and for multimorbidity, using the following formula [26]: where: p o : Relative observed agreement among PC and HI data p h : Hypothetical probability of chance agreement between PC and HI data.
Kappa statistic for each of the 47 conditions was stratified into categories to describe concordance between data sources (slight 0.01-0.2,fair 0.21-0.40,moderate 0.41-0.60,substantial 0.61-0.8,almost perfect 0.81-1.00).Given that HI ascertainment of asthma and epilepsy was not constrained by prescribing data but PC and linked PC-HI was, the final three measures of concordance could not be assessed for these conditions.
The project received ethical approval from the SAIL Databank independent information governance panel [27].Data cleaning was performed using SQL to query IBM DB2 databases.Analysis, performed using the glm function in 'stats' package, and data visualisation were performed using R version 4.1.2[28].

Role of funding source
The funders of the study had no role in the study design, data collection, data analysis, data interpretation, or writing of the report.The corresponding author had full access to all the data used in the study and had final responsibility for the decision to submit the study for publication.

Results
On 1 January 2019, 2,340,027 individuals living in Wales were registered with SAIL-contributing GP practices for at least 1 year.Multimorbidity had the highest estimated prevalence using linked PC-HI data (32.2%),followed by only PC data (29.6%), and lowest using only HI data (16.5%)(Table 1).The mean age of people with multimorbidity was nearly 4 years younger using linked PC-HI data (62.5 years) compared with only HI data (66.8 years).The proportion of women with multimorbidity was nearly two percentage points higher using linked PC-HI data (54.2%[408,760 of 754,082]) than HI data (52.7%[202,678 of 385,276]).There was little difference in the distribution of people by deprivation status when multimorbidity was defined using different data sources (Table 1).
The 1-year mortality rate increased markedly with increasing number of conditions when identified in all three data sources (Additional file 3).In unadjusted analysis, the odds ratio for mortality in people with 4 + versus The estimated prevalence of most conditions was higher using only PC versus only HI data (PC/HI prevalence ratios).For 37/47 conditions, the PC/HI data prevalence ratio was statistically significantly > 1 (i.e.prevalence using only PC > only HI) including tuberculosis, cancer, and anaemia; congenital disease, visual impairment, and all mental and behavioural disorders; diseases of the respiratory system; and diseases of the ear and mastoid process.The PC/HI prevalence ratios were statistically significantly lower for 10/47 conditions: Addison's disease, epilepsy, paralysis, coronary artery disease, heart valve disorders, arrythmia, aneurysm, osteoporosis, and endometriosis (Fig. 2 and Additional file 5).Conditions with the highest PC/HI prevalence ratios were mental and behavioural and sensory disorders (depression at 14.2 [95% CI 14.0 to 14.4] and hearing impairment at 9.1 [95% CI 8.9, 9.2]) and lowest were diseases of the circulatory system and the nervous system (aneurysm at 0.5 [95% CI 0.5 to 0.5] and paralysis at 0.7 [95% CI 0.7 to 0.7]).The PC/HI prevalence ratios were close to 1 (lying between 0.9 and 1.1) for six conditions (cancer, arrythmias, coronary heart disease, heart failure, heart valve disorders, and endometriosis) of which four were diseases of the circulatory system (Table 2).
For most conditions, using only PC data ascertained a higher proportion of people identified using linked PC-HI data than using only HI data.Fig. 1 Adjusted odds ratios for 12-month mortality.Odds ratios for 12-month mortality in people with 4 + versus 0 conditions, adjusted for age, sex, and deprivation status.Ninety-five percent confidence intervals are represented by error bars Fig. 2 Forest plot of primary care to hospital inpatient data prevalence ratios.Prevalence ratios are calculated by dividing prevalence using only primary care (PC) data by prevalence using only hospital inpatient (HI) data: PC/HI ratio.Error bars represent 95% confidence intervals.The vertical dotted line represents where the PC/HI ratio is 1, meaning the prevalence rate is the same using both PC and HI data.Where the ratio is > 1, the prevalence was higher using PC versus HI data.Conversely, a ratio < 1 represents conditions where prevalence is higher using HI versus PC data.Concordance between data sources was variable across conditions and ICD-10 body systems.The percentage agreement of people identified as having each condition in both PC and HI data varied considerably across conditions, ranging from a minimal agreement in anxiety (4.6%) and depression (5.1%) to a maximal agreement in coronary artery disease and (62.9%) and multiple sclerosis (60.4%).ICD-10 chapters with the highest percent agreement were endocrine, nutritional, and metabolic diseases (median 44.9% [IQR 42.6 to 48.4]) and diseases of the nervous system (median across conditions in that chapter 38.10% [IQR 25.3 to 49.7]).Percent agreement was lowest for diseases of the ear and mastoid process (median 6.9% [IQR 6.6 to 7.3]) and mental and behavioural disorders (median 22.7% [IQR 5.1 to 28.7]) (Fig. 3 and Additional file 5).Agreement measured using Cohen's kappa was slight (< 0.20) for five, fair (0.21-0.40) for 12, moderate (0.41-0.60) for 16, and substantial (0.61-0.80) for 12 conditions.Kappa was lowest in depression (0.08) and hearing impairment (0.10), and highest in diabetes (0.72) and alcohol and substance misuse (0.79).At the ICD-10-chapter level, conditions with the lowest kappa were found in mental and behavioural disorders (three slight and two fair agreement out of nine) and diseases of the ear and mastoid process (two slight out of two); in contrast, ICD-10 chapters with the highest kappa were seen in diseases of the circulatory system (four substantial and two moderate out of nine) and diseases of the endocrine system (three substantial and one moderate out of four)

Discussion
The prevalence of multimorbidity was higher using only PC (29.6%) than only HI (16.5%) data and higher still using linked PC-HI (32.2%) data.The population of people identified as having multimorbidity using linked PC-HI data compared to only HI was younger, included a higher proportion of women, and people identified as multimorbid in both PC and HI had a stronger association with mortality.Using only PC data identified more people as having most of the 47 conditions than using only HI, and this was most marked for mental and behavioural and sensory disorders.Concordance between data sources was variable across conditions and ICD-10 body systems.The use of single data sources may underestimate the prevalence of multimorbidity and most individual conditions, especially mental and behavioural disorders.Findings from this study support the use of linked primary care and hospital inpatient data where available.
Strengths of the study are the inclusion of almost the entire adult population of Wales and examination of multimorbidity and a large number of individual conditions recommended for use in multimorbidity research [14], providing granular insights into variation in the relative ascertainment of disease from PC versus HI data sources.Limitations include variation in longitudinal availability of data for individuals (for example because individuals change GP registration or migrate into Wales), although we mitigated against this by requiring 1 year of GP registration to minimise impact [16].Like other UK datasets, PEDW data only reliably includes ICD-10 codes for hospital inpatient events, although all specialist outpatient clinics generate a letter to the general practitioner which is commonly used to code the primary care data.A further limitation is that primary care prescribing data were used to qualify epilepsy and asthma as 'active' on the analysis date, with the same data/rules used to estimate prevalence in linked PC-HI data.In contrast, only HI data were not qualified in this way, meaning that ascertainment of asthma and epilepsy are not strictly comparable because only PC and linked PC-HI estimates are for 'active' disease, whereas only HI is for 'ever recorded' .Finally, we treated the linked PC-HI estimates of prevalence as gold standard but did not have any way of examining false positive diagnoses in either of the data sources.

Table 2 (continued)
* Primary care (PC) and linked PC_HI prevalence estimates include prescribing data as documented in Additional File 2

ICD-10 chapter
Long-term condition Prevalence, no.(%) Difference between PC and HI, number (% of total cohort) Primary care data Hospital inpatient data Linked PC-HI data XI-Diseases of the digestive system Fig. 3 Venn diagrams of concordance between data sources for 47 long-term conditions by ICD-10 body system.Red represents people identified as having each condition in primary care (PC) and blue represents people identified in hospital inpatient (HI) datasources.Area of cross-over represents individuals identified as having each condition in both data sources Consistent with our findings, a study from Canada found varying degrees of discordance comparing ascertainment of seven conditions (myocardial infarction, asthma, diabetes, chronic lung disease, stroke, hypertension, and congestive heart failure) between the Canadian Community Health Survey (patient self-report) and health administrative data [29].Ascertainment of diabetes and hypertension were similar, but administrative health data gave lower prevalence estimates for stroke, congestive heart failure, and COPD.Another study, from the USA, compared ascertainment of conditions using hospital outpatient EHRs and encounter diagnosis data from Community Health Center (CHC) patients, where care is provided for un-and under-insured patients regardless of their ability to pay, found considerable variation in ascertainment across sources [30].They conclude that using EHRs capturing hospital outpatient data only might under-ascertain conditions in people who attend CHCs with less access to hospital services.In our study, where we have examined a broader range of conditions in individuals who have access to universal care that is free at the point of delivery, using hospital inpatient data alone usually under-ascertains conditions, most consistently for mental health conditions.Given marked socioeconomic gradients in mental-physical health multimorbidity, it is important to ascertain mental and behavioural disorders to represent morbidities experienced by people living in deprived areas [4], who often have poorer health outcomes [31].This is important in terms of application to health policy where models predicting healthcare costs and risk of admission perform better when two sources, outpatient and prescribing data, are linked than when using single sources [32].
It is important to note the variation in prevalence estimates for individual conditions and multimorbidity where differing rules for ascertainment have been applied across studies.Ascertainment of conditions using the same criteria can be similar; for example, estimates of hypertension prevalence calculated as any time lookback for clinical codes were 19.52% in this study compared with 18.2% in a recent study using linked primary care to HES in CPRD (both studies examined all ages) [5].However, where ascertainment rules differ, such as when ascertaining depression, estimated prevalence was lower (13.31%) in the current study where a 1-year look-back for either clinical codes or prescribing activity was necessary to reach the diagnosis compared with a higher estimate of 17.3% in the recent study where any time look-back for codes was used [5].Multimorbidity prevalence estimates and associations with adverse outcomes vary across studies where studies use different data sources and numbers and selections of conditions.For example, in the current study, the aOR for mortality was 8.34 (95% CI 8.02,8.68)when using linked PC-HI data.This is higher than a recent similar study also in SAIL Databank using primary care data to define the 40 long-term conditions described by Barnett et al. [22] who report a hazard ratio of 5.14 (95% CI 4.95-5.34)for mortality in people with five or more long-term conditions [33] and although the aOR in the current study cannot be directly compared with a HR the result was similar when using PC only data ).
Similar to the 14-fold higher ascertainment of depression using only PC versus only HI data in this study, previous studies have shown that the recording of depression in hospital data is incomplete and has been attributed to clinicians considering depression as being non-relevant to admissions for physical conditions [34].A similar pattern is seen in studies examining ascertainment of musculoskeletal conditions in hospital data, in particular where clinicians under-reported back pain [35].In this study, osteoarthritis was 1.7 times more commonly identified from only PC than only HI data, although both may under-ascertain because patients can under-consult with this condition to medical professionals because they consider it to be part of the normal ageing process [36].There was substantial agreement between data sources for several conditions, including inflammatory bowel disease where this is likely to reflect the need for shared primary and hospital care, and the frequency of hospital admission for acute disease flares [37].
The implication of the study for clinicians and managers is that coding of conditions in the two settings seems inconsistent, reflecting often a manual transfer of diagnoses between settings.More consistent coding is important to improve information transfer across the primary care-hospital boundary, which is a critical underpinning for good care.It will be necessary to examine the effects of the implementation of SNOMED-CT codes in electronic health records in the UK.Due to the use of a more consistent medical vocabulary, it is anticipated that the introduction of SNOMED-CT will improve precision in the exchange of clinical information between primary and secondary care settings for both clinical and research purposes and therefore comparability with international studies where the same system is used [38].For researchers, the key implication is to recognise that only using hospital data is likely to seriously under-ascertain many conditions (although it will identify people with more severe disease for some conditions like heart failure), which will particularly matter in studies of multimorbidity or in studies where mental health is important.There are, however, certain circumstances where it is appropriate to use primary care data alone, for example when examining trajectories of workload pressures in primary care [39] or changes in quality of care in relation to financial incentives offered to general practitioners [40].Further research is needed to examine the validity of diagnoses recorded in both primary and secondary care, with accurate estimation of false positive and false negative rates for different choices of data source.For multimorbidity studies involving large numbers of conditions, it is unlikely that gold-standard medical data review at the scale requires is feasible, but code lists can be at least partially validated by examining associations with a range of other data including clinical outcomes, laboratory, and prescribing data.

Conclusions
This study highlights the importance of, where available, linking primary care and hospital inpatient data when measuring multimorbidity to avoid underestimation of prevalence and underrepresentation of certain population groups.Robust and consistent methods across studies are needed to improve comparability and reproducibility and ultimately improve the quality of research and the clinical trials and guideline development needed to support people with multimorbidity.

Fig. 2 (
Fig.2Forest plot of primary care to hospital inpatient data prevalence ratios.Prevalence ratios are calculated by dividing prevalence using only primary care (PC) data by prevalence using only hospital inpatient (HI) data: PC/HI ratio.Error bars represent 95% confidence intervals.The vertical dotted line represents where the PC/HI ratio is 1, meaning the prevalence rate is the same using both PC and HI data.Where the ratio is > 1, the prevalence was higher using PC versus HI data.Conversely, a ratio < 1 represents conditions where prevalence is higher using HI versus PC data.Conditions are grouped by ICD-10 chapter (See figure on next page.)

Table 1
Study population characteristics for the whole study cohort and by multimorbidity measured using different data sources

Table 2
Prevalence of long-term conditions using primary care, hospital inpatient, and linked primary care to hospital inpatient data