We conducted this study using data from three different Swedish cohorts: the Malmö Offspring Study (MOS) , the Malmö Diet and Cancer study (MDC)  and the Malmö Preventive Project (MPP) . As described below, a diet associated metabolic signature was generated and internally validated in MOS and further externally validated in MDC. The associations with the metabolic signature and future type 2 diabetes and CAD were tested in both MDC and MPP.
MOS is an ongoing cohort study that was launched in 2013 to map risk factors for chronic diseases . In our study, the study sample consisted of 1538 individuals with overlapping data on metabolomics and adherence to a previously derived data-driven healthy food pattern .
MDC is a population-based prospective cohort study consisting of 28 098 individuals who attended baseline examination between 1991 and 1996 . We had previously included a random sample of 3833 participants from the MDC cardiovascular cohort , and out of these, 2684 had information on adherence to a previously derived data-driven healthy food pattern . After exclusion of participants with prevalent coronary artery disease (CAD) (n = 0) or prevalent type 1 or type 2 diabetes (n = 138), missing data on alcohol intake (n = 1) or smoking status (n = 7), or unknown vital status due to emigration (n = 17), 2521 individuals remained and were used in the statistical analyses.
MPP is another population-based prospective cohort, with 33,346 individuals enrolled between 1974 and 1992. Between 2002 and 2006, all participants still alive were invited to a re-examination, which serves as baseline in this study. Among a random sample of 5386 individuals out of the 18,240 that attended re-examination, we have previously created a nested case-control study design . Among the 5386 individuals, 1406 were excluded due to prevalent type 2 diabetes, CAD or because of incomplete data on CAD risk factors or missing plasma samples. Out of the remaining 3980 individuals, 382 developed CAD before December 31, 2013, and 203 developed type 2 diabetes. In total, 35 individuals developed both type 2 diabetes and CAD. The remaining 3361 individuals qualified as controls due to them not developing CAD or type 2 diabetes during follow-up. Due to high analytical demand, 498 were randomly included in the analyses as controls, resulting in a baseline study sample of 1083 individuals. The median follow-up time for type 2 diabetes was 6.3 years and for CAD 7.2 years.
At the baseline examination of respective cohort, covariate collection was done primarily through questionnaires combined with a visit to a research nurse whom conducted standardised anthropometrics analyses and blood sampling. BMI was calculated using the weight and height measured at the baseline visit. Supine blood pressure (mm Hg) was measured once after 10 min rest. The usage of anti-hypertensive medicine was identified through a questionnaire where participants listed their daily medications.
In MDC, physical activity was assessed using a questionnaire including 17 different activities adapted from the Minnesota Leisure Time Physical Activity Questionnaire and split into three equally big groups: low, medium and high . In MPP, physical activity was classified according to four different categories in a questionnaire as previously described . The highest group had only two participants so they were moved into the “high” activity group so that three groups remained. Participants with missing data on physical activity were imputed into the largest middle group. Smoking status was defined as smoking or non-smoking using self-reporting. Ex-smokers were defined as non-smokers. The total consumption of alcohol was in MDC defined by a four-category variable created by combining information from the questionnaire and the 7 day menu book as previously described . After the above described exclusion in MDC, combined with the imputation of physical activity in MPP, there were no missing values for the covariates.
Baseline blood samples were drawn for analysis of blood lipids (total and HDL-cholesterol and triglycerides) and blood glucose according to standard procedures at the Department of Clinical Chemistry, Malmö University Hospital. LDL-cholesterol concentration was calculated according to Friedewald formula. An aliquot of plasma samples were collected in citrate-coated vials in MDC and EDTA-coated vials in MPP and MOS and frozen to − 80° until extraction for metabolomics analysis as described below.
Endpoints were retrieved by linking the ten digit Swedish personal identification number with three registers: the Swedish Hospital Discharge Register, the Swedish Cause of Death Register, and the Swedish Coronary Angiography and Angioplasty Registry (SCAAR) as previously described . These registers have been previously described and validated for classifications of outcomes . CAD was defined as coronary artery revascularization, fatal or non-fatal myocardial infarction or death due to ischemic heart disease. Myocardial infarction was defined on the basis of the International Classification of Diseases (ICD) 9 code 410 or ICD-10 code I21. Death attributable to ischemic heart disease was defined as ICD-9 codes 412 and 414, or ICD-10 codes I22, I23, or I25. Coronary artery bypass surgery was identified from the national Swedish classification systems of surgical procedures and defined as procedure codes 3065, 3066, 3068, 3080, 3092, 3105, 3127, or 3158 in the Op6 system or as procedure code FN in the KKÅ97 system. Percutaneous coronary intervention was identified from SCAAR .
Incident diabetes cases were retrieved from six different national and regional diabetes registers as described elsewhere . Prevalent diabetes mellitus at baseline was defined as a fasting whole blood glucose ≥ 6.1 mmol/L (corresponding to a plasma glucose of ≥ 7.0 mmol/L) or a history of physician diagnosis of diabetes mellitus or being on antidiabetic medication or having been registered in any of the six different national and regional diabetes registers.
The date of last follow-up was 2016-12-31 in MDC and 2013-12-31 for MPP.
Health conscious food patterns
In this study, we utilised two published data-driven dietary patterns, a health-conscious food pattern from MOS  and a health-conscious food pattern from MDC  which both were created using principal component analysis to reduce food groups to dietary patterns. In MDC, the dietary data was collected using a modified diet history method that combined a 7-day menu book, a food frequency questionnaire and a 45-min interview [23, 24]. In MOS, the diet was assessed using the 4-day online food record Riksmaten2010, developed by the Swedish National Food Agency and a short food frequency questionnaire [25, 26]. The food patterns consisted of similar loadings in MOS and MDC (Additional file 1: Supplementary method).
Profiling of plasma metabolites was performed using LC-MS using a UPLC-QTOF-MS System (Agilent Technologies 1290 LC, 6550 MS, Santa Clara, CA, USA) and has been described elsewhere . Briefly, over-night fasted plasma samples were extracted and subsequently separated on an Acquity UPLC BEH Amide column (1.7 μm, 2.1 × 100 mm; Waters Corporation, Milford, MA, USA).
We identified metabolites by matching the measured mass-over charge ratio (m/z) and chromatographic retention times with an in-house metabolite library consisting of 111 metabolites that were measurable on all three cohorts (Additional file 1: Table S1). Out of 111 metabolites, 25 of them, mostly consisting of acylcarnitines had putative identities based on their fragmentation spectra and the rest had confirmed identities (Additional file 1: Table S1). Metabolite peak areas were integrated using Agilent Profinder B.06.00 (Agilent Technologies, Santa Clara, CA, USA). The normalisation process of metabolite levels is described in the supplementary method (Additional file 1: Supplementary method) .
All statistical analyses were done using R (version 4.0.4). To create a metabolic signature for health-conscious eating in MOS, partial least square (PLS) regression was applied with metabolite data as X and the health-conscious food pattern in MOS as Y using the package mixOmics (version 6.14.0) . The model was trained in 80% randomly selected participants from MOS. The number of principal components included in the model was determined by calculating the Q2 (predicted variation) and R2 (explained variation) values using ten-fold cross validation and a threshold of Q2 > 0.0975 . This resulted in only one principal component, named the metabolic signature. The results were validated in the remaining 20% using Pearson correlation after calculating the metabolic signature using the “Predict” function in mixOmics. We tested correlations between the metabolic signature and intake of food groups in MOS with Pearson correlation. The “Predict” function in mixOmics was further used to calculate the metabolic signature in MDC and MPP. The correlation between the metabolic signature and the health-conscious food pattern in MDC was tested using Pearson correlation as well as partial Pearson correlation adjusted for sex, age and body mass index (BMI).
To test the associations between the metabolic signature and type 2 diabetes and CAD, together referred to as cardiometabolic disease, prospective data was used in both MPP and MDC. First, we constructed Kaplan–Meier curves in MDC for type 2 diabetes and CAD separately with participants split into quintiles of the metabolic signature. Differences in risk in the Kaplan–Meier analysis between quintiles were evaluated using the log rank test.
To further explore the phenotype of the metabolic signature, baseline characteristics were summarised by quintile of the metabolic signature in both MPP and MDC. The differences were tested using ANOVA for continuous variables and chi-square test for categorical variables.
For the remainder of the logistic and proportional hazard regression analyses, the metabolic signature was added as a mean centred and unit variance scaled continuous variable.
In MDC, Cox proportional hazards regression was used to create three models associating the metabolic signature with CAD and type 2 diabetes separately. Model 1 was unadjusted; model 2 was adjusted for the potential confounders smoking, age, sex, alcohol intake and physical activity. Model 3 was additionally adjusted for the potential mediators LDL cholesterol, HDL cholesterol, glucose, triglycerides, BMI, systolic blood pressure and treatment of anti-hypertensive medicine. Model 2 was to be considered the main analyses while model 3 further included adjustments for the above-mentioned potential mediators as previously known risk factors for cardiometabolic disease. Smoking status, sex, alcohol intake and physical activity were adjusted for as categorical variables and the remaining covariates were adjusted for as continuous variables. The proportional hazard assumption was tested using the “coxzph” function in the “Survival” package . Years to event or to last follow-up was used as the underlying time variable in the Cox regressions. The association between the metabolic signature and CAD in MDC was also tested with logistic regression. As MPP had a nested case-control design as previously described, we used logistic regressions to test the association between the metabolic signature and future disease. We created three models for CAD and three models for type 2 diabetes that were adjusted for the same variables as Cox regression models 1-3 except for alcohol intake, which was not included in the MPP models as MPP has no baseline estimate of alcohol intake. Analyses were considered significant if the p value was below 0.05.