The EPIC study
EPIC is an ongoing multi-center cohort study including approximately 520,000 participants recruited between 1992 and 2000 from ten European countries [10]. Female participants (n = 367,903) were aged 35–75 years at recruitment. Detailed information was collected on dietary, lifestyle, reproductive, medical, and anthropometric data at inclusion [10]. Around 246,000 women from all countries provided a baseline blood sample. Blood was collected according to a standardized protocol in France, Germany, Greece, Italy, the Netherlands, Norway, Spain, and the UK [10]. Serum (except in Norway), plasma, erythrocytes, and buffy coat aliquots were stored in liquid nitrogen (−196°C) in a centralized biobank at IARC. In Denmark, blood fractions were stored locally in the vapor phase of liquid nitrogen containers (−150°C), and in Sweden, they were stored locally at −80°C in standard freezers. All participants provided written informed consent to participate in the EPIC study. This study was approved by the ethics committee of the International Agency for Research on Cancer (IARC) and all centers.
Study population and cross-sectional design
This study included all female EPIC participants (1) who provided a blood sample; (2) who were previously included in one of six case-control studies on cancer etiology nested within the EPIC cohort (on breast [1], endometrial [11], colorectal [12], kidney [13], liver [14], and gallbladder cancers) with available blood concentrations of acetylcarnitine, arginine, asparagine, PCs aa C36:3, ae C34:2, ae C36:2, ae C36:3, and ae C38:2 measured by the same targeted metabolomics approach; (3) who were included as control participants in these studies (i.e., free of cancer (except non-melanoma skin cancer) at the time of the diagnosis of the cases, using incidence-density sampling, and matched to cases by age, sex, study center, time of blood collection, fasting status at blood collection (except for kidney cancer study), menopausal status and exogenous hormone use at blood collection (for breast, endometrial, liver, and gallbladder studies), and phase of menstrual cycle (for breast and endometrial cancer studies)); and (4) whose samples were included in an analytical batch including at least 10 samples, to ensure proper normalization of metabolite concentrations (see the “Statistical analyses” section) (N = 3163).
We then excluded women who declared use of hormones at blood collection (n = 768), and those whose hormone use status at blood collection was unknown (n = 37), because associations between the studied metabolites and breast cancer risk were limited to hormone non-users [1]. The current analysis included data from 2358 participants.
The 2358 participants were split into a discovery set (N = 1572, 66.7% of the population) and a validation set (N = 786, 33.3% of the population). Metabolites of interest were those found to be associated with breast cancer risk, and this observed association could result from associations between metabolites and some of the correlates under study in the present work. Thus, the discovery set included all controls from the breast cancer study (n = 1079), and randomly selected controls from the other nested case-control studies (n = 493), while the validation set did not include participants from the breast cancer study. This way, associations identified on the discovery set and further validated on the validation set are guaranteed not to be driven by the breast cancer study only.
Laboratory measurements
Before exclusions of hormone users, a total of 3179 samples were available for 3163 women. All samples, plasma (in 95.1% of samples) or serum, were assayed by liquid chromatography-mass spectrometry using the AbsoluteIDQ p180 commercial kit (Biocrates Life Sciences AG, Innsbruck, Austria). A total of 2289 (72.0%) samples were assayed at the laboratory of the Biomarkers Group at IARC (breast, colorectal, kidney, and liver cancer studies); 851 (26.8%) at the Imperial College, London; and 39 (1.2%) at the Helmholtz Zentrum, München, Germany. At IARC, analyses were run on a QTRAP5500 (breast, kidney, and liver cancer studies) and TQ4500 (colorectal cancer study) mass spectrometers (AB Sciex, Framingham, MA, USA), while at the Imperial College London and Helmholtz Zentrum, analyses were run using an API4000TQ (endometrial and gallbladder cancer studies). All analyses for a given study were performed using the same instrument. Sixteen participants had their samples analyzed in two different studies, at IARC and at the Helmholtz Zentrum, for whom the metabolite concentrations were averaged over the two measures.
Out of the 3179 samples, arginine concentrations could not be quantified in five, as they were below the lower limit of quantification (LLOQ) and were therefore imputed to half this LLOQ, consistently with previous work [1].
Covariate data
Details of data collection in EPIC are described elsewhere [10]. Lifestyle and medical factors were assessed in the baseline questionnaire. Usual dietary intakes were assessed using center- or country-specific validated questionnaires covering the previous 12 months and matched to the US Department of Agriculture food composition database to estimate macronutrient intakes [15]. Glycemic index and glycemic load were computed. In all EPIC centers, except France, Oxford, and Norway, height, weight, and waist and hip circumference were measured on all participants using similar protocols (in Umeå (Sweden), only weight and height were measured). In France and Oxford, weight, height, and waist and hip circumferences were measured in a sub-set of participants, but self-reported weight and height were obtained from all individuals, and validation studies showed high correlations between self-reported and measured values (r ≥ 0.90) [16, 17]. In Oxford, self-reported measurements also included waist and hip circumferences. In Norway, only self-reported height and weight were available.
Dietary data were used to compute the inflammatory score of the diet (ISD) [18] (reflecting the inflammatory potential of the diet based on 28 dietary components), the modified Mediterranean diet score [19] (a 9-component score indicating the degree of adherence to the traditional Mediterranean diet; 0 minimal adherence to 9 maximal adherence), and the Diet Quality Index-International (DQI-I; a 17-component score based on general nutritional guidelines [20, 21]; 0 to 100, minimal to maximal diet quality). Dietary and lifestyle data were combined to calculate the Healthy Lifestyle Index (HLI) [22], designed to reflect five components of lifestyle factors (smoking, alcohol consumption, diet (cereal fibers, red and processed meat, the ratio of polyunsaturated to saturated fatty acids, margarine, glycemic load, and fruits and vegetables), physical activity, and body mass index; ranging from 0, least healthy, to 20). Furthermore, we calculated the World Cancer Research Fund/American Institute for Cancer Research score, which reflects recommendations for cancer prevention on weight maintenance, physical activity, intake of food and drinks which promote weight gain, of plant-based foods, of animal-based foods, of alcohol, and breastfeeding [23] (from 0, low adherence to recommendation, to 7 for women).
Statistical analyses
Normalization of metabolite concentrations
A specific statistical pipeline was developed [24] and applied on raw metabolite concentrations (before exclusion of hormone users) to adequately pool measures obtained from different studies, instruments, and laboratories. This pipeline was shown to be efficient in removing unwanted variability and improving the comparability of measurements acquired across different nested studies. Log-transformed concentrations of the metabolites of interest were normalized to remove effects of analytical batch and study, which were estimated as random effects in mixed-effects linear models correcting for possible heteroscedasticity. Corrected metabolite concentrations analyzed in this work correspond to residuals from the model.
Missing data
When missing values on covariates represented less than 5% of the overall values, they were imputed to the mode value (categorical variables: number of full-term pregnancies, ever use of oral contraceptive, ever use of hormones for menopause (by menopausal status), education level, physical activity, smoking status, fasting status) or median (continuous variables: age at menarche, age at first full-term pregnancy (among parous women), duration of breastfeeding among women who breastfed, waist circumference, hip circumference, waist/hip ratio, time at blood collection). When missing values represented more than 5% of values for a variable, this variable was categorized, and a “missing” category was created (phase of menstrual cycle at blood collection for pre- and perimenopausal women, breastfeeding, lifetime alcohol consumption, Healthy Lifestyle Index, WCRF/AICR score).
Identification of correlates
Participants’ characteristics were described using frequencies for categorical variables and mean (standard deviation) for continuous variables. We calculated partial Pearson’s correlations between metabolite concentrations (adjusted for center and age) and between metabolites and age (adjusted for center).
Analyses were first run in the discovery set. For each metabolite of interest and each lifestyle variable, a linear regression model was built with metabolite concentration as a dependent variable. Models were adjusted for center of recruitment, age at blood collection, menopausal status (premenopausal, perimenopausal, postmenopausal [25]), phase of the menstrual cycle for premenopausal women (follicular, ovulatory, luteal, missing), time of the day, and fasting status at blood collection (“No”: < 3 h since last meal (< 4 h in Umeå), “In between”: 3–6 h (4–8 h in Umeå), and “Yes”: > 6 h (> 8 h in Umeå)). Models that examined age as exposure were not adjusted for age, and models with menopausal status as main exposure were not adjusted for phase of menstrual cycle, as this variable is defined in premenopausal women only.
Variables tested as possible correlates were age at blood collection (continuous), age at menarche (continuous), total duration of menstrual cycles (quartiles/missing), pregnancy (ever/never), number of full-term pregnancies (continuous), age at first full-term pregnancy (nulliparous/quartiles), breastfeeding (ever/never/missing), duration of breastfeeding (nulliparous/quartiles/missing), use of oral contraceptive (ever/never; current users excluded), menopausal status at blood collection (premenopausal/perimenopausal/postmenopausal), use of hormones for menopause (ever/never; current users are excluded), education level (no schooling or primary/technical, professional or secondary/longer education), physical activity (Cambridge Index [26]: inactive/moderately inactive/moderately active/active), smoking status (never/former/current), smoking status combined with intensity (never/current, 1–15 cigarettes/day/current, 16+ cigarettes/day/current, pipe/cigar/occasional/former, quit for ≤10 years/former, quit 11–20 years/former, quit > 20 years), baseline alcohol consumption (continuous, g/day), lifetime alcohol consumption (non-drinker/former drinker/current > 0–3 g/day/> 3–12 g/day/> 12–24 g/day/> 24 g/day/missing), BMI (continuous, kg/m2), waist circumference (continuous, cm), hip circumference (continuous, cm), waist/hip ratio (continuous), height (continuous, cm), total energy intake (continuous, kcal/day), and the following food components estimated as residuals on total energy intake (continuous, g/day): protein, carbohydrate, starch, sugar, fiber, fat (total), fatty acids (monounsaturated, polyunsaturated, saturated, trans, trans-monoenoic, trans-polyenoic), glycemic index (continuous), glycemic load (continuous), Healthy Lifestyle Index (0–10/11–15/16–20), WCRF/AICR score (quartiles/missing), modified Mediterranean diet score (continuous), diet quality index (continuous), and inflammatory score of the diet (continuous).
For each metabolite, P-values from F-tests for each variable were collected and were corrected for multiple testing by controlling for family-wise error rate at α = 0.05 by permutation-based stepdown minP adjustment of P-values, a method which accounts for dependencies between tests [27].
Validation
All statistically significant associations in the discovery set (based on P-values corrected for multiple tests ≤0.05) were assessed in the validation set, using the same model and categories of variables as in the discovery set. In this validation set, a more conservative approach was chosen for controlling for multiple tests [28], i.e., the Bonferroni correction based on the number of tests run for each metabolite.
For all variables showing a significant association with the metabolites of interest in both the discovery and validation sets, continuous variables were categorized (quartiles) and means of metabolites, with 95% confidence intervals, were estimated in each category, using the overall dataset (n = 2358).
Interactions
For each metabolite and each variable examined as potential correlate, we investigated interaction with fasting status (no/in between/yes), menopausal status at blood collection (pre-/peri-/postmenopausal), and BMI (18.5–24.9/25–29.9/≥30 kg/m2, excluding n = 15 participants with BMI < 18.5 kg/m2), in the discovery set. To do so, an interaction term was added in the model and the P-value associated with this term was evaluated, after correction for multiple testing using the permutation minP algorithm.
Sensitivity analyses
We conducted sensitivity analyses (1) excluding participants from the liver and gallbladder studies (n = 128), for which the blood fraction analyzed was serum and not plasma, and (2) excluding participants with self-reported diabetes (n = 71) or with missing data on diabetes status (n = 160) at recruitment.