Cohort design and study population
The Lifelines cohort study is a multidisciplinary prospective population-based cohort study that applies in a unique three-generation design, the health and health-related behaviors of 167,729 persons living in the north of The Netherlands. It employs a broad range of investigative procedures in assessing the biomedical, socio-demographic, behavioral, physical, and psychological factors which contribute to health and disease of the general population.
Participants were included in the study between 2006 and 2013. So far, four follow-up assessment rounds took place, i.e., T1=baseline, median (interquartile) months to follow-up rounds: T2=13 (13-15), T3=25 (23-28), and T4=44 (35-51). Comprehensive physical examinations, biobanking, and questionnaires were conducted at T1 and T4, and follow-up questionnaires (including questions for diabetes status) were issued to participants at T2 and T3. The timeline of data collection of the Lifelines cohort study is presented in Additional file 1: Fig. S1. Before study entry, a signed informed consent form was obtained from each participant. The Lifelines study is conducted according to the principles of the Declaration of Helsinki and approved by the Medical Ethics Committee of the University Medical Center Groningen, The Netherlands (approval number 2007/152). The overall design and rationale of the study have been described in detail elsewhere [25, 26].
Participants aged between 35 and 70 years who were free of diabetes at baseline, and for whom valid dietary intake data was available were included in this study. The ascertainment of prevalent diabetes cases at baseline was based on (1) self-report questionnaires, (2) fasting glucose ≥ 7.0 mmol/L, (3) HbA1c ≥ 48 mmol/mol (6.5%) [27], and (4) medication use on glucose-lowering agents (ATC code A10) [28]. Dietary intake data was considered unreliable when the ratio between reported energy intake and basal metabolic rate (calculated with the Schofield equation [29]) was below 0.50 or above 2.75 (based on the considerations by Goldberg [30]). Furthermore, participants for whom only baseline data was available, or who reported the development of type 1 diabetes or gestational diabetes during the follow-ups, were excluded. In total, 70,421 participants (41,243 women and 29,178 men) were included in the analysis (Additional file 1: Fig. S2).
Data collection
Ascertainment of incident type 2 diabetes
Incident type 2 diabetes was assessed by self-report questionnaires at the two follow-ups (T2, from year 2011 to 2015; and T3, from year 2012 to 2016) and the second assessment (T4, from year 2014 to 2018). Additionally, blood glucose and HbA1c measurements were available at the second assessment (T4). Participants were considered an incident case if they met one of the following criteria: (1) self-reported newly developed type 2 diabetes since last time they filled out a questionnaire, (2) fasting glucose ≥ 7.0 mmol/L, or (3) HbA1c ≥ 48 mmol/mol (6.5%) [27]. However, data on prescribed medication was not available during follow-ups and the precise time of diabetes diagnosis was not documented.
Clinical measurements
Blood samples were collected by venipuncture in a fasting state between 8 and 10 am and were further transferred to the Lifelines central laboratory for analysis. Serum levels of glucose and HbA1c were subsequently analyzed. Anthropometric measurements were made by trained research staff following standardized protocols. These measurements were performed without shoes and heavy clothing. BMI was calculated as weight in kilograms divided by the square of height in meters.
Dietary assessment
At baseline, dietary consumption was assessed using a validated 110-item semi-quantitative food frequency questionnaire (FFQ), which was designed to assess the food consumption (including alcohol) over the previous month [31]. The questionnaire assessed the frequency of consumption and portion sizes, the latter of which were estimated by fixed portion sizes (e.g., slices of bread, pieces of fruit) and commonly used household measures (e.g., cups, spoons). For insight into the overall diet quality, the food-based Lifelines Diet Score (LLDS) was calculated. This score ranks the relative intake of nine food groups with positive health effects (vegetables, fruit, whole grain products, legumes/nuts, fish, oils/soft margarines, unsweetened dairy, coffee, and tea) and three food groups with negative health effects (red/processed meat, butter/hard margarines, and sugar-sweetened beverages). The development of this score is described in detail elsewhere [32].
Categorizing the degree of food processing—the NOVA classification
The NOVA classification was used to categorize all 110 food items into the four proposed categories: (1) un-processed or minimally processed food (e.g., fresh vegetables/fruits, unprocessed meat), (2) processed culinary ingredient (e.g., butter/oil for cooking, sugar, salt), (3) processed food (e.g., canned vegetables/fish, fruits in syrup), and (4) ultra-processed food (e.g., processed meat, soft drinks) [9]. The proportion (weight ratio, %) of intake of UPF in the total weight of food and beverages consumed per day was calculated and was then divided into sex-specific quartiles for further analyses. Using weight ratio of UPF intake accounts for the food that does not provide energy (e.g., artificially sweetened beverages) as well as non-nutritional factors (e.g., additives, by-products during processing). The categorization of the items was verified by four of the authors and can be found in Additional file 1: Table S1.
Assessment of other baseline covariates
Age, smoking status, TV watching time, and educational level were assessed by self-administered questionnaires. Smoking status was categorized as never, former, and current smoker. The highest educational level achieved was categorized as (1) low—junior general secondary education or lower (International Standard Classification of Education [ISCED] level 0, 1, or 2); (2) middle—secondary vocational education and senior general secondary education (ISCED level 3 or 4); and (3) high—higher vocational education or university (ISCED level 5 or 6) [33]. The validated Short QUestionnaire to ASsess Health-enhancing physical activity (SQUASH) was used to assess physical activity level [34]. From the SQUASH data, leisure time and commuting physical activities, including sports, at moderate (4.0–6.4 MET) to vigorous (≥ 6.5 MET) intensity (non-occupational moderate-to-vigorous physical activity [MVPA]), were calculated in minutes per week [34]. The variable was categorized by dividing participants who reported any non-occupational MVPA into sex-specific quartiles. For participants who reported zero non-occupational MVPA, the categorical variable was coded as 0.
Statistical analysis
Consumption patterns of ultra-processed food
As UPF is highly heterogeneous on multiple concepts (i.e., nutrient density, nutrient composition, taste, snack or main meal items), it is difficult to create well-founded subgroups. Therefore, instead of using a priori defined subgroups, we used principal component analysis (PCA) to derive underlying consumption patterns of UPF, to obtain real-world insight into the intake of this highly heterogeneous food category. Based on the Scree plot, eigenvalues, and explained variations, four UPF consumption patterns were selected. Thereafter, the derived components were orthogonally rotated to obtain uncorrelated components to enhance interpretability. We selected food items with absolute factor loadings ≥ 0.20 to construct simplified pattern scores while retaining the weight (factor loading) of each selected food item. The simplified UPF consumption pattern scores (hereafter referred to as UPF consumption patterns) were standardized and then divided into sex-specific quartiles for further analyses. Sensitivity analysis was performed by repeating the PCA procedure 3 times on a random half sample.
Risk of incident type 2 diabetes
Associations between UPF intake (total intake [continuous or sex-specific quartiles] and UPF consumption patterns [continuous or sex-specific quartiles]) with incident type 2 diabetes were estimated with logistic regression models and results were shown as ORs with 95% confidence intervals. In models where UPF intake was included as a continuous variable (weight ratio), ORs regarding a 10% absolute increment of UPF in the total diet were calculated. In four steps, the analyses were adjusted for (1) age and sex; (2) diet quality (LLDS), total energy intake, and alcohol intake; (3) non-occupational MVPA, TV watching time, smoking status, and educational level; and (4) BMI (continuous). This addition of BMI in the last step aimed to investigate the role of this intermediate factor in the association between UPF and type 2 diabetes. Additionally, the possibility of effect modification by sex was tested by including the interaction-term for sex and UPF intake in the models. To account for missing covariates, multiple imputation by chained equations was performed to deal with missing data for non-occupational MVPA (proportion of missing 6.5%), TV watching time (proportion of missing 0.6%), smoking status (proportion of missing 0.6%), and educational level (proportion of missing 0.4%).
We performed several sensitivity analyses to test the robustness of our results. First, analyses were performed using energy-adjusted UPF intake. Second, sensitivity analyses on missing data were performed by complete case analysis. Moreover, we excluded participants who were lost to follow-up after 24 months, in an attempt to address the possible reverse causation caused by short follow-up time.
Post hoc analysis—baseline diabetes risk and ultra-processed food consumption patterns
Individuals’ awareness of elevated diabetes risk may have influenced individuals’ dietary behaviors at baseline. Therefore, linear regression models were performed to investigate whether type 2 diabetes risk at baseline, as calculated with the PROCAM risk algorithm (Additional file 1: Table S2) [35], was associated with the total intake of UPF and distinctive UPF consumption patterns. In the linear regression models, the total intake of UPF or the UPF consumption pattern scores were set as dependent variable one by one. The analyses were additionally adjusted for the same covariates as described above, except for energy intake and BMI. Energy intake was not considered to be a confounding factor, and BMI was not included due to its high correlation with the PROCAM diabetes risk algorithm (Pearson correlation coefficient = 0.835).