Development and validation of a lifestyle-based model for colorectal cancer risk prediction: the LiFeCRC score

Background Nutrition and lifestyle have been long established as risk factors for colorectal cancer (CRC). Modifiable lifestyle behaviours bear potential to minimize long-term CRC risk; however, translation of lifestyle information into individualized CRC risk assessment has not been implemented. Lifestyle-based risk models may aid the identification of high-risk individuals, guide referral to screening and motivate behaviour change. We therefore developed and validated a lifestyle-based CRC risk prediction algorithm in an asymptomatic European population. Methods The model was based on data from 255,482 participants in the European Prospective Investigation into Cancer and Nutrition (EPIC) study aged 19 to 70 years who were free of cancer at study baseline (1992–2000) and were followed up to 31 September 2010. The model was validated in a sample comprising 74,403 participants selected among five EPIC centres. Over a median follow-up time of 15 years, there were 3645 and 981 colorectal cancer cases in the derivation and validation samples, respectively. Variable selection algorithms in Cox proportional hazard regression and random survival forest (RSF) were used to identify the best predictors among plausible predictor variables. Measures of discrimination and calibration were calculated in derivation and validation samples. To facilitate model communication, a nomogram and a web-based application were developed. Results The final selection model included age, waist circumference, height, smoking, alcohol consumption, physical activity, vegetables, dairy products, processed meat, and sugar and confectionary. The risk score demonstrated good discrimination overall and in sex-specific models. Harrell’s C-index was 0.710 in the derivation cohort and 0.714 in the validation cohort. The model was well calibrated and showed strong agreement between predicted and observed risk. Random survival forest analysis suggested high model robustness. Beyond age, lifestyle data led to improved model performance overall (continuous net reclassification improvement = 0.307 (95% CI 0.264–0.352)), and especially for young individuals below 45 years (continuous net reclassification improvement = 0.364 (95% CI 0.084–0.575)). Conclusions LiFeCRC score based on age and lifestyle data accurately identifies individuals at risk for incident colorectal cancer in European populations and could contribute to improved prevention through motivating lifestyle change at an individual level.


Background
Colorectal cancer accounted for over 1.8 million new cases or 10% of all new cases of cancer worldwide in 2018 [1]. Worryingly, the global burden of colorectal cancer is expected to rise by 60% reaching 2.2 million new cases and 1.1 million deaths in 2030, with European countries ranking highest in the global statistics of colorectal cancer incidence and mortality [2]. The projected increase in colorectal cancer burden necessitates improved assessment of primary prevention strategies [2,3]. Targeted prevention in an asymptomatic population that addresses potentially modifiable factors has potential for reducing lifestyle-associated long-term risk of colorectal cancer and represents a cost-effective approach to reduce the cancer burden [4,5].
Lifestyle behaviours such as smoking, alcohol consumption, and poor diet have long been recognized to be associated with a higher risk of colorectal cancer [6][7][8][9][10][11][12][13][14][15]. Updated evidence on nutrition and cancer risk further highlighted the importance of risk factors such as body fatness (i.e. abdominal adiposity), adult-attained height, physical activity, high intake of red and processed meat and low intakes of whole grains, dairy products and fish [15,16]. Despite accumulation of evidence, translation of lifestyle information into individualized colorectal cancer risk assessment strategies has not been implemented so far. Risk stratification may aid the identification of highrisk individuals, guide referral to screening and motivate lifestyle modification [17]. Individualized risk estimates in primary care may essentially aid behaviour change and complement preventive approaches to shifting population distributions of risk factors [17].
A number of colorectal cancer risk prediction models have been published over the last decade [18][19][20][21]. Most published models have been predominantly developed using data from American and Asian populations [18,19]. We have previously validated several models in European populations based on data from UK Biobank and the European Prospective Investigation into Cancer and Nutrition (EPIC) cohort studies [20]; however, several gaps remain to be addressed. First, only a few previous models have been developed based on prospective cohort data with long enough follow-up time to account for the potentially long latency period of colorectal cancer development [18]. Second, important emerging predictors related to nutrition and lifestyle such as abdominal fatness have not been considered [22]. Third, most models focused only on model development and did not address the full continuum of model development, validation and communication recommended in recent methodological guidelines for research on risk prediction (i.e. TRIPOD, Transparent Reporting of a multivariable Prediction model for Individual Prognosis or Diagnosis) [19,23]. Fourth, previous models were mostly developed using logistic regression and did not account for time-to-event. New approaches such as penalized regression methods (i.e. elastic net regression) and machine learning algorithms (i.e. random survival forest) might offer additional means for model improvement [24,25]. Finally, model communication to the wider public was generally not addressed by previous studies and was restricted to providing a formula to calculate individual absolute risk of colorectal cancer [18]. Graphical nomograms and web-based applications could further aid in facilitating model communication [26].
In this context, we aimed to develop and validate a lifestyle-based risk prediction model for the prevention of colorectal cancer in a population-based European cohort. We further aimed to construct a simple and widely applicable user-friendly risk calculator offering an estimate of colorectal cancer risk based on individual's personal data.

Study design and data source
The lifestyle-based prediction model for colorectal cancer risk (LiFeCRC score) was developed using data collected within EPIC, a multicentre prospective cohort study comprising 521,324 participants aged 17 to 98 years at study baseline (predominantly 35 to 70 years) recruited between 1992 and 2000 across 23 centres in 10 European countries [27]. Participants included blood donors, screening participants, health-conscious individuals and the general population. Written informed consent was obtained from all participants before joining the EPIC study. Approval for the EPIC study was obtained from the ethical review boards of the International Agency for Research on Cancer and from all local institutions through which subjects were recruited for the EPIC study, as previously reported [28].

Case ascertainment
The primary outcome was incident colorectal cancer. Cancer cases were identified through population cancer registries in Denmark, Italy, the Netherlands, Spain, Sweden and the UK. In France, Germany and Greece, a combination of methods was used including health insurance records, cancer pathology registries and active follow-up of study participants. Follow-up began at the date of enrolment and ended at the date of diagnosis of colorectal cancer, death or last complete follow-up. The last update of endpoint information was done up to 31 September 2010. Colon and rectal cancers were defined according to the 10th Revision of the International Statistical Classification of Diseases, Injuries and Causes of Death (ICD-10), proximal colon tumours include tumours in the cecum, cecal appendix, ascending colon, hepatic flexure, transverse colon and splenic flexure (ICD-10 codes C18.0-18.5); distal colon tumours include those in the descending colon (ICD-10 code C18.6) and sigmoid colon (ICD-10 code C18.7); and rectal tumours are those occurring at the rectosigmoid junction (ICD-10 code C19) or in the rectum (ICD-10 code C20). Only the first primary neoplasm was included in the analysis; nonmelanoma skin cancer was excluded. Figure 1 presents a flowchart of study population selection for deriving the LiFeCRC score in the EPIC cohort.

Baseline data collection
At baseline, participants completed extensive medical, dietary and lifestyle questionnaires, including questions on alcohol use, smoking status, physical activity, education and previous illnesses. Body weight, height and waist circumference were measured in all centres except for EPIC-Oxford (health-conscious population) and EPIC-France where anthropometric measurements were self-reported [27]. Usual food intakes were measured by using country-specific validated dietary questionnaires [29]. All dietary variables used in the present study were calibrated by using an additive calibration method as previously described [30]. Non-steroidal anti-inflammatory drug (NSAID) use was only assessed in the Cambridge study center, and family history of colorectal cancer was assessed only in study centres in France, Spain and the UK. Baseline characteristics of participants with available information on NSAID use and family history of colorectal cancer are presented in Supplementary Table 1, Additional File 1.

Model development
The model development and model validation were performed and reported following the TRIPOD guidelines [23,31] (Supplementary Table 2, Additional File 1). The general workflow of model derivation, performance evaluation, validation and model communication are presented in Supplementary Fig. 1, Additional File 2.
Overall, the LiFeCRC score was derived based on beta coefficients for colorectal cancer risk estimated in Cox proportional hazard models within the derivation dataset. Time -to -event was defined as time from baseline assessment to first cancer event. Supplementary Table 3, Additional File 1 presents the variable names and measurement scales of a predefined set of 16 predictors selected based on published literature reflecting latest evidence from systematic reviews (i.e. World Cancer Research Fund/American Institute for Cancer Research reports) and based on availability of data in the EPIC cohort. Analyses based on Schoenfeld residuals and stratified Kaplan-Meier curves revealed no violation of the proportional hazard assumption of the Cox model. To test whether the predictive performance of each variable is the same, regardless of the values of other predictors, statistical interactions between different combinations of predictor variables on the multiplicative scale were tested using the likelihood ratio test. Since model discrimination was not improved by including significant interaction terms, the inclusion of interaction terms in the final Cox models was disregarded to avoid overfitting.

Elastic net selection
Predictor variable selection was performed using bootstrapped elastic net regularization [32]. Elastic net regularization is a penalized regression method, combining least absolute shrinkage and selection operator (LASSO) and ridge regression. A penalty parameter λ is used to shrink predictor regression coefficients, eventually removing predictor variables from the model by setting their respective regression coefficient to zero. A mixing parameter α is used to fix the proportion for combining LASSO and ridge regression. Optimal values for both parameters λ and α were determined based on minimal mean error of 10-fold cross-validation using 100 possible λ values for α values between 0.5 and 1 (0.5, 0.6, 0.7, 0.8, 0.9, 1). The selected parameters were then used to bootstrap the elastic net regularization of each predictor's Cox regression coefficient with 1000 replications. Based on all bootstrap replications, mean coefficient values and 95% confidence intervals were calculated for each predictor coefficient. Predictors with confidence intervals including zero were removed. All remaining predictors were then used to generate reduced elastic net penalized Cox regression models. The model selection was conducted for colorectal cancer as a single endpoint (LiFeCRC score) and according to sex and cancer subsite (colon/rectum). Variable selection and Cox regression modeling were performed using R 3.6.1 (R Core Team) [33], and the glmnet (version 2.0-18) [34] and survival (version 2.44-1.1) [35] packages.

Absolute risk assessment
The individual 10-year absolute risk P (10y) for colorectal cancer was calculated using the following formula: The 10-year survival function estimate S m (10y) was calculated for average predictor variable values. The average Risk Score m and the individual Risk Score i were computed using the following formulas: The j index stands for a predictor variable of a Cox regression model and β j is the beta estimate.
In additional analyses, the study population was stratified according to predefined risk categories of low, intermediate and high risk, based the 50th and 90th percentile of predicted risk in the derivation cohort. Incidence rates and model selection characteristics across the so defined risk categories in both the derivation and validation samples have been assessed.

Model discrimination
Model discrimination was assessed based on Harrell's Cindex as a measure similar to the receiver operating characteristic statistic that takes the censored nature of data into account. This value represents the odds of the predicted probability of developing colorectal cancer being higher for those who actually develop colorectal cancer compared to those who do not develop the disease. To account for model optimism in terms of overfitting, bootstrapping with 1000 replications was performed. In bootstrapping, entries are randomly drawn with replacement from a data set until the bootstrap sample has the size as the original dataset. For each bootstrap sample, an elastic net penalized Cox regression model was fitted.
Harrell's C-index of each bootstrap model was then calculated for the bootstrap sample and the original data in each bootstrap replication. The difference of these values was averaged over all 1000 bootstrap replications to calculate the amount of optimism for the C-index of the original model, which was used to calculate an optimism-corrected C-index. This analysis was performed in R [33] with the package rms (version 5.1-3.1) [36].

Model calibration
Calibration plots of estimated individual predicted risks of developing colorectal cancer in the next 10 years were derived from the penalized Cox regression model. These values were divided into deciles, and each decile's mean value was computed. The Kaplan-Meier survival function at 10 years with 95% confidence interval was calculated for each decile group. Subsequently, the trend of the mean predicted risks and the observed complement of the Kaplan-Meier survival of each decile was visually compared as a measure of calibration. Model performance, including Harrell's C-index and calibration plots, was also evaluated in the validation cohort.

Model communication
In order to assist the translation of the generated statistical model into an individual risk prediction equation, we created a 10-year risk assessment nomogram as a graphical model representation that allows risk estimation. For this purpose, we used the R [33] package rms (version 5.1-3.1) [36]. In addition, we developed a userfriendly risk calculator application using the R [33] packages shiny (version 1.2.0) [37] and shinydashboard (version 0.7.1) [38] that can be adapted for a web-based use. This application allows the prediction of individual colorectal cancer risk by including characteristics into input fields. The input values are then evaluated using the validated colorectal cancer risk prediction model.

Random survival forest
Random survival forest was used as an alternative machine learning method in order to prove model robustness, i.e. assess whether the same set of predictors will be selected. Each random survival forest was generated with a total number of 500 decision trees with 100 unique data points on average in each terminal node and a maximum of 10 possible random split points to consider at each branch of a decision tree. A variable importance measure for each predictor variable, describing the impact of using randomly permuted values of this variable instead of observed values for the prediction of known entries, was then extracted from the random survival forest. For the computation of random survival forests, the package "randomForestSRC" (version 2.6.1) was used. Model performance was evaluated in the derivation and validation cohort using Harrell's C-index and calibration plots.

Sensitivity analyses
In sensitivity analyses, we evaluated the added predictive value of lifestyle data beyond age, using the following statistics: (1) improvement in model discriminationbased on goodness of fit (likelihood ratio test), estimated net change in Harrell's C-index and continuous net reclassification improvement (NRI > 0 ); (2) improvement in model calibration based on comparison of calibration plots and (3) net benefit of the model based on decision curve analysis. We also stratified the study population in the derivation and validation sample according to age groups: < 45 years; 45-65 years; > 65 years and calculated model performance characteristics (Harrell's C-index and NRI > 0 ) for the lifestyle-based model across these categories. In addition, we also calculated the predicted 10-year absolute risk of colorectal cancer for a predefined "healthy" and "unhealthy" lifestyle pattern across different age groups and a constant body height. In subsample of the derivation cohort with available information, Harrell's C-index was compared between models with and without inclusion of NSAID use or family history. To address model generalizability, we further evaluated model performance across subgroups by selected variables, i.e. waist circumference, education, smoking status (including level of smoking intensity) and level of alcohol consumption. Finally, to account for the potential influence of competing risk of death (N = 23,774), we calculated the cumulative incidence adjusted for mortality and evaluated the discrimination of the reduced model based on Fine-Gray subdistribution hazard regression [39] in both the derivation and validation samples. Table 1 shows the baseline characteristics of men and women in the derivation and validation cohorts. Overall, the distribution of risk factors was similar across both cohorts. In the derivation cohort, the mean age at study baseline was 51.4 years, 67.5% of the participants were women, and mean age at colorectal cancer diagnosis was 66.0 years in women and 66.4 years in men. Never-smokers, physically active and highly educated people comprised 49.1%, 10.3% and 24.6% of the derivation cohort, respectively. The median follow-up time was 15.4 (interquartile range 13.2 to 16.9) years in the derivation cohort and 14.1 (interquartile range 10.5 to 16.0) years in the validation cohort. Figure 2 illustrates the distribution of Cox regression coefficients of all predictor variables based on the bootstrapped elastic net regularization. Selected variables in the reduced model are highlighted based on the selection criterion of having a coefficient value of 0 not included in the 95% confidence interval. Table 2 shows derived colorectal cancer hazard ratios for all risk factors (full model) and risk factors that remained after elastic net selection (reduced model). The selected predictors of the overall colorectal cancer risk in men and women included age, waist circumference, height, daily alcohol consumption, smoking, physical activity, vegetables, dairy products, processed meat, and sugar and confectionary ( Table 2). The models derived separately for men and women confirmed age, waist circumference, smoking and vegetable intake as consistent predictors across both genders. Additional predictors retained in the reduced model in men were daily alcohol consumption, dairy intake, dark bread and red meat, and in women, height and processed meat. The estimated 10year mean absolute risk for colorectal cancer of the derivation cohort was 0.78% in both sexes, 1.07% in men and 0.64% in women (Table 2). Table 3 provides an overview of selected variables by anatomical subsite, colon and rectal cancer, overall and separately in men and women. An additional predictor that was retained in the model for rectal cancer was the intake of soft drinks. Notably, selected predictors in women were somewhat different for colon and rectal cancer. For colon cancer, the model included age, waist circumference, height, smoking and vegetable intake, whereas for rectal cancer it included age, processed meat and soft drinks (Table 3).

Model performance: discrimination and calibration
Overall model discrimination was good with Harrell's C-index of 0.709 for the derived colorectal cancer risk model. Optimism-adjusted Harrell's C index ranged from 0.667 for the model for rectal cancer in women to 0.716 for the model for colon cancer in both sexes (Table 4). Reduced models showed similar predictive performance as the "full models" suggesting that obtaining data on selected predictors would yield sufficient information and additional factors are not adding predictive value to the model. The performance in the validation cohort was similar for all models, suggesting a high level of stability and a lack of overfitting. Calibration plots of derived colorectal cancer risk models in the derivation and validation sample overall and by sex are presented in Fig. 3. An overall good calibration was observed based on the comparable intercepts for models across derivation and validation samples.

Model communication Absolute risk formula
To provide assessment of the absolute 10-year risk of colorectal cancer for individuals with various combinations of risk factors, we prepared a formula with the following selected predictors:

Absolute risk
Colorectal cancer within 10 years  Values for S m (10 years) and Risk Score m are given in Table 2. Absolute risk for different timespans can be calculated by replacing S m in the formula accordingly. The survival function estimates for timespans between 0 and 20 years are shown in Supplementary Fig. 2 Nomogram Figure 4 shows a nomogram of the weights and points of the colorectal cancer risk prediction score allowing estimation of an individual's probability to develop colorectal cancer over a 10-year period. The nomogram is characterized by a scale corresponding to each variable, a point scale, a total point scale and a probability scale. The use of the nomogram is simple and involves 3 steps. First, on the scale for each variable, the value corresponding to a specific individual is read and the point scale is used to calculate the points for all variable values. Second, the total number of points is calculated by adding up all the points obtained in the previous step, and its value is identified on the total point scale. Finally, the probability of an event corresponding to the total points of the individual is represented on the risk scale. As a practical example, we estimated the 10-year risk of colorectal cancer, for individuals with two different combinations of ages and lifestyle factors, representing lowrisk and high-risk extremes: individual 1 was 45 years old (50 points) with a body height of 166 cm (7.5 points), a waist circumference of 70 cm (3 points) and healthy lifestyle behaviour (low daily alcohol consumption (0 points), non-smoker (0 points), physically active (0 points), 430 g daily vegetable intake (7 points), 630 g daily dairy products intake (2.5 points), 0 g daily processed meat intake (0 points), and 5 g daily sugar and confectionary intake (0 points)), and individual 2 was 65 years old (90 points) with a body height of 166 cm (7.5 points), a waist circumference of 100 cm (12 points) and rather unhealthy lifestyle behaviour (high daily alcohol consumption (3 points), smoker (5 points), physically inactive (2.5 points), 80 g daily vegetable intake (14.5 points), 70 g daily dairy products intake (5 points), 60 g daily processed meat intake (2.5 points), and 90 g daily sugar and confectionary intake (1.5 points)). The total number of points of the various prediction indicators was~70 and~143.5 and the corresponding absolute predicted 10-year risk of colorectal cancer was~0.2% (risk score of~5.7) and~3-3.5% (risk score of~8.6) for individual 1 and individual 2, respectively.

Web-based calculator
As an alternative approach to model communication, we developed a web-based calculator for the estimation of a     personalized colorectal cancer risk based on the validated LiFeCRC score. A graphical illustration of the application layout with predicted and absolute risk values for a modifiable time span is presented in Fig. 5. Of note, the results produced by the web-based calculator should be interpreted considering that competing risk of mortality was not included in the absolute risk calculation.

Random survival forest
Results of random survival forest-based relative variable importance for colorectal cancer risk prediction are presented in Supplementary Fig. 3, Additional File 2. The main selected predictors remained similar as in the Cox regression model, confirming model robustness. The highest relative importance was observed for age, followed by waist circumference, red and processed meat intake, height and vegetable consumption. The model for women showed, in addition, height, dark bread and dairy products intake as additional important predictors, whereas the model for men showed smoking and sweets and confectionary consumption as additional important predictors. Overall, the discrimination ( Supplementary  Fig. 3, Additional File 2) and calibration ( Supplementary  Fig. 4, Additional File 2) of the random survival forest based colorectal cancer risk prediction model was comparable to the Cox regression model.

Sensitivity analysis
In a sensitivity analysis, we evaluated to what extent lifestyle data added predictive value to the colorectal cancer risk model based on age only. The addition of the lifestyle variables resulted in a statistically significantly increased goodness of fit (likelihood ratio test p < 0.001). The estimated NRI > 0 was 0.307 (95% confidence interval 0.264 to 0.352) indicating an improvement in model performance. Supplementary Fig. 5, Additional File 2  cancer. An improved calibration and higher net benefit were observed for colorectal cancer risk thresholds between 0.7 and 2.5% for the LiFeCRC model compared to the age-based model. In analyses stratified according to age groups, model performance was higher in individuals < 45 years and adding lifestyle data contributed to improved reclassification statistics, i.e. higher NRI > 0 , suggesting relative importance of lifestyle data assessment for risk prediction at younger ages (< 45 years), i.e. NRI > 0 = 0.364 (95% confidence interval 0.084 to 0.575) (Supplementary Table 5, Additional File 1). We further estimated the predicted 10-year absolute risk of colorectal cancer for an arbitrary predefined "healthy" and "unhealthy" lifestyle, across different age groups and a constant body height ( Supplementary Fig. 6, Additional File 2). For example, an individual aged 45 years with a body height of 166 cm adopting a predefined "unhealthy lifestyle" (waist circumference of 100 cm, high daily alcohol consumption, smoker, physically inactive, 80 g daily vegetable intake, 70 g daily dairy products intake, 60 g daily processed meat intake and 90 g daily sugar and confectionary intake) has a 3.6 times higher absolute risk of colorectal cancer within the next 10 years compared to a person of the same age and body height, adopting a predefined "healthy lifestyle" (waist circumference of 70 cm, low daily alcohol consumption, non-smoker, physically active, 430 g daily vegetable intake, 630 g daily dairy products intake, 0 g daily processed meat intake and 5 g daily sugar and confectionary intake). In a subsample with available information, addition of information on NSAID use or family history of colorectal cancer to the list of predictors did not further improve model performance beyond main lifestyle variables ( Supplementary  Fig. 7, Additional File 2). The results did not reveal marked differences in model discrimination among subgroups by waist circumference, education, smoking status and levels of alcohol consumption (Supplementary Table 6, Additional File 1). Furthermore, no substantial differences could be seen between the Kaplan-Meier survival function and the cumulative incidence function taking competing risk into account (data not shown). Also, no differences in the discrimination ability of the Fine-Gray model taking competing risk of death into account could be observed (C-index = 0.710).

Discussion
In this large European prospective cohort study, we developed and validated the LiFeCRC score, as a lifestylebased prediction model for the prevention of colorectal cancer in asymptomatic populations across Europe. Beyond age, the variables retained in the model were waist circumference, height, daily alcohol consumption, smoking status, physical activity and dietary intakes of vegetables, dairy products, processed meat and sugar and confectionary. Separate models were also developed for men and women and for colon and rectal cancer subtypes. The model showed good calibration and discrimination properties to identify individuals at all levels of colorectal cancer risk. Modifiable lifestyle factors contributed to model performance and accuracy beyond age alone and could improve reclassification statistics Currently, the target population for colorectal cancer screening is mainly selected based on age alone (i.e. 50 years or above). Although age is undoubtedly an important predictor of colorectal cancer as shown in our data, information on modifiable lifestyle factors allows provision of preventive health recommendations for individuals at risk [40]. Lifestyle-based models have been suggested in medical practice as important tools that could be used to identify those most likely to benefit from lifestyle interventions and to contribute to behaviour change interventions [41]. A number of intervention studies focusing on changing lifestyle for colorectal cancer prevention reported significant effects on the target behaviours [42][43][44][45][46]. In those studies, tailored approaches that enable personalized feedback regarding individual lifestyle patterns were suggested as more successful compared to generic approaches [42][43][44][45][46][47]. Despite lifestyle interventions representing a powerful costeffective strategy for colorectal cancer prevention, there has been little incentive on the side of health professionals to advocate lifestyle-based recommendations [48]. Risk assessment tools such as the LifeCRC score could facilitate improved advocacy on the side of health professionals and motivate or empower individuals to implement behaviour changes [47,49]. Targeting lifestyle factors in those at highest risk may be particularly relevant for younger age groups that may profit most from early preventive interventions aimed at encouraging behavioural changes [47].
A number of previous models incorporated lifestyle data with common covariates including self-reported BMI (body mass index), alcohol consumption and smoking [18][19][20][21]. Recently, a model based on BMI, smoking, alcohol, red and processed meat, fruits, vegetables and physical activity demonstrated C-statistics of 0.66 and 0.68 in men and women, respectively [41]. Compared with this and other published models that also include family history and more complex variables [18,19,50,51], the EPIC lifestyle-based model showed a comparable and even improved performance based on Harrell's C-index of 0.710 in both derivation and validation cohort. As previously reported, the highest C-statistic for colorectal cancer risk prediction model ranged from 0.67 in UK Biobank to 0.69 EPIC validation samples [20]. Compared to our model, that model included 13 variables: age, ethnicity, education, BMI, family history, diabetes, oestrogen exposure, non-steroidal anti-inflammatory use, physical activity, smoking, alcohol, red meat intake and multivitamin use. Having the strong discrimination statistics for models based on age alone, additional predictors were shown to add little improvement to model C-statistics in previous studies as well as in our data [18,20,51]. To address the question whether lifestyle information is important for absolute risk assessment beyond age, we evaluated the model performance across different age groups. These results showed that the model performance was highest in the group of participants < 45 years old and suggested this age period as a relevant time window for early cancer prevention. We further calculated the 10-year absolute risk of colorectal cancer across different ages comparing predefined "healthy" versus "unhealthy" lifestyle pattern based on selected model predictors. These analyses suggested that at a given age and height, i.e. for an individual aged 45 years with a body height of 166 cm, following the unhealthy lifestyle pattern would lead to 3.6 times higher absolute risk of colorectal cancer within the next 10 years compared to a person of the same age and body height, adopting a healthy lifestyle. These results highlight the importance of adherence to healthy lifestyle for the long-term reduction of colorectal cancer risk. In support of these data, recent analysis based on a large German population sample showed that healthy lifestyle could improve prospects for avoiding colorectal cancer in the long term even beyond individual genetic risk [52].
The elaborated phenotyping and detailed assessment of nutritional data in the EPIC cohort allowed selection of several factors not commonly depicted in previous colorectal cancer risk prediction models. Compared to previous models that used data on self-reported BMI, in the EPIC cohort data was available on waist circumference measurements and these were among the main predictors [53,54]. Unlike BMI which does not take body fat distribution into account, waist circumference provides a proxy for the centrally located visceral fat shown especially relevant for colorectal cancer development [53,55]. Only a few previous models included data on height which was selected as another important predictor by our model [56,57]. Greater height could provide reflection of an increased standard of living characterized by greater availability of energy and protein-rich foods, lower physical activity and a reduced incidence of childhood infections that follow different patterns across Europe [58]. Physical activity was also selected as a predictor of colorectal cancer risk, particularly in the model for women. These data support recent findings from the Women's Health Initiative [59] and the overall notion of the importance of physical activity for the prevention of colorectal cancer [60]. Beyond red meat [56,57,61] and vegetable intake [56,[62][63][64], additional dietary predictors selected by our model included low dairy intake and high intakes of sugary products, including soft drinks. Guiding individuals towards healthy dietary and lifestyle choices could complement colorectal cancer screening as means for colorectal cancer prevention.
The selected model performed similarly well as the model with the full list of predictors, suggesting that it can be used as a simpler approach for determining high-risk individuals. Thus, individuals and health professionals would need to inquire about fewer lifestyle factors, avoiding the use of long questionnaires and minimizing the burden of data collection on both the patient and clinician side. However, for a comprehensive lifestyle recommendation, all healthy behaviours could be considered in additional counselling. The model performance among women was modest, and better in men, likely because some risk factors were more strongly associated with risk among men. The general distribution and influence of risk factors may differ geographically across populations and additional model elaboration and adaptation of country-specific risk models should be further considered. Ultimately, research is needed to assess the feasibility and effectiveness of the current lifestyle-based risk assessment tool on health behaviour modification, colorectal cancer risk factor improvement, and overall potential for colorectal cancer prevention when incorporated into the primary care setting, particularly as a pre-screening instrument of high-risk patients. More work is also warranted for the refinement of the risk communication tool before its general integration into practice. Finally, in future research, additional predictors, including relevant biomarker and genetic variables, should be further explored on the way towards improved precision prevention of colorectal cancer. For example, in a systematic review of 29 studies, addition of common single nucleotide polymorphisms (SNPs) to other risk factors in models developed in asymptomatic individuals in the general population increased model discrimination by 0.01 to 0.06 [19]. Overall, the reported C-statistic ranged from 0.56 to 0.63 for SNPs alone and in combination with other risk factors, respectively [19]. Further studies are warranted to evaluate whether employing genetic risk profiling beyond established risk factors can be useful to identify individuals at high colorectal cancer risk.
Our work has several strengths. The EPIC study provided an ideal setting to develop a lifestyle-based colorectal cancer risk prediction model, given its large sample size, various population backgrounds and a long follow-up time of over 20 years. Furthermore, the study provided a variety of objectively measured anthropometric data along with dietary and lifestyle information. Therefore, the current model is the first developed on a European-wide study population sample, allowing assessment of risk across a broad range of diet and lifestyle behaviours. Given the large sample size, we were also able to validate the risk scores in an independent subset of the EPIC populations. Additionally, we derived the colorectal cancer risk estimates empirically following state-of-the-art and novel machine learning approaches, i.e. random survival forest, considering various predictors simultaneously and the gradient in risk across the full distribution of risk levels. Finally, we considered model application and suggested a nomogram and a webtool to enable risk communication. Several potential limitations of our study warrant discussion. First, we derived the risk equations based on a study population comprising of volunteers. Volunteer-based studies are prone to include individuals who are often more likely to have favourable exposure and health profiles compared to those who do not. Thus, higher prevalence of healthy behaviours in our sample as compared to the general population could have resulted in overestimated absolute risk estimates. Second, with the exception of age and the anthropometric measures, we relied on data of self-reported predictors and routinely collected cancer outcomes. Though any risk prediction tool made publicly available online would also rely on self-reported data, more accurate risk factor ascertainment would possibly improve overall model discrimination and calibration. Nevertheless, our model has shown a good discrimination and excellent calibration. Third, dietary data was collected using food frequency questionnaires as a commonly applied dietary assessment method in epidemiology, however future model application should consider further adaptation and feasibility assessment to facilitate model communication in practice. Fourth, we based analyses on lifestyle information collected at study baseline and, therefore, could not account for potential behavioural changes during study follow-up. Finally, the model was developed based on data available in the EPIC cohort and did not include some potentially important predictors, such as NSAID use or family history of colorectal cancer. However, we have conducted a sensitivity analysis using data from study centres that collected these data and the model performance was not altered.

Conclusions
Despite being one of the leading causes of cancer morbidity and mortality, colorectal cancer is largely preventable. LiFeCRC score based on age and lifestyle data accurately identifies individuals at risk for incident colorectal cancer in European populations and could contribute to improved prevention through motivating lifestyle change at the individual level.

Supplementary information
Supplementary information accompanies this paper at https://doi.org/10. 1186/s12916-020-01826-0.  Model performance comparison the LiFeCRC score and a colorectal cancer risk model including only age. (a) Calibration plot of predicted 10year colorectal cancer risk for a model that included only age and the LiFeCRC score model with additional lifestyle predictors (waist circumference, body height, daily alcohol consumption, smoking, physical activity, and daily intake of vegetables, dairy products and red meat). (b) Decision curves illustrating net benefit of prediction models for a range of colorectal cancer risk thresholds, used to decide about further treatment or intervention. Decisions curves are shown for different models: none treatment, all treatment, treatment based on the age-model, treatment based on the LiFeCRC model. Supplementary Figure 6. Predicted 10year absolute risk of colorectal cancer for a healthy and unhealthy lifestyle. Risk across different age-groups and a constant body height of 166 cm. Unhealthy lifestyle: waist circumference of 100 cm, high daily alcohol consumption, smoker, physically inactive, 80 g daily vegetable intake, 70 g daily dairy products intake, 60 g daily processed meat intake, and 90 g daily sugar and confectionary intake. Healthy lifestyle: waist circumference of 70 cm, low daily alcohol consumption, non-smoker, physically active, 430 g daily vegetable intake, 630 g daily dairy products intake, 0 g daily processed meat intake, and 5 g daily sugar and confectionary intake. Supplementary Figure 7. Full model performance including NSAID use and family history of colorectal cancer.