Skip to main content
  • Research article
  • Open access
  • Published:

Polygenic risk score-based phenome-wide association study of head and neck cancer across two large biobanks



Numerous observational studies have highlighted associations of genetic predisposition of head and neck squamous cell carcinoma (HNSCC) with diverse risk factors, but these findings are constrained by design limitations of observational studies. In this study, we utilized a phenome-wide association study (PheWAS) approach, incorporating a polygenic risk score (PRS) derived from a wide array of genomic variants, to systematically investigate phenotypes associated with genetic predisposition to HNSCC. Furthermore, we validated our findings across heterogeneous cohorts, enhancing the robustness and generalizability of our results.


We derived PRSs for HNSCC and its subgroups, oropharyngeal cancer and oral cancer, using large-scale genome-wide association study summary statistics from the Genetic Associations and Mechanisms in Oncology Network. We conducted a comprehensive investigation, leveraging genotyping data and electronic health records from 308,492 individuals in the UK Biobank and 38,401 individuals in the Penn Medicine Biobank (PMBB), and subsequently performed PheWAS to elucidate the associations between PRS and a wide spectrum of phenotypes.


We revealed the HNSCC PRS showed significant association with phenotypes related to tobacco use disorder (OR, 1.06; 95% CI, 1.05–1.08; P = 3.50 × 10−15), alcoholism (OR, 1.06; 95% CI, 1.04–1.09; P = 6.14 × 10-9), alcohol-related disorders (OR, 1.08; 95% CI, 1.05–1.11; P = 1.09 × 10−8), emphysema (OR, 1.11; 95% CI, 1.06–1.16; P = 5.48 × 10−6), chronic airway obstruction (OR, 1.05; 95% CI, 1.03–1.07; P = 2.64 × 10−5), and cancer of bronchus (OR, 1.08; 95% CI, 1.04–1.13; P = 4.68 × 10−5). These findings were replicated in the PMBB cohort, and sensitivity analyses, including the exclusion of HNSCC cases and the major histocompatibility complex locus, confirmed the robustness of these associations. Additionally, we identified significant associations between HNSCC PRS and lifestyle factors related to smoking and alcohol consumption.


The study demonstrated the potential of PRS-based PheWAS in revealing associations between genetic risk factors for HNSCC and various phenotypic traits. The findings emphasized the importance of considering genetic susceptibility in understanding HNSCC and highlighted shared genetic bases between HNSCC and other health conditions and lifestyles.

Peer Review reports


Head and neck squamous cell carcinoma (HNSCC), which includes malignancies mainly affecting the oral cavity and oropharynx, holds the position of being the sixth most common cancer worldwide [1, 2]. Tobacco use, including both direct consumption and exposure to smoke, and moderate alcohol intake are accepted as the primary etiological contributors to the development of HNSCC [3]. Infection with human papillomavirus (HPV) also constitutes a significant causative factor, particularly for oropharyngeal cancer (OPC) [4]. However, considering that a significant portion of the evidence concerning these risk factors originates from observational epidemiological studies, it is crucial to examine the underlying associations between risk factors. Moreover, the observation that HNSCC occurrence is limited to a minority among tobacco users, alcohol consumers, and individuals infected with HPV implies a significant involvement of genetic predisposition in its pathophysiology [5]. To achieve this, a comprehensive investigation into the potential involvement of genetic factors is warranted.

Extensive genome-wide association studies (GWASs) have revealed thousands of common variants to be associated with various types of cancer [6]. Polygenic risk scores (PRSs) aim to achieve a substantial improvement in risk prediction by considering the combined effects of multiple risk alleles. These scores provide a valuable methodology for capturing the collective influence of multiple genetic variants, enabling the identification of individuals who are at increased risk of developing site-specific cancers [7]. While the general predictive ability of PRSs for disease outcomes across diverse populations has demonstrated only modest performance in various cancer types, its effectiveness in cohort risk stratification has been substantiated [8, 9]. Recently, utilization of PRSs has expanded to encompass the screening of a diverse array of clinical phenotypes, collectively referred to as the medical phenome, to explore associations of these phenotypes with secondary traits [10].

As a singular biomarker computationally derived from a diverse spectrum of genetic variants, a PRS has markedly greater power than an individual single nucleotide polymorphism (SNP) and can be leveraged to great effect by phenome-wide association studies (PheWAS). PheWAS provide a valuable framework for the simultaneous investigation of genetic variants and physiological and clinical phenotypes, thereby facilitating the exploration of associations across a broad spectrum of traits. In such investigations of the combined landscape of genomics and phenomics, access to both electronic health records (EHRs) and GWAS data is essential.

To date, no studies have been reported that examine the correlation between genetic predisposition to HNSCC and related phenotypes utilizing a PRS-PheWAS analysis. The objective of our study was to demonstrate the potential utility of a PRS derived from a comprehensive population-based GWAS on HNSCC in the prediction of secondary phenotypes within an independent cohort. We conducted PheWAS to examine the correlation between the HNSCC PRS and the EHR-based phenome and validated our findings across independent diverse cohorts. Furthermore, we analyzed the association between HNSCC PRS and lifestyles related to significant phenotypes.


Study population

The UK Biobank (UKBB) is a large prospective observational cohort study that has recruited > 500,000 adults across 22 centers located throughout the UK. The full protocol of the UKBB study is publicly available, and the study design and measurement methods have been described elsewhere [11]. Participants aged 40–69 years were enrolled between 2006 and 2010 and were followed up for subsequent health events. We included in the main analysis individuals diagnosed with International Classification of Diseases (ICD)-9 or ICD-10 codes or identified from hospital episode statistics. All ICD-9 and ICD-10 diagnosis codes and laboratory measurements up to July 2020 were extracted from the EHRs.

The Penn Medicine Biobank (PMBB) is a large academic medical biobank in which participants are agnostically recruited from the outpatient setting and consented for access to their EHR data and permission to generate genomic and biomarker data [12]. The study flowchart is illustrated in Additional file 1: Fig. S1.

Definition of HNSCC and subtypes

Cancer cases comprised the following ICD-9 codes: oropharynx (145.3, 146.0, and 146.1); oral cavity (140.0–140.9, 141.0–141.9, 142.0–142.8, 143.0–143.9, 144.0–144.9, 145.0–145.9, and 230.0); and larynx (1610–1619), and the following ICD-10 codes: oropharynx (C01, C02.0, C02.4, C05.1, C05.2, C09.0-C10.9, and C14.0), oral cavity (C00.0–C00.9, C02.0–C02.9, C03.0–C03.9, C04.0–C04.9, C05.0–C06.9, and C148), hypopharynx (C12.9, C13.0–C13.2, C13.8, and C13.9), and larynx (C32.0–C32.3, C32.8, and C32.9). The detailed definition criteria for HNSCC and its subtypes in each cohort are described in Additional file 1: Method S2.

Genotype data quality control and imputation

Genotyping and quality control (QC) procedures and imputation followed standard practices and were performed per cohort-genotyping platform pair. We have filtered out related individuals (with second-degree or closer relatives) by KING software in both biobanks [13]. Further details are described in Additional file 1: Method S3 [14,15,16,17,18,19,20].

UK Biobank

The UKBB samples (version 3; March 2018) were genotyped for > 800,000 SNPs using either the Affymetrix UK BiLEVE Axiom array or the Affymetrix UKBB Axiom array. After QC and imputation, 308,492 European (White-British) individuals were determined eligible for the validation analyses.

Penn Medicine Biobank

The PMBB consists of 43,623 samples that have been genotyped with the GSA genotyping array. After QC and imputation, a total of 27,933 individuals considered of European (non-Hispanic White) ancestry and 10,468 individuals considered of African American (non-Hispanic Black) ancestry were determined eligible for the replication analyses.

Polygenic risk score

The HNSCC, OPC, and oral cavity cancer (OC) PRSs were generated based on the large-scale HNSCC (5974 cases and 4012 controls), OPC (2617 cases and 4012 controls), and OC (2958 cases and 4012 controls) GWAS summary statistics from the Genetic Associations and Mechanisms in Oncology (GAME-ON) Network (dbGAP [OncoArray: Oral and Pharynx Cancer; study accession number: phs001202.v1.p1]) [21].

To generate the PRSs, we used the Bayesian polygenic prediction method PRS-CS [22]. Individual PRSs were computed from beta coefficients as the weighted sum of the risk alleles by applying PLINK version 1.90 with the --score command [23]. Details of the PRS analysis are described in Additional file 1: Method S4.

Phenome-wide association study

The PheWAS R package (version 0·99·5–5) was used to perform PheWAS analyses [24]. In these analyses, the PRS was set as the independent variable, and disease phenotypes as the dependent variables, with age, sex, genotyping array, and the first 10 genetic principal components (PCs) as covariates. Disease diagnosis category phenotypes were obtained by mapping the ICD-9 and ICD-10 diagnosis codes of the UKBB to 1608 hierarchical phenotypes (PheCodes) categorized into 17 disease categories [24, 25]. We removed phenotypic codes with less than 200 cases and those concerning symptoms, injuries, and poisoning; this left 850 phenotypes in 15 disease categories that were included in our analysis. Of these, 838 were eligible for replication analysis in the PMBB.

Statistical analysis

Demographic and clinical characteristics are presented as mean ± standard deviation (SD) or as number (percentage). Continuous variables were compared by Student’s t-test or the Mann–Whitney U test as appropriate. Categorical variables were compared by the chi-square test or Fisher’s exact test as appropriate.

We used a multivariate logistic regression model to evaluate the association of the HNSCC, OPC, and OC PRSs with HNSCC, OPC, and OC occurrence. In the PheWAS analysis, we calculated odds ratios (ORs) and 95% confidence intervals (CIs) after adjusting for age, sex, the first 10 PCs of ancestry, and genotyping array type. The ORs of the PRS were used both as quantitative variables reported per one-SD, and categorical variables were defined as follows: low (0–24th percentile), intermediate (25–49th percentile), high (50–74th percentile), and very high (75–99th percentile). For the PRS-PheWAS analyses, we utilized Bonferroni’s correction for multiple hypothesis testing. We determined P < 5.88 × 10−5 (= 0.05/850, adjusted for the number of phecode-based traits analyzed in the study) as a statistical significance. In addition, we performed sex, age, and smoking status stratified, HNSCC exclusion, and masked major histocompatibility complex (MHC) regions subgroup sensitivity analyses. Subsequently, we conducted trend analyses to identify statistical differences between the PRS risk group and lifestyles (alcohol use and smoking) and HPV (Additional file 1: Method S5).

All statistical tests were two-sided, and P < 0.05 was considered statistically significant. All statistical analyses were conducted using the R Statistical Software (version 4.1.0; R Foundation for Statistical Computing, Vienna, Austria) and PLINK version 1.90 [23].



In total, 308,492 participants of European descent from the UKBB were included, after excluding those having no history of in-patient records or a lack of ICD or self-reported information relevant to this study. The mean age of participants was 58.0 years (SD, 7.9 years). The characteristics of participants in each group are presented in Additional file 1: Table S1. In total, 1763 study subjects had a history of HNSCC, 556 (31.7%) of OPC, and 856 (48.8%) of OC. The “others” category (346 [19.5%]) includes hypopharynx cancer, larynx cancer, and other cancers. Significant differences between the controls and HNSCC cases were observed in HPV positivity, smoking status, and alcohol intake frequency.

For the replication set, a total of 38,401 PMBB participants of European (n = 27,933) and African American (n = 10,468) descent were included (Additional file 1: Table S2). The mean age of participants was 55.9 years (SD, 16.4 years). Among the HNSCC cases, there were 437 (59.8%) diagnosed with OC, 231 (31.6%) with OPC, and 64 (8.8%) with other cancers.

PRS association with HNSCC and validation in the UKBB and PMBB

We investigated the associations between PRSs and HNSCC and its subtypes in the UKBB. We observed HNSCC PRS to be associated with the occurrence risk of HNSCC (OR, 1.12; 95% CI, 1.06–1.17; P < 0.001), OPC (OR, 1.18; 95% CI, 1.08–1.28; P < 0.001), and OC (OR, 1.10; 95% CI, 1.02–1.17; P = 0.009). OPC PRS was also associated with occurrence risk of HNSCC (OR, 1.10; 95% CI, 1.05–1.16; P < 0.001), OPC (OR, 1.20; 95% CI, 1.10–1.31; P < 0.001), but not with OC risk. Meanwhile, OC PRS was associated with the occurrence risk of HNSCC (OR, 1.09; 95% CI, 1.04–1.15; P < 0.001), OPC (OR, 1.10; 95% CI, 1.01–1.20; P = 0.027), and OC (OR, 1.09; 95% CI, 1.02–1.17; P = 0.015) (Additional file 1: Table S3). We also confirmed the association between the PRSs of HNSCC and its subtypes with the risk of occurrence in subgroups based on age, sex, and smoking status (Additional file 1: Table S4). These associations were replicated in the PMBB cohort: the PRSs for HNSCC and OPC showed significant association with HNSCC and its subtypes, while that for OC exhibited weaker association (Additional file 1: Table S5). We estimated the proportion of variance explained by the PRSs for HNSCC, OPC, and OC in both cohorts (Additional file 1: Table S3 and S5).

In order to investigate the impact of unbalanced case-to-control ratios between the two cohorts, we expanded our analysis of the PRS at different ratios across data from both biobanks (Additional file 1: Table S6). In addition, we performed ancestry-specific analyses in the PMBB (Additional file 1: Table S7). We found that the inherent differences in the characteristics of the target cohorts could potentially impact the performance of the PRS analysis, including the proportion of variance explained, regardless of identical proportions.


We tested the association between HNSCC PRS and phenotypes constructed in the UKBB (Fig. 1). In HNSCC PRS, the strongest association was observed for “Tobacco use disorder” (OR, 1.06; 95% CI, 1.05–1.08; P = 3.50 × 10−15). The HNSCC PRS was also associated with “Alcoholism” (OR, 1.06; 95% CI, 1.05–1.09; P = 6.14 × 10−9), “Alcohol-related disorders” (OR, 1.08; 95% CI, 1.04–1.09; P = 1.09 × 10−8), “Emphysema” (OR, 1.11; 95% CI, 1.06–1.16; P = 5.48 × 10−6), “Chronic airway obstruction” (OR, 1.05; 95% CI, 1.03–1.07; P = 2.64 × 10−5), “Cancer of bronchus; lung” (OR, 1.08; 95% CI, 1.04–1.13; P = 4.68 × 10−5), and “Spondylosis and allied disorders” (OR, 1.05; 95% CI, 1.03–1.07; P = 1.46 × 10-5) (Table 1 and Additional file 2: Table S8).

Fig. 1
figure 1

PheWAS Manhattan plot of HNSCC and subtypes genetic risk score in UK Biobank. Abbreviations: PheWAS, phenome-wide association study; HNSCC, head and neck squamous cell carcinoma

Table 1 Significant associations of HNSCC PRS with PheWAS in the UK Biobank that were also replicated in the Penn Medicine Biobank

In the subtype PRS analysis for OPC PRS, the phenotype most strongly associated was “Tobacco use disorder,” followed by “Cancer of bronchus; lung” and “Chronic airway obstruction” (Table 1 and Additional file 2: Table S9). Meanwhile, for the OC PRS, significant associations were observed with “Tobacco use disorder,” “Alcoholism,” and “Alcohol-related disorders” (Table 1 and Additional file 2: Table S10). When stratified by HNSCC PRS percentile, we confirmed the prevalence of each phenotype to be increased with higher PRS percentiles (Additional file 1: Fig. S2).

PRS-PheWAS validation in the PMBB

To establish the correlation of PRSs with the identified phenotype traits, we replicated the association analyses within the corresponding phenotype of the PMBB dataset. Upon examination of the PMBB phenome, the majority of previously observed associations were validated; the exceptions were the traits “Spondylosis and allied disorders” with the HNSCC PRS and “Alcoholism” and “Alcohol-related disorders” with the OC PRS, which did not exhibit significant associations (Table 1 and Additional file 2: Tables S8-10).

Sensitivity analysis

Exclusion PheWAS

To investigate whether the observed associations of the HNSCC PRS with phenotypes were solely attributable to the inclusion of HNSCC cases, we conducted a PheWAS after excluding HNSCC cases from the UKBB. We still found consistent associations between HNSCC PRS and the phenotypes after removing 1753 HNSCC case subjects compared to the full analysis. Specifically, “Tobacco use disorder,” “Alcoholism,” “Alcohol-related disorders,” “Emphysema,” “Chronic airway obstruction,” and “Cancer of bronchus; lung” remained significantly associated with HNSCC PRS in the UKBB (Table 2).

Table 2 Sensitivity analysis results of HNSCC PRS with significant associations in the UK Biobank

MHC region exclusion analysis

We also generated a HNSCC PRS excluding MHC locus. We observed this score to exhibit a persistent significant association with all phenotypes even after excluding the entire MHC region. Moreover, these significant correlations remained in a second sensitivity analysis that further excluded HNSCC cases as well as the MHC region (Table 2).

Sex, age, and smoking status-stratified analyses

In sex-stratified analysis, all phenotypes remained significant. Overall, there was no significant sex interaction (Table 3). There was no significant association between “Cancer of bronchus; lung” and HNSCC PRS in the younger (age ≤ 60 years) group, while all phenotypes showed significant associations in the elderly group (age > 60 years). In addition, “Alcoholism” and “Emphysema” were only significant in the never-smoker group, while all phenotypes showed significant associations in the ever-smoker group (Table 4).

Table 3 Sex-stratified results of HNSCC PRS with significant associations in the UK Biobank
Table 4 Subgroup-stratified results of HNSCC PRS with significant associations in the UK Biobank

Association between HNSCC PRS and smoking, alcohol consumption, and HPV seropositivity

As we observed HNSCC PRS to have associations with the phenotypes of alcoholism and smoking, which were generated based on ICD codes, we proceeded to explore its connections with lifestyle factors related to actual alcohol consumption and smoking. Having a very high PRS was significantly associated with current smoking status (P < 0.001), previously smoked a high number of cigarettes daily (P < 0.001), high pack years of smoking (P < 0.001), past tobacco smoking (P < 0.001), maternal smoking around birth (P < 0.001), stopped smoking at a high age (P < 0.001), and a high number of unsuccessful stop-smoking attempts (P = 0.006) (Table 5). We also observed significant associations of HNSCC PRS with alcohol drinker status (P < 0.001), frequency (P = 0.045), amount (P < 0.001), alcohol usually taken with meals (P < 0.001), and a history of past alcohol consumption (P < 0.001) (Table 6). However, no significant association was found between HNSCC PRS and seropositivity for HPV type-16 (Table 7).

Table 5 Smoking-related characteristics according to the genetic risk group of HNSCC
Table 6 Alcohol-related characteristics according to the genetic risk group of HNSCC
Table 7 HPV characteristics according to the genetic risk group of HNSCC


The aim of this study was to explore phenotypes connected to the genetic predisposition for HNSCC within the UKBB cohort, for which we utilized a PheWAS. These findings were validated in a replication set involving 38,401 participants from the PMBB.

The HNSCC PRS constructed here, including subtypes such as OC and OPC, incorporated the most extensive assemblage of SNPs discovered in the recent GWAS for HNSCC conducted by the GAME-ON Network [21]. The resultant PRS was robustly validated in both the UKBB (European) and the PMBB (European and African American) cohorts, despite the population diversity present within the PMBB. One previous study derived PRSs for 16 cancer types, including a HNSCC PRS derived from the 14 SNPs in prior HNSCC GWASs; this PRS demonstrated the most minimal effect size with an OR of 1.08 [26]. Another PRS based on summary data from the FinnGen HNSCC GWAS showed a nonsignificant association with the risk of HNSCC [27]. Our validated and replicated results and higher OR of 1.17 (95% CI, 1.07–1.26) indicate improved performance of the HNSCC PRS for capturing high-risk individuals. The two datasets used to evaluate and validate the performance of the HNSCC PRS are cohorts with distinct characteristics and diverse ancestry. The UKBB is a prospective national cohort study based on healthy participants, whereas the PMBB is an academic research cohort derived from a regional university hospital with diverse ancestry. Therefore, although these datasets differ in their case–control ratios, when analyzed with an alternative ratio, they demonstrated differences in the proportion of variance explained. This suggests that the distinct characteristics of the different cohorts and ancestry influence the results of the performance analysis of the PRS.

The overall low effect of the HNSCC PRS can be attributed to several factors. Firstly, the etiology of HNSCC is multifaceted, involving a complex interplay of genetic, environmental, and lifestyle factors. While PRSs are designed to capture the cumulative effect of multiple genetic variants, they might not fully account for the intricate interactions between genetic variations and the diverse array of risk factors specific to PRS. Additionally, the genetic architecture of HNSCC might not be as strongly influenced by common variants as some other diseases [28]. This could result in the PRS having lower predictive power, as it relies heavily on the contributions of common variants. Furthermore, the HNSCC patient group is heterogeneous, which poses a distinct challenge. Cancers at different subsites within the head and neck region (e.g., oral cavity, pharynx, and larynx) may have distinct genetic underpinnings and risk factors, making it harder for a general PRS to accurately predict risk across all subtypes. On the other hand, PRSs can serve as a valuable tool for conducting PheWAS to unveil secondary trait associations facilitated by the presence of shared genetic risk factors. These secondary associations have the potential to unveil characteristics within EHRs that manifest prior to cancer diagnosis, and hence could emerge as meaningful predictors for cancer outcomes [27]. Fritsche et al. conducted a comprehensive PheWAS using PRSs encompassing 35 prevalent cancer traits; however, their analysis did not yield any substantial phenotypic associations for oral cancer and laryngeal cancer, the examined types that correspond to HNSCC [27]. Our study explored the associations between HNSCC PRS and various phenotypes constructed from the UKBB cohort. Notably, we observed strong associations of the PRS with certain phenotypes; for instance, “Tobacco use disorder” showed a particularly robust association, indicating the importance of smoking as a risk factor for HNSCC. This association was also detected when using both OPC and OC PRSs. Additional associations with “Alcoholism,” “Alcohol-related disorders,” and other health conditions suggest a complex interplay of lifestyle and genetic factors in HNSCC risk and particularly imply that HNSCC and disorders related to alcohol and smoking share a genetic basis. A case–control study also reported polymorphism in glutathione S-transferase genes and interaction with environmental factors such as smoking and alcohol on susceptibility to HNSCC [29].

In a previous Mendelian randomization (MR) analysis, researchers observed a PRS representing genetic susceptibility to smoking initiation to be non-significantly associated with elevated risk of HNSCC [30]. Another study conducted univariable and multivariable MR analyses utilizing summary-level genetic data from the GWAS and Sequencing Consortium of Alcohol and Nicotine Use, the UKBB, and the GAME-ON Network, which revealed independent causal impacts of both smoking and alcohol on the risk of oral and OPC [31].

Smoking is notably correlated with the prevalence of HNSCC [32], and this association is particularly evident in cases involving tumors originating from the oral cavity, nasopharynx, oropharynx, hypopharynx, and larynx [33]. Some genetic variations might contribute to both increased HNSCC risk and a higher susceptibility to smoking addiction [34]. In particular, certain genes related to nicotine metabolism, neurotransmitter pathways, and cellular processes can influence both smoking behavior and cancer susceptibility [35]. A recent study also found that genetic variants in metabolic genes linked to polycyclic aromatic hydrocarbons and tobacco-specific nitrosamines exhibit associations with susceptibility to HNSCC and its subtypes [36]. Moreover, findings from prior PheWAS have revealed significant correlations between these genes and the risks of diverse cancers, along with smoking behavior. Meanwhile, when it comes to alcohol consumption, observational evidence regarding connections with different types of cancers presents varying conclusions [37]. The interaction of genetic polymorphisms related to alcohol metabolism with alcohol drinking has been noted to affect the risk of HNSCC [38]. In particular, Chien et al. showed SNPs in genes encoding alcohol-metabolizing enzymes (ADH1B, ADH1C, and ALDH2) to be associated with patients’ susceptibility to developing multiple primary tumors, especially in the hypopharynx and esophagus, which are challenging in patients with HNC [39]. Our findings add to these reports by unveiling the association of HNSCC PRS with smoking and alcohol-related disorder.

Graff et al. previously explored the presence of PRS-specific pleiotropy across 16 types of cancer using individuals of European ancestry from the Genetic Epidemiology Research on Adult Health and Aging cohort and the UKBB [26]. In their findings, lung cancer PRS was positively associated with oral/pharyngeal cancer, but oral/pharyngeal cancer PRS was inversely associated with lung cancer. This inconsistency could be attributed to two specific variants (rs467095 and rs10462706) among the 14 associated with oral/pharyngeal sites, which were inversely correlated with lung cancer risk. Meanwhile, the HNSCC PRS in this study, which was based on hundreds of thousands of variants through the PRS-CS approach, showed significant positive pleiotropy with cancer of the bronchus, chronic airway obstruction, and emphysema. A recent study showed that SNP (rs3017895 located in the FAM13A) may contribute to OC, which had a strong association with chronic obstructive lung disease including emphysema in GWAS [40].

In this study, we conducted several sensitivity analyses to assess the robustness of our findings, including sex, age, and smoking status stratified assessments, exclusion analyses, and exclusion of the MHC region. That last analysis was conducted due to several MHC risk variants, particularly the class II HLA genes (e.g., HLA-DPB1), having a known substantial impact on genetic predisposition to HNSCC [21, 41]. As a result, the identified associations were consistent across sensitivity analyses, providing further confidence in the study’s results. Moreover, the analyses excluding MHC variants consistently showed similar effect sizes, indicating a restricted role of such variants in HNSCC.

Cancer susceptibility is multifaceted, encompassing not only genetic risk factors but also various lifestyle, anthropometric, hormonal, reproductive, and imaging factors [42]. In the context of our study, the prediction of HNSCC based solely on genetic factors proves challenging, given the multifactorial nature of cancer, the involvement of numerous genes, the impact of environmental factors, and the incomplete elucidation of the intricate interplay between genetics and non-genetic risk factors. Our results, derived from the establishment of HNSCC PRS within a relatively extensive cohort, reveal an association with the disease across two cohorts. However, the predictive efficacy was relatively low. Notably, through PRS-PheWAS, our investigation confirmed a significant correlation between HNSCC and disease entities related to alcohol and smoking, which are well-known modifiable risk factors for HNSCC. We analyzed the association between genetic risk and the major risk factors for HNSCC, such as alcohol and tobacco-related lifestyle habits and HPV infection. We found high PRS risk to be significantly associated with various smoking-related characteristics, including current smoking status, pack years of smoking, and age at smoking cessation. This reinforces the well-established link between smoking and HNSCC risk. Similarly, the study identified significant associations with alcohol-related factors, such as alcohol drink status and past alcohol consumption. Taken together, these findings emphasize the roles of smoking and alcohol consumption as risk factors for HNSCC. However, no significant association was found between HNSCC PRS and seropositivity for HPV type-16. Considering the limited sample size for HPV seropositive and seronegative cases in the UKBB, it becomes challenging to draw definitive conclusions regarding the correlation between HNSCC PRS and HPV seropositivity. Our investigation establishes significant associations between genetic and modifiable risk factors for HNSCC within a population-based cohort, distinguished by a comprehensive dataset encompassing diverse phenotypes and cancer risk factors. By identifying these associated secondary phenotypes, we could understand the genetic factors in HNSCC better and improve the prediction ability for HNSCC by considering interactions with various non-genetic traits in the future [43, 44].


This study has several limitations. Firstly, despite conducting numerous sensitivity analyses, the possibility of pleiotropic effects resulting from multiple genetic instruments cannot be eliminated unless all the biological impacts of each and every SNP are comprehensively understood. Secondly, HNSCC is a markedly heterogeneous malignancy, encompassing molecular subtypes that exhibit contrasting behaviors [45]. Adopting a broader phenotype definition would permit larger sample sizes, but it could also lead to the inclusion of genetically diverse phenotypes, contributing to increased disease heterogeneity and a subsequent reduction in predictive capability [46]. Conversely, refining the phenotype might enhance homogeneity, but it could constrain sample size, with consequent loss of statistical power.


In conclusion, this study provides valuable insight into the genetic risk factors associated with HNSCC and its subtypes. The findings highlight the importance of PRS as a tool for understanding disease risk and suggest a complex interaction between genetic susceptibility and lifestyle factors, particularly smoking and drinking. These findings have the potential to inform strategies for HNSCC prevention and personalized medicine. Further research may be needed to explore the underlying mechanisms linking genetics, lifestyle, and HNSCC risk in more detail.

Availability of data and materials

GAME-ON Network data, including HNSCC and its subtypes genotype and phenotype data, are available for download from the dbGAP upon appropriate request under study accession number phs001202.v1.p1 (OncoArray: Oral and Pharynx Cancer, The HNSCC, OPC, and OC PRS models constructed in the current paper are available for download from the GitHub page (



Confidence interval


Genetic Associations and Mechanisms in Oncology


Genome-wide association study


Electronic health records


Head and neck squamous cell carcinoma


Human papillomavirus


International Classification of Diseases


Major histocompatibility complex


Mendelian randomization


Oral cancer


Oropharyngeal cancer


Odds ratio


Principal component


Phenome-wide association study


Penn Medicine Biobank


Polygenic risk score


Quality control


Standard deviation


Single nucleotide polymorphism


UK Biobank


  1. Warnakulasuriya S. Global epidemiology of oral and oropharyngeal cancer. Oral Oncol. 2009;45:309–16.

    Article  PubMed  Google Scholar 

  2. Saba NF, Goodman M, Ward K, Flowers C, Ramalingam S, Owonikoko T, et al. Gender and ethnic disparities in incidence and survival of squamous cell carcinoma of the oral tongue, base of tongue, and tonsils: a surveillance, epidemiology and end results program-based analysis. Oncology. 2011;81:12–20.

    Article  PubMed  PubMed Central  Google Scholar 

  3. Hashibe M, Brennan P, Benhamou S, Castellsague X, Chen C, Curado MP, et al. Alcohol drinking in never users of tobacco, cigarette smoking in never drinkers, and the risk of head and neck cancer: pooled analysis in the International Head and Neck Cancer Epidemiology Consortium. J Natl Cancer Inst. 2007;99:777–89.

    Article  PubMed  Google Scholar 

  4. Vidal L, Gillison ML. Human papillomavirus in HNSCC: recognition of a distinct disease type. Hematol Oncol Clin North Am. 2008;22:1125–42.

    Article  PubMed  Google Scholar 

  5. Ho T, Wei Q, Sturgis EM. Epidemiology of carcinogen metabolism genes and risk of squamous cell carcinoma of the head and neck. Head Neck J Sci Spec Head Neck. 2007;29:682–99.

    Article  Google Scholar 

  6. Buniello A, MacArthur JAL, Cerezo M, Harris LW, Hayhurst J, Malangone C, et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 2019;47:D1005–12.

    Article  CAS  PubMed  Google Scholar 

  7. Dudbridge F. Power and predictive accuracy of polygenic risk scores. PLoS Genet. 2013;9:e1003348.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Khera AV, Chaffin M, Aragam KG, Haas ME, Roselli C, Choi SH, et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat Genet. 2018;50:1219–24.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Wray NR, Goddard ME, Visscher PM. Prediction of individual genetic risk to disease from genome-wide association studies. Genome Res. 2007;17:1520–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Fritsche LG, Gruber SB, Wu Z, Schmidt EM, Zawistowski M, Moser SE, et al. Association of polygenic risk scores for multiple cancers in a phenome-wide study: results from the Michigan Genomics Initiative. Am J Hum Genet. 2018;102:1048–61.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–9.

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  12. Verma A, Damrauer SM, Naseer N, Weaver J, Kripke CM, Guare L, et al. The Penn Medicine BioBank: towards a genomics-enabled learning healthcare system to accelerate precision medicine in a diverse population. J Pers Med. 2022;12:1974.

    Article  PubMed  PubMed Central  Google Scholar 

  13. Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM. Robust relationship inference in genome-wide association studies. Bioinformatics. 2010;26:2867–73.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Howie BN, Donnelly P, Marchini J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 2009;5:e1000529.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. 1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, et al. A global reference for human genetic variation. Nature. 2015;526:68–74.

    Article  CAS  Google Scholar 

  16. McCarthy S, Das S, Kretzschmar W, Delaneau O, Wood AR, Teumer A, et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat Genet. 2016;48:1279–83.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. O’Connell J, Sharp K, Shrine N, Wain L, Hall I, Tobin M, et al. Haplotype estimation for biobank-scale data sets. Nat Genet. 2016;48:817–20.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Fuchsberger C, Abecasis GR, Hinds DA. minimac2: faster genotype imputation. Bioinforma Oxf Engl. 2015;31:782–4.

    Article  CAS  Google Scholar 

  19. Browning SR. Missing data imputation and haplotype phase inference for genome-wide association studies. Hum Genet. 2008;124:439–50.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Das S, Forer L, Schönherr S, Sidore C, Locke AE, Kwong A, et al. Next-generation genotype imputation service and methods. Nat Genet. 2016;48:1284–7.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Lesseur C, Diergaarde B, Olshan AF, Wünsch-Filho V, Ness AR, Liu G, et al. Genome-wide association analyses identify new susceptibility loci for oral cavity and pharyngeal cancer. Nat Genet. 2016;48:1544–50.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Ge T, Chen CY, Ni Y, Feng YCA, Smoller JW. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun. 2019;10:1776.

    Article  PubMed  PubMed Central  ADS  Google Scholar 

  23. Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience. 2015;4:7.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Denny JC, Ritchie MD, Basford MA, Pulley JM, Bastarache L, Brown-Gentry K, et al. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations. Bioinformatics. 2010;26:1205–10.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Denny JC, Bastarache L, Ritchie MD, Carroll RJ, Zink R, Mosley JD, et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat Biotechnol. 2013;31:1102–11.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Graff RE, Cavazos TB, Thai KK, Kachuri L, Rashkin SR, Hoffman JD, et al. Cross-cancer evaluation of polygenic risk scores for 16 cancer types in two large cohorts. Nat Commun. 2021;12:970.

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  27. Fritsche LG, Patil S, Beesley LJ, VandeHaar P, Salvatore M, Ma Y, et al. Cancer PRSweb: an online repository with polygenic risk scores for major cancer traits and their evaluation in two independent biobanks. Am J Hum Genet. 2020;107:815–36.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Lacko M, Braakhuis BJM, Sturgis EM, Boedeker CC, Suárez C, Rinaldo A, et al. Genetic susceptibility to head and neck squamous cell carcinoma. Int J Radiat Oncol. 2014;89:38–48.

    Article  CAS  Google Scholar 

  29. Singh M, Shah PP, Singh AP, Ruwali M, Mathur N, Pant MC, et al. Association of genetic polymorphisms in glutathione S-transferases and susceptibility to head and neck cancer. Mutat Res Mol Mech Mutagen. 2008;638:184–94.

    Article  CAS  Google Scholar 

  30. Larsson SC, Carter P, Kar S, Vithayathil M, Mason AM, Michaëlsson K, et al. Smoking, alcohol consumption, and cancer: a mendelian randomisation study in UK Biobank and international genetic consortia participants. PLoS Med. 2020;17:e1003178.

    Article  PubMed  PubMed Central  Google Scholar 

  31. Gormley M, Dudding T, Sanderson E, Martin RM, Thomas S, Tyrrell J, et al. A multivariable Mendelian randomization analysis investigating smoking and alcohol consumption in oral and oropharyngeal cancer. Nat Commun. 2020;11:6071.

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  32. Argiris A, Karamouzis MV, Raben D, Ferris RL. Head and neck cancer. The Lancet. 2008;371:1695–709.

    Article  CAS  Google Scholar 

  33. Jethwa AR, Khariwala SS. Tobacco-related carcinogenesis in head and neck cancer. Cancer Metastasis Rev. 2017;36:411–23.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Anantharaman D, Chabrier A, Gaborieau V, Franceschi S, Herrero R, Rajkumar T, et al. Genetic variants in nicotine addiction and alcohol metabolism genes, oral cancer risk and the propensity to smoke and drink alcohol: a replication study in India. PLoS ONE. 2014;9:e88240.

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  35. Bierut LJ. Genetic vulnerability and susceptibility to substance dependence. Neuron. 2011;69:618–27.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Liu H, Li G, Sturgis EM, Shete S, Dahlstrom KR, Du M, et al. Genetic variants in CYP2B6 and HSD17B12 associated with risk of squamous cell carcinoma of the head and neck. Int J Cancer. 2022;151:553–64.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Bagnardi V, Rota M, Botteri E, Tramacere I, Islami F, Fedirko V, et al. Alcohol consumption and site-specific cancer risk: a comprehensive dose-response meta-analysis. Br J Cancer. 2015;112:580–93.

    Article  CAS  PubMed  Google Scholar 

  38. Kawakita D, Matsuo K. Alcohol and head and neck cancer. Cancer Metastasis Rev. 2017;36:425–34.

    Article  CAS  PubMed  Google Scholar 

  39. Chien HT, Young CK, Chen TP, Liao CT, Wang HM, Cheng SD, et al. Alcohol-metabolizing enzymes’ gene polymorphisms and susceptibility to multiple head and neck cancers. Cancer Prev Res (Phila Pa). 2019;12:247–54.

    Article  CAS  Google Scholar 

  40. Hsieh MJ, Lo YS, Tsai YJ, Ho HY, Lin CC, Chuang YC, et al. FAM13A polymorphisms are associated with a specific susceptibility to clinical progression of oral cancer in alcohol drinkers. BMC Cancer. 2023;23:607.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Shete S, Liu H, Wang J, Yu R, Sturgis EM, Li G, et al. A genome-wide association study identifies two novel susceptible regions for squamous cell carcinoma of the head and neck. Cancer Res. 2020;80:2451–60.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Yang X, Kar S, Antoniou AC, Pharoah PDP. Polygenic scores in cancer. Nat Rev Cancer. 2023;21:1–12.

    Article  CAS  Google Scholar 

  43. Kachuri L, Graff RE, Smith-Byrne K, Meyers TJ, Rashkin SR, Ziv E, et al. Pan-cancer analysis demonstrates that integrating polygenic risk scores with modifiable risk factors improves risk prediction. Nat Commun. 2020;11:6084.

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  44. Garcia-Closas M, Gunsoy NB, Chatterjee N. Combined associations of genetic and environmental risk factors: implications for prevention of breast cancer. J Natl Cancer Inst. 2014;106:dju305.

    Article  PubMed  PubMed Central  Google Scholar 

  45. Yin J, Zheng S, He X, Huang Y, Hu L, Qin F, et al. Identification of molecular classification and gene signature for predicting prognosis and immunotherapy response in HNSCC using cell differentiation trajectories. Sci Rep. 2022;12:20404.

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  46. Fritsche LG, Beesley LJ, VandeHaar P, Peng RB, Salvatore M, Zawistowski M, et al. Exploring various polygenic risk scores for skin cancer in the phenomes of the Michigan genomics initiative and the UK Biobank with a visual catalog: PRSWeb. PLoS Genet. 2019;15: e1008202.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references


We thank the participants who contributed their data to the UK Biobank study. We also acknowledge the Penn Medicine Biobank for providing data and thank the patient-participants of Penn Medicine who consented to participate in this research program. And we would like to thank the Penn Medicine Biobank team and Regeneron Genetics Center for providing genetic variant data for analysis. A full list of contributors to the Penn Medicine Biobank is available in Additional file 1: Method S1.


This work was supported by the National Institute of General Medical Sciences (NIGMS) R01 GM138597.

Author information

Authors and Affiliations



YL, S-HJ, and DK conceived and designed the study and analyzed the data. YL and S-HJ performed the statistical analyses and wrote the manuscript. YL and Y-GE curated the data. SH-J, MS, and SC conducted data pre-processing. W-YP and H-HW interpreted the data. MS, SC, W-YP, H-HW, and Y-GE read and critically revised the manuscript for intellectual content; all authors have read and approved the final manuscript. DK supervised the project.

Authors’ information

Young Chan Lee and Sang-Hyuk Jung contributed equally to this work.

Authors’ Twitter handles

Sang-Hyuk Jung:

Dokyoon Kim:

Corresponding author

Correspondence to Dokyoon Kim.

Ethics declarations

Ethics approval and consent to participate

The UK Biobank (UKBB) was approved by the National Research Ethics Committee (June 17, 2011 [RES reference 11/NW/0382]; extended on May 10, 2016 [RES reference 16/NW/0274]). The present research using the UKBB Resource was approved under Application Number 33002. The collection, storage, and analysis of biospecimens, genetic data, and data derived from electronic health records as part of the Penn Medicine Biobank (PMBB) is approved under University of Pennsylvania IRB protocol #813913. Participants from the UKBB and the PMBB provided written informed consent allowing the use of their samples and data for medical research purposes. This study followed the reporting requirements of the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) Statement.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1:

Method S1. Penn Medicine Biobank banner author list and contribution statements. Method S2. Detailed definition of HNSCC. Method S3. Detailed information on the genotype data quality control and imputation procedures. Method S4. Generation of polygenic risk scores. Method S5. Number of missing data for each variable in the UK Biobank. Table S1. Characteristics of participants in the UK Biobank. Table S2. Characteristics of participants in the Penn Medicine Biobank. Table S3. Odds ratio for HNSCC and its subtypes associated with genetic risk in the UK Biobank. Table S4. Odds ratio for HNSCC and its subtypes associated with genetic risk across subgroups by age, sex, and smoking status in the UK Biobank. Table S5. Odds ratio for HNSCC and its subtypes associated with genetic risk in the Penn Medicine Biobank. Table S6. Odds ratio for HNSCC associated with genetic risk across different case–control ratios in the UK Biobank and Penn Medicine Biobank. Table S7. The ancestry-specific odds ratio for HNSCC associated with genetic risk in the Penn Medicine Biobank. Figure S1. Study flowchart. Figure S2. Prevalence plot for significant phenotypes in PheWAS according to genetic risk groups.

Additional file 2:

Table S8. Full results of HNSCC PRS-PheWAS in UK Biobank and Penn Medicine Biobank. Table S9. Full results of OPC PRS-PheWAS in UK Biobank and Penn Medicine Biobank. Table S10. Full results of OC PRS-PheWAS in UK Biobank and Penn Medicine Biobank.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lee, Y.C., Jung, SH., Shivakumar, M. et al. Polygenic risk score-based phenome-wide association study of head and neck cancer across two large biobanks. BMC Med 22, 120 (2024).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: