Study design
UK Biobank
Our study was based on data from the large-scale prospective cohort of UK Biobank, which enrolled 502,507 individuals aged between 40 and 69 years across the UK during 2006–2010 (i.e., ages 50–83 years at the time of the COVID-19 outbreak). Genotyping data were obtained from 488,377 blood samples collected at baseline for each participant. They were assayed using UK BiLEVE (Applied Biosystems, Foster City, CA, USA) and UK Biobank Axiom Array [13]. After quality control following the UK Biobank pipeline, genotype imputation was completed using the Haplotype Reference Consortium and the UK10K haplotype resource as reference panels [13]. The kinship coefficient and principal components (PCs), which were calculated using the KING tool, were also provided by UK Biobank. Details about the quality-control pipeline of UK Biobank and imputation methods have been described previously [13]. The final quality-controlled and imputed-genotype dataset was the basis for our analyses and contained >93 million autosomal single-nucleotide polymorphisms (SNPs) of 487,296 individuals.
Phenotypic data such as sex and birth year were collected upon recruitment using a questionnaire. Health-related outcomes were obtained through periodically linked data from multiple national datasets, including death registries and inpatient hospital data from across England, Scotland, and Wales [13]. After the global outbreak of COVID-19, UK Biobank was also been linked to Public Health England (PHE), where results of COVID-19 tests by reverse transcription-polymerase chain reaction (RdRp gene assay) from oral swabs have been documented since 16 March 2020 [14].
UK Biobank collected all data after written informed consent had been obtained from each participant. The study protocol had full ethical approval (16/NW/0274) from the National Research Ethics Service of the UK National Health Service. The present study was also approved (2020.661) by the biomedical research ethics committee of West China Hospital (Chengdu, China).
Ascertainment of psychiatric disorders and COVID-19
To align with our previous analyses of phenotypic association [6], we used an identical approach for ascertaining psychiatric disorders and COVID-19 in the present study. Briefly, we defined five broad diagnostic categories of psychiatric disorders (substance misuse, depressive disorders, anxiety disorders, psychotic disorders, and stress-related disorders) based on hospital admissions. The diagnosis of these disorders was according to the International Classification of Diseases, Tenth Revision (ICD-10), or ICD-9 codes in the UK Biobank inpatient hospital data (Additional file 1: Table S1) before 31 January 2020. Identification of COVID-19 was based on a positive test result from the PHE dataset, a diagnosis in the UK Biobank inpatient hospital data, or a cause of death based on the UK Biobank mortality data (ICD codes are shown in Additional file 1: Table S1), updated until 18 October 2021. COVID-19 cases from hospitalization records or death records were considered “severe cases.”
PRS for psychiatric disorders
We conducted genetic analyses of white British participants in UK Biobank who were registered in England at recruitment, alive, and trackable on 31 January 2020 and had available genetic data (n = 346,502). Standard quality control of GWAS was carried out. Briefly, we restricted our analyses to autosomal biallelic SNPs and removed variants with a call rate <98%, minor allele frequency <0.01, or deviation from Hardy–Weinberg equilibrium (p < 10−6). Then, related individuals, up to the third degree (i.e., kinship coefficient > 0.044) [15], were removed before GWAS based on the principle of prioritizing the inclusion of individuals with COVID-19. We also removed individuals having a genotyping rate < 98% and outlier samples based on their abnormal heterozygosity level, leaving 287,123 participants for further analysis. The details of the quality-control strategy are summarized in Additional file 1: Fig. S1.
We calculated the PRS for each studied psychiatric disorder based on two types of GWAS summary statistics. First, due to human heterogeneity (and therefore possible limited portability of the PRS between populations [16]) and to keep consistent ascertainment of psychiatric disorders between the present study and our previous study on phenotypic association [6], we undertook a GWAS by splitting the data from UK Biobank into a base dataset and target dataset (see the study design in Fig. 1). GWAS analyses of the base dataset (i.e., 50% of participants selected randomly from the study population) were performed after removal of all individuals with confirmed COVID-19 (n = 10,868) to avoid the influence of phenotypic associations (e.g., between depression and COVID-19) on the identification of the genetic background for the exposure trait (e.g., depression). Specifically, the associations between these phenotypes (as binary variables) and each eligible SNP were tested using logistic regression models adjusted for the covariates sex, birth year, genotyping array, and the top-10 PCs. Second, given that publicly available GWAS of the five specific psychiatric disorders have larger discovery samples, in the dataset that included all eligible and unrelated participants (n = 287,123), we also generated PRSs for these psychiatric disorders based on summary statistics from publicly available GWAS [17,18,19,20,21].
The PRS for each exposure trait was calculated as the weighted sum of the risk alleles based on the summary statistics (i.e., effect sizes and standard errors for the variants) of the GWAS results mentioned above using the least absolute shrinkage and selection operator (LASSO) approach, which allowed heavy shrinkage in the effect estimates of SNPs via regularization methods [22]. The hg19 genome of the European population was used as a reference panel to handle linkage disequilibrium (LD) [23].
Statistical analyses
First, we validated the predictability of the generated PRSs (i.e., PRS calculated based on GWAS from UK Biobank and that based on publicly available GWAS) on the corresponding category of psychiatric disorder in UK Biobank data using logistic regression models adjusted for sex, birth year (1936–1940, 1941–1945, 1946–1950, 1951–1955, 1956–1960, 1961–1965, and 1966–1970), genotyping array, and significant PCs for population heterogeneity (i.e., p < 0.05). The variance explained by the PRS was assessed as the difference in variance from the full model, including the PRS compared with the basic model, and represented as Nagelkerke’s R2.
Then, we examined the association between the PRS of a specific category of psychiatric disorder and the risk of COVID-19, as well as severe COVID-19, using odds ratios (ORs) with 95% confidence intervals (CIs) derived from logistic regression models adjusted for the covariates mentioned above. In addition to treating the standardized PRS as a continuous variable, we also divided participants into “low,” “moderate,” and “high” genetic risk groups based on the tertile distribution of the PRS and then compared the risk of COVID-19 outcomes using the group of low genetic risk (i.e., below the lower tertile of the PRS) as the reference. The degree of pleiotropy or genetic correlation between psychiatric disorders and COVID-19 was additionally estimated using cross-trait LD score regression based on publicly available GWAS summary statistics of both traits [17,18,19,20,21, 24].
In sensitivity analyses, all the analyses stated above were repeated after the application of the standard clumping + thresholding (C+T) approach, instead of the LASSO approach used in the main analyses, for PRS calculation. This method handles the shrinkage-of-effect size estimates through selectively including SNPs with a GWAS association p-value below a certain threshold (e.g., p < 5 × 10−8). Then, through clumping, only SNPs that were largely independent of each other were retained so that their effects could be summed. We computed the PRS under 10 p-value thresholds (i.e., 5 × 10−8, 1 × 10−6, 1 × 10−4, 1 × 10−3, 0.05, 0.1, 0.2, 0.3, 0.4, and 0.5).
PLINK (version 1.9) and R software (version 4.0) were used for GWAS and PRS calculations (“LASSO” package, version 0.4.4) [22]. LD score regression was done by LDSC software (version 1.0.1) [25]. Other analyses were carried out with R software (version 4.0), and p < 0.05 (two-sided) was considered significant.