 Research article
 Open Access
 Published:
The normality assumption on betweenstudy random effects was questionable in a considerable number of Cochrane metaanalyses
BMC Medicine volume 21, Article number: 112 (2023)
Abstract
Background
Studies included in a metaanalysis are often heterogeneous. The traditional randomeffects models assume their true effects to follow a normal distribution, while it is unclear if this critical assumption is practical. Violations of this betweenstudy normality assumption could lead to problematic metaanalytical conclusions. We aimed to empirically examine if this assumption is valid in published metaanalyses.
Methods
In this crosssectional study, we collected metaanalyses available in the Cochrane Library with at least 10 studies and with betweenstudy variance estimates > 0. For each extracted metaanalysis, we performed the Shapiro–Wilk (SW) test to quantitatively assess the betweenstudy normality assumption. For binary outcomes, we assessed betweenstudy normality for odds ratios (ORs), relative risks (RRs), and risk differences (RDs). Subgroup analyses based on sample sizes and event rates were used to rule out the potential confounders. In addition, we obtained the quantile–quantile (Q–Q) plot of studyspecific standardized residuals for visually assessing betweenstudy normality.
Results
Based on 4234 eligible metaanalyses with binary outcomes and 3433 with nonbinary outcomes, the proportion of metaanalyses that had statistically significant nonnormality varied from 15.1 to 26.2%. RDs and nonbinary outcomes led to more frequent nonnormality issues than ORs and RRs. For binary outcomes, the betweenstudy nonnormality was more frequently found in metaanalyses with larger sample sizes and event rates away from 0 and 100%. The agreements of assessing the normality between two independent researchers based on Q–Q plots were fair or moderate.
Conclusions
The betweenstudy normality assumption is commonly violated in Cochrane metaanalyses. This assumption should be routinely assessed when performing a metaanalysis. When it may not hold, alternative metaanalysis methods that do not make this assumption should be considered.
Background
The normality assumption is commonly used in most metaanalytic methods [1,2,3,4], but this assumption could be questionable in practice [5]. Specifically, the normality assumption is typically involved in two levels of metaanalysis with randomeffects models. At the withinstudy level, twostage metaanalysis methods assume the observed summary statistics follow normal distributions with an underlying true mean [6, 7]. This is generally valid if the sample sizes of individual studies are sufficiently large by the central limit theorem [5]. Onestage metaanalysis methods could avoid the normality assumption at this level by using exact distributions for outcome measures, such as the binomial likelihood for binary outcomes [8,9,10]. The validity of the withinstudy normality assumption could be affected by multiple factors, such as individual studies’ sample size, event probabilities of binary outcomes, and true distributions of continuous measures [11,12,13,14]. As such, this assumption needs to be evaluated on a casebycase basis and is generally difficult to assess. If strong evidence indicates this assumption is violated, researchers should consider alternative metaanalysis methods (e.g., onestage models) that do not make this assumption [8, 10, 15].
On the other hand, the normality assumption at the betweenstudy level is typically required by the most commonly used one and twostage randomeffects methods and is not guaranteed for large sample sizes by the central limit theorem. It assumes that the true effects of individual studies differ due to heterogeneity, and they follow a normal distribution with a mean of overall effect and variance of betweenstudy heterogeneity [16]. This article will focus on this betweenstudy normality assumption. Heterogeneity between studies is generally expected in metaanalyses because of the potential differences in baseline characteristics of populations, study locations, methods used by research teams, etc. [17, 18]. Although different studies’ underlying effects are conveniently modeled to have a normal distribution as a convention in the literature, this assumption should not be taken for granted [19, 20]. The presence of betweenstudy normality depends on the choice of effect measures because effect measures are assumed exchangeable across studies [6, 21, 22], and the presence of outlying studies could make this exchangeability assumption questionable [23, 24].
Violations of the betweenstudy normality assumption could lead to problematic metaanalytical conclusions [5, 25]. Although the nonnormality might not have substantial impacts on the point estimates, it could greatly affect the interval estimates [26]. For example, if the true betweenstudy distributions are skewed, 95% confidence intervals (CIs) of the overall effect estimates produced by commonly used metaanalysis methods could have coverage away from the nominal level of 95% [19]. Such inaccuracy in coverage could greatly affect the conclusions about the significance of treatment effects. Moreover, a group of studies may share similar treatment effects but have substantially different effects from another group of studies. The betweenstudy distribution could be bimodal rather than normal. It may be sensible to perform separate metaanalyses for different groups of studies instead of pooling all studies together [27]. Such nonnormality challenges the generalizability of metaanalytic conclusions. In addition, nonnormality caused by a few outlying studies could seriously bias metaanalytic results [23]. It is possible to remove evident outlying studies or subgroup certain studies with similar features if they are substantially different from other studies. However, the practice of removal or subgrouping might not be welljustified when it is not prespecified in the protocol, as this could lead to “cherrypicking” favorable studies in a systematic review [28, 29].
Several methods have been proposed to test the betweenstudy normality assumption [16, 30]. The fundamental idea is to construct studyspecific standardized effect estimates, which are calculated as the differences between individual studies’ effect sizes and the overall effect size, divided by the marginal standard deviations. These standardized effect estimates are expected to be independently and identically distributed as standard normal variables. Consequently, approaches for assessing normality, such as the quantile–quantile (Q–Q) plot and statistical tests for normality, can describe the deviation from the betweenstudy normality assumption visually and quantitively.
Considering the lack of proper assessment of the betweenstudy normality assumption, this article empirically assesses this assumption using the Cochrane Library, a large database of systematic reviews. Our aims are threefold. First, we will use hypothesis testing to examine the proportions of Cochrane metaanalyses with a questionable betweenstudy normality assumption. Second, for binary outcomes, we aim to compare the validity of the betweenstudy normality assumption among three commonly used effect measures, i.e., odds ratios (ORs), relative risks (RRs), and risk differences (RDs). Third, we will construct Q–Q plots for assessing the betweenstudy normality and evaluate the agreement between the visual assessment by independent researchers.
Methods
Datasets
This study used the Cochrane Library, a large database of systematic reviews and metaanalyses, which has been used in our previous work on assessing heterogeneity and smallstudy effects [18, 22, 31]. Specifically, the Cochrane Library publishes and records systematic reviews on a wide range of healthcarerelated topics; it generally has better data quality than nonCochrane reviews [32, 33]. We extracted the statistical data from all systematic reviews published from 2003 Issue 1 to 2020 Issue 1. Data withdrawn from the Cochrane reviews (which may be flawed or outdated) were also excluded from our analyses. The detailed data collection procedures have been documented in our previous publications [31, 34].
Additional exclusion criteria were applied to the metaanalyses. First, like the assessment of smallstudy effects based on the funnel plot [35], the statistical powers of tests may be too low for distinguishing true nonnormality from chance in a metaanalysis containing few studies. Therefore, we excluded metaanalyses with less than 10 studies. Second, we employed the restricted maximumlikelihood (REML) method for the randomeffects model in each metaanalysis [36]. However, when the algorithm using the REML method for estimating the overall effect size could not converge in some cases, we excluded those metaanalyses from our analysis. Third, the betweenstudy normality cannot be assessed for homogeneous metaanalyses (\({\widehat{\tau }}^{2}\)=0), so these metaanalyses were also excluded.
We classified the eligible metaanalyses to include both those with binary outcomes and those with nonbinary outcomes (such as continuous data, survival data, and outcomes reported as generic effect sizes). For both outcomes, we obtained the originally reported studyspecific effect size and its standard error in each metaanalysis. The originally reported effect measures included the (log) OR, Peto OR, (log) RR, or RD for binary outcomes and the mean difference, standardized mean difference, and rate ratio (of count or survival data) for nonbinary outcomes. For binary outcomes, we additionally extracted the counts of events and nonevents in the treatment and control groups (i.e., 2 × 2 table) for each study.
Assessing the betweenstudy normality assumption
We used the methods recently proposed by Wang and Lee [30] to assess the betweenstudy normality assumption in the metaanalyses. Specifically, this assumption was assessed both visually and quantitatively. The visual assessment was based on the Q–Q plot of standardized effect estimates, and the quantitative assessment was based on the Shapiro–Wilk (SW) test for normality [37]. Considering the relatively low statistical power of tests for normality, we set the significance level to 0.1. This follows the conventions for handling underpowered tests that also occur in the assessments of heterogeneity and publication bias [38, 39], although we acknowledge that the choice of the significance level is debated broadly in scientific communities [40,41,42].
We applied the SW test to the originally reported effect sizes in each metaanalysis. If the resulting Pvalue was < 0.1, then the null hypothesis of normality between studies was rejected. We recorded the test results’ statistical significance. Additionally, for each metaanalysis with a binary outcome, we used the 2 × 2 tables to recalculate individual studies’ ORs, RRs, and RDs and applied the SW test to compare the normality assessments among these recalculated effect sizes. Of note, the ORs and RRs were analyzed on the logarithmic scale, as in the convention of metaanalyses.
Approximate proportion of truly nonnormal metaanalyses
The above procedure gave the proportion of metaanalyses with significant nonnormality by the SW test, denoted by \(q\). Due to type I and II errors, a Pvalue < 0.1 or ≥ 0.1 did not ascertain that the betweenstudy normality does not hold or holds in a metaanalysis. Thus, \(q\) did not represent the proportion of truly nonnormal metaanalyses, denoted by \(p\).
Based on the available information, we proposed a method to approximate \(p\) from \(q\) as follows. By conditional probabilities, the proportion of metaanalyses with significant nonnormality should be \(q=p\cdot \mathrm{power}+\left(1p\right)\cdot \alpha\), where \(\alpha\) is the type I error rate of 0.1, and the SW test’s power could be determined by the simulations in Wang and Lee [30]. The statistical power depends on many factors, including the number of studies in a metaanalysis and the true betweenstudy distributions. There is no explicit formula to calculate this power; we used the empirical evidence from simulation studies by Wang and Lee [30] to impute the SW test’s power. Based on the foregoing observations, we approximated the proportion of truly nonnormal metaanalyses as \(p=\left(q\alpha \right)/\left(\mathrm{power}\alpha \right)\). Here, we assumed that all metaanalyses were independent and shared the same power of the SW test. Although these assumptions are unrealistic, they could provide a rough proportion of truly nonnormal metaanalyses for a possible range of power.
Subgroup analyses
Methods for assessing the betweenstudy normality assume the withinstudy normality. This withinstudy normality assumption generally requires large sample sizes and event rates that are away from the boundary values of 0% and 100% [5]. Therefore, we conducted subgroup analyses by categorizing the metaanalyses by sample sizes (for both types of outcomes) and event rates (for binary outcomes only). In each subgroup, the metaanalyses were restricted to those with studies that meet a sample size threshold, which was set to 0, 10, …, and 100. Metaanalyses with binary outcomes were further categorized based on the crude event rate, which was calculated by dividing the total event count by the total sample size across studies. The thresholds of crude event rates were set to 0–100%, 1–99%, …, and 25–75%. Of note, we did not use twodimensional analyses with a factorial design that would lead to too many subgroups. Instead, the subgroups were created by matching the 11 thresholds of crude event rates with the foregoing 11 thresholds of sample sizes accordingly; the withinstudy normality assumption was gradually more likely to hold in these subgroups.
Visual assessment of Q–Q plots
As the SW test has low statistical power for metaanalyses with a small or moderate number of studies, visual assessments of the normality based on Q–Q plots remain essential. Two authors (ZL and FMAA) independently performed visual assessments of the betweenstudy normality in Q–Q plots of the originally reported effect sizes. To reduce workload, we focused on the metaanalyses with nonsignificant test results (Pvalues ≥ 0.1), i.e., when the SW test failed to detect nonnormality. The two authors also assessed the Q–Q plots based on the (log) OR, (log) RR, and RD for each metaanalysis with a binary outcome.
To describe our visual assessment of normality, we set five tail scores (\(\)2, \(\)1, 0, 1, and 2) for tails in a Q–Q plot, representing an apparently light tail, slightly light tail, approximately normal tail, slightly heavy tail, and apparently heavy tail, respectively. Here, light and heavy tails were defined based on the normal distribution’s tails. A Q–Q plot with both light left and right tails implied a lighttailed distribution, that with both heavy left and right tails implied a heavytailed distribution, that with a heavy left tail and a light right tail implied a leftskewed distribution, and that with a light left tail and a heavy right tail implied a rightskewed distribution.
The normality assumption could also be affected by subgroup effects, where different subgroups may come from different distributions, leading to an overall multimodal distribution if the subgroups are inappropriately combined in the same metaanalysis. We set three mode scores (0, 1, and 2) for assessing the multimodal status, representing apparent multimodal, suspicious multimodal, and approximately unimodal distributions, respectively. Additional file 1: Figs. S1 and S2 give examples of Q–Q plots in different scenarios.
A metaanalysis was considered approximately satisfying the betweenstudy normality assumption only if both tail and mode scores of visual assessments equal 0 in a Q–Q plot. Cohen’s \(\kappa\) statistic was used to quantify the agreement between the visual assessments by the two authors [43]. We calculated Cohen’s \(\kappa\) statistics for two types of assessment for: (I) all 5 × 5 × 3 = 75 categories for the 5 scores for the left tail, 5 scores for the right tail, and 3 scores for the multimodal status, and (II) 2 aggregate categories of normality (all scores equal 0) vs. nonnormality (any score does not equal 0). The first type of assessment involves detailed scores evaluated by the two assessors, while the second type of assessment represents the goal of making a binary decision of whether the betweenstudy normality assumption holds approximately.
Results
Characteristics of included metaanalyses
Additional file 1: Fig. S3 presents the flow chart of selecting the metaanalyses from the Cochrane Library. We collected a total of 107,140 metaanalyses, of which 64,929 had binary outcomes and 42,211 had nonbinary outcomes. Among the 64,929 metaanalyses with binary outcomes, 6162 metaanalyses contained at least 10 studies. Based on their originally reported effect measures, 259 had convergence issues with the REML method, and 1669 had zero betweenstudy variance estimates. As a result, 4234 metaanalyses were eligible for our analyses. Among the 4234 metaanalyses, 498 originally used ORs, 3340 used RRs, 32 used RDs, and the remaining used other effect measures such as Peto ORs. We recalculated the ORs, RRs, and RDs using the 2\(\times\)2 tables; Table 1 presents the number of eligible metaanalyses based on the REML method’s convergence and \({\widehat{\tau }}^{2}\)>0 criterion using the recalculated ORs, RRs, and RDs for the 6162 metaanalyses with ≥10 studies.
For the 42,211 metaanalyses with nonbinary outcomes, 4014 metaanalyses contained at least 10 studies, of which 101 had convergence issues with the REML method and 480 had zero betweenstudy variance estimates. Thus, 3433 metaanalyses had \({\widehat{\tau }}^{2}\)>0 based on the REML method.
Test results for originally reported effect measures
The overall proportion of metaanalyses of binary outcomes having significant nonnormality between studies was 15.7% (95% CI, 14.6% to 16.8%) based on originally reported effect measures. The overall proportion of metaanalyses with nonbinary outcomes having significant nonnormality between studies was 26.2% (95% CI, 24.8% to 27.7%).
We also calculated these proportions categorized by sample sizes and event rates, as shown in Fig. 1. For binary outcomes, the proportion with significant nonnormality increased as the sample size increased and the event rate moved away from 0 and 100% (Fig. 1A). As the withinstudy normality assumption was more likely violated for smaller sample sizes and event rates close to 0% or 100%, this increasing trend implied that the potential violation of the withinstudy normality might confound the assessment of the betweenstudy normality, possibly through the impact on the test power. In contrast, the proportions for nonbinary outcomes were stable (Fig. 1B). This might be because most such metaanalyses used mean differences as effect measures, which converged quickly to normality within studies, even for moderate sample sizes, making the withinstudy normality assumption generally valid.
According to Wang and Lee [37], the statistical power of the SW test is higher for metaanalyses with more studies. In our analyses, the median number of studies in metaanalyses was 15; the simulation studies by Wang and Lee [37] indicated that the test’s power was about 30–60%. Based on these observations and the calculation in the methods section, Fig. 2 presents the approximated proportions of truly nonnormal metaanalyses. When the power of the SW test changed from 30 to 60%, the proportion for binary outcomes roughly varied from 28 to 10%, and that for nonbinary outcomes roughly varied from 80 to 30%. The proportion of truly nonnormal metaanalyses had a wide range, but it sufficiently suggested that the nonnormality issue occurred quite frequently, especially for nonbinary outcomes.
Impact by effect measures for binary outcomes
For binary outcomes, we investigated how the choices of effect measures affected the assessment of the betweenstudy normality. Based on the recalculated ORs, RRs, and RDs from 2 \(\times\) 2 table data among all eligible metaanalyses (Table 1), the proportions of metaanalyses with significant nonnormality for ORs, RRs, and RDs were 15.1% (95% CI, 14.0% to 16.2%), 15.2% (95% CI, 14,1% to 16.3%), and 21.8% (95% CI, 20.6% to 23.0%), respectively.
For the three effect measures, Fig. 3 presents the proportions of metaanalyses with significant nonnormality subgrouped by sample sizes and event rates. The proportion for ORs varied from 15.1 to 29.0%, that for RRs varied from 15.2 to 26.3%, and that for RDs varied from 21.8 to 32.5%. The proportion of metaanalyses with significant nonnormality for the recalculated RDs was lower than that based only on the 32 metaanalyses originally using the RD. This difference was likely because of sampling variability, as using all eligible metaanalyses led to much more precise results. Like the trend in Fig. 1A, the proportions were higher for larger sample sizes and event rates away from 0 and 100%. This again suggested that the withinstudy normality might not be valid for smaller study sample sizes or event rates closer to boundary values; this could affect the assessment of the betweenstudy normality. Moreover, we approximated the proportions of truly nonnormal metaanalyses when using ORs, RRs, and RDs (Fig. 4). The proportion for RDs varied in a wider range than the ORs and RRs.
Visual assessment based on Q–Q plots
Table 2 presents Cohen’s \(\kappa\) statistics of agreements on the visual assessment of Q–Q plots between the two independent assessors. All Q–Q plots and the two assessors’ scores can be accessed on the Open Science Framework [44]. Based on all 75 categories of tail scores and multimodal status scores, the \(\kappa\) statistics were 0.36 for metaanalyses with binary outcomes and 0.37 for those with nonbinary outcomes. When only focusing on 2 aggregate categories of normality vs. nonnormality, the \(\kappa\) statistics were 0.44 for metaanalyses with binary outcomes and 0.46 for those with nonbinary outcomes. In general, these statistics implied fair to moderate agreements [45]. They did not differ much for different types of outcomes and different effect measures. The 2categorybased \(\kappa\) statistics were larger than the 75categorybased ones. This difference was expected because it was more likely to achieve an agreement on whether a Q–Q plot reflects normality (i.e., scatter points approximately on a straight line) than to have a consensus on the magnitudes of nonnormality.
Discussion
In this study, we investigated the betweenstudy normality assumption in randomeffects metaanalyses based on a largescale realworld dataset. Our findings suggested that the betweenstudy normality assumption is questionable in a considerable number of Cochrane metaanalyses, although this assumption dominates the current metaanalytical practice.
We also found that the validity of the betweenstudy normality assumption is relevant to the types of outcomes and effect measures. In general, betweenstudy nonnormality issues are less likely to occur with ORs and RRs than RDs and effect measures for nonbinary outcomes. This is generally expected because RD values are bounded between \(\)1 and 1, so assuming them to follow a normal distribution may not be plausible. Researchers should carefully account for the exchangeability across studies when choosing the effect measure in a metaanalysis [6, 22, 46,47,48].
In addition, we evaluated the confounding effects of the withinstudy nonnormality on assessing the betweenstudy normality by subgroup analyses with restrictions on sample sizes and event rates. For binary outcomes, the subgroup analyses showed that the betweenstudy nonnormality occurred more frequently in metaanalyses with larger sample sizes and event rates away from the boundary values of 0% and 100%. In such cases, the withinstudy normality was more likely valid and possibly led to a larger power of the SW test. Restricting to large sample sizes within studies generally did not affect the assessment of the betweenstudy normality for nonbinary outcomes.
Our findings suggested that this visual tool could be very subjective, as the agreement between two independent assessors was only fair to moderate. As statistical tests for normality have relatively low powers, particularly when the number of studies is small [30], the Q–Q plot remains essential for assessing normality. Nevertheless, researchers should expect high uncertainties in the conclusions of visual assessments. Such conclusions should be evaluated and discussed with multiple assessors.
Considering that the betweenstudy nonnormality is a common issue, we have some recommendations as follows. First, if there are a sufficient number of studies (e.g., > 10) and heterogeneity likely exists between studies, researchers should validate the normality assumption for performing a randomeffects metaanalysis. Second, if the betweenstudy normality in a metaanalysis may not hold, researchers should explore potential clinical characteristics of included studies that might contribute to the nonnormality. For example, based on the studies’ characteristics, researchers may consider subgroup analyses, metaregressions, and sensitivity analyses that exclude outlying studies. Smallstudy effects could also lead to skewed betweenstudy distributions, so methods that account for smallstudy effects may be used to examine if they might improve the normality [49]. Third, researchers should consider if it makes sense to assume the effect measure is exchangeable across studies [21, 22]. If not, they may try using other effect measures to examine whether the normality could be improved. Finally, researchers may consider alternative statistical metaanalytic methods that are robust to model misspecification [50,51,52,53], non or semiparametric methods [54,55,56], and exact models that do not require the withinstudy normality assumption [57,58,59]. If the betweenstudy normality is evidently violated, the robust methods could produce less biased results, while they may sacrifice statistical power for finding true treatment effects. Figure 5 describes a framework of recommendations based on the assessments of heterogeneity and normality.
This study had several limitations. First, due to the nature of largescale analyses, it was not feasible to investigate the nonnormality on a casebycase basis. For example, although we might identify multimodal patterns in the Q–Q plot of a metaanalysis, we did not further investigate if such patterns were caused by certain effect modifiers or some outlying studies. When the betweenstudy normality is violated in a particular metaanalysis, we recommend exploring the potential causes of nonnormality. Second, the statistical tests for nonnormality have relatively low power, and many factors could affect the assessment of the betweenstudy normality. Those factors may include type I and II error rates of the SW test, sample sizes, and event rates. Nevertheless, many other factors (e.g., publication bias) could not be accurately taken into account. Third, although the REML method is generally recommended for estimating the betweenstudy variance [36], it could have convergence problems that lead to a loss of about 2.5–4.4% of metaanalysis samples and thus affect their representativeness. Fourth, our analyses were restricted to metaanalyses with at least 10 studies due to the relatively low power of statistical tests for normality. This restriction is similarly recommended when using statistical methods to assess smallstudy effects [35]. Nevertheless, metaanalyses with a small number of studies could also seriously suffer from nonnormality issues, which were not investigated in the current study. Last, the Q–Q plots were assessed by two authors, who are welltrained statisticians. The \(\kappa\) statistics’ interpretations only represent the agreements between these two assessors, and they may not be generalizable to other systematic reviewers.
Conclusions
In conclusion, despite its popularity, the betweenstudy assumption should not be taken for granted in metaanalyses. It needs to be carefully assessed; if it is evidently violated, alternative metaanalysis methods that do not make this assumption should be considered.
Availability of data and materials
The data are available upon reasonable request from the corresponding author.
Abbreviations
 CI:

Confidence interval
 OR:

Odds ratio
 Q–Q plot:

Quantile–quantile plot
 RD:

Risk difference
 REML:

Restricted maximumlikelihood
 RR:

Relative risk
References
DerSimonian R, Laird N. Metaanalysis in clinical trials. Control Clin Trials. 1986;7(3):177–88.
Brockwell SE, Gordon IR. A comparison of statistical methods for metaanalysis. Stat Med. 2001;20(6):825–40.
Jackson D, Riley R, White IR. Multivariate metaanalysis: potential and promise. Stat Med. 2011;30(20):2481–98.
Cheung MWL, Ho RCM, Lim Y, Mak A. Conducting a metaanalysis: basics and good practices. Int J Rheum Dis. 2012;15(2):129–35.
Jackson D, White IR. When should metaanalysis avoid making hidden normality assumptions? Biom J. 2018;60(6):1040–58.
Deeks JJ. Issues in the selection of a summary statistic for metaanalysis of clinical trials with binary outcomes. Stat Med. 2002;21(11):1575–600.
Lin L, Aloe AM. Evaluation of various estimators for standardized mean difference in metaanalysis. Stat Med. 2021;40(2):403–26.
Jackson D, Law M, Stijnen T, Viechtbauer W, White IR. A comparison of seven randomeffects models for metaanalyses that estimate the summary odds ratio. Stat Med. 2018;37(7):1059–85.
Xu C, FuruyaKanamori L, Lin L. Synthesis of evidence from zeroevents studies: a comparison of onestage framework methods. Res Synth Methods. 2022;13(2):176–89.
Simmonds MC, Higgins JPT. A general framework for the use of logistic regression models in metaanalysis. Stat Methods Med Res. 2016;25(6):2858–77.
Efthimiou O. Practical guide to the metaanalysis of rare events. Evid Based Ment Health. 2018;21(2):72–6.
Lin L. Bias caused by sampling error in metaanalysis with small sample sizes. PLoS ONE. 2018;13(9): e0204056.
Higgins JPT, White IR, AnzuresCabrera J. Metaanalysis of skewed data: combining results reported on logtransformed or raw scales. Stat Med. 2008;27(29):6072–92.
Sun RW, Cheung SF. The influence of nonnormality from primary studies on the standardized mean difference in metaanalysis. Behav Res Methods. 2020;52(4):1552–67.
Rosenberger KJ, Chu H, Lin L. Empirical comparisons of metaanalysis methods for diagnostic studies: a metaepidemiological study. BMJ Open. 2022;12(5): e055336.
Hardy RJ, Thompson SG. Detecting and describing heterogeneity in metaanalysis. Stat Med. 1998;17(8):841–56.
Higgins JPT. Commentary: heterogeneity in metaanalysis should be expected and appropriately quantified. Int J Epidemiol. 2008;37(5):1158–60.
Ma X, Lin L, Qu Z, Zhu M, Chu H. Performance of betweenstudy heterogeneity measures in the Cochrane Library. Epidemiology. 2018;29(6):821–4.
Kontopantelis E, Reeves D. Performance of statistical methods for metaanalysis when true study effects are nonnormally distributed: A simulation study. Stat Methods Med Res. 2012;21(4):409–26.
RubioAparicio M, MarínMartínez F, SánchezMeca J, LópezLópez JA. A methodological review of metaanalyses of the effectiveness of clinical psychology treatments. Behav Res Methods. 2018;50(5):2057–73.
Takeshima N, Sozu T, Tajika A, Ogawa Y, Hayasaka Y, Furukawa TA. Which is more generalizable, powerful and interpretable in metaanalyses, mean difference or standardized mean difference? BMC Med Res Methodol. 2014;14(1):30.
Zhao Y, Slate EH, Xu C, Chu H, Lin L. Empirical comparisons of heterogeneity magnitudes of the risk difference, relative risk, and odds ratio. Syst Rev. 2022;11(1):26.
Viechtbauer W, Cheung MWL. Outlier and influence diagnostics for metaanalysis. Res Synth Methods. 2010;1(2):112–25.
Lin L, Chu H, Hodges JS. Alternative measures of betweenstudy heterogeneity in metaanalysis: reducing the impact of outlying studies. Biometrics. 2017;73(1):156–66.
BlázquezRincón D, SánchezMeca J, Botella J, Suero M. Heterogeneity estimation in metaanalysis of standardized mean differences when the distribution of random effects departs from normal: A Monte Carlo simulation study. BMC Med Res Methodol. 2023;23(1):19.
RubioAparicio M, LópezLópez JA, SánchezMeca J, MarínMartínez F, Viechtbauer W, Van den Noortgate W. Estimation of an overall standardized mean difference in randomeffects metaanalysis if the distribution of random effects departs from normal. Res Synth Methods. 2018;9(3):489–503.
Sedgwick P. Metaanalyses: heterogeneity and subgroup analysis. BMJ. 2013;346: f4040.
MayoWilson E, Li T, Fusco N, Bertizzolo L, Canner JK, Cowley T, Doshi P, Ehmsen J, Gresham G, Guo N, et al. Cherrypicking by trialists and metaanalysts can drive conclusions about intervention efficacy. J Clin Epidemiol. 2017;91:95–110.
Palpacuer C, Hammas K, Duprez R, Laviolle B, Ioannidis JPA, Naudet F. Vibration of effects from diverse inclusion/exclusion criteria and analytical choices: 9216 different ways to perform an indirect comparison metaanalysis. BMC Med. 2019;17(1):174.
Wang CC, Lee WC. Evaluation of the normality assumption in metaanalyses. Am J Epidemiol. 2020;189(3):235–42.
Lin L, Shi L, Chu H, Murad MH. The magnitude of smallstudy effects in the Cochrane Database of Systematic Reviews: an empirical study of nearly 30 000 metaanalyses. BMJ Evid Based Med. 2020;25(1):27–32.
Petticrew M, Wilson P, Wright K, Song F. Quality of Cochrane reviews is better than that of nonCochrane reviews. BMJ. 2002;324(7336):545.
Büchter RB, Weise A, Pieper D. Reporting of methods to prepare, pilot and perform data extraction in systematic reviews: analysis of a sample of 152 Cochrane and nonCochrane reviews. BMC Med Res Methodol. 2021;21(1):240.
Lin L, Chu H, Murad MH, Hong C, Qu Z, Cole SR, Chen Y. Empirical comparison of publication bias tests in metaanalysis. J Gen Intern Med. 2018;33(8):1260–7.
Sterne JAC, Sutton AJ, Ioannidis JPA, Terrin N, Jones DR, Lau J, Carpenter J, Rücker G, Harbord RM, Schmid CH, et al. Recommendations for examining and interpreting funnel plot asymmetry in metaanalyses of randomised controlled trials. BMJ. 2011;343: d4002.
Langan D, Higgins JPT, Jackson D, Bowden J, Veroniki AA, Kontopantelis E, Viechtbauer W, Simmonds M. A comparison of heterogeneity variance estimators in simulated randomeffects metaanalyses. Res Synth Methods. 2019;10(1):83–98.
Shapiro SS, Wilk MB. An analysis of variance test for normality (complete samples). Biometrika. 1965;52(3/4):591–611.
Sedgwick P. Metaanalyses: what is heterogeneity? BMJ. 2015;350: h1435.
Peters JL, Sutton AJ, Jones DR, Abrams KR, Rushton L. Comparison of two methods to detect publication bias in metaanalysis. JAMA. 2006;295(6):676–80.
Benjamin DJ, Berger JO, Johannesson M, Nosek BA, Wagenmakers EJ, Berk R, Bollen KA, Brembs B, Brown L, Camerer C, et al. Redefine statistical significance. Nat Hum Behav. 2018;2(1):6–10.
Ioannidis JPA. The proposal to lower P value thresholds to .005. JAMA. 2018;319(14):1429–30.
Amrhein V, Greenland S, McShane B. Scientists rise up against statistical significance. Nature. 2019;567(7748):305–7.
Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20(1):37–46.
Betweenstudy normality in Cochrane metaanalyses. URL: https://osf.io/vzshp/.
Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159–74.
Doi SA, FuruyaKanamori L, Xu C, Lin L, Chivese T, Thalib L. Controversy and Debate: Questionable utility of the relative risk in clinical research: Paper 1: A call for change to practice. J Clin Epidemiol. 2022;142:271–9.
Xiao M, Chen Y, Cole SR, MacLehose RF, Richardson DB, Chu H. Controversy and Debate: Questionable utility of the relative risk in clinical research: Paper 2: Is the Odds Ratio “portable” in metaanalysis? Time to consider bivariate generalized linear mixed model. J Clin Epidemiol. 2022;142:280–7.
Bakbergenuly I, Hoaglin DC, Kulinskaya E. Pitfalls of using the risk ratio in metaanalysis. Res Synth Methods. 2019;10(3):398–419.
Duval S, Tweedie R. A nonparametric “trim and fill” method of accounting for publication bias in metaanalysis. J Am Stat Assoc. 2000;95(449):89–98.
Maier M, Bartoš F, Wagenmakers EJ. Robust Bayesian metaanalysis: addressing publication bias with modelaveraging. Psychol Methods. 2022:In press.
Chen Y, Hong C, Ning Y, Su X. Metaanalysis of studies with bivariate binary outcomes: a marginal betabinomial model approach. Stat Med. 2016;35(1):21–40.
Wang Y, Lin L, Thompson CG, Chu H. A penalization approach to randomeffects metaanalysis. Stat Med. 2022;41(3):500–16.
Henmi M, Copas JB. Confidence intervals for random effects metaanalysis and robustness to publication bias. Stat Med. 2010;29(29):2969–83.
Doi SAR, Barendregt JJ, Khan S, Thalib L, Williams GM. Advances in the metaanalysis of heterogeneous clinical trials I: the inverse variance heterogeneity model. Contemp Clin Trials. 2015;45:130–8.
Burr D, Doss H. A Bayesian semiparametric model for randomeffects metaanalysis. J Am Stat Assoc. 2005;100(469):242–51.
Karabatsos G, Talbott E, Walker SG. A Bayesian nonparametric metaanalysis model. Res Synth Methods. 2015;6(1):28–44.
Chu H, Nie L, Chen Y, Huang Y, Sun W. Bivariate random effects models for metaanalysis of comparative studies with binary outcomes: methods for the absolute risk difference and relative risk. Stat Methods Med Res. 2012;21(6):621–33.
Smith TC, Spiegelhalter DJ, Thomas A. Bayesian approaches to randomeffects metaanalysis: a comparative study. Stat Med. 1995;14(24):2685–99.
Tian L, Cai T, Pfeffer MA, Piankov N, Cremieux PY, Wei LJ. Exact and efficient inference procedure for metaanalysis and its application to the analysis of independent 2 × 2 tables with all available data but without artificial continuity correction. Biostatistics. 2009;10(2):275–81.
Acknowledgements
Not applicable.
Funding
This study was supported in part by the National Institute of Mental Health grant R03 MH128727 (LL) and the National Library of Medicine grant R01 LM012982 (LS and LL). The content is solely the responsibility of the authors and does not necessarily represent the official views of the US National Institutes of Health.
Author information
Authors and Affiliations
Contributions
ZL: conceptualization, methodology, formal analysis, investigation, writing—original draft, visualization. FMAA: conceptualization, methodology, formal analysis, investigation. MX: writing—review and editing. CX: writing—review and editing. LFK: writing—review and editing. HH: writing—review and editing. LS: writing—review and editing, funding acquisition. LL: conceptualization, methodology, validation, data curation, writing—original draft, writing—review and editing, supervision, funding acquisition. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate.
Ethics approval and consent to participate were not required for this study because it investigated statistical properties using published data in the existing literature.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1: Figure S1.
Examples of Q–Q plots in four tail scenarios with respect to the normal distribution’s tails. Figure S2. Examples of Q–Q plots in unimodal and bimodal scenarios. Figure S3. Flow chart of selecting the metaanalyses from the Cochrane Library.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Liu, Z., Al Amer, F.M., Xiao, M. et al. The normality assumption on betweenstudy random effects was questionable in a considerable number of Cochrane metaanalyses. BMC Med 21, 112 (2023). https://doi.org/10.1186/s12916023028239
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12916023028239
Keywords
 Cochrane Library
 Effect measure
 Heterogeneity
 Metaanalysis
 Normality assumption
 Q–Q plot