Systematic reviews: a cross-sectional study of location and citation counts

Background Systematic reviews summarize all pertinent evidence on a defined health question. They help clinical scientists to direct their research and clinicians to keep updated. Our objective was to determine the extent to which systematic reviews are clustered in a large collection of clinical journals and whether review type (narrative or systematic) affects citation counts. Methods We used hand searches of 170 clinical journals in the fields of general internal medicine, primary medical care, nursing, and mental health to identify review articles (year 2000). We defined 'review' as any full text article that was bannered as a review, overview, or meta-analysis in the title or in a section heading, or that indicated in the text that the intention of the authors was to review or summarize the literature on a particular topic. We obtained citation counts for review articles in the five journals that published the most systematic reviews. Results 11% of the journals concentrated 80% of all systematic reviews. Impact factors were weakly correlated with the publication of systematic reviews (R2 = 0.075, P = 0.0035). There were more citations for systematic reviews (median 26.5, IQR 12 – 56.5) than for narrative reviews (8, 20, P <.0001 for the difference). Systematic reviews had twice as many citations as narrative reviews published in the same journal (95% confidence interval 1.5 – 2.7). Conclusions A few clinical journals published most systematic reviews. Authors cited systematic reviews more often than narrative reviews, an indirect endorsement of the 'hierarchy of evidence'.


Background
Evidence-based medicine (EBM) is the judicious and conscientious incorporation of the best available evidence into clinical decision making while considering the values and preferences of the patient [1]. This definition of EBM invokes a hierarchy of evidence. Systematic reviews of the literature occupy the highest position in currently proposed hierarchies of evidence [2]. It is argued that systematic reviews should occupy this top position because of two fundamental premises. First, clinical reviews systematically search, identify, and summarize the available evidence that answers a focused clinical question, with particular attention to methodological quality. Second, reviews that include a meta-analysis provide precise estimates of the association or the treatment effect. Clinicians can apply the results of meta-analyses to a wide array of patients -certainly wider than those included in each of the primary studies -that do not differ importantly from those enrolled in the primary studies.
Narrative reviews are summaries of research that lack an explicit description of a systematic approach. Despite the emerging dominance of systematic reviews, narrative reviews persist. A study by Antman et al . [3] found that narrative reviews, which frequently reflected the opinion of a single expert, lagged behind the evidence, disagreed with the existing evidence, and disagreed with other published expert opinions. Mulrow [4,5] and later McAlister et al . [4,5] found that these reviews lacked methods to limit the intrusion of bias in the summary or the conclusions.
Because of the importance of systematic reviews in summarizing the advances of health care knowledge, their number is growing rapidly. The Cochrane Collaboration, a world-wide enterprise to produce and disseminate systematic reviews of effectiveness, has published in excess of 1000 systematic reviews since its inception [6,7]. Collectively, other groups and individuals have likely contributed three to five times that number in the past 20 years and these reviews are dispersed throughout the medical literature [8]. Researchers wanting to define the frontier of current research and clinicians wanting to practice EBM should be able to reliably and quickly find all valid systematic reviews of the literature. Nonetheless, researchers have reported difficulty finding systematic reviews within the mass of biomedical literature represented in large bibliographic databases such as MEDLINE [9][10][11][12].
If systematic reviews in fact represent the best available evidence, they are likely to have great clinical importance. It follows that they be cited often in the literature. The Institute for Scientific Information (ISI) impact factors reflect (albeit far from ideally [13]) the prestige of a journal and the importance of the articles it publishes. The impact factor for a journal is the number of citations to articles published in the journal in the past two years, divided by the number of articles published during that period. ISI also reports the number of citations that individual articles receive. Since the impact factor relates to the overall citation performance of the articles a journal publishes and not to any individual article type, and since systematic reviews are a relatively small fraction of all articles published in journals, we did not expect a strong association between impact factors and the frequency of publication of systematic reviews. However, we hypothesized that the number of citations for systematic reviews would be greater than the number of citations for a "look alike" article, in this case, a narrative review published in the same journal.
Thus, we sought to answer the following research questions: (1) Where are systematic reviews published? (2) What is the relation between journal impact factor and journal yield of systematic reviews? (3) Do systematic reviews receive more citations than narrative reviews?
Answers to our first question may lead to definition of journal subsets in MEDLINE within which most systematic reviews will reside. Answers to our second and third questions will indicate whether the literature reflects the hierarchy of evidence, one of the basic tenets of EBM.

Methods
The Hedges Team of the Health Information Research Unit (HIRU) at McMaster University is conducting an expansion and update of our 1991 work on search filters or 'hedges' to aid clinicians, researchers, and policymakers harness high-quality and relevant information from MEDLINE [14]. We planned to conduct the present work within the larger context of the Hedges Project prior to the onset of data collection and analyses.

Journal selection
The editorial group at HIRU prepares four evidence-based medical journals, the ACP Journal Club, Evidence-based Medicine, Evidence-based Nursing, and, up to 2003, Evidence-based Mental Health. These journals help keep healthcare providers up-to-date. To produce these secondary journals, the editorial staff has identified 170 journals that regularly publish clinically-relevant research in the areas of focus of these evidence-based journals (i.e., general internal medicine, family practice, nursing, and mental health). We evaluated journals for inclusion into this set that have the highest Science Citation Index Impact Factors in each field and journals that clinicians and librarians who collaborate with HIRU recommended based on their perceived yield of important papers. The editorial staff then monitors the yield of original studies and reviews of scientific merit and clinical relevance (criteria below) for each of these journals, to determine if they should be kept on the list or replaced with higher yielding nominated journals.

Study identification and classification
On an ongoing basis, six research associates review each of these journals and apply methodological criteria to each item to determine if the article is eligible for inclusion in the evidence-based publications. For the purpose of the Hedges Project (i.e., to develop search strategies for large bibliographic databases such as MEDLINE), we expanded the data collection effort and began intensive training and calibration of the research staff in 1999. In this manuscript, we report the κ statistic measuring chanceadjusted agreement between the six research assistants for each classification procedure.
We reported the training and calibration process in detail elsewhere. [15] Briefly, prior to the first inter-rater reliability test research staff met to develop the data collection form, and to develop a document outlining the coding instructions and category definitions using examples from the 1999 literature. Meetings involving the research staff revealed differences in interpretation of the definitions (early κ were as low as 0.54). Intensive discussion periods and practice sessions using actual articles were used to hone definitions and thus remove ambiguities (goal κ > 0.8). The six research associates received the same articles packaged with the data collection form and the instructions document (this document is available from the authors on request) and each independently and blindly reviewed each article and recorded their classification in the data collection forms. We conducted three reliability tests during 1999. We conducted the fourth and final inter-rater reliability test approximately 14 months after the process had commenced using a sample of 72 articles randomly selected across the 170 journal titles. In calculating the κ statistic for methodological rigor, raters had to agree on the purpose category for the item to be included in the calculation (Table 1 describes the purpose categories and the criteria for methodological rigor for each one). We analyzed data using PC-agree (software code written by Richard Cook; maintained by Stephen Walter, McMaster University, Hamilton, Ontario, Canada).
For the purposes of the Hedges Project, we defined review as any full text article that was bannered as a review, overview, or meta-analysis in the title or in a section heading, or that indicated in the text that the intention of the authors was to review or summarize the literature on a particular topic [15]. To be considered a systematic review, the authors had to clearly state the clinical topic of the review, how the evidence was retrieved and from what sources (i.e., name the databases), and provide explicit inclusion and exclusion criteria. The absence of any one of Content pertains directly to determining if there is an association between an exposure and a disease or condition. The question is "What causes people to get a disease or condition?" Observations concerned with the relationship between exposures and putative clinical outcomes; data collection is prospective; clearly identified comparison group(s); blinding of observers of outcome to exposure. Prognosis Content pertains directly to the prediction of the clinical course or the natural history of a disease or condition with the disease or condition existing at the beginning of the study.
Inception cohort of individuals all initially free of the outcome of interest; follow-up of at least 80% of patients until occurrence of a major study end point or to the end of the study; analysis consistent with study design. Diagnosis Content pertains directly to using a tool to arrive at a diagnosis of a disease or condition.
Inclusion of a spectrum of participants; objective diagnostic reference standard OR current clinical standard for diagnosis; participants received both the new test and some form of the diagnostic standard; interpretation of the diagnostic standard without knowledge of test result and vise versa; analysis consistent with study design. Treatment Content pertains directly to an intervention for therapy (including adverse effects studies), prevention, rehabilitation, quality improvement, or continuing medical education.
Random allocation of participants to comparison groups; outcome assessment of at least 80% of those entering the investigation accounted for in 1 major analysis at any given follow-up assessment; analysis consistent with study design. Economics Content pertains directly to the economics of a healthcare issue with the economic question addressed being based on the comparison of alternatives.
Question is a comparison of the alternatives; alternative services or activities compared on outcomes produced (effectiveness) and resources consumed (costs); evidence of effectiveness must from a study of real patients that meets the above-noted criteria for diagnosis, treatment, quality improvement, or a systematic review article; effectiveness and cost estimates based on individual patient data (micro-economics); results presented in terms of the incremental or additional costs and outcomes of one intervention over another; sensitivity analysis if there is uncertainty. Clinical prediction guide Content pertains directly to the prediction of some aspect of a disease or condition.
Guide is generated in one or more sets of real patients (training set); guide is validated in another set of real patients (test set).
* Other study categories included qualitative (studies that pertain directly to how people feel or experience certain situations using data collection methods and analyses appropriate for qualitative data) and a category 'something else' to include studies with a content that did not fit any of the above definitions. these 3 characteristics would classify a review as a narrative review. The inter-rater agreement for this classification was almost perfect (κ = 0.92, 95% confidence interval 0.89 -0.95).
Then, we classified all reviews by whether they were concerned with the understanding of healthcare in humans. Examples of studies that would not have a direct effect on patients or participants (and, thus, are excluded from analysis) include studies that describe the normal development of people; basic science; gender and equality studies in the health profession; or studies looking at research methodology issues. The inter-rater agreement for this classification was almost perfect (κ = 0.87, 95% confidence interval 0.89 -0.96).
A third level of classification placed reviews in purpose categories (i.e., what question(s) are the investigators addressing) that we defined for the Hedges Project and included etiology (causation and safety), prognosis, diagnosis, treatment, economics, clinical prediction guides, and qualitative (Table 1) [15]. The inter-rater agreement for this classification was 81% beyond chance (κ = 0.81, 95% confidence interval 0.79 -0.84).
A fourth level of classification graded reviews for methodological rigor placing them in pass and fail categories. To pass, the review should include a statement of the clinical topic (i.e., a focused review question); explicit statements of the inclusion and exclusion criteria; a description of the search strategy and study sources (i.e., a list of the databases); and at least 1 included study that satisfied methodological rigor criteria for the purpose category (Table 1). For example, reviews of treatment interventions had to have at least one study with random allocation of participants to comparison groups and assessment of at least one clinical outcome. All narrative reviews were included in the fail category. We refer to systematic reviews that passed this methodological rigor evaluation as rigorous systematic reviews. Again, the inter-rater agreement for this classification was almost perfect (κ = 0.89, 95% confidence interval 0.78 -0.99).
For this report, we retrieved data on review articles including a complete bibliographic citation (including journal title), the pass/fail methodological grade, and the review type (narrative or systematic review).

Impact factor and citation counts
To collect impact factor data for all 170 journals in the database we used the ISI Journal Citation Reports http:// isiknowledge.com. We also queried the ISI Web of Science database to ascertain, as of February 2003, the number of citations to each one of the reviews in an arbitrary subset of five journals that published the most systematic reviews and are indexed journals in the ISI database.

Data analysis
Data were arrayed in frequency tables. We conducted nonparametric univariate analysis (Kruskal-Wallis) to assess the relationship between the number of citations and the type of review. We assessed the correlation between journal impact factor and citation counts. Then, using multiple linear regression, we determined the ability of the independent variables -methodological quality of the reviews and journal source -to predict the dependent variable, the number of citations (after log transformation). Thus, this analysis was stratified by journal to adjust not only for impact factor, but also for other journal-specific factors not captured by this measure.

What journals publish systematic reviews?
For the year 2000, there were 60330 articles published in the 170 journals of which 26694 were original research reports and 3193 were review articles. Of the review articles, 768 (24%) were systematic reviews that passed methodological criteria. Of these (and some entered in more than one purpose category), 662 were about therapy (63%), 308 (29%) were about causation and safety, 47 (4.4%) were about diagnosis, 22 (2.1%) were about prognosis, and 18 reviews were about economics, clinical prediction guides, and qualitative research. Table 2 shows the top 20 journals that published the largest number of systematic reviews that passed methodological criteria: 11% of all journals published 80% of all rigorous systematic reviews. Of these, the Cochrane Library published 56% of all rigorous systematic reviews. Within the 102 journals that published at least one rigorous systematic review, the median number of rigorous systematic reviews published per journal was 2 (interquartile (IQR) range 1 -4; total range 1 -427). Table 3 indicates the top five journals that published the most systematic reviews in 2000 by purpose category (therapy and diagnosis). Table 4 indicates the top five journals that published the most systematic reviews by audience (nursing and general medicine).

The relationship between journal impact factor and publication of systematic reviews
In the subset of 99 journals for which impact factor data were available, impact factor was significantly and weakly associated with the number of rigorous reviews published (R 2 = 0.075, P = 0.0035). The association was also significant and somewhat stronger in the subset of general medicine journals (No. rigorous systematic reviews = 2. + • impact factor; R 2 = 0.257, P = 0.0156) with all other clinical topic subsets being not significant (P 0.05).

Citation analyses
To conduct citation analysis we identified the top five journals that published the most systematic reviews ( Table 2). The Cochrane Library was excluded because ISI does not track citations for Cochrane reviews. In this subset, there were 172 narrative reviews and 99 systematic reviews of which 82 were rigorous systematic reviews. For the rest of the analyses we considered the systematic reviews that did not meet methodological criteria (n = 17) in the same group as narrative reviews.
Rigorous systematic reviews were cited significantly (P < 0.0001) more often (median 26.5, IQR 12 -56.5) than narrative reviews (8, 3 -20). After stratifying for journal source, review type (narrative vs. rigorous systematic review) was an independent predictor of citation counts (R 2 = 0.257, P < 0.0001): a rigorous systematic review had, on average, twice the number of citations as a narrative review published in the same journal (relative citation rate 2.0, 95% confidence interval 1.5 -2.7). There was no significant interaction between journal and review type.

Main findings
Our study indicates that 11% of the 170 clinical journals we reviewed published more than 80% of all systematic *, percentage of all systematic reviews in same category reviews. Impact factor was a weaker predictor of citations than the methodological quality of the review. Among the five journals publishing most systematic reviews, and after stratifying by journal, the type of review (rigorous systematic vs. narrative) was independently associated with the number of citations. Thus, our findings are consistent with the priority given to systematic reviews in the hierarchy of evidence to support evidence-based clinical decisions.

Limitations and strengths of the research
Our research has some limitations. First, we did not determine the nature of the citations. That is, it is possible that certain citations pointed out a fatal flaw in the index paper. Second, of all the journals, the Cochrane Library provides the largest number of reviews. Unfortunately, the Cochrane Library is not an ISI indexed resource. Third, the New England Journal of Medicine had the highest impact factor, but no systematic reviews in 2000. Nevertheless, our results were statistically significant and did not lack statistical power. Furthermore, our results apply to most medical journals that publish systematic reviews (unlike the New England Journal of Medicine) in addition to reports using other study designs (unlike the Cochrane Library).
We did not set out to evaluate the impact of these reviews on clinical practice.
Our research has several strengths. The methods we used to ascertain the database and classify the records involved highly trained personnel, independent assessments, explicit definitions, third-party arbitration of differences between reviewers, and a large and complete database. To our knowledge, this is the first paper to describe where systematic reviews are most often published in a broad range of clinical journals. Also for the first time, we evaluated and demonstrated that rigorous systematic reviews were cited more often than less rigorous and narrative reviews in the subset of journals that publish most systematic reviews, even after adjusting for journal of publication (e.g., journal impact factor). Our results are consistent with another study that also documented a weak association between journal impact factor and the methodological quality of published studies [16].

Meaning of the research
We can only speculate about the causes of the maldistribution of rigorous systematic reviews among a few journals, since exploration of such causes was not an objective of our study. Journal policy and author preferences may contribute to this maldistribution. The lack of systematic reviews and meta-analysis published in the New England Journal of Medicine is evidence of the effect of journal policy. Other journals, such as The Journal of the American Medical Association, Lancet, The British Medical Journal, and Annals of Internal Medicine, have published articles about systematic review methodology and reporting, and enthusiastically publish rigorous reviews of clinical importance. Authors of such reviews, naturally, may prefer to submit their reviews to journals with large circulation and impact. The relative contributions of these sources to the observed maldistribution constitute hypotheses that remain to be tested.
Given that our research design does not support causal inferences, it is unwise to derive recommendations to journal editors based on our findings. We think that journal editors interested in publishing rigorous research should prefer systematic reviews over narrative reviews. Furthermore, our research generates the hypothesis that a choice of systematic over narrative reviews may contribute to increase a journal's impact factor. However, editors of traditional journals have other competing priorities that rely less on citation counts and more on popularity (e.g., attract and maintain readership, attract advertisement and generate revenue) which may direct their choice of reviews to publish (i.e., if they perceive narrative reviews as easier to read and more attractive to their readership than systematic reviews and meta-analyses).

Future directions
Future research may refine citation counting to ascertain whether the citation is positive or negative. This work will also inform our development of MEDLINE search filters for identifying systematic reviews in that database, particularly through the generation of journal subsets within the database to expedite the search.

Conclusions
In summary, our report identifies for researchers and clinicians the journals that are to publish rigorous reviews. Furthermore, rigorous systematic reviews are cited more often than narrative ones, an indirect endorsement of the hierarchy of evidence.