Predicting streptococcal pharyngitis in adults in primary care: a systematic review of the diagnostic accuracy of symptoms and signs and validation of the Centor score

Background Stratifying patients with a sore throat into the probability of having an underlying bacterial or viral cause may be helpful in targeting antibiotic treatment. We sought to assess the diagnostic accuracy of signs and symptoms and validate a clinical prediction rule (CPR), the Centor score, for predicting group A β-haemolytic streptococcal (GABHS) pharyngitis in adults (> 14 years of age) presenting with sore throat symptoms. Methods A systematic literature search was performed up to July 2010. Studies that assessed the diagnostic accuracy of signs and symptoms and/or validated the Centor score were included. For the analysis of the diagnostic accuracy of signs and symptoms and the Centor score, studies were combined using a bivariate random effects model, while for the calibration analysis of the Centor score, a random effects model was used. Results A total of 21 studies incorporating 4,839 patients were included in the meta-analysis on diagnostic accuracy of signs and symptoms. The results were heterogeneous and suggest that individual signs and symptoms generate only small shifts in post-test probability (range positive likelihood ratio (+LR) 1.45-2.33, -LR 0.54-0.72). As a decision rule for considering antibiotic prescribing (score ≥ 3), the Centor score has reasonable specificity (0.82, 95% CI 0.72 to 0.88) and a post-test probability of 12% to 40% based on a prior prevalence of 5% to 20%. Pooled calibration shows no significant difference between the numbers of patients predicted and observed to have GABHS pharyngitis across strata of Centor score (0-1 risk ratio (RR) 0.72, 95% CI 0.49 to 1.06; 2-3 RR 0.93, 95% CI 0.73 to 1.17; 4 RR 1.14, 95% CI 0.95 to 1.37). Conclusions Individual signs and symptoms are not powerful enough to discriminate GABHS pharyngitis from other types of sore throat. The Centor score is a well calibrated CPR for estimating the probability of GABHS pharyngitis. The Centor score can enhance appropriate prescribing of antibiotics, but should be used with caution in low prevalence settings of GABHS pharyngitis such as primary care.


Background
Upper respiratory tract infections such as acute pharyngitis represent a substantial portion of the cases seen in primary care [1]. Although the cause of acute pharyngitis in the majority of patients is viral, approximately 5% to 17% is caused by a bacterial infection, often β-haemolytic streptococci [2]. A number of serotypes of β-haemolytic streptococci can cause pharyngitis in humans, however, antibiotics are only recommended in US and UK guidelines for treating patients with group A β-haemolytic streptococcal (GABHS) pharyngitis [3,4]. Antibiotics reduce the risk of complications (for example, peritonsillar abscess, bacteraemia, acute glomerulonephritis and rheumatic fever), as well as reducing the duration of symptoms and spread of the disease [5][6][7].
Throat cultures are currently considered to be the 'reference standard' for the diagnosis of streptococcal pharyngitis [8,9]. This test has a number of limitations in practice; it is relatively expensive; the laboratory tests take 1-2 days leading to delays in starting treatment; and excessive false positive results in asymptomatic pharyngeal carriers may lead to over treatment [10,11]. To enhance the appropriate prescribing of antibiotics without performing cultures on all patients a number of clinical prediction rules (CPRs) have been developed over the last 40 years to distinguish streptococcal pharyngitis from pharyngitis by other causes [12][13][14][15]. CPRs are evidence-based tools that allow clinicians to stratify patients according to their probability of having a particular disorder. They can also be used to provide a rational basis for treatment.
The most widely recognised CPR for GABHS pharyngitis is the Centor score [16]. The Centor score consists of four signs and symptoms (Table 1) and is recommended in clinical guidelines from the American College of Physicians-American Society of Internal Medicine (ACP/ASIM) and Centers for Disease Control and Prevention (CDC) in the US. The ACP/ASIM recommends (a) empirical antibiotic treatment of adults with at least three of four Centor criteria and no treatment for all others; or (b) empirical treatment of adults with all four criteria, rapid antigen detection test (RADT) of patients with three or two criteria, and subsequent treatment of those with positive test results and no treatment for all others [17]. In the UK, the National Institute for Health and Clinical Excellence (NICE) recommend that clinicians consider immediate treatment with antibiotics for patients who have three or more Centor criteria [4]. A modified version of the Centor criteria is also used in New Zealand as part of a guideline for sore throat management [14,18].
The pretest probability of GABHS pharyngitis is reported to peak between the ages of 5 and 10 years [15]. The prevalence in children is reported to be around 20% to 25% while in adults it is between 5% to 10% [12]. This review will focus on adults (≥ 15 years of age), the age group of the cohort in which the Centor score was derived.
Although a considerable amount of research has already been devoted to streptococcal pharyngitis, it remains unclear which symptoms and signs have the most discriminatory power and whether the most widely recognised rule, the Centor score, is valid in a range of clinical settings. The aim of this systematic review was to analyse the current evidence on the usefulness of individual signs and symptoms in assessing the risk of streptococcal pharyngitis in adults, to assess the diagnostic accuracy of the Centor score as a decision rule for antibiotic treatment (discrimination analysis) and to perform a meta-analysis on validation studies of the Centor score (calibration analysis).

Data sources and searches
An electronic search was performed using a search filter developed by Haynes et al. [19,20]  Patients receive a point for the presence or absence of signs and symptoms. Each patient is assigned a score between 0 and 4 which is associated with a post-test probability as calculated by Centor and colleagues [16]. (The posttest probability values presented are the mean of the original probability intervals reported by Centor and colleagues.)

Study selection
Two investigators (JA and KOB) independently evaluated the title, abstract and subsequently full text of all articles for inclusion and any disagreements were resolved by discussion with a third investigator (WSC). Studies were included if participants were recruited upon first presentation from an ambulatory care setting, had a sore throat as their main presenting complaint, and were ≥ 15 years of age. Both prospective and retrospective studies were included in the review. Each included study assessed the diagnostic accuracy of signs and symptoms and/or validated the Centor score. The reference standard for all studies was a throat culture. If this information was not available in publications, data were sought from corresponding authors. The majority of studies separated positive results for group A β-haemolytic streptococcal infection from nongroup A infection (mostly group C and G). Patients who were positive with a non-group A streptococcal infection were counted as negatives when the data were pooled. Additional file 1 has more information on the reported proportions of non-group A infection.

Data extraction and quality assessment
Data were extracted by two investigators (JA and KOB) independently, and any discrepancies were resolved by discussion.
The Quality Assessment of Diagnostic Accuracy Studies (QUADAS) tool was used to assess the quality of each included study [21]. This tool was modified to ensure appropriateness to this study. Items 3, 4, 6, 7, 12 and 13 were omitted from the original QUADAS tool as they were not relevant to this study and four questions extracted from other reviews were added; 'Was the hypothesis clearly defined', 'Were the patients selected in a non-biased manner', 'Were the statistical tests for the main outcomes adequate' and 'Were data on observer variation reported and within acceptable range' (Figure 1) [22,23]. Quality assessment was performed independently by three researchers (JA, KOB and WSC). Each article was assessed by at least two researchers with disagreements resolved by review and discussion with the third researcher.

Data synthesis and analysis Diagnostic accuracy of signs and symptoms
Data were extracted and 2 × 2 tables constructed for the following signs and symptoms: (i) absence of cough, (ii) fever, (iii) anterior cervical adenopathy, (iv) tender anterior cervical adenopathy, and (v) any exudates (either tonsillar exudate or pharyngeal exudate or any exudate). Although some studies examined other signs and symptoms, those chosen for inclusion in this diagnostic test accuracy study were the most consistently studied signs and symptoms. Review Manager v.5.0.16 [24] and a bivariate random effects model [25] were used to analyse the extracted data. The analysis consisted of (a) summary sensitivities and specificities calculated for each sign and symptom, (b) positive and negative likelihood ratios and (c) summary receiver operating characteristic (SROC) curves. The bivariate random effects model accounts for the bivariate nature of sensitivity and specificity as well as the within-study and between-study variability [25]; as this approach is not available in Review Manager v.5.0.16, the Stata package metandi [26] was used for this part of the analysis.

Diagnostic accuracy of the Centor score
As the Centor score is recommended by guidelines as a decision aid for empirical antibiotic use [4,17], we explored the diagnostic accuracy of the score at different cut points. In all, 12 studies were included in this analysis [14,[27][28][29][30][31][32][33][34][35][36][37]; 3 studies were excluded from this analysis as they excluded patients with a Centor score less than 2 [38][39][40]. The analysis consisted of (a) summary sensitivities and specificities and (b) positive and negative likelihood ratios, calculated using a random effect bivariate model (using the Stata package metandi [26]). Posttest probabilities are presented for the Centor score at a range of pretest probabilities.

Calibration of the Centor score
We assessed calibration of the Centor score across four levels (0-1, 2, 3 and 4). Calibration enables visual and quantitative assessment of how well a CPR performs across different levels of risk [41]. The predicted number of patients with GABHS pharyngitis (based on the probability calculated in the derivation study [16], Table 1) were compared with the observed number of patients with GABHS pharyngitis in each validation study. The data were pooled and analysed using a Mantel-Haenszel random effects model and risk ratios (RRs) reported. The total heterogeneity across studies was quantified using the I 2 index. The Centor score data were analysed in groups (score 0-1, 2-3 and 4) as the ACP/ASIM guidelines recommend treatment options on the basis of these categories [17]. In the majority of studies, data were available for all score categories; the predicted was calculated for each score category and the results added together to form the group data (0-1, 2-3). For example, Atlas et al. [27] reported 11 patients had a score of 0 and 44 had a score of 1. We calculated the number predicted to have GABHS pharyngitis based on the probabilities reported in Table 1, 11 × 2.5% + 44 × 6.5% = 0.275 + 2.86 = 3.135. In one case [29], data were only available for the score group (0-1, 2-3); in this case the mean post-test probability for the group (mean of 2.5% and 6.5% = 4.5%) was used to calculate the predicted score.
We carried out a subgroup analysis to discover the influence of disease prevalence on the performance of the score (cut point 17.1% prevalence as in the derivation study [16]). Poses and colleagues suggested the use of the likelihood ratio formulation of Bayes' theorem to adjust for prevalence [42]. In our review, the method of Poses et al. was applied to the meta-analysis data and the effect on the results is discussed.
The Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) statement was followed during the course of this study [43].

Study identification
A flow diagram of our search strategy is presented in Figure 2. Two researchers screened all potential articles. They agreed that the full text of 58 articles should be examined. In all, 35 relevant studies were identified, 18 of which included only adults whilst the other 17 included both adults and children. Only 4 of the 18 adult only studies reported all required data [16,28,39,44]. The authors of the remaining adult papers were written to for additional data. In all, 13 authors responded and 8 studies were subsequently included [13,27,29,37,38,40,45,46]. After writing to the authors of mixed adult and children papers, 13 responses were received and the data for adults only were included for 8 of these studies [14,[31][32][33][34][35][36]47]. Data from one thesis was included in the analysis and was obtained through a personal communication [30].

Study quality
The result of the quality assessment is shown in Figure  1. The overall quality of the included studies was good. The spectrum of patients was generally appropriate and representative of the patients who would receive the test  in practice, the selection criteria were stated and the signs and symptoms being studied were generally clearly described. Quality items on test and diagnostic review bias scored well. This was due to the result of the throat culture being unknown at the time of the first visit when the signs and symptoms were recorded, and the throat culture being analysed by an independent laboratory. Observer variation in assessing signs and symptoms (question 12) was poorly reported.

Diagnostic accuracy of individual symptoms and signs
The sensitivity, specificity, positive likelihood ratio (+ LR) and -LR are reported in Table 2. Absence of cough and tender cervical adenopathy had a higher sensitivity than specificity (sensitivity 0.74, specificity 0.49 and sensitivity 0.67, specificity 0.59 respectively), while fever and any exudates had a higher specificity than sensitivity (sensitivity 0.50, specificity 0.70 and sensitivity 0.57 and specificity 0.74 respectively). 'Any exudates' had the  (2.20), suggesting it raises the probability of disease by 15% to 20% when present [48]. Absence of cough and tender anterior cervical adenopathy both decrease the likelihood of GABHS pharyngitis by 15% to 20% when absent.
A summary ROC curve for signs and symptoms is presented in Figure 3. The curve and point estimate are presented as a strong negative correlation was found between sensitivity and specificity for some of the signs and symptoms, suggesting the presence of an implicit threshold effect. The ROC curves are overlapping, suggesting that each of the individual signs and symptoms included in the analysis have a similar, relatively low ability to discriminate GABHS pharyngitis patients from other patients presenting with a sore throat.

Diagnostic test accuracy of the Centor score
Summary estimates for the four levels of Centor categories show (as expected) increasing specificity and diminishing sensitivity with higher scores (Table 3). When ≥ 3 signs or symptoms are present (the recommended cut-off point for empirical antibiotic treatment according to the ACP/ASIM guidelines), the Centor score has a specificity of 0.82 and a sensitivity of 0.49 and raises the probability of GABHS in absolute terms by 17% in situations of intermediate pretest probability (pretest probability 15%) ( Table 4) [48]. Based on the pooled results, Table 4 shows the post-test probability of GABHS pharyngitis for a range of pretest probabilities. If clinicians estimate the prevalence of GABHS pharyngitis in their area, this table can be used to find the corresponding posttest probability of GABHS.

Calibration of the Centor score
There was no significant difference between predicted and observed events in any of the Centor score categories (Figure 4), suggesting that the Centor score performed as well in the pooled data at predicting the probability of culture positive GABHS pharyngitis across the strata of risk as it did in the derivation study. Slightly fewer events were predicted in the 0-1 category than observed (z = 1.69, P = 0.09). There was modest between-study heterogeneity in the analysis, with I 2 values ranging from 11 to 49%.
A subgroup analysis based on prevalence was carried out for each score category of the Centor score. The prevalence was classified as 'high' if it was higher than that reported in the Centor derivation study (17.1%). The analysis showed that in the 0-1 and 2-3 score categories fewer events were predicted than observed in the high prevalence subgroup (0-1 n = 7 RR 0.42, 95% CI 0.25 to 0.70; 2-3 n = 9 RR 0.77, 95% CI 0.60 to 0.98) and slightly more events were predicted than observed in the low prevalence subgroup (0-1 n = 5 RR 1.11, 95% CI 0.72 to 1.71; 2-3 n = 6 RR 1.43, 95% CI 1.07 to 1.91). For score category 4, prevalence made little difference to the performance of the score (high prevalence n = 9 RR 1.13, 95% CI 0.89 to 1.43 and low prevalence n = 6 RR 1.20, 95% CI 0.85 to 1.70). Overall the subgroup analysis reduced interstudy heterogeneity, but did not improve the performance of the score.
We used the method of Poses et al. [42] to adjust each study for its own prevalence. We found this method decreased between-study heterogeneity, but the predicted-to-observed ratio did not improve significantly (data not shown).

Principal findings
From the diagnostic test accuracy of signs and symptoms analysis, all symptoms and signs included in the analysis have only a modest ability to discriminate patients with GABHS pharyngitis from those without it (range +LR 1.45-2.20, range -LR 0.53-0.71); therefore no sign or symptom on its own has the power to rule in or rule out a diagnosis of GABHS pharyngitis. Fever and 'any exudates' have a higher specificity than sensitivity and are more valid for ruling in a diagnosis of GABHS pharyngitis when present, while absence of cough and tender anterior cervical adenopathy have a higher sensitivity than specificity and are more valid for ruling out GABHS pharyngitis when absent. Based on our analysis it could be argued that the signs and symptoms present in the Centor score could be given different weights depending on whether the aim of the physician is to rule in or rule out a diagnosis of GABHS pharyngitis. However, it is highly unlikely that  the benefit would outweigh the cost of complicating such a simple score.
In terms of diagnostic accuracy, our analysis of the Centor score as a decision aid for antibiotic prescribing suggests that although the score is reasonably specific when ≥ 3 signs or symptoms are present (0.82) and very specific when 4 are present (0.95), the post-test probability of GABHS pharyngitis is relatively low (that is, for a prevalence of 15% and a score of ≥ 3, post-test probability is 32%, Table 4). Therefore, although the Centor score can enhance appropriate prescribing of antibiotics, it should be used with caution as treating all patients presenting with a sore throat and a score of ≥ 3 may lead to many patients being treated with antibiotics inappropriately (Table 4).
In terms of calibration, the Centor score produces consistent observed:predicted performance across all risk strata in different populations (Figure 4). This shows that the Centor score is well calibrated, suggesting that the rule is generalisable across settings and countries [41].

Findings in the context of other studies
The diagnostic accuracy of signs and symptoms findings of this systematic review are consistent with a previous review on GABHS pharyngitis which concluded that no sign or symptom on its own is powerful enough to rule in or rule out the diagnosis of GABHS pharyngitis [12]. Not all studies reported the same signs or symptoms to be of similar predictive value. For example, Lindbaek et al. and Llor et al. found that among the four Centor criteria, only cervical adenitis and absence of cough were significantly more frequent in the GABHS pharyngitis patients compared to those with negative cultures [33,39], while Meland et al. found that tonsillar exudate had no predictive ability [35]. Our meta-analysis shows that all individual symptoms and signs that comprise the Centor score do have modest discriminatory power, with 'any exudates' being the strongest (Table 2).
To the best of our knowledge, this is the first diagnostic test accuracy review of the Centor score. Wigton et al. [49] reported that a cut-off point of ≥ 2 signs or symptoms in their patient cohort produced a sensitivity of 86% and a specificity of 42%, which was similar to our pooled results (79% and 55% respectively). The most appropriate cut point for antibiotic treatment when using the Centor score depends on the clinicians aim; adults in Western society rarely have complications such as rheumatic fever and clinicians may want to ensure a high specificity in the test, which would lead to lower antibiotic prescription rates but missed cases of GABHS pharyngitis. Where as a clinician in a developing country with a high rate of rheumatic fever, and no access to other diagnostic tests, may feel a high sensitivity is more important.

Strengths and weaknesses
The strengths of this study include the inclusion of additional data from authors, and pooling the results of validation studies for the Centor score so that formal quantitative validation of the Centor score is accomplished.
We acknowledge that our review has several limitations: there is moderate heterogeneity in the Centor score calibration analysis (I 2 = 11% to 49%). Heterogeneity in the studies could be due to a variety of factors: chance; a threshold effect as caused by observer variation in the measurement of signs and symptoms; a variation in the pretest probability of GABHS pharyngitis; or other unanticipated factors. The prevalence of GABHS pharyngitis was highly variable between studies (Additional file 1). We addressed the effect of study prevalence as a source of heterogeneity in our calibration analysis.
Although we used a systematic search strategy, we acknowledge that it was not exhaustive and it is possible that we may have missed relevant articles. In particular, the use of search filters in systematic reviews is debatable and not always recommended [50].
The use of a throat culture as the reference standard for diagnosing GABHS pharyngitis is open to some debate. To date, throat culture is still considered by most to be the reference standard of choice when diagnosing GABHS pharyngitis [3,8]. Newly developed RADTs can be used in ambulatory care settings, with results available within minutes [51,52]. However, throat cultures and RADTs fail to distinguish between active infection and carriage, which can lead to inappropriate prescribing of antibiotics for cases of carriage [10,53]. In addition, many argue that lower sensitivities and the lack of cost effectiveness of RADTs in primary care, will limit their use and that signs and symptoms will always be valuable [54,55].
The method of analysis in pooling the individual Centor score studies (calibration analysis) is based on the comparative approach used by Bont et al. to validate the CRB-65 CPR in a single validation study [56]. This  [57]. No statistically significant difference was found between the predicted events by the two methods (P > 0.05). A limitation of this method is that it compares the proportion of patients predicted and observed to have GABHS pharyngitis but without patient level data it is not possible to determine if the positives as predicted by the Centor score are the same patients who are positive based on the throat swab.

Implications for practice
Our meta-analysis of Centor score suggests that it transfers well to other populations and can be used by        Notes: Meland 1993 reported the proportion of group A positive cultures versus group C and G in their study. After doing a sensitivity analysis, we adjusted the results using the percentage of group A when estimating the number of observed GABHS positive patients. Kljakovic 1993 did not report the group (A, C or G) for cultures positive for streptococci. After a sensitivity analysis we assumed that all positive cultures were group A, as the overall prevalence in this study was relatively low.  (Table 4 and Figure 4). However, the relatively low post-test probability of GABHS pharyngitis even in areas of high prevalence (Table 4), suggests the score should be used with caution by clinicians when used as a decision aid for antibiotic prescribing. Studies have shown that the use of scores can improve antibiotic prescribing [14], while others have found them no better than clinician judgement [58].
A barrier when introducing CPRs such as the Centor score into practice is that clinicians often fail to apply them [59,60]. One community-based study that used repeated clinical prompts for the modified Centor score to try and influence physician's behaviour when prescribing antibiotics for sore throats, found no significant change in physician behaviour [60]. However, the authors had problems retaining communitybased physicians for the duration of the study and believe their results may have been biased by these losses [60].
The formal incorporation of CPRs can be facilitated by computer-based clinical decision support systems (CDSSs) that quantify diagnostic and prognostic information so as to provide physicians with patient specific recommendations: such aids have been shown to reduce antibiotic prescribing in respiratory tract infections in children in primary care settings [61,62].

Conclusions
Individual symptoms and signs have only a modest ability to rule in or out a diagnosis of GABHS pharyngitis. The Centor score uses a combination of signs and symptoms to predict the risk of GABHS pharyngitis; the score is well calibrated across a variety of countries and settings. It has reasonably good specificity, and can enhance the appropriate prescribing of antibiotics, but should be used with caution in low prevalence settings of GABHS pharyngitis such as primary care.

Additional material
Additional file 1: Table S1. Summary of included studies.