Grading Strength of Evidence Prepared for: The Agency for Healthcare Research and Quality (AHRQ) Training Modules for Systematic Reviews Methods Guide
Systematic Review Process Overview
To define what “grading strength of evidence (SOE)” is To describe why grading SOE is important To distinguish between grading SOE and rating the quality of individual articles To list primary and additional domains for grading SOE To describe options for scoring SOE domains To describe how to score and present SOE grades Learning Objectives
Is distinct from rating the quality of individual studies Is generally used only to assess: Major outcomes (benefits and harms) Major comparisons, when relevant Grading Strength of Evidence
To facilitate use of systematic reviews by diverse decisionmakers and stakeholders To give decisionmakers: A comprehensive evaluation of the evidence A sense of how much confidence they can place in the evidence To foster transparency and documentation Why Grade Strength of Evidence?
1.Scoring four required domains a.Risk of bias b.Consistency c.Directness d.Precision 2.Considering, and possibly scoring, four additional domains a.Dose-response association b.Plausible confounders c.Strength of association d.Publication bias 3.Combining scores from required domains into a single strength-of- evidence score, taking scores on additional domains into account as needed Three Steps to Grading Strength of Evidence
Concerns both study design and study conduct for individual studies, rated by usual methods Assesses the aggregate quality of studies within each major study design and integrates those assessments into an overall risk-of-bias score Risk-of-bias scores: High — lowers strength-of-evidence grade Medium Low — raises strength-of-evidence grade Four Required Domains: Risk of Bias
Defined as the degree of similarity in the effect sizes of different studies within an evidence base Consistent evidence bases: Have the same direction of effect (same side of “no effect”) Have a narrow range of effect sizes Inconsistent evidence bases: Have nonoverlapping confidence intervals Have significant unexplained clinical or statistical heterogeneity Four Required Domains: Consistency
Only three possible scores for consistency: Consistent (i.e., no inconsistency) Inconsistent Unknown or not applicable (single study cannot be assessed) Meta-analysis: Use appropriate tests, such as Cochran’s Q test or I2 statistics Four Required Domains: Consistency Scores
Defined as whether the evidence being assessed: Reflects a single, direct link between the interventions of interest and the ultimate health outcome under consideration Relies on multiple links in a causal chain If multiple links are involved, strength of evidence can be only as strong as the weakest link Using analytic frameworks* is important Four Required Domains: Directness *See the “Analytic Frameworks” module
Intermediate or surrogate outcomes instead of health or patient-centered outcomes Example: laboratory test results or radiographic findings versus patient- reported functional outcomes or death Indirect comparisons rather than direct, head-to-head comparisons Direct (e.g., A vs. B, A vs. C, and B vs. C): Head-to-head studies in the evidence base Generally assumes use of health outcomes, not surrogate/proxy outcomes Better strength of evidence Indirect (e.g., A vs. B, B vs. C, but not A vs. C): No head-to-head studies that cover all interventions or outcomes of interest Problematic situation for all types of comparisons Strength-of-evidence grades not as strong as with direct evidence Four Required Domains: Aspects of Indirectness
Applicability is evaluated separately from directness for the Evidence-based Practice Center (EPC) program. For decisionmakers, the applicability of evidence depends on the different interests of diverse groups. A PICOS framework (patient populations, interventions, comparators, outcomes, and settings) is used for applicability assessment in the EPC program. Although the EPC program separates applicability from strength-of- evidence grading, other systems that work with one decisionmaker may incorporate applicability issues into their evaluations of directness. Related Issue of Applicability* *See the “Assessing Applicability” module
Only two possible scores for directness: Direct: Evidence is based on a single link between the intervention and health outcomes Indirect: Evidence relies on: Surrogate/proxy outcomes More than one body of evidence Both situations Four Required Domains: Directness Scores
Defined as the degree of certainty for estimate of effect with respect to a specific outcome Is a complicated concept that: Asks the question: What can decisionmakers conclude about whether one treatment is, clinically speaking, inferior, superior, or equivalent (neither inferior nor superior) to another? Includes considerations of: Statistical significance for effect estimates Confidence intervals for those effect estimates Four Required Domains: Precision
Are rated separately for each important outcome or comparison, including for any summary estimate of effect size Only two scores are possible Precise: estimate allows a clinically useful conclusion Imprecise: confidence interval is so wide it could include clinically distinct (even conflicting) conclusions Four Required Domains: Precision Scores
Four “discretionary” domains: Dose-response association Plausible confounders Strength of association Publication bias Use when they are: Applicable Helpful in reaching conclusions about overall grades for strength of evidence Additional Domains
Pattern of a larger effect with greater exposure (dose, duration, adherence) either across or within studies Rate if studies give levels of exposure Additional Domains: Dose-Response Association
Three scores are possible for dose-response: Present: dose-response pattern observed In such a case, Evidence-based Practice Center reviewers may want to upgrade the level of evidence. Not present: no dose-response pattern observed (dose-response relationship not present) Not applicable or not tested Additional Domains: Dose-Response Scores
In an observational study, sometimes plausible confounding factors work in the direction opposite that of the observed effect. Had such “effect-weakening” confounders not been present, the observed effect would have been even larger than the one observed. In such a case, Evidence-based Practice Center reviewers may want to upgrade the level of evidence. Consider whether or not plausible confounding exists that would decrease the observed effect. Additional Domains: Plausible Confounding
Two scores are possible for plausible confounding: Present: confounding factors that would decrease the observed effect may be present Absent: confounding factors that would decrease the observed effect are not likely to be present Additional Domains: Plausible Confounding Scores
Magnitude of effect: Defined as the likelihood that the observed effect is large enough that it cannot have occurred solely as a result of bias from potential confounding factors Consider when effect size is particularly large Additional Domains: Strength of Association
Two scores are possible for strength of association: Strong: large effect size that is unlikely to have occurred in the absence of a true effect of the intervention In such a case, Evidence-based Practice Center reviewers may want to upgrade the level of evidence. Weak: small enough effect size that it could have occurred solely as a result of bias from confounding factors Additional Domains: Strength of Association Scores
Studies may have been published selectively. Example: only a small proportion of relevant trials or other studies has been published. Estimated effects of an intervention that are based on published studies do not reflect true effect. Publication bias may undermine the overall robustness of a body of evidence. Additional Domains: Publication Bias
Publication bias scores: Need not be formally computed but can influence ratings of required domains Should take these possible publication bias factors into account: Rating for consistency Calculating a summary confidence interval for an effect Add comments on publication bias when circumstances suggest that relevant empirical findings, particularly negative or no-difference findings, have not been published or are not otherwise available. Additional Domains: Publication Bias Scores
Use two or more reviewers with the appropriate clinical and methodological expertise. Assess separately: Each required domain (or each optional domain, as relevant) Each major outcome, including benefits and harms Resolve differences by consensus or mediation by an additional expert; consensus scores should appear in tables. Record and maintain records of each reviewer's individual judgments about domains as background documentation. Procedures for Assessing Domains
Reflect a global assessment that: Takes the required domains directly into account Incorporates judgments about the additional domains as needed Aim to: Provide “actionable” information for a variety of different users, readers, and stakeholders Be transparent in how the strength-of-evidence grades are reached Strength of Evidence Grades (I)
For each comparison of interest, rate the strength of evidence for: Each major benefit (e.g., positive effects on health outcomes such as physical function or quality of life, or effects on laboratory measures or other surrogate variables) Each major harm (ranging from rare, serious, or life-threatening adverse events to common but bothersome effects) For both benefits and harms: Focus on the outcomes most relevant to patients, clinicians, and policymakers Strength of Evidence Grades (II)
High: High confidence that the evidence reflects the true effect. Further research is very unlikely to change our confidence in the estimate of effect. Moderate: Moderate confidence that the evidence reflects the true effect. Further research may change our confidence in the estimate of effect and may change the estimate. Low: Low confidence that the evidence reflects the true effect. Further research is likely to change the confidence in the estimate of effect and is likely to change the estimate. Insufficient: Evidence either is unavailable or does not permit a conclusion. Strength of Evidence Grades and Definitions
Using the high, moderate, or low strength-of- evidence grade: Implies that a body of evidence actually exists Is intended to convey how confident reviewers are about decisions that may be made based on evidence graded one way or another Requires the use of only one designation, not a range (e.g., not “low to moderate”) Strength of Evidence Grades: Additional Points (I)
The insufficient strength-of-evidence grade: Is applied when: Reviewers cannot draw conclusions about an outcome, comparison, or other question Is appropriate when: No evidence is available at all Evidence is too insubstantial to permit conclusions to be drawn (e.g., opposing results from studies with a similar risk of bias; wide and overlapping confidence intervals) Strength of Evidence Grades: Additional Points (II)
Use different approaches to incorporate multiple domains into an overall strength-of-evidence grade GRADE algorithm Weighting system of the Evidence-based Practice Center Some qualitative approach Use (at least) two reviewers Assess resulting interrater reliability for each domain score, and keep records Scoring and Reporting: General Guidance
Risk of bias (given design and conduct of available studies) is the essential component in determining the strength-of-evidence grade. First, consider which study design is most appropriate to reduce bias for each question. Next, consider the risk of bias from available studies. Guiding Principles: Risk of Bias
Drug comparisons in randomized controlled trials (RCTs), with either placebo or an active comparator as an appropriate design: Evidence from well-conducted RCTs will have less risk of bias than evidence based on observational studies. For RCTs, reviewers can start with a rating of low for risk of bias and change the assessment if the RCTs have important flaws. For observational data, reviewers can start with a rating of high for risk of bias and change the assessment, depending upon how well studies were conducted. Guiding Principles: Risk of Bias Example
Be explicit about how the evidence grade will be determined. A point system for combining ratings of the domains A qualitative consideration of the domains Carefully document procedures. Keep records of procedures and results for each review so that they may contribute to the overall expertise of the Evidence-based Practice Center and the science of grading evidence. Further Guidance: Principles for Scoring
Explain the rationale for the approach used and identify which domains were important in upgrading or downgrading the strength of evidence. Explain judgments about the degree to which any additional domains altered the overall strength-of-evidence grade. Provide enough detail within the report to ensure that users can grasp the methods. Further Guidance: Principles for Reporting (I)
Use the terms high, moderate, low, or insufficient. Do not use Roman numerals or other symbols. Use or adapt the illustrative tabular approach to reporting (see the publications listed below for examples). Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions. In: Methods Guide for Comparative Effectiveness Reviews. Rockville, MD: Agency for Healthcare Research and Quality, Posted August Available at: ahrq.gov/ ehc/products/60/318/2009_0805_grading.pdf. Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions —Agency for Healthcare Research and Quality and the Effective Health Care Program. J Clin Epidemiol 2010;63: Further Guidance: Principles for Reporting (II)
Grading Strength of Evidence: Presentation of Results — Moderate and High Grades CI = confidence interval; RCT = randomized controlled trial Number of Studies (Subjects)Domains Pertaining to Strength of Evidence Magnitude of Effect and Strength of Evidence (SOE) Risk of Bias; Design/QualityConsistencyDirectnessPrecision Absolute Risk Difference per 100 Patients Severe DiarrheaModerate SOE 4 (256)RCT/FairConsistentDirectImprecise 4 (95% CI – 8 to +1) 14 (28,400)Cohort/FairConsistentDirectPrecise 5 (95% CI 8 to 2) Improved Quality of LifeHigh SOE 6 (265)RCTs/GoodConsistentDirectPrecise 5 (95% CI 1 to 7)
Grading Strength of Evidence: Presentation of Results — Insufficient and Low Number of Studies (Subjects)Domains Pertaining to Strength of Evidence Magnitude of Effect and Strength of Evidence (SOE) Risk of Bias; Design/Quality ConsistencyDirectnessPrecision Absolute Risk Difference per 100 Patients MortalityInsufficient SOE 1 (80)RCT/FairUnknownDirectImprecise 1 (95% CI 4 to +3) 14 (384) Retrospective cohort/Fair InconsistentDirectImprecise 7 to +5 (range) Myocardial InfarctionLow SOE 7 (625) Retrospective cohort/Low ConsistentDirectImprecise 3 (95% CI 5 to 1) CI = confidence interval; RCT = randomized controlled trial
The grading system used by the Evidence-based Practice Centers (EPCs) is similar to the GRADE system. The EPC grading system reflects the needs of AHRQ stakeholders for reviews on a wide variety of topics and not for recommendations or guidelines. The main differences between the two grading systems: The definitions of domains differ slightly; in the EPC system “directness” excludes “applicability,” which is handled separately. In the EPC system, observational studies are considered to have less risk of bias for outcomes such as harms, which can raise the initial grade to “moderate.” The definition of overall grade differs; the EPC system emphasizes confidence in estimate, whereas the GRADE system emphasizes effect of future research. The EPC system permits three different ways to reach an overall strength-of - evidence grade; the GRADE formula has one. Comparison With the GRADE System
Is a critical last step in analysis and presentation Is done after the quality of articles is rated by at least two independent reviewers Helps users of systematic reviews understand the body of evidence and how much confidence they can have in making decisions based on that evidence Uses scores on four primary (mandatory) domains and four additional (discretionary) domains Focuses on major outcomes and comparisons Is denoted in terms of high, moderate, or low strength or insufficient evidence Presents strength-of-evidence grades in tabular form Summary: Grading Strength of Evidence
Atkins D, Best D, Briss PA, et al, for the GRADE Working Group. Grading quality of evidence and strength of recommendations. BMJ. 2004;328:1490. Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions. In: Agency for Healthcare Research and Quality. Methods Guide for Comparative Effectiveness Reviews [posted July 2009]. Rockville, MD. Available at: ahrq.gov/healthInfo.cfm?infotype=rr&ProcessID=60. Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions — Agency for Healthcare Research and Quality and the Effective Health Care Program. J Clin Epidemiol 2010;63: References
This presentation was prepared by Kathleen N. Lohr, Ph.D., a Distinguished Fellow at RTI International. This module is based on an update of chapter 11 in version 1.0 of the Methods Guide for Comparative Effectiveness Reviews (updated chapter available at: /318/2009_0805_ grading.pdf ). Author