Presentation is loading. Please wait.

Presentation is loading. Please wait.

Grading Strength of Evidence Prepared for: The Agency for Healthcare Research and Quality (AHRQ) Training Modules for Systematic Reviews Methods Guide.

Similar presentations


Presentation on theme: "Grading Strength of Evidence Prepared for: The Agency for Healthcare Research and Quality (AHRQ) Training Modules for Systematic Reviews Methods Guide."— Presentation transcript:

1 Grading Strength of Evidence Prepared for: The Agency for Healthcare Research and Quality (AHRQ) Training Modules for Systematic Reviews Methods Guide www.ahrq.gov

2 Systematic Review Process Overview

3  To define what “grading strength of evidence (SOE)” is  To describe why grading SOE is important  To distinguish between grading SOE and rating the quality of individual articles  To list primary and additional domains for grading SOE  To describe options for scoring SOE domains  To describe how to score and present SOE grades Learning Objectives

4  Is distinct from rating the quality of individual studies  Is generally used only to assess:  Major outcomes (benefits and harms)  Major comparisons, when relevant Grading Strength of Evidence

5  To facilitate use of systematic reviews by diverse decisionmakers and stakeholders  To give decisionmakers:  A comprehensive evaluation of the evidence  A sense of how much confidence they can place in the evidence  To foster transparency and documentation Why Grade Strength of Evidence?

6 1.Scoring four required domains a.Risk of bias b.Consistency c.Directness d.Precision 2.Considering, and possibly scoring, four additional domains a.Dose-response association b.Plausible confounders c.Strength of association d.Publication bias 3.Combining scores from required domains into a single strength-of- evidence score, taking scores on additional domains into account as needed Three Steps to Grading Strength of Evidence

7  Concerns both study design and study conduct for individual studies, rated by usual methods  Assesses the aggregate quality of studies within each major study design and integrates those assessments into an overall risk-of-bias score  Risk-of-bias scores:  High — lowers strength-of-evidence grade  Medium  Low — raises strength-of-evidence grade Four Required Domains: Risk of Bias

8  Defined as the degree of similarity in the effect sizes of different studies within an evidence base  Consistent evidence bases:  Have the same direction of effect (same side of “no effect”)  Have a narrow range of effect sizes  Inconsistent evidence bases:  Have nonoverlapping confidence intervals  Have significant unexplained clinical or statistical heterogeneity Four Required Domains: Consistency

9  Only three possible scores for consistency:  Consistent (i.e., no inconsistency)  Inconsistent  Unknown or not applicable (single study cannot be assessed)  Meta-analysis:  Use appropriate tests, such as Cochran’s Q test or I2 statistics Four Required Domains: Consistency Scores

10  Defined as whether the evidence being assessed:  Reflects a single, direct link between the interventions of interest and the ultimate health outcome under consideration  Relies on multiple links in a causal chain  If multiple links are involved, strength of evidence can be only as strong as the weakest link  Using analytic frameworks* is important Four Required Domains: Directness *See the “Analytic Frameworks” module

11  Intermediate or surrogate outcomes instead of health or patient-centered outcomes  Example: laboratory test results or radiographic findings versus patient- reported functional outcomes or death  Indirect comparisons rather than direct, head-to-head comparisons  Direct (e.g., A vs. B, A vs. C, and B vs. C):  Head-to-head studies in the evidence base  Generally assumes use of health outcomes, not surrogate/proxy outcomes  Better strength of evidence  Indirect (e.g., A vs. B, B vs. C, but not A vs. C):  No head-to-head studies that cover all interventions or outcomes of interest  Problematic situation for all types of comparisons  Strength-of-evidence grades not as strong as with direct evidence Four Required Domains: Aspects of Indirectness

12  Applicability is evaluated separately from directness for the Evidence-based Practice Center (EPC) program.  For decisionmakers, the applicability of evidence depends on the different interests of diverse groups.  A PICOS framework (patient populations, interventions, comparators, outcomes, and settings) is used for applicability assessment in the EPC program.  Although the EPC program separates applicability from strength-of- evidence grading, other systems that work with one decisionmaker may incorporate applicability issues into their evaluations of directness. Related Issue of Applicability* *See the “Assessing Applicability” module

13  Only two possible scores for directness:  Direct:  Evidence is based on a single link between the intervention and health outcomes  Indirect:  Evidence relies on:  Surrogate/proxy outcomes  More than one body of evidence  Both situations Four Required Domains: Directness Scores

14  Defined as the degree of certainty for estimate of effect with respect to a specific outcome  Is a complicated concept that:  Asks the question:  What can decisionmakers conclude about whether one treatment is, clinically speaking, inferior, superior, or equivalent (neither inferior nor superior) to another?  Includes considerations of:  Statistical significance for effect estimates  Confidence intervals for those effect estimates Four Required Domains: Precision

15  Are rated separately for each important outcome or comparison, including for any summary estimate of effect size  Only two scores are possible  Precise: estimate allows a clinically useful conclusion  Imprecise: confidence interval is so wide it could include clinically distinct (even conflicting) conclusions Four Required Domains: Precision Scores

16  Four “discretionary” domains:  Dose-response association  Plausible confounders  Strength of association  Publication bias  Use when they are:  Applicable  Helpful in reaching conclusions about overall grades for strength of evidence Additional Domains

17  Pattern of a larger effect with greater exposure (dose, duration, adherence) either across or within studies  Rate if studies give levels of exposure Additional Domains: Dose-Response Association

18  Three scores are possible for dose-response:  Present: dose-response pattern observed  In such a case, Evidence-based Practice Center reviewers may want to upgrade the level of evidence.  Not present: no dose-response pattern observed (dose-response relationship not present)  Not applicable or not tested Additional Domains: Dose-Response Scores

19  In an observational study, sometimes plausible confounding factors work in the direction opposite that of the observed effect.  Had such “effect-weakening” confounders not been present, the observed effect would have been even larger than the one observed.  In such a case, Evidence-based Practice Center reviewers may want to upgrade the level of evidence.  Consider whether or not plausible confounding exists that would decrease the observed effect. Additional Domains: Plausible Confounding

20  Two scores are possible for plausible confounding:  Present: confounding factors that would decrease the observed effect may be present  Absent: confounding factors that would decrease the observed effect are not likely to be present Additional Domains: Plausible Confounding Scores

21  Magnitude of effect:  Defined as the likelihood that the observed effect is large enough that it cannot have occurred solely as a result of bias from potential confounding factors  Consider when effect size is particularly large Additional Domains: Strength of Association

22  Two scores are possible for strength of association:  Strong: large effect size that is unlikely to have occurred in the absence of a true effect of the intervention  In such a case, Evidence-based Practice Center reviewers may want to upgrade the level of evidence.  Weak: small enough effect size that it could have occurred solely as a result of bias from confounding factors Additional Domains: Strength of Association Scores

23  Studies may have been published selectively.  Example: only a small proportion of relevant trials or other studies has been published.  Estimated effects of an intervention that are based on published studies do not reflect true effect.  Publication bias may undermine the overall robustness of a body of evidence. Additional Domains: Publication Bias

24  Publication bias scores:  Need not be formally computed but can influence ratings of required domains  Should take these possible publication bias factors into account:  Rating for consistency  Calculating a summary confidence interval for an effect  Add comments on publication bias when circumstances suggest that relevant empirical findings, particularly negative or no-difference findings, have not been published or are not otherwise available. Additional Domains: Publication Bias Scores

25  Use two or more reviewers with the appropriate clinical and methodological expertise.  Assess separately:  Each required domain (or each optional domain, as relevant)  Each major outcome, including benefits and harms  Resolve differences by consensus or mediation by an additional expert; consensus scores should appear in tables.  Record and maintain records of each reviewer's individual judgments about domains as background documentation. Procedures for Assessing Domains

26  Reflect a global assessment that:  Takes the required domains directly into account  Incorporates judgments about the additional domains as needed  Aim to:  Provide “actionable” information for a variety of different users, readers, and stakeholders  Be transparent in how the strength-of-evidence grades are reached Strength of Evidence Grades (I)

27  For each comparison of interest, rate the strength of evidence for:  Each major benefit (e.g., positive effects on health outcomes such as physical function or quality of life, or effects on laboratory measures or other surrogate variables)  Each major harm (ranging from rare, serious, or life-threatening adverse events to common but bothersome effects)  For both benefits and harms:  Focus on the outcomes most relevant to patients, clinicians, and policymakers Strength of Evidence Grades (II)

28  High: High confidence that the evidence reflects the true effect. Further research is very unlikely to change our confidence in the estimate of effect.  Moderate: Moderate confidence that the evidence reflects the true effect. Further research may change our confidence in the estimate of effect and may change the estimate.  Low: Low confidence that the evidence reflects the true effect. Further research is likely to change the confidence in the estimate of effect and is likely to change the estimate.  Insufficient: Evidence either is unavailable or does not permit a conclusion. Strength of Evidence Grades and Definitions

29  Using the high, moderate, or low strength-of- evidence grade:  Implies that a body of evidence actually exists  Is intended to convey how confident reviewers are about decisions that may be made based on evidence graded one way or another  Requires the use of only one designation, not a range (e.g., not “low to moderate”) Strength of Evidence Grades: Additional Points (I)

30  The insufficient strength-of-evidence grade:  Is applied when:  Reviewers cannot draw conclusions about an outcome, comparison, or other question  Is appropriate when:  No evidence is available at all  Evidence is too insubstantial to permit conclusions to be drawn (e.g., opposing results from studies with a similar risk of bias; wide and overlapping confidence intervals) Strength of Evidence Grades: Additional Points (II)

31  Use different approaches to incorporate multiple domains into an overall strength-of-evidence grade  GRADE algorithm  Weighting system of the Evidence-based Practice Center  Some qualitative approach  Use (at least) two reviewers  Assess resulting interrater reliability for each domain score, and keep records Scoring and Reporting: General Guidance

32  Risk of bias (given design and conduct of available studies) is the essential component in determining the strength-of-evidence grade.  First, consider which study design is most appropriate to reduce bias for each question.  Next, consider the risk of bias from available studies. Guiding Principles: Risk of Bias

33  Drug comparisons in randomized controlled trials (RCTs), with either placebo or an active comparator as an appropriate design:  Evidence from well-conducted RCTs will have less risk of bias than evidence based on observational studies.  For RCTs, reviewers can start with a rating of low for risk of bias and change the assessment if the RCTs have important flaws.  For observational data, reviewers can start with a rating of high for risk of bias and change the assessment, depending upon how well studies were conducted. Guiding Principles: Risk of Bias Example

34  Be explicit about how the evidence grade will be determined.  A point system for combining ratings of the domains  A qualitative consideration of the domains  Carefully document procedures.  Keep records of procedures and results for each review so that they may contribute to the overall expertise of the Evidence-based Practice Center and the science of grading evidence. Further Guidance: Principles for Scoring

35  Explain the rationale for the approach used and identify which domains were important in upgrading or downgrading the strength of evidence.  Explain judgments about the degree to which any additional domains altered the overall strength-of-evidence grade.  Provide enough detail within the report to ensure that users can grasp the methods. Further Guidance: Principles for Reporting (I)

36  Use the terms high, moderate, low, or insufficient.  Do not use Roman numerals or other symbols.  Use or adapt the illustrative tabular approach to reporting (see the publications listed below for examples).  Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions. In: Methods Guide for Comparative Effectiveness Reviews. Rockville, MD: Agency for Healthcare Research and Quality, Posted August 2009. Available at: http://effectivehealthcare. ahrq.gov/ ehc/products/60/318/2009_0805_grading.pdf.  Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions —Agency for Healthcare Research and Quality and the Effective Health Care Program. J Clin Epidemiol 2010;63:531-523. Further Guidance: Principles for Reporting (II)

37 Grading Strength of Evidence: Presentation of Results — Moderate and High Grades CI = confidence interval; RCT = randomized controlled trial Number of Studies (Subjects)Domains Pertaining to Strength of Evidence Magnitude of Effect and Strength of Evidence (SOE) Risk of Bias; Design/QualityConsistencyDirectnessPrecision Absolute Risk Difference per 100 Patients Severe DiarrheaModerate SOE 4 (256)RCT/FairConsistentDirectImprecise  4 (95% CI – 8 to +1) 14 (28,400)Cohort/FairConsistentDirectPrecise  5 (95% CI  8 to  2) Improved Quality of LifeHigh SOE 6 (265)RCTs/GoodConsistentDirectPrecise  5 (95% CI  1 to  7)

38 Grading Strength of Evidence: Presentation of Results — Insufficient and Low Number of Studies (Subjects)Domains Pertaining to Strength of Evidence Magnitude of Effect and Strength of Evidence (SOE) Risk of Bias; Design/Quality ConsistencyDirectnessPrecision Absolute Risk Difference per 100 Patients MortalityInsufficient SOE 1 (80)RCT/FairUnknownDirectImprecise  1 (95% CI  4 to +3) 14 (384) Retrospective cohort/Fair InconsistentDirectImprecise  7 to +5 (range) Myocardial InfarctionLow SOE 7 (625) Retrospective cohort/Low ConsistentDirectImprecise  3 (95% CI  5 to  1) CI = confidence interval; RCT = randomized controlled trial

39  The grading system used by the Evidence-based Practice Centers (EPCs) is similar to the GRADE system.  The EPC grading system reflects the needs of AHRQ stakeholders for reviews on a wide variety of topics and not for recommendations or guidelines.  The main differences between the two grading systems:  The definitions of domains differ slightly; in the EPC system “directness” excludes “applicability,” which is handled separately.  In the EPC system, observational studies are considered to have less risk of bias for outcomes such as harms, which can raise the initial grade to “moderate.”  The definition of overall grade differs; the EPC system emphasizes confidence in estimate, whereas the GRADE system emphasizes effect of future research.  The EPC system permits three different ways to reach an overall strength-of - evidence grade; the GRADE formula has one. Comparison With the GRADE System

40  Is a critical last step in analysis and presentation  Is done after the quality of articles is rated by at least two independent reviewers  Helps users of systematic reviews understand the body of evidence and how much confidence they can have in making decisions based on that evidence  Uses scores on four primary (mandatory) domains and four additional (discretionary) domains  Focuses on major outcomes and comparisons  Is denoted in terms of high, moderate, or low strength or insufficient evidence  Presents strength-of-evidence grades in tabular form Summary: Grading Strength of Evidence

41  Atkins D, Best D, Briss PA, et al, for the GRADE Working Group. Grading quality of evidence and strength of recommendations. BMJ. 2004;328:1490.  Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions. In: Agency for Healthcare Research and Quality. Methods Guide for Comparative Effectiveness Reviews [posted July 2009]. Rockville, MD. Available at: http://effectivehealthcare. ahrq.gov/healthInfo.cfm?infotype=rr&ProcessID=60.  Owens DK, Lohr KN, Atkins D, et al. Grading the strength of a body of evidence when comparing medical interventions — Agency for Healthcare Research and Quality and the Effective Health Care Program. J Clin Epidemiol 2010;63:513-523. References

42  This presentation was prepared by Kathleen N. Lohr, Ph.D., a Distinguished Fellow at RTI International.  This module is based on an update of chapter 11 in version 1.0 of the Methods Guide for Comparative Effectiveness Reviews (updated chapter available at: http://effectivehealthcare.ahrq.gov/ehc/products/60 /318/2009_0805_ grading.pdf ). Author


Download ppt "Grading Strength of Evidence Prepared for: The Agency for Healthcare Research and Quality (AHRQ) Training Modules for Systematic Reviews Methods Guide."

Similar presentations


Ads by Google