Http://www.youtube.com/watch?v=QUW0Q8tXVUc Viva la evidence.

Viva la evidence

Quiz For each question choose at least one answer…

Question: What is the purpose of a systematic review?
to get a publication to explore the differences between available studies to help people make decisions about whether to use the intervention to reach a definite conclusion about whether the intervention works or not

Answer: The purpose of a systematic review is to
to get a publication to explore the differences between available studies to help people make decisions about whether to use the intervention to reach a definite conclusion about whether the intervention works or not Participants might choose ‘To reach a definite conclusion about whether the intervention works or not’ but a review does not always give the final answer to your question for example due to inclusion of a small number of studies with poor methodological quality. The research questions of individual trials are likely to be different from the question you want to answer in your review. This can lead to differences in population, intervention etc. between the studies that you have included in your review. This variation can be the underlying reasons for differences in results of the different trials. In a review you can explore these differences.

Question: What are key features of a systematic review?
explicit, pre-defined eligibility criteria are used to select included studies studies can be excluded if they come to the wrong conclusions a systematic and comprehensive search to find all available evidence a systematic and comprehensive search to find all the studies that meet the eligibility criteria critical appraisal is used to assess the reliability of the included studies Finding all available evidence is not a key feature because you are trying to find all evidence specific to your topic, that is all studies that will meet you eligibility criteria. Studies can only be excluded if they do not meet explicit eligibility criteria that are pre-defined (can be checked in protocol). In every Cochrane review the risk of bias is assessed of all included studies.

Answer: The key features of a systematic review are
explicit, pre-defined eligibility criteria are used to select included studies studies can be excluded if they come to the wrong conclusions a systematic and comprehensive search to find all available evidence a systematic and comprehensive search to find all the studies that meet the eligibility criteria critical appraisal is used to assess the reliability of the included studies Finding all available evidence is not a key feature because you are trying to find all evidence specific to your topic, that is all studies that will meet you eligibility criteria. Studies can only be excluded if they do not meet explicit eligibility criteria that are pre-defined (can be checked in protocol). In every Cochrane review the risk of bias is assessed of all included studies.

Introduction to meta-analysis

Steps of a Cochrane review
define the question plan eligibility criteria plan methods search for studies apply eligibility criteria collect data assess studies for risk of bias analyse and present results interpret results and draw conclusions improve and update review When we have collected the relevant results for all of our included studies, we now need to analyse them – to combine them together and reach overall results for our review. One of the ways we can do that is using meta-analysis.

Session outline principles of meta-analysis steps in a meta-analysis
presenting your results See Chapter 9 of the Handbook

Study level ↓ Review level ↓ Study A Study B Effect measure Study C
Outcome data Study A Effect measure Outcome data Study B Effect measure Effect measure Outcome data Study C As mentioned in the presentations on Dichotomous and Continuous data, we have until now been mainly focused on collecting appropriate data and effect estimates for each included study. With this presentation, we’re now looking at the review level – how to bring the results of our collection of included studies together, and one of the ways we can do that is meta-analysis. Effect measure Outcome data Study D Source: Jo McKenzie & Miranda Cumpston

What is a meta-analysis?
combines the results from two or more studies estimates an ‘average’ or ‘common’ effect optional part of a systematic review Systematic reviews Meta- analyses The terms ‘systematic review’ and ‘meta-analysis’ are often used interchangeably, but they are not the same. Meta-analysis is the term used for the statistical method of combining the results of more than one study, to find the average or common effect across those studies. A systematic review, bringing together all the relevant studies to answer a particular question, can synthesise the results with or without a meta-analysis – for example by presenting a narrative synthesis in the text, or it may not find enough of the right kind of data to perform a meta-analysis. Equally, a meta-analysis can be presented with or without a systematic review. You can meta-analyse any old studies, that may not represent a systematic, critically appraised, comprehensive view of the literature. Source: Julian Higgins

Why perform a meta-analysis?
quantify treatment effects and their uncertainty increase power increase precision explore differences between studies settle controversies from conflicting studies generate new hypotheses There are a number of reasons why we might want to perform a meta-analysis in our review. First, it’s useful to have a quantitative answer as to how effective our intervention is, and how uncertain we are about the results. Bringing together the results of multiple studies gives us advantages – by combining samples we increase our power to detect differences, and increase the precision of our answer. We can also do things that a single study can’t, no matter how well conducted – we can explore the differences between the individual studies, giving us more answers about the way the intervention works in different variations, or different populations and contexts. If the individual studies are giving us conflicting answers, a meta-analysis may settle the controversy by giving us an overall answer, although sometimes controversies can be hard to settle. We can also identify new ideas and hypotheses to be tested by future studies. Source: Julian Higgins

When not to do a meta-analysis
mixing apples with oranges each included study must address same question consider comparison and outcomes requires your subjective judgement combining a broad mix of studies answers broad questions answer may be meaningless and genuine effects may be obscured if studies are too diverse But it’s not always a good idea to perform a meta-analysis, and there are some situations where we should not. The first of these situations is where we are mixing apples with oranges – when the studies are too different from each other, and it would not make sense to combine their results. Before we combine the results of multiple studies, we need to be confident that they are comparing the same interventions, and measuring the same outcomes. You’ll need to use your judgement to decide whether this is the case, and refer back to the objective of your review. In some cases, it might make sense to combine a broad range of studies. If your objective is to investigate the impact of exercise programs compared to no exercise, then you might be happy to combine studies using many different kinds of exercise programs, and you would get a broad answer about their effectiveness. On the other hand, this would not answer questions about the difference between swimming and jogging, or between self-managed exercise versus exercise with a physiotherapist, or between short and long exercise programs, and the overall broad answer you get may be too broad to be useful in predicting the effect of any particular exercise program. If that’s what you want to do, you might decide to break up your review into several separate meta-analyses. It’s up to you to make those judgements. Source: Julian Higgins

When not to do a meta-analysis
garbage in – garbage out a meta-analysis is only as good as the studies in it if included studies are biased: meta-analysis result will also be incorrect will give more credibility and narrower confidence interval if serious reporting biases present: unrepresentative set of studies may give misleading result The second reason why we may not want to do a meta-analysis is if the studies are too unreliable – if their risk of bias is too high for us to be confident that they are telling us the truth. A meta-analysis is only as good as the studies in it – as we say, ‘garbage in, garbage out’. If the studies are biased, then the result of the meta-analysis may also be wrong. Even worse, meta-analysing biased results will increase their precision, narrowing the confidence intervals and increasing people’s confidence in the result, and it will give the results more credibility by labelling them as a systematic review. It’s also important to consider whether your studies are a true reflection of the research in a field. Reporting bias or publication bias might mean that we have an unrepresentative sample that exaggerates the true intervention effect. If you suspect that your review is suffering from this problem, it may be best not to present the meta-analysed result. Source: Julian Higgins

When can you do a meta-analysis?
more than one study has measured an effect the studies are sufficiently similar to produce a meaningful and useful result the outcome has been measured in similar ways data are available in a format we can use However, if you are confident that you have a group of studies that are sufficiently comparable, and they are sufficiently reliable, then we can go ahead and do a meta-analysis. To do this, we need to have at least two studies measuring the same thing in a similar way, and we need the data in a format we can use, e.g. for dichotomous outcomes the number of events and the number of people in each group, and for continuous outcomes the mean, SD and number of people in each group.

presenting your results

Steps in a meta-analysis
identify comparisons to be made identify outcomes to be reported and statistics to be used collect data from each relevant study combine the results to obtain the summary of effect explore differences between the studies interpret the results To generate a meta-analysis we follow these steps. We begin by identifying the comparison to be made, and then the outcome to be measured, and the appropriate statistics to measure the effect. When we’ve decided what we’re measuring, we collect the relevant data from each study, and combine the results together. We can then explore the differences between the studies, before reaching our final interpretation of the result and describing it in our review. We’ll be looking at all these steps in more detail.

Selecting comparisons
Hypothetical review: Caffeine for daytime drowsiness vs caffeinated coffee decaffeinated coffee break your topic down into pair-wise comparisons each review may have one or many use your judgement to decide what to group together, and what should be a separate comparison The first step is to identify the comparisons in your review. In a Cochrane review, we always have to break down our question into pairwise comparisons – one thing compared against another, e.g. intervention vs placebo, or intervention A vs intervention B. That way we can compare the two results against each other, and test which intervention is most effective. Depending on your objective, your review may have a single comparison, or it may have many. You may be comparing one specific intervention to another specific intervention – which is one comparison. If you are looking at a number of different interventions for a condition, each different intervention might become a separate comparison. In our hypothetical example, the review of caffeine for daytime drowsiness, we may have a collection of studies comparing ordinary coffee with decaffeinated coffee, but our review includes any studies of caffeine, so we may have other comparisons as well. We might have some studies comparing coffee vs tea, or tea vs placebo, or Red Bull versus coffee. We might also decide that the effect of caffeine in children should be treated as a separate comparison to the effect in adults. Although our review is interested in all those things, we need to break them down and look at them systematically, two at a time. Your judgement will be needed here to decide what your comparisons are – you don’t need to take every tiny difference in your studies and make a separate comparison. We will still be exploring the important differences between the studies within each comparison. The purpose of selecting comparisons is to say “this group of studies is similar enough that it makes sense to analyse their results together”.

Selecting outcomes & effect measures
Hypothetical review: Caffeine for daytime drowsiness vs caffeinated coffee decaffeinated coffee asleep at end of trial (RR) irritability (MD/SMD) headaches (RR) Once you have identified your comparisons, you can then select the outcomes you will be measuring to decide which of the interventions is the most effective. These outcomes should be those you identified at the protocol stage, although you may add additional outcomes that have been identified as important during the course of the review. For each outcome, you’ll also need to identify the effect measure you will use to report the results. For example, in our review of caffeine vs decaf, our first outcome might be the number of people who fell asleep during the study – this is a dichotomous outcome, and based on our protocol, we have planned to report dichotomous outcomes using RR. Our next outcome, irritability, is a continuous outcome measured on a scale. We planned to report continuous outcomes using MD, unless we have studies measuring irritability on different scales, in which case we may need to use SMD. Your decisions about how to analyse and report the results may depend on the data you have in your included studies. It may help you to map out which studies reported results for each outcome, and how they have reported it, as we discussed in the presentation on collecting data. for each comparison, select outcomes for each outcome, select an effect measure may depend on the available data from included studies

Calculating the summary result
collect a summary statistic from each contributing study how do we bring them together? treat as one big study – add intervention & control data? breaks randomisation, will give the wrong answer simple average? weights all studies equally – some studies closer to the truth weighted average So, starting with the first outcome in our first comparison, we need to combine the results from our set of studies together. ASK: How do we bring the results of several studies together? CLICK Particularly for dichotomous data, we could add the events and sample sizes from each study together, and then compare the groups as if they were part of one big study. CLICK This is not correct – in effect, we are comparing the intervention data from one study with the control data from other studies, which is not a randomised comparison, and this can change the answer we get. CLICK We could simply take the average of all the study results. CLICK But this ignores the fact that some studies are contributing more information than others. CLICK The way we combine results is using a weighted average. ASK: How do you think we might weight the studies? It would be nice to weight by their risk of bias, but unfortunately we don’t have the empirical information to calculate weights on that basis.

Weighting studies more weight to the studies which give more information more participants, more events, narrower confidence interval calculated using the effect estimate and its variance inverse-variance method: We want to give the most weight to the studies that give us the most information about the effect – the most precise estimate of the difference between the two groups. Usually that means the studies that have more participants, or more events of interest for dichotomous data, or more precise estimates of the mean for continuous data, should have the most weight. To weight studies this way, we need two thing: an effect estimate for each study, and a measure of its precisions or uncertainty. A good way to summarise this precision is using the variance of the effect estimate. The variance is the same as the square of the standard error – a high variance means a very imprecise or uncertain study, and low variance means a more precise study that we want to give more weight to. For dichotomous data, RevMan can calculate the variance from the raw data about numbers of events and people. For continuous data, RevMan can use the number of people and the standard deviations we entered. Alternatively, we can enter the effect estimate and a measure of variance directly – such as when a study reports an effect estimate without the separate data for each group. We use these numbers in what’s known as the inverse-variance method of meta-analysis. The weight of each study is the inverse of its variance - studies with a low variance get the most weight, and studies with a high variance get the least weight. Note that this has some implications for the kind of studies that are likely to get greater weight. For example, if we’re measuring a continuous outcome, we enter the standard deviation into RevMan, which is used to calculate the variance. Pragmatic studies with broader inclusion criteria are likely to have more variation from participant to participant, and therefore a higher standard deviation. This will mean they get relatively lower weight in a meta-analysis than tightly-controlled studies. Similarly, studies with a longer follow-up period are likely to have higher standard deviations. We multiply the result of each study by its weight, add them all together, and divide the result by the total of the weights to get the combined, meta-analysed result. You don’t need to calculate the weights or do these multiplications yourself – RevMan will calculate the weights and combine the results for you, but it’s important that you understand how the weights come about when you see them in your results.

For example Headache Caffeine Decaf Weight Amore-Coffea 2000 2/31
10/34 Deliciozza 2004 10/40 9/40 Mama-Kaffa 1999 12/53 9/61 Morrocona 1998 3/15 1/17 Norscafe 1998 19/68 9/64 Oohlahlazza 1998 4/35 2/37 Piazza-Allerta 2003 8/35 6/37 For example, we have a group of studies here measuring the effect of caffeine compared to decaf, measuring the outcome of headache. ASK: Which study will have the most weight? ASK: Which study will have the least weight?

For example Headache Caffeine Decaf Weight Amore-Coffea 2000 2/31
10/34 6.6% Deliciozza 2004 10/40 9/40 21.9% Mama-Kaffa 1999 12/53 9/61 22.2% Morrocona 1998 3/15 1/17 2.9% Norscafe 1998 19/68 9/64 26.4% Oohlahlazza 1998 4/35 2/37 5.1% Piazza-Allerta 2003 8/35 6/37 14.9% These are the results as calculated by RevMan. You can see that: None of the studies is dominating the meta-analysis – no one study is taking most of the weight. Norscafe, with the largest sample size, has the largest weight. Not far behind is Mama-Kaffa, with only a slightly smaller sample. Deliciozza has almost the same weight as Mama-Kaffa, even though it has a smaller sample – but with a very similar event rate, it’s giving us a very similar amount of information on the difference between the intervention and the control. Morrocona, with it’s small sample and few events, has the least weight.

Meta-analysis options
for dichotomous or continuous data inverse-variance straightforward, general method for dichotomous data only Mantel-Haenszel (default) good with few events – common in Cochrane reviews weighting system depends on effect measure Peto for odds ratios only good with few events and small effect sizes (OR close to 1) Although RevMan will do all these calculations for you, you do have some options about the meta-analysis method used. The inverse variance method, as you’ve just seen, is a straightforward method that can be used generally in most situations, but there are some slight variations on this method available in RevMan. One of these methods is called the Mantel-Haenszel method, and it’s actually the default method RevMan uses for dichotomous data. The Mantel-Haenszel method is particularly good for reviews with few events or small studies – which is often the case with Cochrane reviews. For odds ratios, there’s also the additional Peto method. This is a good method if you have few events and small effects, such as an OR close to 1, but you shouldn’t use it if that’s not the case, as it can be biased. 24

Meta-analysis options
When you create an outcome in RevMan, these are the options you have available to choose from. You can see the choice between the meta-analysis method. Unless you have a strong preference, or your Review Group has recommended one of these meta-analysis methods, you can leave the default settings in place. For this dichotomous outcome, we can also choose between RR, OR and RD. There’s one other important choice to make about your meta-analysis – between fixed-effect and random-effects meta-analysis. We’ll come back to that choice in a separate presentation, on Heterogeneity.

presenting your results

A forest of lines ASK: Does anyone know how we present the results of a meta-analysis: They are presented on a forest plot – so-called because it’s said to resemble a forest of lines on the page. Trees Joyce Kilmer Forest by charlescleonard

Forest plots Headache at 24 hours headings explain the comparison
This is what a forest plot looks like. This example is from our caffeine review, reporting the headache outcome. ASK: Who has seen one of these before? Are you comfortable interpreting a forest plot? CLICK: Headings at the top of the table tell you what the comparison is – first the intervention, and then the control. in this case, our intervention is caffeinated coffee, and our control is decaffeinated coffee. headings explain the comparison

Forest plots Headache at 24 hours list of included studies
On the left is a list of included studies (by first author’s name and year of publication, by Cochrane convention). list of included studies

Forest plots Headache at 24 hours raw data for each study
Individual data are presented for each study – in this case, number of events and sample size. For a continuous outcome, the mean and SD would be shown with the sample size. raw data for each study

Forest plots Headache at 24 hours total data for all studies
The total data for all the included studies is also given – in this case, the total number of events and participants in the intervention groups and control groups. total data for all studies

Forest plots Headache at 24 hours weight given to each study
The weight assigned to each study in the meta-analysis is given. weight given to each study

Forest plots Headache at 24 hours
The individual result for each study is given – in this case, the Risk Ratio with a 95% confidence interval. The statistical options chosen are noted at the top. effect estimate for each study, with CI

The individual study results are also presented graphically. The coloured square shows the effect estimate, and the size of the square corresponds to the weight given to the study in the meta-analysis. The horizontal line shows the confidence interval. The vertical line down the middle indicates the line of no effect – in this case, for a ratio, at 1. ASK: What does it mean if the 95% CI crosses the line of no effect? It means the results is not statistically significant, although there’s more to interpreting these results than statistical significance. effect estimate for each study, with CI

Forest plots Headache at 24 hours scale and direction of benefit
At the bottom of the plot is the scale, which you can adjust in RevMan as needed. Note that for ratios the scale is a log scale. The lowest value a ratio can take is 0, 1 represents no effect, and highest value it can take is infinity. The data are presented on a log scale to make the scale and the confidence intervals appear symmetrical. For an absolute effect (e.g. RD, MD), the scale is symmetrical, showing positive and negative values around 0 as the point of no effect. Below the scale is an indication of which side of the plot favours the intervention. This will depend on the outcome you are measuring. The right side of the scale always indicates more events, or a higher score, for the intervention. The left side always indicates fewer events, or a lower score for the intervention. If you’re measuring something good, such as recovery or quality of life, then a result on the right side will be a good outcome for the intervention, because you want an increase in your outcome. A result on the left side will favour your control, because it means a decrease in your desired outcome. If you’re measuring something bad, such as headaches or irritability, then a result on the left side of the scale will indicate a favourable result for your intervention, because you wanted to reduce the outcome. A result on the right side will be bad for the intervention, because it indicates an increase in the negative outcome, and so results on the right side favour the control. It’s important that you read these labels carefully, and make sure you have them the right way around, depending on whether you’re measuring a good or a bad outcome. scale and direction of benefit

Finally, the pooled result for all the studies combined is presented, both in numbers and graphically. The result is shown graphically as a black diamond. The top and bottom points of the diamond correspond to the overall effect estimate, and the width of the diamond represents the confidence interval. pooled effect estimate for all studies, with CI

Interpreting confidence intervals
always present estimate with a confidence interval precision point estimate is the best guess of the effect CI expresses uncertainty – range of values we can be reasonably sure includes the true effect significance if the CI includes the null value rarely means evidence of no effect effect cannot be confirmed or refuted by the available evidence consider what level of change is clinically important Whenever we present results in a Cochrane review, we need to include a measure of uncertainty, such as a confidence interval. While the point estimate is our best guess of the effect of the intervention, based on the information we have, we need to take into account that next time we take a sample, we might not get the same result. The confidence interval represents the range of values we can be reasonably sure includes the true value of the effect – for a 95% CI, if we repeated the study indefinitely, the CI would include the true effect 95% of the time. A narrow confidence interval means we have a precise estimate of the effect. A wide confidence interval means less precision, although sometimes we can still be certain enough to make a decision about the intervention – if the CI is wide, but both the top and bottom of the range indicate a beneficial effect, we can go ahead and use the intervention. If the CI is very wide, and it includes conflicting effects (e.g. benefit and harm), then perhaps we don’t have enough information to make a decision. For an individual study, larger studies tend to have narrower confidence intervals. For a meta-analysis, more studies will usually mean a narrower CI, although if the study results are conflicting with each other, more studies may lead to a wider CI. The CI can also tell us about the statistical significance of the estimate – if the CI includes the line of no effect, then the result is not statistically significant at that level (e.g. a 95% CI corresponds to a P value of 0.05, a 90% CI corresponds to a P value of 0.1). Authors are advised NOT to describe results as ‘not statistically significant’ or ‘non-significant’, but to interpret what the results tell us. It’s important to be able to tell the difference between ‘evidence of no effect’ and ‘no evidence of effect’. A non-significant result may mean that we don’t have enough information to be certain that the intervention works, but if we had some more studies and more results, our precision might increase. Alternatively, if we have lots of studies, and a very precise result sitting right on the line of no effect, then perhaps we can be certain that the intervention has no effect. It’s also important to consider clinical significance – for this outcome, in the context of your question, what level of change would be considered important? e.g. 10% decrease in risk? 2 point change on a 10 point pain scale? If the CI shows a range that includes values above and below your clinical important change, then you can’t be confident that the effect will be large enough to be important to your patients. If the range also includes the line of no effect, then you can’t be certain that the intervention will have any effect, and may even be harmful. More on interpretation will be covered in a separate presentation.

Considering clinical significance
In this example, the review is of antibiotics for otitis media, or ear infections, in children. We are measuring the number of children experiencing pain (in this case, as a dichotomous outcome, not a continuous pain scale). The subgroups are according to the time point at which the outcome was measured: in the first subgroup, pain is measured at 24 hours. In the second subgroup, pain is measured at 2-7 days. ASK: Is this effect clinically important? Looking at the overall numbers of children in pain in the control groups, given that presumably almost all of them would have been in pain at the start of the trial, almost two thirds had spontaneously recovered without any intervention after 24 hours. After 2-7 days, 78% of children had spontaneously recovered. So, the effect observed, while significant, only translates to a few more children without pain in practice. The benefit of antibiotics is relatively limited, and perhaps pain relief might be an effective intervention. This would need to be weighed against the risks of side effects from antibiotics such as diarrhoea and antibiotic resistance, and the risk of more serious complications of ear infections, such as mastoiditis, which is more common in developing countries than high-income countries. Based on Sanders S, Glasziou PP, Del Mar C, Rovers MM. Antibiotics for acute otitis media in children. Cochrane Database of Systematic Reviews 2004, Issue 1. Art. No.: CD DOI: / CD pub2.

The Results section of your review
a systematic, narrative summary of results forest plots key forest plots linked as figures usually primary outcomes avoid forest plots with only one study may also add other data tables results of single studies summary data for each group, effect estimates, confidence intervals non-standard data not helpful to report trivial outcomes or results at high risk of bias There’s more to your analysis than this, and we’ll come back to talk about some tricky types of data, and exploring and interpreting your results in separate presentations. First, a few words on how your forest plots fit in with the results section of your review. First, your results section should present a systematic, narrative summary of the results. Don’t need to repeat all the data in the text, but make sure you summarise the key findings, and that the text makes sense without referring to the plots. All your forest plots will be included in the online version of the review, and you should make sure that you’re not including unnecessary forest plots, such as repetitive variations on the same plot, or forest plots with only one included study, as these just make it more difficult for the reader to navigate through all the information. It may be preferable to include outcome data from single studies in a table rather than presenting multiple single-study forest plots The complete set of forest plots is treated as supplementary data alongside the online version of your published review. Some printed versions will not include all the forest plots, so you should select a small number of key forest plots, usually relating to your primary outcomes, and link them as figures in the results section – the same way you would for any published paper. These will then be included with any printed version of the review. Don’t forget that you might also have other data you need to present, that wouldn’t fit in the forest plots, such as results in different formats that did not match the other studies, or results of single studies that were the only ones to report a particular comparison or outcome. These results should not be left out of your review - you need to give a complete and unbiased picture of the evidence. Don’t forget, though, that some results may not be helpful to include. For example, trivial outcomes measured by your included studies but not considered important at the protocol stage do not have to be included (although you might note that they were measured in your ‘Characteristics of included studies’ table. You may also choose not to report results at high risk of bias. Always be clear and report when you have chosen not to include some results.

What to include in the protocol
how will you decide whether a meta-analysis is appropriate? meta-analysis model to be used Thinking back to the protocol stage, you’ll need to give brief descriptions of your planned analysis. First, you’ll need to briefly state that you will consider whether your studies are similar enough to meta-analyse before proceeding. You’ll also need to specify the meta-analysis methods you plan to use.

Take home message there are several advantages to performing a meta-analysis but it is not always possible (or appropriate) plan your analysis carefully, including comparisons, outcomes and meta-analysis methods forest plots display the results of meta-analyses graphically interpret your results with caution

Question: How are studies in a meta-analysis are combined?
by giving more weight to the studies with larger sample sizes and lower standard deviations, because they are likely to more accurately estimate the true intervention effect by a simple average, to give each study equal weight by giving less weight to studies with lower variance in treatment effect by giving more weight to studies with higher variance in treatment effect

Answer: Studies in a meta-analysis are combined
by giving more weight to the studies with larger samples sizes and lower standard deviations, because they are likely to more accurately estimate the true intervention effect by a simple average, to give each study equal weight by giving less weight to studies with lower variance in treatment effect by giving more weight to studies with higher variance in treatment effect CORRECT - we want to give the most weight to the studies that give us the most information about the effect – the most precise estimate of the difference between the two groups. Usually that means the studies that have more participants, or more events of interest for dichotomous data, or more precise estimates of the mean for continuous data, should have the most weight. INCORRECT – calculating a simple average ignores the fact that some studies are contributing more information than others (i.e. some studies provide more precise estimates of the true effect) INCORRECT – studies with lower variance in the treatment effect (e.g. smaller standard deviations or narrower 95%CIs) provide a more precise or certain estimate of the true effect, so should receive relatively MORE weight in the meta-analysis INCORRECT – studies with higher variance in the treatment effect (e.g. larger standard deviations or narrower 95%CIs) provide a more imprecise or uncertain estimate of the true effect, and so should receive relatively LESS weight in the meta-analysis

Collecting data

Steps of a systematic review
define the question plan eligibility criteria plan methods search for studies apply eligibility criteria collect data assess studies for risk of bias analyse and present results interpret results and draw conclusions improve and update review Once you have identified your list of included studies, the next thing to do is to collect the data you will need for your review from each study. This is a necessary step before we can go on to assessing the risk of bias of those studies or analysing their results.

Outline data to be collected putting it into practice
See Chapter 7 of the Handbook

What data should you collect?
comprehensive information about each study population and setting (e.g. age, race, sex, socioeconomic details, disease status, duration, severity, comorbidities) interventions and integrity of delivery methods and potential sources of bias outcomes, authors’ conclusions citation, author contact details sources of funding, etc. information required for: references description of included studies risk of bias assessment analyses ASK: What kind of data do you think we need to collect about each study? You will need to collect a wide range of information about each study: everything you will want to report and analyse in your review, and everything your readers will want to know about your included studies. First, you will need to describe your included studies in detail – so, you’ll need to collect information on the population, setting, and the intervention, remembering those factors and variations in the population and intervention that you think might have an impact on the results of the study, and that you specified in your protocol that you would like to investigate. The readers of your review will want information detailed enough to help them decide whether to apply the results in their context? e.g. information such as socioeconomic or cultural information might have a big impact on whether an intervention is feasible in different settings. Interventions should be described in enough detail to allow them to be replicated in practice. Integrity of delivery, or compliance/fidelity, can help understand whether incomplete implementation may explain poor findings, and can also highlight difficulties in feasibility for future users. You will need to collect details information how the studies were conducted for your risk of bias assessment. Then, you will need to collect detailed information about the outcomes. We’ll come back to this in more detail later. You should focus on the outcomes you planned to report in your protocol, but you may wish to report a complete list of the outcomes measured by each study for your readers, or perhaps to identify important outcomes that you did not consider at the protocol stage. Although you will be conducting your own analysis, it’s also helpful to collect the study authors’ conclusions – these can be useful to double-check your own findings later. Other items of interest include bibliographic information, contact details of the authors, sources of funding, the trial registration number, etc. You may have additional things you’d like to collect relevant to your question. HANDOUT: Data collection: what items to consider?

Collecting outcome data
focus on those outcomes specified in your protocol be open to unexpected findings, e.g. adverse effects measures used definition (e.g. diagnostic criteria, threshold) timing unit of measurement for scales – upper and lower limits, direction of benefit, modifications, validation, minimally important difference numerical results may be in many formats - conversion may be required collect whatever is available no. participants for each outcome & time point Results should only be reported and analysed for those outcomes specified in your protocol, although you should be aware of any important and unexpected outcomes, e.g. serious adverse effects. Remember to clearly identify any outcomes that weren’t pre-specified in your protocol. Also, if a study is included with more than two intervention arms, extract data only for intervention and control groups that meet the eligibility criteria (but make it clear in the ‘Table of characteristics of included studies’ that these intervention groups were present in the study). Quite a lot of detail is needed for good reporting of outcomes. Aside from the results themselves, you should collect information about the definitions used in each outcome, such as diagnostic criteria or thresholds for definitions such as ‘improved’ or ‘high vs low’ levels of any measure, as these might vary from study to study. Collect details on the timing of each measure – each study is likely to report measures at several time points. The unit of measurement is also very important. When a study is reporting results measured on a scale, be as detailed as you can. Report the upper and lower limits of the scale – is it a 0-10 pain scale, or a 1-10 pain scale? For some complex scales such as quality of life or function, it may not be possible to score 0. Report the direction of benefit – does a higher score mean better quality of life, or worse? This may not be the same for every scale used in your included studies. Is the study using a complete scale, or a modified version or subscale? Has the scale has been validated? What is the minimally important difference on the scale – that is, what size difference on the scale is big enough to be detectable and important to study participants? e.g. in a 10 point pain scale, changes of less than 1.5 points may not be considered large enough to be important to patients. This level of details is important not just for your own benefit – your readers may not be experts in this field, and may not be familiar with the scales used. There are many ways to report statistics and numerical outcomes. and you should expect to find some variation among your included studies. RevMan requires results in specific formats for analysis – you may need to do some simple calculations to get the results from your included studies into the right format, or to make results from one study comparable to the others. We’ll be looking in detail at how to present and analyse your results in a separate presentation. For now, the best strategy when collecting data is to collect whatever outcome measures are reported in the study – that way you have the most comprehensive picture possible, and you can compare the results reported across all your included studies before selecting the best approach to take. Participant numbers will change throughout the study – keep track of how many people were still in each group and were measured for each outcome at each time point? These changes affect your analysis, and will also help you assess the risk of bias for each study.

Data in many formats Outcome Reported as Trials
Volume transfused (mls) Mean and SEM Mean and SD Mean and something in brackets Median and something in brackets Two unlabelled numbers e.g. x(y) Bar chart showing mean per person per day 4 2 1 Units transfused Mean only Total in each group Volume adjusted for patient mass (mls/kg) Patients who had a transfusion Number of patients 3 Not reported This is an example of data collected for a real review, measuring blood transfusion as an outcome. As you can see, almost every study reported the data in a different way. Some reported the outcome as the volume of blood transfused, some as the number of standard units transfused, some as the volume of blood adjusted for the weight of the patient, and some simply the number of patients who had any blood transfused. Some studies may have reported the result in more than one way, appearing more than once on this table. ASK: How would you manage a set of data like this? Within each of those outcome definitions, the numbers reported also differed. Most reported means, but one reported medians. Some reported standard errors, some standard deviations. Quite a few did not label the numbers they reported, making it very difficult to interpret. One study only reported the results in a graph, so that the systematic review author would have to measure the graph to work out the numerical results. The data aren’t always presented in a way that is useful for meta-analysis. Unless we can get more data from the authors, it’s difficult to summarise all these studies together, as they’re not all quite measuring the same thing. There are some choices we can make as authors: where studies report results in more than one way, we can choose the most common measure. We can do some conversion of results into more useful formats – we’ll address that in more detail in a separate presentation (Continuous outcomes). We should certainly contact the authors to clarify what they’re reporting, and see if we can get more detailed information that may help us report consistent data from each study. We won’t have a clear picture of all the data we have, how compatible they are, and the best way to manage them, until after we have completed our data collection from all our included studies, and collected all this detail. Source: Phil Wiffen

Outline data to be collected putting it into practice

Data collection forms organise all the information you need
reminds you what to collect records what was not reported in the study records the decisions you make about each study source document for data entry into your review must be tailored to your review adapt from a good example paper or electronic – your choice A data collection form is a crucial tool to help you organise the collection of all this information. Once you have decided what you need to collect, creating a form will help you systematically look for and collect the data. The form will keep a record of what you found, in an organised way, and also what was not reported in the study. The form will then act as the main source document for your review, saving you repeatedly going back to the study reports to sift through the various sections wondering where the piece of information you are looking for might be found. Every review is unique, and so you will need to create your own form for your review. Some examples are available that you may wish to adapt – ask your Review Group if they have a model to work from. You can set up the form however you prefer – on paper, in a Word document, an Excel spreadsheet, a PDF form, a database – depending on how and where you like to work. Paper forms can be taken anywhere, such as in the back garden or on the train, and are easy to create, and easy to compare forms completed by different authors. Electronic forms can be more advanced, perhaps including simple data calculations or guided steps. They allow you to copy and paste data into RevMan when you are finished, they save paper, and you don’t have to rely on reading messy handwriting. Note that your CRG may ask you to submit data extraction forms for checking. You might also need them to check your own data entry at later stages of the review, or even after publication, so make sure you keep them.

Hints and tips plan what you need to collect – not too much or too little consider including: review title name of author completing the form Study ID (and Record ID if multiple reports of a study) plenty of space for notes (at beginning and throughout) eligibility criteria at the beginning source of each piece of information (e.g. page no.) tick boxes or coded options to save time ‘not reported’ and ‘unclear’ options format to match RevMan data entry provide instructions for all authors Some suggestions about what to include in your data collection form. Make sure you have planned the data you need to collect – include everything you will need later on, but make sure you are not collecting too much information that you don’t need. Begin your form with the review title and the name of the author completing the form, and then a clear identification of the study to which this data relates. If there is more than one publication for a study, clearly identify which report the data is coming from, although you may also choose to record information from all publications on a single form. Include plenty of space for notes, including on the front page for notes to yourself that you need to remember, such as missing data you need to follow up. If you wish, you can incorporate your eligibility criteria on at the beginning of the form, combining the selection and data collection processes. For each item you are collecting, note where in the report it was found (e.g. page numbers), consider whether you can use tick-boxes or lists of pre-specified categories to save time, and make sure you include ‘not reported’ and ‘unclear’ as well as ‘yes and ‘no’ or other available options. Weeks and months after you complete the form, it will be difficult to remember the details of what was found, and keeping track of where information was not available, or some information was reported but not enough to classify the study, will save you time re-reading the papers again later. If possible, format your data tables to match the required data entry in RevMan (which we will address in a separate presentation). This will save time if you are able to cut and paste from an electronic form, but even if you are using a paper form it will help prevent data entry errors, such as entering the intervention group data into the control group column.

Source: Miranda Cumpston
This is an example of a data collection form completed on paper – this one is four pages long, and includes the eligibility criteria on the first page, and a combination of checkboxes (yes/no/unclear, or other coded options as appropriate) and space for notes about each aspect of the study. At the end is a table for data entry. You can see there’s plenty of notes, and corrections, which gives a history of decisions made. Source: Miranda Cumpston

Minimising bias in data collection
two authors should independently collect study characteristics and outcome data reduces error checks agreement on subjective judgments and interpretations resolving disagreements can usually be resolved by discussion if not, refer to a third author pilot data collection process include each person assisting check criteria are consistently applied may need to revise form or instructions To ensure your data collection is accurate, and to reduce the risk of bias, particularly where you are making subjective judgements and interpreting the data, it’s important to have data collected independently by two authors. It may also be helpful to have authors from different perspectives collecting data, e.g. a content expert and a methods expert. Wherever you find disagreements between the data collected by two authors, these should be resolved by discussion to identify whether they arise from a data entry error, or a more substantive disagreement. A third author may also be used to resolve any disagreements. You should plan to pilot your data collection form on a small number of studies before continuing with complete data collection – this will help you identify any practical improvements to the way the form is set out, and ensure the forms are being used consistently by each author. You may need to discuss and revise your guidance to authors or the form itself. [Note if asked: It is possible to conduct data collection process with blinding, e.g. by editing copies of the articles to removed information about authors, location and journals. This is not necessary for Cochrane reviews.] [Note if asked: Agreement can be measured using the kappa statistic, but this is not required (see Handbook section for method of calculation).]

Contacting study authors
to obtain unreported data or confirm unclear data e.g. unclear risk of bias, missing outcomes, missing SDs finding contact details check the study reports check PubMed for recent publications check Google for current staff profiles save all your queries for one request be clear about what you need avoid sounding critical ask for descriptions rather than yes/no answers providing a table to complete may be helpful If you find that information is missing from the study reports, or if anything is unclear, it’s always worth attempting to contact study authors – even for very old studies. You won’t hear back from everyone you contact, but many authors are very happy to share additional information about their work. Authors’ addresses are often included in published papers these days. For older papers, or where the details aren’t available, it’s usually fairly easy to find current details. Look up the authors’ more recent papers in PubMed, which may include contact details. Alternatively, Google can usually find the authors’ current staff website at their home institution . Don’t forget to check the second and third authors as well as the first. When contacting authors for information, whether about the description of the study, methods details for risk of bias assessment, or additional outcome data, be as polite as you can. Try to save all your questions for one – don’t them repeatedly with extra requests. Be clear about what you need – sometimes a table can help clarify the statistics you’re after. When asking for descriptive information about the study, open-ended questions may be more helpful – asking “can you describe how you managed blinding in this study?” is much more likely to get a detailed, helpful answer than if you ask “Did you use blinding?”. Authors may become defensive and be less likely to respond if your questions sound as if you are being critical of their work, and one study has shown that authors are likely to give overly positive answers if asked direct yes/no questions (Haahr 2006, Handbook 8.3.4).

Managing data can enter data directly from form to RevMan
may need to consider intermediate steps e.g. Excel spreadsheet group studies by comparison & outcomes measured calculations to convert into required statistics don’t forget to check your final results against those reported in the study When you have your data collected and organised, you can begin entering data into RevMan…..or alternative meta-analysis program. For results data, it may be helpful to set up a spreadsheet or table as an intermediate step, to collect together data reported on each outcome from each study. This will give you an overview of the data available, and can help you make final decisions about how best to analyse the results. As mentioned earlier, you may need to do some simple calculations with data from some or all studies to make them compatible with each other and with RevMan, which is easiest to do with a spreadsheet (avoiding errors that can occur when making conversions using a calculator or by hand). Having an overall picture of the results will also help ensure that you don’t forget about any reported results that are not compatible with the majority of studies, and cannot be added to your analysis in RevMan. These results may need to be reported in the text or an additional table in the review. We’ll talk about different kinds of analysis you might undertake in separate presentations, but regardless of these calculations, don’t forget to go back and check your results against the findings of the original study. Are the direction and size of effect comparable? If not, can you explain why? This is an important check against data entry and analysis errors before your review goes for publication.

Author support tools Covidence http://www.covidence.org/reviews
EPPI reviewer Covidence free at the moment – works well for simple reviews EPPI reviewer – free one month trial – apparently better for complex reviews Have not used EPPI reviewer but have played in covidence – you can use for screening – good if you are doing double screening, can set up comparisons. Compatible with RevMan. Have not used them enough to make a recommendation and not sure how useful they are if you happen to be doing a single author review. Just letting you know there are various tools around.

What to include in your protocol
data categories to be collected whether two authors will independently extract data piloting and use of instructions for data collection form how disagreements will be managed processes for managing missing data Thinking back to the Protocol stage, you should describe your data collection process. Include a brief description of the categories of data you will collect, whether two authors will independently extract data, whether a form will be used and whether you will pilot it first, how you will resolve disagreements, and how you plan to manage any missing data, e.g. by contacting the study authors.

Take home message think carefully about the data you wish you collect
design and pilot a data collection form should be done independently by two authors to minimise error and bias

Some studies that I like to quote

Analysing dichotomous outcomes
To begin looking at analysis of outcomes in your review, we’re going to start with one of the most common outcome types that you may come across – dichotomous outcomes.

Steps of a systematic review
define the question plan eligibility criteria plan methods search for studies apply eligibility criteria collect data assess studies for risk of bias analyse and present results interpret results and draw conclusions improve and update review

Outcome data Study A Effect measure Outcome data Study B Effect measure Effect measure Outcome data Study C When we’ve collected the data from a study, we need to organise it and make sense of it. Each study measures outcomes among its individual participants, and reports those outcomes in various ways. For each study in your review, you will select an effect measure – a statistic to represent the effect of the intervention on the people in that study, such as an odds ratio. It might be the effect measure reported in the study itself, or you might choose a different effect measure that you think is more useful. [Click] We have to do this for each included study in your review – if possible, so that we have a comparable set of effect measures for each study. [Click] Ultimately, our aim will be to synthesise those results across all the studies, and report an overall effect for the intervention. We’ll come back to that stage of the process later. For now, we will look at some of the different types of outcome data you are likely to find at the study level, and how we can select and calculate appropriate effect measures. We’re going to start by talking about the analysis of dichotomous outcomes. Effect measure Outcome data Study D Source: Jo McKenzie & Miranda Cumpston

Session outline expressing chance: risk and odds
effect measures for comparing groups choosing an effect measure collecting data for dichotomous outcomes See Chapters 7 & 9 of the Handbook

What are dichotomous outcomes?
when the outcome for every participant is one of two possibilities or events alive or dead healed or not healed pregnant or not pregnant Truly dichotomous data is outcomes in which every participant is in one of two possible groups – either with the event of interest or without, yes or no.

What were the chances of that?
Risk and odds express chance in numbers for dichotomous outcomes, express the chance within a group of being in one of two states particular statistical meanings, calculated differently In general conversation the terms ‘risk’ and ‘odds’ are used interchangeably (as are the terms ‘chance’, ‘probability’ and ‘likelihood’). In statistics, however, risk and odds have particular meanings and are calculated in different ways. although they are both ways of expressed the chance of being in one of two groups – with the dichotomous outcome, or without. It’s important to be clear abut the difference between risk and odds, and which one you’re using.

Risk 24 people drank coffee 6 developed a headache risk of a headache
= 6 headaches / 24 people who could have had one = 6/24 = ¼ = 0.25 = 25% Risk is the concept more familiar to patients and health professionals. Risk describes the probability with which a health outcome or event will occur. Risk is very easy to calculate: in this case, we have a group of 24 people who drank coffee, and 6 of them had a headache. So we take the number of people with a headache, and divide it by the total number of people – in this case, 6 out of 24, or 25%. Risk is often expressed as a percentage, but you can also express it as a decimal, in this case, or as a number out of 100 or 1,000 people. risk = no. participants with event of interest total no. participants

Odds 24 people drank coffee 6 developed a headache odds of a headache
= 6 headaches/18 without headaches = 6/18 = 1/3 = 0.33 = 1:3 (not usually as %) odds = no. participants with event of interest no. participants without event of interest Odds is slightly different, but just as easy to calculate. It’s probably a concept that is more familiar to gamblers. To calculate the odds, we take the number of people with a headache, and divide it not by the total number, but by the number of people without a headache. So in this case, 6 people with a headache, divided by 18 people without a headache, or Odds are not usually expressed as a percentage, but you can express them as a ratio – in this case 1:3.

Do risks and odds differ much?
Two examples from caffeine trials 5 people with ‘headaches’ out of 65 chance of having a headache risk = 5/65 = odds = 5/60 = 0.083 130 people ‘still awake’ out of 165 chance of still being awake risk = 130/165 = 0.79 odds = 130/35 = 3.71 It’s important to be clear about whether you’re using risk or odds, because they are not always the same. As you can see from the first example here, when the event is rare (5 out of 65 people with the event, in this case a headache), the risk and odds will be very similar – risk is and odds are CLICK However, when the event is common, such as 130 people out of 165 ‘still awake’ at the end of the trial, risk and odds will be very different. In this case, risk is 0.79, and odds are 3.71.

Overview expressing chance: risk and odds
effect measures for comparing groups choosing an effect measure collecting data for dichotomous outcomes So, we can calculate risk and odds in a group of people, some of whom experience an event. In a trial, what we usually want to do is compare the chances of an event between two groups (such as the intervention and control groups), to decide which of the two groups has a better outcome. For that, we need to use effect measures that compare the risk or odds between two groups.

Comparing two groups Headache No headache Total Caffeine 17 51 68
Decaf 9 55 64 26 106 132 All the information you need to calculate measures for dichotomous outcomes can be expressed in a 2x2 table. ASK: Has anyone seen this kind of table before? ASK: Can anyone suggest any of the effect measures we might use to compare two groups with each other?

Comparing two groups effect measures
risk ratio (RR) (relative risk) odds ratio (OR) risk difference (RD) (absolute risk reduction) all estimates are uncertain, and should be presented with a confidence interval There are three main effect measures we use to express the difference between two groups for a dichotomous outcome: risk ratios (also known as relative risk), odds ratios and risk difference (also known as absolute risk reduction). Whichever one you use, remember that all measures of effect are uncertain – we should never present the result of any study without a measure of uncertainty, preferably a confidence interval. Both the effect measure and measures of uncertainty can be calculated from a 2x2 table. Note: Number Needed to Treat (NNT) is a popular measure of the effect of an intervention, giving the number of people who need to receive the intervention for one additional person to experience or avoid a particular event. NNT can’t be used directly in a meta-analysis. You don’t have to report the NNT in your review, but if you would like to, you’ll need to use one of these other statistics for your analysis, and then calculate the NNT at the end to report it in your review.

Risk ratio risk of event with intervention = 17/68
risk of event with control = 9/64 risk ratio = intervention risk control risk = 17/ = = 1.79 9/ Where risk ratio = 1, there is no difference between the groups Headache No headache Total Caffeine 17 51 68 Decaf 9 55 64 26 106 132 The risk ratio simply divides the risk in the intervention group by the risk in the control group. In this case, the risk in the intervention group is 17/68 or 0.25, and the risk in the control group is 9/64 or We take the intervention risk, 0.25 and divide it by the control risk, 0.14, and get the answer 1.79. If you’re comparing two active interventions, it’s up to you to decide which intervention you’re going to treat as the intervention and which you will treat as the control in this calculation – remember to be consistent throughout the review, and make sure you’re interpreting the result correctly for each group in that case. Where RR is less than 1, this implies that the intervention reduced the number of events compared to the control. If the RR is greater than 1, the intervention increased the number of events. If the RR is equal to 1, this implies there was no difference between the two groups – 1 is our point of no effect for risk ratios. ASK: What does this RR of 1.79 mean?

Expressing it in words Risk ratio 1.79 or for a reduction in risk:
the risk of having a headache with treatment was 179% of the risk in the control group intervention increased the risk of headache by 79% or for a reduction in risk: Risk ratio 0.79 the risk of having a headache with treatment was 79% of the risk in the control group intervention reduced the risk of headache by 21% These are some examples of the ways we can talk about RR. For our calculated RR of 1.79, this means the intervention increased the number of headaches by 79%, or that the risk in the intervention group was 179% of the control. Hypothetically, if we had found a decreased risk, say a RR of 0.79, less than one, we could say that the chances of having a headache in the intervention group were 79% of the chances in the control group, or that the intervention reduced the risk of headache by 21%. This is can also be called the relative risk reduction.

Odds ratio odds of event with intervention = 17/51
odds of event on control = 9/55 odds ratio = intervention odds control odds = 17/51 = = 2.06 9/ Where odds ratio = 1, there is no difference between the groups Headache No headache Total Caffeine 17 51 68 Decaf 9 55 64 26 106 132 The odds ratio is very similar – instead of dividing the risk by the risk, we divide the odds in the intervention group by the odds in the control group. In this case, an odds for the intervention group of 0.33 divided by the control group odds of 0.16, gives an odds ratio of 2.06. Just like a RR, where OR is less than 1, this implies that the intervention reduced the number of events compared to the control. If the OR is greater than 1, the intervention increased the number of events. If the OR are equal to 1, this implies there was no difference between the two groups, and therefore no intervention effect. ASK: What does this OR of 2.06 mean?

Expressing it in words Odds ratio 2.06 or for a reduction in odds:
intervention doubled the odds of headache intervention increased the odds to 206% of the odds in the control group intervention increased the odds of headache by 106% or for a reduction in odds: Odds ratio 0.06 intervention reduced the odds of headache to 6% of the odds in the control group intervention reduced the odds of headache by 94% These are some examples of the ways we can talk about OR. In this case, our OR of 2.06 means the intervention doubled the odds of headache. We can say the odds in the intervention group were increased to 206% of the odds in the control group, or increased by 106%. Hypothetically, if we had found a result indicating fewer headaches in the intervention group, or a number less than 1, say an odds of 0.06, we could say that the intervention had reduced the odds of headache to 6% of the odds in the control group, or reduced the odds by 94%. You can see that the way we express RR and OR are not very different, but we do need to be careful to be specific about which we are reporting. Misinterpreting an OR as an RR can lead to overestimating or underestimating the effect, especially for common events.

Risk difference risk of event with intervention = 17/68
risk of event with control = 9/64 risk difference = risk with intervention – risk with control = 17/68 – 9/64 = 0.25 – 0.14 = 0.11 When risk difference = 0, there is no difference between the groups Headache No headache Total Caffeine 17 51 68 Decaf 9 55 64 26 106 132 The third effect measure that can be used to measure dichotomous outcomes is the Risk Difference. The previous two measures, RR and OR, have been ratios, which give relative measures of the effect, dividing one by the other. RD is an absolute measure, giving you the absolute difference between the risks in each group. All we do is subtract – take the risk in the intervention group, and subtract the risk in the control group. In this case, 0.25 – 0.14, which gives a RD of 0.11. You cannot use odds to calculate a corresponding Odds Difference. For absolute measures, our point of no effect is different. For ratios, the point of no effect is 1, but for RD the point of no effect is 0. Where risk difference is less than 0, this implies that the intervention reduced the number of events compared to the control. If the risk difference is greater than 0, the intervention increased the number of events. If the risk difference is equal to 0, this implies there was no difference between the two groups, and therefore no intervention effect. ASK: What does this RD of 0.11 mean?

Expressing it in words Risk difference 0.11
intervention increased the risk of headache by 11 percentage points 14 out of 100 people experienced a headache in the control group. 11 more people experienced a headache with caffeine. or for a reduction in risk: Risk difference -0.11 intervention reduced the risk of headache by 11 percentage points 14 out of 100 people experienced a headache in the control group. 11 fewer people experienced a headache with caffeine. It’s common to talk about RD in terms of ‘percentage points’, to make it clear that we’re not speaking in relative terms, so we could say our RD of 0.11 means the intervention increased the risk of headache by 11 percentage points. 11 percentage points is not 11% of the control value, but actually 11 higher than the control value. An alternative way to express absolute differences is to talk about them in natural frequencies – for example, numbers out of 100, or per 1000, or per million. Consumers find these values particularly easy to understand. In this case, we would start by noting that 14 out of 100 people had a headache in the control group – this represents what would happen anyway, without the intervention. With caffeine, 11 more people experienced a headache (making a total of 25 out of 100). Hypothetically, if we had a reduction in headaches, say a RD of -0.11, we might say that the intervention reduced the risk of headache by 11 percentage points. This can also be called the absolute risk reduction. We could also say that, in this case, 11 fewer people had a headache with caffeine.

effect measures for comparing groups choosing an effect measure collecting data for dichotomous outcomes So, we have three effect measures available for dichotomous outcomes. How do you decide which effect measure to use for your review?

Choosing an effect measure
communication of effect users must be able to understand and apply the result consistency of effect applicable to all populations and contexts mathematical properties All the effect measures are equally valid, but there are some properties that we might wish to consider when choosing which statistic to use: communication - everyone knows what it means and it is useful for communicating treatment effects in practice consistency of effect - it would be nice to have one number we can take away and use in a range of populations, that may be different from the population or context in which the study was conducted easy to manipulate mathematically Your CRG may have a policy on which to use – in which case, take their advice.

Communication OR is hard to understand, often misinterpreted
RR is easier, but relative can mean a very big or very small change RD is easiest absolute measure of actual change in risk easily converted to natural frequencies or NNT People find risk a more intuitive concept, and often interpret odds ratios as if they were risk ratios. This may be a problem for frequent events, where the statistics are not the same, and may lead readers to exaggerate the treatment effect. RR is easier, but as a relative measure, it can mean a very big or very small change, depending on the underlying risk. For example, 50% decrease in risk is very important when coming from a baseline of 80% to 40%, but less so when going from a baseline risk of 4% to 2%. Absolute measures like RD are often most easily interpreted by clinicians and patients, as they give clear estimations of actual risk, and can be expressed as natural frequencies, such as 11 fewer people out of 100 experiencing a headache. The NNT can be easily calculated as 1/|RD| (1 divided by the absolute value of the RD), although it can also be calculated from other statistics.

Consistency event rates vary from study to study within a review
The amount of variation in the effect measure (RR, OR, RD) across a set of studies conducted in different times, places and populations depends on which effect measure you choose. This is mainly due to changes in the underlying event rates across studies e.g. some populations may have high underlying death rates and some lower rates. A study of meta-analyses in the Cochrane Library shows that, in general OR and RR are less variable across different populations – that is, in general, relative measures like OR and RR tend to stay more stable when applied to different populations, which also makes it easier for readers to apply the results of the review to their own population. RD varies more widely from population to population, and must be interpreted in the context of the underlying risk of the population being considered. For example, a risk difference of 0.02 (or 2%) may represent a small, clinically insignificant change from a risk of 58% to 60% or a proportionally much larger and potentially important change from 1% to 3%. The absolute value is likely to be different in every population, and different again in the reader’s population. Sometimes this can cause problems – for example, if the RD is -11%, and the reader’s population has an underlying risk of 5%, then their risk with the intervention would be calculated as -6% risk of the event occurring, which doesn’t make sense. This variation is an important consideration if you’re using the RD to communicate absolute effects, or to calculate other statistics such as NNT – you may need to consider calculating multiple statistics for different levels of assumed underlying risk, or using a relative statistic to calculate the NNT instead. Consistency event rates vary from study to study within a review study of meta-analyses in The Cochrane Library: RR and OR are less variable across different populations RD is more variable, dependant on baseline risk readers will apply results to their own population, which may be different again Source: Deeks JJ. Issues in the selection of a summary statistic for meta-analysis of clinical trials with binary outcomes. Statistics in Medicine 2002; 21: 82

Mathematical properties
defining your event good or bad, presence or absence? think carefully and choose in advance OR and RD are stable either way, RR varies unbounded values OR is the only unbounded effect measure The first issue arising from the mathematical properties of the effect measures is the importance of defining your event. You have the option to define your event of interest as the desirable outcome or the undesirable outcome – e.g. people who made it to the bottom of the ski slope, or people who didn’t make it; people with headache, or people without headache. For OR and RD the effect measure is stable, but reversed, when you change the event: e.g. OR 0.5 for bad event corresponds to OR 2.0 for good (in general 1/OR for good event = OR for bad event). For RD, if we have RD of 5 percentage points for a good event, we get –5 percentage points for a bad event (in general, the same but different direction , +/-) RR is not so straightforward – changing the event changes the size and significance of your intervention effect. Think carefully about which outcome makes the most sense to you to measure and report. It’s also the case that, while RR and RD are limited in the maximum values they can take, OR is unbounded (that is, it can go up to infinity). There is no clear consensus on the importance of these two factors.

Summary OR RR RD Communication    Consistency Mathematics
As mentioned before, your CRG may have a preferred measure. If not, RR is generally the best default setting for most outcomes.

effect measures for comparing groups choosing an effect measure collecting data for dichotomous outcomes Going back to our data collection – what data do we need to collect from the included studies to analyses a dichotomous outcome?

Collecting data four numbers needed for effect measure and variance:
Headache No headache Total Caffeine 17 51 68 Decaf 9 55 64 26 106 132 If they’re available, the 4 numbers shaded are all you need to collect from your included studies – the number of events in each group, and the total number of people in each group. The Cochrane review software, RevMan, will calculate the risk ratio or odds ratio, as well as a confidence interval as a measure of uncertainty, based on these numbers. CLICK It’s not always clear whether the total number of people for whom the outcome was measured is the same as the total sample size at any given time point. Where possible, collect the actual number of people who were measured for each outcome, at each time point. If that information is not available in the study, that’s ok – you can use the total sample size for the group, or you can contact the author to request more information. The reason we try to use the number actually measured relates to the issue of intention-to-treat analysis. Wherever possible, we want complete data about the whole sample, but when people are missing, we need to make sure that we are aware of that. If we don’t have measurements on some people, and we use the total sample size as our N, then we are effectively imputing data – making an assumption that none of the missing people had a headache, or experienced the event you are measuring. This may or may not be a plausible assumption, so it’s important to be clear on the decisions you’re making, and get statistical advice if you want to know more about imputation. Try to collect the actual number measured for each outcome, at each time point

Other data formats can also be used
percentages number of events can be calculated if sample size is known overall effect estimate (e.g. OR, RR) where results for each group is not reported separately can include in meta-analysis using generic inverse variance method need a measure of variance (e.g. SE, 95% CI) Those four numbers may be the most straightforward way to analyse dichotomous data, but it’s very important to remember that if your study does not report those four numbers, it doesn’t mean you can’t use the data reported in the study. There are other ways the outcome can be reported that are just as useful. For example, if the study doesn’t report the actual number of events in each group, but reports a percentage, this can be easily converted to an absolute number if you have the sample size for each group. Alternatively, your study might not provide separate data for the intervention and control groups at all, but might report the overall effect estimate – such as the OR or RR. You can use these effect estimates in a meta-analysis, too, as long as they have been reported with a measure of variance, such as the standard error or a confidence interval can be used for this. When we enter the usual numbers of events and people, RevMan uses these to calculate the OR or the RR, along with the variance or uncertainty for each study. In effect, studies that report effect estimates have done these calculations for you. Because they’re in a different format, these studies need to be entered using a different outcome type in RevMan, not the usual Dichotomous outcome type, called the generic inverse variance method. An explanation of the generic inverse variance method, and what to do if you have an overall effect estimate instead of separate data for each group, will be given later in a separate presentation on analysing non-standard data.

effect measure(s) to be used for dichotomous data

Take home message risks and odds are two ways of expressing chance
risk ratio, odds ratio and risk difference compare chance between two groups to enter dichotomous data you need the number of events and the total number in each group

Question: Which of the following are examples of dichotomous outcome measures?
weight loss in kilograms reduction in size of wound measured in centimetres success or failure blood pressure readings rate of cardiovascular events per 1000 person-years

Answer: An example of a dichotomous outcome measures is
weight loss in kilograms reduction in size of wound measured in centimetres success or failure blood pressure readings rate of cardiovascular events per 1000 person-years INCORRECT – this outcome is continuous CORRECT – this outcome is dichotomous because there are only two possible outcome response options – ‘success’ or ‘failure INCORRECT - this outcome is continuous INCORRECT – this outcome is rate data, i.e. this outcome has a time component incorporated in its calculation

Question: What information from each study do you need to meta-analyse a dichotomous outcome?
mean, standard deviation, and sample size for each group numbers of events in each group and sample size for each group total number of events overall and sample size of each group effect estimate (e.g. risk ratio) only

Answer: What information from each study allows you to meta-analyse a dichotomous outcome?
mean, standard deviation, and sample size for each group numbers of events in each group and sample size for each group total number of events overall and sample size of each group effect estimate (e.g. risk ratio) only INCORRECT – this information allows you to meta-analyse a CONTINUOUS outcome CORRECT - the Cochrane review software, RevMan, will calculate the risk ratio or odds ratio, or risk difference as well as a confidence interval as a measure of uncertainty, based on this information. INCORRECT – in order to calculate a risk ratio, odds ratio, or risk difference, information on the number of event in each group is necessary INCORRECT – an effect estimate (risk ratio) may have been calculated for a dichotomous outcome, but can only be included in a meta-analysis if a measure if variation (e.g. standard error or 95%CI of the effect estimate) is available

Analysing continuous outcomes

Steps of a Cochrane review
define the question plan eligibility criteria plan methods search for studies apply eligibility criteria collect data assess studies for risk of bias analyse and present results interpret results and draw conclusions improve and update review Still at the stage of collecting and analysing data, continuous outcomes are another common type of data you will encounter.

Outcome data Study A Effect measure Outcome data Study B Effect measure Effect measure Outcome data Study C As we mentioned in the presentation on Dichotomous data, what we are aiming to look at with this presentation is the process of taking the outcome data reported in a study, and organising it into a useful effect measure. Our aim is to get an effect measure to report the results of each included study, and hopefully we can bring them together into some kind of synthesised effect measure at the review level. For now, at the study level, we will look at how we go about analysing continuous data. Effect measure Outcome data Study D Source: Jo McKenzie & Miranda Cumpston

Session outline effect measures for continuous outcomes
collecting data for continuous outcomes See Chapters 7 & 9 of the Handbook

What are continuous outcomes?
can take any value in a specified range intervals between values are equally spaced e.g. height, weight a person can be cm tall distance from 1 to 2 cm is the same as 171 to 172 cm other numerical scales commonly assessed as continuous e.g. quality of life, pain, depression Continuous outcomes, generally, are any outcomes measured on a numerical scale. Truly continuous outcomes, using a formal definition, have two characteristics: firstly, the outcome can take any value in a specified range. That is, not just whole numbers – for any two values on the scale, there could be one in between. in theory, a person could be cm tall if only we had an instrument sensitive enough to measure it. secondly, each interval on the scale should be evenly spaced, and have the same quantitative value. So, the distance between 1 and 2 cm is the same as the distance between two and three, or 171 and 172. Many of the numerical scales we use in research don’t formally meet these requirements – many scales can only measure in whole numbers, and in some qualitative scales, such as quality of life, we can’t really say that the distance between 1 and 2 points is quantitatively the same as the distance between 29 and 30 points. Formally speaking these are called ‘ordinal’ outcomes, rather than truly continuous, but these scales are commonly analysed in studies as if they were continuous. For now, we’ll treat them that way. More information about the specific measures available to analyse ordinal outcomes is covered in a separate presentation on Non-standard data.

Expressing continuous outcomes
two components mean value measure of variation Irritability score Mean SD N Caffeine 20 9.1 65 Decaf 33 8.6 67 When reporting the results of a continuous outcome, we need to do more than report the number of people in a particular category. We need to take a measurement from each person in a group, and then summarise the results – commonly as a mean value (to give us an idea of the middle of the group), and a measure of variability within the group, or how spread out they are, such as the standard deviation.

Standard deviations m = 10 participants sd = 1 scores
It’s important to have both the mean and the standard deviation to give us a good idea of the values in each group. In this example, we have a group of participants each with a score on a continuous scale from 1 to 20. The mean score is 10, with some scatter of individual values that are higher or lower than 10. The standard deviation is 1 – if the data are normally distributed, then we would not expect many scores more than two standard deviations – in this case two points, either side of the mean. scores

In this example, the mean score for the group is still 10, but you can see the data are more scattered either side – the SD for this group is 3. scores

Or we could find the data even more scattered – here the mean is still 10, but the SD is 5. You can see that although we have the same number of participants and the same mean, without the standard deviation we are missing some important information to represent the data in the study. scores

Comparing two groups effect measures
mean difference (MD) (difference of means) standardised mean difference (SMD) (effect size) all estimates are uncertain, and should be presented with a confidence interval When comparing two groups together using continuous outcomes, the two most commonly used effect measures to summarise the difference between groups are the mean difference (also called the difference of means), and the standardised mean difference (which you will sometimes see called an effect size). There are clear circumstances in which you would use each option. As always, any effect estimate is uncertain, and should always be reported with a measure of that uncertainty, such as a confidence interval.

Mean difference when all studies use the same measurement scale
mean difference = mean of intervention – mean of control = 20 – 33 = -13 points When mean difference = 0, there is no difference between the groups Irritability score Mean SD N Caffeine 20 9.1 65 Decaf 33 8.6 67 Our first effect estimate is the MD. The MD is quite simply the mean score in the intervention group, minus the mean score in the control group. The result is then expressed in units on the scale used to measure the outcome. In this example, we are measuring irritability. We take the mean in the intervention group, 10 points on the irritability scale, and subtract the mean in the control group, which is 23 points. So, 10 – 23 is -13 points. Like a risk difference, this is an absolute measure – if the result is 0, this implies there is no difference in the average score between the two groups. Zero is our point of no effect. If the result is a positive number, then the intervention group scored higher on the scale. If the result is negative, the intervention group scored lower on the scale.

Interpreting mean difference
how should we interpret a score of -13? depends on: direction of the scale length of the scale minimally important difference good or bad outcome higher = more irritable 0-50 5 bad ASK: So, how would you interpret our score of -13? Does our intervention have an effect on irritability? Is it a positive or a harmful effect? A MD of -13 indicates that the intervention group had, on average, a lower score on the irritability scale. But what does that mean? Interpreting a continuous outcome score depends on knowing the scale you’re using. [CLICK] First, we need to know whether a higher score means more irritability or less irritability – this isn’t always clear, so be sure to check your study carefully to make sure you have the direction correct. [CLICK] Lets assume that on this scale, a higher score means more irritability. So a score that is 13 points lower means the intervention group, on average, was less irritable. But, how much less irritable? [CLICK] To know whether 13 points is a big or small change, we need to know how long the scale is – 1-20, 1-200? [CLICK] In this case, lets say the scale is ASK: Do you think 13 points is a big or small change relative to the size of the scale? It’s still hard to know if this is a qualitatively important change. [CLICK] We can’t just tell the importance of a change from the numbers – we also need to know what’s called the minimally important difference on this scale, also called the clinically important difference. That is, how big does a change have to be on this scale before it is meaningful and noticeable to patients and health professionals? For example, on a 10 point pain scale, it’s usually agreed that about 1.5 to 2 points is the smallest change that is noticeable for patients. Any change smaller than that is too small to make a difference. [CLICK] Let’s assume that the MID for this scale is 5. ASK: Is our MD an important change? [CLICK] Finally, before we and our readers can interpret the numbers, we need to make a judgement about whether we’re measuring a good or bad outcome. We’ve established that our intervention causes less irritability. ASK: Is that a good thing? Is that the effect we hoped our intervention would have? ANSWER: Yes, irritability is a bad thing, [CLICK] so less irritability favours our intervention. It’s very important to make sure you have reported all this information about the scale, so that readers can clearly understand your conclusions.

Expressing it in words Mean difference -13 or for an increase:
on average, participants with the intervention scored 13 points lower on the irritability scale on average, the intervention reduced irritability by 13 points on the irritability scale or for an increase: Mean difference 13 on average, participants with the intervention scored 13 points higher on the irritability scale on average, the intervention increased irritability by 13 points on the irritability scale To express in words our MD of -13, we might say that, on average, participants in the intervention group scored 13 points lower on the irritability scale. Because we know the direction of the scale, we could also say that the intervention reduced irritability, which would also help our readers interpret the score. Hypothetically, if our score had been positive 13, we would say that, on average, participants in the intervention group scored 13 points higher on the irritability scale, or that it increased irritability by 13 points.

Outcome data Study A Scale 1 Scale 2 Scale 3 lb g kg Effect measure Outcome data Study B Effect measure Effect measure Outcome data Study C Let’s go back to think about our study level data for a minute. If we can calculate a mean difference for each study, we can used that to compare the results across studies. [CLICK] But what if each of our studies uses a different way of measuring an outcome? For example, in a study of premature babies, one of the outcomes might be weight. Most studies might measure weight in grams, but some might use kilograms or pounds. At the study level that’s no problem - we can calculate a mean difference for each study. Thinking ahead to the review stage, though, but thinking ahead to the review stage, when we want to compare the results, the mean differences would not be comparable with each other. A mean difference of 50 grams is not directly comparable to a mean difference of 50 pounds. ASK: So what can we do to make the scores more comparable? ANSWER: Since we know the relative size of grams, kg and pounds – what we call the ‘scale factor’ – we can simply choose the most useful scale (e.g. grams) and convert the other scores to grams by multiplying or dividing. [CLICK] However, what if the different measures are not just different numerical scales, but qualitatively different, such as different depression scales, or different quality of life scales? One study might use the Hamilton Rating Scale for Depression, another might use the Beck Depression Inventory, and another might use the Major Depression Inventory. These scales are measuring the same underlying concept – depression – but it isn’t easy to tell what scale factor would appropriately convert one scale to another. It’s not just about the length of the scale – there are qualitative aspects of how the scale works that determine where a person with a given level of depression would fall on each scale. We can’t assume that someone who scored 3 out of 10 on one scale would score 30 out of 100 on another. ASK: So what can we do to make these kinds of scores more comparable? Effect measure Outcome data Study D

Standardised mean difference
when different scales used to measure the same outcome SMD standardises the results units of standard deviation does not correct for direction – may need to multiply by -1 SMD = mean of intervention group – mean of control group pooled standard deviation of both groups When mean difference = 0, there is no difference between the groups The second common effect measure we have for continuous data is the standardised mean difference. We use this effect measure when we have several studies measuring a concept, e.g. depression, but they are not all using the same measurement scale. When we don’t know the scale factor to mathematically convert one score into another, we can standardise the results of each study using its standard deviations, so that when we get to the review stage we can compare the scores. It’s important to be careful when doing this, and to make sure the scales are measuring the same concept. It’s ok to have some variation, for example, different quality of life scales will have different questions, or you might be combining different measures of weight, such as BMI, weight in kg and skinfold measurements. Although these measures aren’t exactly the same, if you’re asking a question about the effect on weight, then combining them together will give you a useful answer. Sometimes the scales are too different, and combining them would not give meaningful results. For example, if one study is measuring pain, and another measures a combination of pain and function, these scales are not really measuring the same thing, and the results should not be combined. In some cases, even studies that appear to be measuring the same outcome – e.g. health care costs – may not be comparable, e.g. if the studies comparing costs were from different countries with very different health systems. It’s up to you to use your judgement and decide when measures are similar enough to combine and give a meaningful answer to your review question. These decisions are always a trade-off – limiting your results to studies with exactly the same measures may give you a more specific answer, but you may end up with analyses including fewer studies that give you a less complete picture of the evidence. If you have decided your different scales are measuring the same thing, then the SMD is calculated by taking the mean difference for each study and dividing it by the pooled standard deviation of outcomes for all the participants in that study, before being combined in a meta-analysis. This is based on the assumption that differences in the scales (e.g. the difference in a 10 point scale vs a 25 point scale) will be reflected in the SD of the scores measured, and can be adjusted accordingly. The precise formula for this is not shown here, but you do not need to perform this calculation yourself – RevMan will do it for you. [Note: more detailed formulae are available in the Handbook and the RevMan Help Menu for those interested] There are some tricks you need to know, so read the Handbook chapters on this before you try it. For example, what should you do if your scales are running in different directions? e.g. if you have two depression scales – on one scale, more points means more depression, and on the other scale, more points means less depression. In that case, you have to alter the results so they all run in the same direction before you combine them, by multiplying the results of one scale by -1. This doesn’t change the values, just the direction of the scale. The results of the analysis are then no longer in units on one scale or another – they will be reported in units of standard deviation. Like an ordinary mean difference, if the result is 0, this implies there is no difference in the average score between the two groups. A positive number will mean more points in the intervention group, and a negative number will mean fewer points in the intervention group.

Interpreting standardised mean difference
Irritability score Mean SD N MD SMD Caffeine 20 9.1 65 -13 -1.5 Decaf 33 8.6 67 how should we interpret a score of -1.5? compare to available SDs was study likely to have high or low variation? difficult for readers to interpret – convert results to a specific scale for reporting depends on same factors as mean difference Going back to our example, the SMD works out to be -1.5 (I have not shown the detailed calculations – but you don’t need to calculate this yourself. RevMan will run this calculation for you). ASK: How would you interpret this SMD? In terms of the direction of effect, and whether this is a good or bad result for our intervention, the interpretation is the same as for a MD. It’s the same original data for our intervention groups, and a score less than zero still means that we have found a reduction in irritability in the intervention group. Assessing the size of the effect is more difficult with a SMD. Our score means that, on average, irritability was 1.5 standard deviations lower in the intervention group. ASK: How can we tell how big 1.5 standard deviations is? ANSWER: We have some standard deviations for this particular score already available to us. Looking at the original scores, as a rough guide, we can see that the SD was 8.6 in the control group, which is what we’re comparing things to. We can calculate that 1.5 standard deviations would be 12.9 points, which pretty close to our MD result. To interpret that, we would still need to consider the length of the scale, the minimally important difference and those other factors we discussed, just as we would for the MD. The accuracy of this transformation back to points on the scale depends a lot on the range of SDs in your included studies. In a meta-analysis, you might have some studies with high variability, or high SDs, and some studies with low variability. You might then get very different results, depending on the value you choose to calculate with. Another option is to use a rough rule of thumb – for example: 0.2 represents a small effect, 0.5 represents a moderate effect, and anything larger than 0.8 represents a large effect. This rule doesn’t take into account what you consider to be a minimally important difference for a particular outcome. Later, at the review stage, even when we’re looking across several studies using different scales, we’ll always have the original study group data to help us get a rough estimate of how to interpret the SMD. However, you do need to take into consideration factors that might affect whether a particular study was likely to have high or low levels of variation. For example, was this a study of a very diverse patient population, or a very narrowly defined population? Was the study a pragmatic trial, in which you might expect variation in the delivery of the intervention, or was it a very tightly controlled protocol? These factors might explain differences in SD among your included studies, and help you interpret and discuss your results. Although it’s possible to do these rough comparisons, we know this is more difficult for readers understand than simple units on a scale. To help communicate the results of these outcomes, after all your analysis is completed, you may wish to convert the final results back into units on one of the scales used – methods for this are described in Chapter 12, section 12.6 of the Handbook.

Normally distributed data
Analysis of continuous data assumes that the outcome data are normally distributed – that is, the mean is in the centre, and the sample measurements are evenly spread either side, to a width of roughly 2 standard deviations. In some cases, that assumption isn’t true, and we have what we call skewed data. [CLICK] For example here we have a lot of values towards the left side of the distribution, and a few values creating a relatively long tail out to the right – this is called positive skew. When data are skewed, the mean and SD aren’t as good a way to describe the distribution of the data. Skew can be expected where most people in a sample will have similar measurements, but some will have either very low or very high scores. For example, length of stay in hospital. Most people might stay one or two days, but a few might stay for several weeks, giving the distribution its tail of high values to the right. On the other hand, no-one can stay in hospital for less than 0 days, so there’s no corresponding tail out to the left. A similar situation might occur with nutritional values – most people might have low baseline levels of a particular nutrient, unless they happen to have eaten something very rich in that nutrient that day, in which case they will have a very high score.

Skewed data indications of skew addressing skew
reported as geometric mean or median, interquartile range large SD in relation to the mean < 2 x SD between mean and highest/lowest possible score addressing skew get statistical advice before proceeding may be no action required possible actions may include sensitivity analysis without skewed studies, log transformation or other methods. Because our analysis depends on this assumption that the data are normally distributed, it’s important to be aware if you have skewed data in your review. One way to spot it is if the results are reported as a geometric mean based on log transformed data. Also, median and interquartile ranges are often reported when data are skewed, although it can be difficult to make assumptions about the data if the authors aren’t clear. Alternatively, another way to spot skew is if you have a large SD in relation to the size of the mean, especially where your data has a maximum or minimum value. Normally distributed data are spread roughly two SDs either side of the mean, so if you can’t add or subtract two SDs from your mean without exceeding the maximum or minimum value, your data may be skewed. If you do have skewed data, it’s important to get some statistical advice before you proceed. It may be that you don’t need to do anything to address it, especially if your included studies are large. It may be that some action is appropriate – approaches may include a sensitivity analysis to test the impact of studies with skewed data on your overall result, reporting the results of studies with skewed data separately to the others in your review, log transformation of the data, or other methods. It’s very important to get statistical advice before modifying any data, or combining skewed data with non-skewed data, as it’s easy to get into trouble, and any inappropriate steps will make your conclusions less reliable.

Session outline effect measures for continuous outcomes
collecting data for continuous outcomes

Collecting data six numbers needed for meta-analysis
Mean SD N Intervention 20 9.1 65 Control 33 8.6 67 We need slightly more information to analyse continuous outcomes than we did for dichotomous outcomes, but it’s still very straightforward. We need three numbers for each group in the study: the mean value, the SD and the number of people in each group. CLICK. As always, try to get the actual number of people in each group who were measured for this particular outcome at each time point if you can, although you may only be able to find the total number of people randomised to each group. The reason we try to use the number actually measured relates to the issue of intention-to-treat analysis. Wherever possible, we want complete data about the whole sample, but when people are missing, we need to make sure that we are aware of that. If we don’t have measurements on some people, and we use the total sample size as our N, then we are effectively imputing data – making an assumption that the mean value among the missing people is the same as the mean value among the people remaining in the study. This may or may not be a plausible assumption, so it’s important to be clear on the decisions you’re making, and get statistical advice if you want to know more about imputation. Try to collect the actual number measured for each outcome, at each time point

Post-intervention vs change from baseline
m(sd) Intervention group Post change m(sd) MD Control group So we know we’re looking for a mean and standard deviation, but there’s more than one way to measure the mean for continuous data. The simplest way is to implement your intervention, and then measure the values for each group post-intervention. Sometimes, one or more of your included studies may also measure the values at the start of the study, and report the difference between the baseline and post-intervention scores, otherwise called the change from baseline. If that’s the case, the mean and standard deviations of the change score can be collected, and a mean difference can be calculated.

Change from baseline data
can increase or decrease precision reduce between-person variation increase error for unstable or imprecise measures regression (ANCOVA) better to adjusting for baseline values work with what is reported in your included studies: either post-intervention or change scores can be used can use a mixture (only for MD, not SMD) better to be consistent if possible avoid selective outcome reporting change scores require SD of the change Reporting the change from baseline is sometimes done to adjust for the baseline values, and reduce the variation between individuals in the study and increase the precision of the estimate. However, this isn’t always required. Factoring baseline measures in to the analysis can introduce additional error – particularly in unstable or imprecise measures. To calculate a change score we have to measure everyone in the sample twice – and every time we take a measurement we introduce error. If you’re concerned about baseline imbalance in a study, it’s true that even with randomisation, a particular study may have, for example, a few more older participants, or a more participants who are less health at baseline. Even so, it’s not always necessary to take these baseline values in the analysis. At the review level, looking across a number of studies, on average any baseline imbalances will effectively be cancelled out, as your group of studies is likely to have a range of slight imbalances in both directions. Actually, the best way to adjust for baseline differences in an analysis is to use a more advanced analysis method such as regression (or ANCOVA) analysis. It’s relatively rare to find these analyses in your included studies, but if you do, the study is likely to report an adjusted effect estimate and measure of variance. These can be meta-analysed using the generic inverse variance method, which we will cover in a separate presentation (Non-Standard Data). To some extent we are always dependent on whatever is reported in the data. If your included studies all use post-intervention data, or all use change data, that’s no problem, you can use either. ASK: What should you do if you find a mixture of some studies that report post-intervention data, and some that report change from baseline? CLICK. In some cases, you might have most of your studies reporting change scores, and just one reporting post-intervention scores. In that case, if you have the baseline measure for that one study, it can be tempting to try to calculate the change score yourself, to be consistent. Be careful – you might be able to calculate the mean change from baseline, but you can’t easily calculate the SD of the change. In fact, it is possible to use a mixture of change and post-intervention scores in your analysis – this might seem strange, but it’s perfectly fine to include both, as they are estimating the same intervention effect: the MD between the groups. If you have a choice, it’s best to be consistent if possible and report the same measure for all studies, or at least make as many of your studies as possible consistent. It’s very important not to be selective about which you choose based which gives you the best results – pre-specifying your preference in your protocol can help avoid this. An important note: change and post-intervention data should not be combined in a SMD meta-analysis. The SMD assumes that the differences in SDs are due to the differences in the measurement scale, but the SDs of change and post-intervention data will also be different from each other because of the reduction in between-person variation, not just the difference in the scale. To summarise: it’s possible to use a mixture of studies using different measurement scales, and it’s possible to use a mixture of change and post-intervention data, but you cannot do both at the same time.

Other data formats can also be used
statistics other than mean and SD e.g. standard error, confidence interval, P value, t value, median, interquartile range clarify with the author if unclear can often calculate or estimate the mean and SD overall effect estimate e.g. MD, ANCOVA, ratio of means, ratio of geometric means can include in meta-analysis using generic inverse variance method need a measure of variance (e.g. SE, 95% CI) Entering the means and SDs may be the most straightforward way to analyse continuous data, but it’s very important to remember that if your study does not report those numbers, it doesn’t mean you can’t use the data reported in the study. There are other ways the outcome can be reported that are just as useful. In some cases, it might be unclear which statistic you have, for example, where a mean value is reported with a number in brackets after it, that could be an SD or SE. Check carefully and make sure you know exactly what you have. In these cases, you can clarify with the author, but if no further information is available you may need to use your judgement. For example, if the number is much smaller than the SDs given for other studies – especially if there is no clear reason why this sample may be less variable than the other studies, such as tighter inclusion criteria - this may be an indicator that it is an SE, and not an SD. Sometimes, the data may reported using something other than the mean and SD – e.g. SE, confidence interval, P value, t value, median, range, interquartile range – sometimes each of your included studies will report something different. There’s no need to worry - in many cases you can use these figures to calculate the statistics you need – or indeed to double-check that the number you have is an SD or not. That’s one of the reasons why it’s important to collect as much information as possible from the study reports at the data collection stage, allowing you to consider all the available options once you have a complete picture of all the data you have. Instructions on what you can use and how to convert it are included in Section of the Handbook. HANDOUT: Analysing continuous data: what can I use? It’s important to note that medians and interquartile ranges can be used on the assumption that the data are not skewed – although skewed data is often the reason why median & interquartile range are reported. As for all cases with skewed data – get statistical advice – if this is the only data available, it may be preferable to include the study rather than exclude from your analysis altogether. On the other hand, ranges are too unstable, and should never be used to estimate an SD. Sometimes, it may not be possible to calculate a SD, or obtain clarification from the authors. In that case, it may be possible to ‘impute’ the data you need, e.g. by borrowing a SD from another study. This should be done very carefully. Any time you make assumptions about data in your analysis, be sure to note what you’ve done in the review, and conduct a sensitivity analysis to see what impact your assumption may have on the analysis. More about sensitivity analysis will be covered in a separate presentation (Heterogeneity). You also have the alternative not to meta-analyse the data, although you should always present the relevant results from all your included studies – for example in a table, and discussed in your narrative description of the results in your review. Alternatively, your study might not provide separate data for the intervention and control groups at all, but might report the overall effect estimate – such as the MD, adjusted results from an ANCOVA analysis, a ratio of means, or as we mentioned earlier they might be using geometric means for log transformed data. You can use these effect estimates in a meta-analysis, too, as long as they have been reported with a measure of variance, such as the standard error or a confidence interval can be used for this. When we enter the usual means and SDs, RevMan uses these to calculate the effect estimate along with the variance or uncertainty for each study. In effect, studies that report effect estimates have done these calculations for you. Because they’re in a different format, these studies need to be entered using a different outcome type in RevMan, not the usual Continuous outcome type, called the generic inverse variance method. An explanation of the generic inverse variance method, and what to do if you have an overall effect estimate instead of separate data for each group, will be given in a separate presentation on analysing non-standard data. See Section of the Handbook

RevMan Calculator In RevMan there is a calculator that you can use to convert different measures of variability, such as a standard error or 95% confidence interval, or even a mean difference, to the standard deviation which you can then enter into a meta-analysis later. In this example, the study reported the means and standard errors in each group. So to calculate the standard deviations of each group, all you need to do is enter the mean, sample size and standard error values into the appropriate fields which are highlighted in bright green, and once you have entered these values, the rest of the fields in the calculator will be completed automatically. However, just like any calculator, you have to be careful that you enter the correct values into the correct fields in the calculator so that you don’t end up calculating incorrect standard deviations and then entering them into the meta-analysis.

effect measure(s) to be used for continuous data Going back to the protocol, you should include a brief statement of the effect measures you plan to use for continuous data, such as MD if studies all use the same scale, and SMD if it’s possible that you’ll want to combine studies using different scales. You can also mention additional elements such as plans to convert statistics into the required formats, and whether you have a preference to use post-intervention or change scores. If you know in advance you will be using more complex methods, such as imputation of SDs or methods to address skewed data, go ahead and describe them here. It’s likely that some methods cannot be decided until you know what data are reported in your included studies. Where you adopt additional methods later in the review process, these can be added to the Methods section (although any post-hoc changes should be clearly identified, and noted in the ‘Differences between the protocol and the review’ Section).

Take home message mean difference and standardised mean difference compare continuous measures between two groups for basic analysis of continuous data you need the mean and SD and the number of participants in each group both change and post-intervention data can be combined in your analysis the required statistics can often be calculated from the reported data

Question: What information from each study allows you to meta-analyse a continuous outcome?
mean and 95% confidence interval for each group numbers of events in each group and sample size for each group mean difference or mean, standard error and sample size for each group

Answer: What information from each study enables you to perform a meta-analysis of a continuous outcome variable? mean and 95% confidence interval for each group numbers of events in each group and sample size for each group mean difference or mean with a standard error and sample size for each group INCORRECT – without knowing the sample size of each group we are unable to meta-analyse this mean and 95%CI data INCORRECT – this information allows us to meta-analyse DICHOTOMOUS data CORRECT – Mean and SE: the Cochrane review software, RevMan, will calculate the mean difference or standardised mean difference as well as a confidence interval as a measure of uncertainty, based on this information. A mean difference and SE could also be entered into RevMan using the generic inverse variance meta-analysis along with other continuous outcome effect estimates in RevMan. In fact, the sample size is not needed for this analysis.

Question: Which of the following interpretations of the effect estimate for the outcome (response to treatment) is correct? antidepressants increased the odds of responding to treatment by 229% the risk of responding to treatment with antidepressants was 229% of the risk in the placebo group placebo halved the odds of responding to treatment antidepressants doubled the odds of responding treatment Source: Rayner L, Price A, Evans A, Valsraj K, Higginson IJ, Hotopf M. Antidepressants for depression in physically ill people. Cochrane Database of Systematic Reviews 2010, Issue 3. Art. No.: CD DOI: / CD pub2

Answer: Which of the following interpretations of the effect estimate for the outcome (response to treatment) is correct? antidepressants increased the odds of responding to treatment by 229% the risk of responding to treatment with antidepressants was 229% of the risk in the placebo group placebo halved the odds of responding to treatment antidepressants doubled the odds of responding treatment Source: Rayner L, Price A, Evans A, Valsraj K, Higginson IJ, Hotopf M. Antidepressants for depression in physically ill people. Cochrane Database of Systematic Reviews 2010, Issue 3. Art. No.: CD DOI: / CD pub2 INCORRECT – antidepressants increased the odds of responding to treatment by 129%, not 229% INCORRECT – as this meta-analysed effect estimate is an ODDS RATIO, not a risk ratio, then it is incorrect to interpret this effect in terms of ‘risk’ INCORRECT – for placebo to have halved the odds of responding to treatment, the odds ratio would have to be approximately 0.5 (i.e less than 1) CORRECT – an odds ratio of 2.29 means that the intervention, antidepressants, doubled the odds of responding to treatment

Resources The Cochrane Handbook for Systematic reviews of Interventions The PRSIMA statement for transparent reporting of systematic reviews PROSPERO database of systematic review protocols Systematic Reviews Journal

Resources Cochrane Training http://training.cochrane.org/
Slidecast presentations Webinars Some areas require author log in but a lot on this website is free access

References Acknowledgements
Higgins JPT, Deeks JJ (editors). Chapter 7: Selecting studies and collecting data. In: Higgins JPT, Green S (editors). Cochrane Handbook for Systematic Reviews of Interventions Version [updated March 2011]. The Cochrane Collaboration, Available from Deeks JJ, Higgins JPT, Altman DG (editors). Chapter 9: Analysing data and undertaking meta-analyses. In: Higgins JPT, Green S (editors). Cochrane Handbook for Systematic Reviews of Interventions Version [updated March 2011]. The Cochrane Collaboration, Available from Schünemann HJ, Oxman AD, Higgins JPT, Vist GE, Glasziou P, Guyatt GH. Chapter 11: Presenting results and ‘Summary of findings' tables. In: Higgins JPT, Green S (editors). Cochrane Handbook for Systematic Reviews of Interventions Version [updated March 2011]. The Cochrane Collaboration, Available from Acknowledgements Based on materials by Sally Hopewell, Julian Higgins, the Cochrane Statistical Methods Group, the Dutch Cochrane Centre and the Australasian Cochrane Centre

Http://www.youtube.com/watch?v=QUW0Q8tXVUc Viva la evidence.

Similar presentations

Presentation on theme: "Http://www.youtube.com/watch?v=QUW0Q8tXVUc Viva la evidence."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Http://www.youtube.com/watch?v=QUW0Q8tXVUc Viva la evidence.

Similar presentations

Presentation on theme: "Http://www.youtube.com/watch?v=QUW0Q8tXVUc Viva la evidence."— Presentation transcript:

Similar presentations

About project

Feedback