Presentation is loading. Please wait.

Presentation is loading. Please wait.

SOME PROBLEM THAT MIGHT APPEAR IN DATA

Similar presentations


Presentation on theme: "SOME PROBLEM THAT MIGHT APPEAR IN DATA"— Presentation transcript:

1 SOME PROBLEM THAT MIGHT APPEAR IN DATA
CONFOUNDING

2 CONFOUNDING EFFECT It is an extraneous factor, not dependent or main independent variable of interest, that distorts the observed relationship between the dependent and independent variable. Confounding complicates analyses because of the presence of a third factor association with the possible independent variable and the response. Confounding may lead to errors in the conclusion of a study, but, when confounding variables are known, the effect may be fixed.

3 CONFOUNDING EFFECT For example, it is known that smoking is a risk factor for lung cancer. However, in region Z, many men worked in the local asbestos mines. They were therefore exposed to asbestos, which is a known risk for lung cancer. It is also known that, because of the stress miners are exposed to, they tend to smoke more, especially when working underground. Whereas smoking is related to lung cancer, mining is related to smoking as well as to lung cancer. Therefore, there is a triangular relationship between smoking, mining and lung cancer.

4 CONFOUNDING EFFECT Another example: Consider the mortality rate in Florida, which is much higher than in Michigan. Before concluding that Florida is a riskier place to live, one need to consider confounding factors such as age. Florida has a higher proportion of people of retirement age and older people are more likely to die in any given interval of time. Therefore, one must adjust for age before drawing any conclusion. If the result of an analysis differ depending on whether or not a variable is included in the analysis, we say that there is confounding. To check confounding, we can use multiple regression. Regress mortality rates for Florida compared to Michigan adding independent variables, and then add age factor. If the coefficients in the regression change too much, age is a confounding factor.

5 EFFECT OF CONFOUNDING Observed difference between study populations when no real difference exists. Not observed difference between study populations when a true association exists. An underestimation of an effect An overestimation of an effect.

6 Simpsons's Paradox Sometimes averages can be misleading. Sometimes they don’t make any sense. Be careful when averaging different variables. Let’s see the Centerville sign. Entering Centerville Established Population Elevation Average

7 Simpsons's Paradox - When Big Data Sets Go Bad By Smita Skrivanek, Principal Statistician, MoreSteam.com LLC It's a well accepted rule of thumb that the larger the data set, the more reliable the conclusions drawn. Simpson' paradox, however, demonstrates that a great deal of care has to be taken when combining small data sets into a large one. Sometimes conclusions from the large data set are exactly the opposite of conclusion from the smaller sets.  Unfortunately, the conclusions from the large set are also usually wrong.

8 EXAMPLE You’re in charge of a study that compares how two weight loss techniques – Diet and Exercise – affect the weight loss of overweight patients. Overall, you had 240 patients participate in the study, with 120 assigned to a weight‐loss diet and the remaining 120 assigned to a supervised exercise regimen. At the end of 30 days, you measured each group’s weight loss. The data showed that 70 dieters and 57 exercisers lost significant weight, representing 58% in the diet group and only 48% in the exercise group – a significant difference. So, should you conclude that diet is better than exercise?

9 No, and this why Simpson’s Paradox can be so tricky
No, and this why Simpson’s Paradox can be so tricky! When the data are stratified instead by the starting Body Mass Index (BMI) of the participants, as shown below, clearer picture emerges:

10 When examined by BMI group, you can clearly see that the percentage of patients who lost weight in each BMI group was smaller among the dieters than among the exercisers. The surprising variable was the number of obese and severely obese individuals between the diet and exercise groups. Because those numbers were flipped, the overall percentages of successful weight loss are reversed (higher for the diet group). Simpsons’s Paradox at Work: The percentage of patients who lost weight was higher for exercisers among both obese and severely obese patients, but when you aggregate the two groups, the dieters appear to do better.

11 Why Did this Happen? Two factors are at play here. First, there is an overlooked confounding variable (BMI), and second, a disproportionate allocation of BMI levels among the experimental (diet and exercise) groups. We do not know the reason for the disproportionate allocation, but we might guess that the patients somehow self‐selected themselves into the two groups.

12 The proportions of dieters and exercisers in each BMI group

13 The proportions of weight loss and non‐weight loss patients among the different subgroups
It is clear that more exercisers lost weight in each BMI group but that in the aggregated sample the proportions seem to be reversed.

14 EXAMPLE Suppose there are two pilots, Moe and Jill. Moe argues that he is the better pilot of the two, since he managed to land 83% of his last 120 flights on time compared with Jill’s 78%. Let’s look at the data more closely. Here are the results of their last 120 flights, broken down by the time of the day they flew:

15 Time of Day Day Night Overall Pilots Moe 90 out of 100 90% 10 out of 20 50% 100 out of 120 83% Jill 19 out of 20 95% 75 out of 100 75% 94 out of 120 78% Look at the day and night separately. For day flights, Jill had a 95% on time rate, and Moe only a 90% rate. At night, Jill was on time 75% of the time, and Moe only 50%. So, Moe is better “overall”, but Jill is better both during the day. Problem here is the unfair averaging over different groups. Jill has mostly night flights, which are more difficult. So, her average is heavily influenced by her night time average. Moe, on the other hand, benefits from flying mostly during the day, with higher on time percentage. With their different patterns of flying conditions, taking an overall average is misleading.

16 How to Avoid the Paradox
To avoid spurious results, it is always good practice to examine whether the relationship in the aggregated dataset holds up in it subsets, especially when some groups are not equally represented as others in the data. Another way may be to weight the samples according to their sizes. Proper randomization also goes a long way in minimizing the effects of a lurking variable that might have been missed.

17 Unfortunately, statistical analysis tools are just that – tools to help you organize and analyze the observed data. They cannot tell you anything about data that were not observed or not included in the analysis. So it is very important to involve a cross‐functional team and especially subject matter experts and practitioners in the initial planning and selection of the variables to be measured. After they collect the data, the only way to try to avoid this pitfall is to visually and otherwise examine meaningful subsets of the data. If you don’t have the option of planning the study but are given the data from a database and asked to “find what you can”, the lesson of Simpson's Paradox is to always look at the data at several levels of aggregation, as in the example above.

18 CONTROLLING FOR CONFOUNDING AT THE DESIGN STAGE
RANDOMIZATION: All potential confounding variables, both known and unknown, should be equally distributed in the study group. Use random allocation but this can only be used in experimental clinical trials. RESTRICTION: Limit participation in the study of the individuals who are similar in relation to the conformer. e.g. restrict the study for non-smokers hence any potential confounding effect of smoking will be eliminated. DISADVANTAGE: It is difficult to generalize the results of the study to the wider population if the study group is homogeneous. MATCHING: Select controls so that the distribution of potential confounders (e.g. age or smoking) is as similar as possible to that amongst the cases.

19 CONTROLLING FOR CONFOUNDING DURING THE ANALYSIS
STRATIFICATION: It allows the association between X and Y to be examined within different strata of the confounding variable, e.g. age or sex. The strength of the association is initially measured separately within each stratum of the confounding variable. MULTIVARIATE ANALYSIS: As the number of confounders that can be controlled is limited (small numbers can be in a strata), statistical modelling (e.g. logistic regression) is commonly used to control for more than one confounder at the same time.

20 Postoperative wound infection
EXAMPLE We want to examine the possible association between postoperative wound infection with the type of operation: open appendectomy and laparoscopic appendectomy. We have taken 100 cases (patients with wound infection) and 100 controls (patients without wound infection) to see the effect of appendectomy type on the existence of wound infection. The results of this study are shown in Table 1. Table 1. Number of open and laparoscopic appendectomyieas according to postoperative wound infection rate. Appendectomy Postoperative wound infection Total Yes No Open 30 18 48 Laparoscopic 70 82 152 100 200 Odds-ratio=1.95 The OR of 1.95 suggests that the postoperative wound infection is almost 2 times higher for open appendectomy than laparoscopic appendectomy.

21 Postoperative wound infection
In the next step, we must examine whether the observed relation between open appendectomy and postoperative wound infection (outcome variable) is confounded by obesity (confounding variable). Table 2. Contingency tables for wound infection cases and controls by obesity status . Odds-ratio=4.00 Obesity  Postoperative wound infection Total Yes No 50 20 70 80 130 100 200 It seems that, obesity is related to increased risk for postoperative wound infection Table 3. Relation between surgical approach (open vs laparoscopic) and obesity status Obesity Appendectomy Total Open Laparoscopic Yes 35 70 No 13 117 130 48 152 200 Odds-ratio=9.00 We clearly observe that obese patients were more likely than non-obese patients to have an open appendectomy.

22 Table 4. Calculation of odds ratio after stratifying by obesity status
OBESITY - NO Appendectomy Wound infection TOTAL Yes No Open 5 8 13 Laparoscopic 45 72 117 50 80 130 OBESITY - YES 25 10 35 20 70 OR=1 OR=1 When we calculate the OR separately for obese and non-obese patients, we find that the OR is 1 in each stratum, indicating the lack of association between postoperative wound infection and type of surgical approach. We could conclude that the OR of 1.95 in Table 1 was owing to the unbalanced distribution of obesity between cases and controls. Thus, obesity was a confounder, and the association between open appendectomy and postoperative wound infection was spurious.


Download ppt "SOME PROBLEM THAT MIGHT APPEAR IN DATA"

Similar presentations


Ads by Google