ISP 121 Statistics That Deceive. Simpson’s Paradox It’s a well accepted rule of thumb that the larger the data set, the better Simpson’s Paradox demonstrates.

ISP 121 Statistics That Deceive

Simpson’s Paradox It’s a well accepted rule of thumb that the larger the data set, the better Simpson’s Paradox demonstrates that a great deal of care has to be taken when combining smaller data sets into a larger one Sometimes the conclusions from the larger data set are opposite the conclusion from the smaller data sets

Example: Simpson’s Paradox Average college physics grades for students in an engineering program: HS PhysicsNone Number of Students505 Average Grade8070 Average college physics grades for students in a liberal arts program: HS PhysicsNone Number of Students550 Average Grade9585 It appears that in both classes, taking high school physics improves your college physics grade by 10.

Example continued In order to get better results, let’s combine our datasets. In particular, let’s combine all the students that took high school physics. More precisely, combine the students in the engineering program that took high school physics with those students in the liberal arts program that took high school physics. Likewise, combine the students in the engineering program that did not take high school physics with those students in the liberal arts program that did not take high school physics. But be careful! You can’t just take the average of the two averages, because each dataset has a different number of values.

Example continued Average college physics grades for students who took high school physics: # StudentsGradesGrade Pts Engineering50804000 Lib Arts595475 Total554475 Average (4000/4475*80 + 475/4475*95) 81.4 Average college physics grades for students who did not take high school physics: # StudentsGradesGrade Pts Engineering570350 Lib Arts50854250 Total554600 Average (350/4600*70 + 4250/4600*85) 83.6 Did the students that did not have high school physics actually do better?

The Problem Two problems with combining the data –There was a larger percentage of one type of student in each table –The engineering students had a more rigorous physics class than the liberal arts students, thus there is a hidden variable So be very careful when you combine data into a larger set

IT 121 Statistics That Deceive, Part Two

Tumors and Cancer Most people associate tumors with cancers, but not all tumors are cancerous Tumors caused by cancer are malignant Non-cancerous tumors are benign

Mammograms Suppose your patient has a breast tumor. Is it cancerous? Probably not Studies have shown that only about 1 in 100 breast tumors turn out to be malignant Nonetheless, you order a mammogram Suppose the mammogram comes back positive. Does the patient have cancer?

Accuracy Earlier mammogram screening was 85% accurate 85% would lead you to think that if you tested positive, there is a pretty good chance that you have cancer. But this is not true.

Actual Results Consider a study in which mammograms are given to 10,000 women with breast tumors Assume that 1% of the tumors are malignant (100 women actually have cancer, 9900 have benign tumors)

Actual Results Mammogram screening correctly identifies 85% of the 100 malignant tumors as malignant These are called true positives The other 15% had negative results even though they actually have cancer These are called false negatives

Actual Results Mammogram screening correctly identifies 85% of the 9900 benign tumors as benign Thus it gives negative (benign) results for 85% of 9900, or 8415 These are called true negatives The other 15% of the 9900 (1485) get positive results in which the mammogram incorrectly suggest their tumors are malignant. These are called false positives.

Note: Start with 10,000 samples. 1 in 100 are malignant, so that gives you 100 total malignant. 99 in 100 are benign, so that gives you 9900 total benign. Now we know that a mammogram is 85% accurate, so 85% of 100 is 85 True Positives. Likewise, 85% of 9900 gives you 8415 True Negatives.

Results Overall, the mammogram screening gives positive results to 85 women who actually have cancer and to 1485 women who do not have cancer The total number of positive results is 1570 Because only 85 of these are true positives, that is 85/1570

Results Thus, the chance that a positive result really means cancer is only 5.4% Therefore, when your patient’s mammogram comes back positive, you should reassure her that there’s still only a small chance that she has cancer

Another Question Suppose you are a doctor seeing a patient with a breast tumor. Her mammogram comes back negative. Based on the numbers above, what is the chance that she has cancer?

Answer 15/8430, or 0.0018, or slightly less than 2 in 1000

ISP 121 Statistics That Deceive. Simpson’s Paradox It’s a well accepted rule of thumb that the larger the data set, the better Simpson’s Paradox demonstrates.

Similar presentations

Presentation on theme: "ISP 121 Statistics That Deceive. Simpson’s Paradox It’s a well accepted rule of thumb that the larger the data set, the better Simpson’s Paradox demonstrates."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ISP 121 Statistics That Deceive. Simpson’s Paradox It’s a well accepted rule of thumb that the larger the data set, the better Simpson’s Paradox demonstrates.

Similar presentations

Presentation on theme: "ISP 121 Statistics That Deceive. Simpson’s Paradox It’s a well accepted rule of thumb that the larger the data set, the better Simpson’s Paradox demonstrates."— Presentation transcript:

Similar presentations

About project

Feedback