Presentation is loading. Please wait.

Presentation is loading. Please wait.

Review Design of experiments, histograms, average and standard deviation, normal approximation, measurement error, and probability.

Similar presentations


Presentation on theme: "Review Design of experiments, histograms, average and standard deviation, normal approximation, measurement error, and probability."— Presentation transcript:

1 Review Design of experiments, histograms, average and standard deviation, normal approximation, measurement error, and probability

2 Design of experiments Method: Investigators compare the responses of a treatment group with a control group. Treatment group: The group of subjects that are given treatments. Control group: The group of subjects that are not treated. (Given placebos.) Double-blind experiment: The subjects do not know whether they are in treatment or in control; neither do those who evaluate the responses. (e.g. Doctors evaluate the patients responses, investigators compare the responses.) This guard against bias, either in responses or in evaluations.

3 Design of experiments Controlled experiments: Investigators assign the subjects into two groups. If the experiments is randomized, then the subjects are assigned at random. Observational study: The subjects assign themselves to different groups, the investigators just watch what happens. Observational study has a great weakness: confounding. However, the controlled experiments minimize this problem.

4 Design of experiments Confounding factor: The treatment group is different from the control group with respect to other factors. The effect of these factors are confounded with the effect of the treatment. These factors are called confounders. Confounders have to be associated with both disease and exposure. Example: An observational study on smoking with related disease. The disease will be lung cancer or heart attack. The exposure will be smoking. A gene is a confounder if it is related to both lung cancer and smoking.

5 Simpson’s paradox Relationships between percentages in subgroups can be reversed when the subgroups are combined. Example: sex bias in graduate admissions.

6 Cross-sectional vs longitudinal In a cross-sectional study, different subjects are compared to each other at one point in time. (e.g. The HANES is a cross-sectional study.) In a longitudinal study, subjects are followed over time, and compared with themselves at different points in time. Example: In the HANES2, the average height of men appears to decrease after age 20, dropping about two inches in 50 years. Similarly for women. Could we conclude that an average person got shorter at this rate? Not really. Because the HANES is a cross-sectional study: the people in the group of age 18-24 are completely different from those in the group of age 65-74. The first group was born around 50 years later than the second group.

7 Histogram What is a histogram? A histogram is a graph that summarizes data. (It is just a summary.) Histogram consists of a set of blocks, and the area of each block represents the percentage of cases in the corresponding class interval. The total area is 100%. To calculate the height: The height represents the crowding in that class interval. It equals to the area divided by the length of that interval.

8 Histogram To draw a histogram: A distribution table may help: count the frequency, then calculate the percentage. Draw a horizontal axis with given scale. (Then, for most of the cases, draw a vertical axis for density scale.) Compute the height for each class interval. Draw the blocks. Quiz 1 will be a typical example for you.

9 Ave and SD A list of numbers (usually a data set) can be summarized by its average and standard deviation. Average locates the “center”, and SD measures the “spread”. Average = sum of entries / number of entries. The SD measures distance from the average. And SD = r.m.s. of the deviations from the average.

10 Convert to standard units A value is converted to standard units by seeing how many SDs it is above or below the average. Values above the average are given a plus sign; values below the average get a minus sign. The horizontal axis of the graph of the normal curve is in standard units. Many histograms for data are similar in shape to the normal curve, provided they are drawn to the same scale: making the horizontal scales match up involves standard units.

11 Example A histogram for the calculus test scores. Average is 70 and SD is 10. Number of students is 200. We convert the horizontal axis into standard units. Then we match the vertical scale by fixing the areas. (Or just multiply the corresponding factor.) Then we sketch the normal curve. (A bell shape curve with center height about 40%.)

12 Normal approximation Example: Find the number of scores within 1.6 SDs of the average in the previous example. (Or equivalently, we can say, what is the number of scores between 54 and 86.) Solution: From the normal table, we find that the region under the normal curve between -1.6 and 1.6 has an area 89.04% ≈ 90%. So the number should be about 200 x 90% = 180.

13 Percentile A percentile is a number of the quantitative variable, representing the corresponding percentage. For example, say, in the previous example, the 10 th percentile is 60. This means, about 10% of the students (population) is below or equal 60 (the percentile level). Exercise: What is the 25 th percentile of the list: 1,2,3,4? (See next slide.) A percentile rank is a percent of the percentile: e.g. 10%. All histograms, whether or not they follow the normal curve, can be summarized using percentiles.

14 The 25 th percentile of the list: 1,2,3,4 Correction: what I showed you in class had a mistake. I apologize for that. Solution: The 25 th percentile means that, about 25% of the entries is below or equal to the percentile, say z. So the number of entries that is about 25% is 4 x 25% = 1. Hence there is only one entry is below or equal to z. This implies z = 1. So the 25 th percentile of the list is 1. Similarly, the 75 th percentile of the list: 8,4,2,9 is 8. (4 x 75% = 3, after ordering, the 3 rd entry is 8.) In general, for discrete data set, like a list, if the number of entries we calculate is not an integer, then the percentile is not defined. For example, the 20 th percentile of the list is not defined, since 4 x 20% = 0.8.

15 Percentile approximation Example: In the previous example, if one of the students claims his score is higher than 90.32% of his classmates, use the normal approximation to estimate his score. (Or equivalently, what is the 90 th percentile of the distribution of the score.) Solution: Let’s say the spot at the standard units is z, such that the region to the left of z has the area 90.32%. Then the area to the left of –z will be 100% - 90.32% = 9.68%. So the area between –z and z is 90.32% - 9.68% = 80.64%. From the normal table, z = 1.3. So the score of the student is about 1.3 x 10 + 70 = 83.

16 Median and Interquartile The median is another way to locate the center of a histogram, with half the area to the left and half to the right. (The 50 th percentile.) The interquartile range = 75 th percentile – 25 th percentile. When the distribution has a long tail, we use median as the center of the histogram, and we use the interquartile range as a measure of spread.

17 Change of scale Adding the same number to every entry on a list adds that constant to the average; the SD does not change. Multiplying every entry on a list by the same positive number multiplies the average and the SD by that constant. These changes of scale do not change the standard units.

18 Measurement error Chance errors change from measurement to measurement, sometimes up and sometimes down. Bias affects all measurements the same way, pushing them in the same direction. If there is no bias in a measurement procedure, then the long-run average of repeated measurements should give the exact value of the thing being measured: the chance errors should cancel out. If there is bias, then the long-run average will itself be either too high or too low. Bias can not be detected just by looking at the measurements themselves.

19 Size of the chance error The likely size of the chance error in a single measurement can be estimated by the SD of repeated measurements. Example: Homework Set 3, problem 5.

20 Probability The probability of something gives the percentage of times the thing is expected to happen, when the basic process is repeated over and over again. Probabilities are between 0% and 100%. Impossibility is represented by 0%, certainty by 100%. The probability of something equals 100% minus the probability of the opposite thing. For example: We draw a ticket from a box with tickets: 1,2,3,4,5. Then the probability of drawing a number 4 or more is 2/5. The probability of drawing a number 3 or less is 1 – 2/5 = 3/5.

21 Formulas for probability The multiplication rule: P(A, B) = P(A|B) x P(B). The conditional probability: P(A|B) = P(A, B) / P(B). Two events are independent if the chances for the second one stay the same no matter how the first one turns out: P(A|B) = P(A). Consequence of independence: P(A, B) = P(A) x P(B). For example, We draw twice from a box with tickets: 1,2,3,4,5. Then the probability of the first draw being a number 4 or more and second draw being a number 3 or less is: 2/5 x 3/5 = 6/25 = 0.24.

22 Drawing tickets from a box When we draw tickets at random, all tickets in the box share the same chance to be picked. Draws made at random with replacement are independent. Without replacement, the draws are dependent. (Exclude some extreme cases.)

23 Good Luck!


Download ppt "Review Design of experiments, histograms, average and standard deviation, normal approximation, measurement error, and probability."

Similar presentations


Ads by Google