Statistics - methodology for collecting, analyzing, interpreting and drawing conclusions from collected data Anastasia Kadina GM presentation 6/15/ Design: Planning and carrying out research studies; 2.Description: Summarizing and exploring data; 3.Inference: Making predictions and generalizing about phenomena represented by the data. Homer SimpsonHomer Simpson: Aw, you can come up with statistics to prove anything, Kent. 14 percent of all people know that.
Population - the collection of all individuals or items under consideration in a statistical study Sample - that part of the population from which information is collected Parameter – statistical description of the population PopulationSample Statistical Data Analysis
Variable – characteristic that varies from one item to another Quantitative (numerical) Qualitative (categorical) DiscreteContinuous
Observing the values of the variables yield data Observation – individual piece of data Data set/Data matrix – collection of observations for variable Data matrix k variables measured in sample with the size of n
Presenting data Relative frequency = Frequency / total # of observations Sample and population distributions:
Measures of center (averages) 1.The mode: the value that occurs with the highest frequency Example: 4, 2, 5, 2, 6, 1, 2:2 occurs with a greatest frequency If greatest freq == 1: no mode Can be more than 1 mode 2. The median: Arrange the observed values of variable in a data in increasing order. a. # of observations is odd: the value in the middle. b. # of observations is even: the number halfway between the two middle values Example: 2, 5, 7, 8, 9, 11:Median = 7.5 (len = 6) 3. Sample mean: the sum of observed values in a data divided by the number of observations
Measures of variability 1.Range: Range = max – min 2.Standard deviation: For a variable x, the sample standard deviation, denoted by s x or σ x (for sample), or σ (for population), is: SamplePopulation
Z-Score (Standard score) How many standard deviations a value lies above or below the mean of the set of data; For normal distribution probability of the event (area under the curve) can be found in the tables by z. SamplePopulation Empirical rule for symmetrical normal distribution: 68% of the values lie within x ± s x, 95% of the values lie within x ± 2s x, 99.7% of the values lie within x ± 3s x.
Z-Score (Standard score) Z α : value of Z for which the area under the standard normal curve to its right is equal to α. If we want to take both ends of the distribution into account, we consider Z α/2
Sampling of the population Random sample - a sample from a finite population random of it is chosen in such a way that each of the possible samples has the same probability of being selected. For random sample of size n of population N: Sampling distribution mean = population mean μ = μ x Standard deviation (standard error of the mean): Infinite populationFinite population Standard deviation correction factor
Central Limit Theorem For large samples the sample distribution of the mean can be approximated closely with a normal distribution. Large: sample size n >= 30 μ = μ x
Z α denotes the value of z for which the area under the standard normal curve to its right is equal to α Z α/2 is such value that area under the standard normal curve between -Z α/2 and +Z α/2 is equal to 1 - α μ = μ x When we use μ x as an estimate of μ, the probability is 1 - α that this estimate will be “off” either way by at most E = Z α/2 * (σ / √n) (standard error) Probability and Confidence of Statements In general, we make probability statements about future values of random variables (e.g. potential error of an estimate) and confidence statements once the data has been obtained.
Confidence intervals The probability is (1 – α) that a random variable having the normal distribution will take on a value between -Z α/2 and +Z α/2 : -Z α/2 < Z < Z α/2 -Z α/2 < < Z α/2 Confidence interval X - Z α/2 * σ / √n < μ < X + Z α/2 * σ / √n As we increase the degree of certainty, namely the degree of confidence (1 – α), the confidence interval becomes wider and thus tells us less about the quantity we are trying to estimate. For large samples (n >= 30) and σ is known
Student’s t-test Also good for small samples (<30) and/or when standard dev is unknown; distribution is roughly the shape of normal distribution Degrees of freedom: df = n – 1 Small sample confidence interval: X - t α/2 * s / √n < μ < X + t α/2 * s / √n t α/2 can be found in corresponding tables by df and α t-score
Error Bars - graphical representation of the variability of data and are used on graphs to indicate the error, or uncertainty in a reported measurement Common Error Bars
Test of Hypotheses A statistical hypothesis is an assertion about the parameter(s) of a population. Null hypothesis (H 0 ) – any hypothesis set up primarily to see whether it can be rejected (is directly tested); Alternative hypothesis (H A ) – the hypothesis that we accept when the null hypothesis can be rejected. A significance test is a way of statistically testing a hypothesis by comparing the data to values predicted by the hypothesis. Data that fall far from the predicted values provide evidence against the hypothesis. If the difference between what we expect and what we observe is so small that it may well be attributed to chance, the results are not statistically significant. The test statistics is a statistic calculated from the sample data to test the null hypothesis. This statistic typically involves a point estimate of the parameter to which the hypotheses refer.
p-value - the probability, when H 0 is true, of a test statistic value at least as contradictory to H 0 as the value actually observed. The smaller the p-value, the more strongly the data contradict H 0. The primarily reported result of a significance test. The p-value summarizes the evidence in the data about the null hypothesis. A moderate to large p-value means that the data are consistent with H 0. Most studies require very small p-value, such as p 0.05, before concluding that the data sufficiently contradict H 0 to reject it. In such cases, results are said to be significant at the 0.05 level. This means that if the null hypothesis were true, the chance of getting such extreme results as in the sample data would be no greater than 5%.