Download presentation
Presentation is loading. Please wait.
Published byZoe Holland Modified over 7 years ago
1
What is Statistics? Thanks to Texas A&M University at College Station, TX for giving me a wonderful opportunity to advance my teaching of Statistics. A special thanks to Dr. Jim Matis and Dr. Julie H. Carroll for their inspiration and dedication to improving the field of teaching statistics at the undergraduate level. Ask students why would they want to learn statistics. Besides the requirement for graduation…ask them if they ever read the sports summary statistics after games, watch the analysts predict stock market movements, watched the weather news forecast tomorrow’s climatic changes,…ALL of which involves statistics. Have they ever participated in a survey or experiment? I.E. phone surveys, internet surveys, medical experiments So what is Statistics? See if any volunteers will attempt to answer.
2
Statistics is a collection of procedures and principles for gathering data and analyzing information in order to help people make decisions when faced with uncertainty. Text: Mind On Statistics, by Jessica Utts and Robert Heckard Pg. 1—Stress definition of Statistics.
3
7 Statistical Stories w/Morals
Speedy Drivers Disaster in the Skies Dating Angry Women Prayer & Pressure Heart Attacks Internet Loneliness Text: Mind On Statistics, by Jessica Utts and Robert Heckard Humpty Dumpty sat on the wall, Humpty Dumpty had a great fall. All the king’s horses And all the king’s men Couldn’t put Humpty Dumpty Together again. We could all make a moral for this story such as Stay focused or The higher you get the greater the fall. Ch 1 contains 7 case studies that will be referred to continuously in the textbook. Read the moral first then the case study.
4
“Data are used to make a judgment about a situation”
What question needs to be answered? How should we collect data & how much? How can we summarize the data? What decisions or generalizations can be made in regards to the question based on the data collected? Stress that there has to be a question first before we gather data. With today’s presentation we will discuss summarizing data. Tomorrow we will discuss collection and how much data issues. The meat of this course consists of analysis of data. Everyday we will work on generalizations (in the context of the situation). Ex. How many drove to school today? Do you have an established route picked out? Was that particular route decided upon personal experience with traffic lights, road construction, parking space…data. Your analysis drew you to your route. Your generalization is the route you travel to school on a daily basis.
5
Population Data vs. Sample Data
Everyone—everything Parameters—summary measurements denoted by greek letters (, ) Representative smaller “subset” of population Statistics—summary measurements denoted by standard letters (p, ) Stress we infer THINGS about the population using statistics gathered from samples. Many times it is impossible to measure an entire population, so sample statistics is used to inform us of what is happening with the population of interest.
6
Data--Types of Variables
Categorical Group of category names w/no order Eye Color (brown, blue, green) Ordinal Categorical variable w/order T-Shirt Size (S, M, L, XL. XXL) Quantitative Numerical values taken from an individual Weight (117 lbs, 170 lbs, 253 lbs) Keep this simple. May want to discuss Discrete vs. Continuous quantitative variables.
7
Summarizing Data w/ Bar graph of Categorical Data
Tables are used to organize data collected by case number and variables. “Fancier” tables can contain percents, totals, etc. Summary statistics can contain information such as average value, proportions, minimum and maximum values. Bar chart is used with categorical data to display frequency within the different categories of variable.
8
Histogram—Univariate quantitative data
Histogram is used to graphically display univariate quantitative data as the example shows. Smaller data sets can be sketched by hand with 5 to 7 equal width intervals. (Note: In Stat 303, we will be using the computer to generate graphs.) The vertical axis represent count (frequency) or it could represent percent (relative frequency).
9
Comparative Boxplots—Univariate quantitative data by category
Outliers Boxplot can be used to graphically display univariate data. As in the example here, a quick comparison can be made by separating the data by the categorical variable, gender. The five number summary (minimum, quartile 1, median, quartile 3, and maximum) are the breaking points in boxplot. If outliers exist, then these points are not included in the modified boxplot.
10
Scatterplot—Bivariate quantitative data
From the beginning of USA participation in the Olympics to 1996, the bivariate data represents the relationship between the year which the USA competed in the Olympics and the distance of first place finish in the Men’s long jump field event. What pattern exists in the graph? A linear trend. In 1920, the first place finisher would have not won the event in the prior years of 1900, 1904, 1908, and Remember due to the war no Olympics was held in 1916.
11
Summary Features of Quantitative Variables
Location—Center(average) Spread—Variability Shape—Distribution pattern with data CSS—Center, Spread, Shape. You need to be able to eyeball this information from a graph. Dotplot has 500 temperatures recorded at the Southpole for 379 months. Approximately where is the center? Median –54.6 or Mean –49.4 Approximately how spread out is the data? Overall from –69 to –21 or 48 degree variability Approximately what shape does the data show? Trimodal representing the “3” seasons (short summer, short fall & spring, and long winter) Any outliers? Potentially but not positive without further investigation.
12
Location—Center Mean(, ) —add up all data values and divide by number of data values Median—list data values in order, locate middle data value Data Set: 19, 20, 20, 21, 22 Mean is Mean and median of a data set may or may NOT be one of the values in the data set. Inform students that mu symbol represents the population mean and x bar represents the sample mean. Remind students that the data must be ranked prior to finding median by hand. Ask them to refer to their textbook for step by step instructions. Median is 20 since it is the middle number of the ranked data values.
13
Describing Spread: Range, Quartiles, and Interquartile Range
Range = Maximum – minimum Q1 (Quartile 1) is the 25th percentile of ordered data or median of lower half of ordered data Median (Q2) is 50th percentile of ordered data Q3 (Quartile 3) is the 75th percentile of ordered data or median of upper half of ordered data IQR(Interquartile Range) = Q3 – Q1 Any point that falls outside the interval calculated by Q1- 1.5(IQR) and Q (IQR) is considered an outlier. Percentiles concept is implied but should be stressed to students that there exists other percentiles such as 56th percentile or the 98th percentile and what do these mean. Kth percentile means that k% of the ordered data values are at or below that data value. For example, if the median is 100, then 50% of the ordered data values falls at or below Also, (100-k)% represents the amount of ordered data that falls above the percentile data value. Outliers found by using the formula above creates an interval that if any data value falls outside that interval it is considered an outlier. We use this often combined with boxplots.
14
Boxplot—5 Number Summary
Computersx1000 250 1000 2400 3500 5400 8600 1000 2950 5400 8600 250 Data from The Presence of Computers in American Schools, by Ronald E. Anderson and Amy Ronnkvist Teaching, Learning, and Computing: 1998 Survey, Report #2, Center for Research on Information Technology and Organizations, The University of California, Irvine and The University of Minnesota. Stress 25% of the ordered data falls within the interval from min to Q1, as 25% of the ordered data set falls within the interval from Q1 and median, as this continues 25% of the ordered data falls within the interval from median to Q3, and the final 25% of the ordered data set falls within the interval from Q3 to max. Although the spreads appear different in length, the amount of data is the same within each interval. The difference in spread indicates a difference in data variation within each interval (not amount of data). Remember the IQR contains the middle 50% of the ordered data set. Q3 Max min Q1 median
15
Describing Spread: Standard Deviation
Roughly speaking, standard deviation is the average distance values fall from the mean. Let the arrow mark the center, mean. Each ring measures an average distance from the center, mean. Stress the the rings are of equal width (standard).
16
Population and Sample Standard Deviation
2 population variance s2 sample variance Be sure to go back over what each letter stands for in both formulas. Remind students what operation is performed by the summation symbol AND that a calculator or computer software will calculate these for them. Variance is another measure of spread and is calculated by squaring the standard deviation value. Students may ask why there is a difference dividing by n instead of n-1 for their respective formulas. Later in the course it hopefully will become clearer. What is Variance?
17
Calculated Standard Deviations
Sample Data Set Mean Standard Deviation 100, 100, 100, 100, 100 100 90, 90, 100, 110, 110 10 30, 90, 100, 110, 170 50 90, 90, 100, 110, 320 142 99.85 The first data set contained all the same value, so the mean is obvious and hopefully the standard deviation value is too. Since there is no variation in the data set, the standard deviation is zero. The second data set has a simple mean to calculate (mentally) but the standard deviation can be calculated using the formula on a side board. The third example has data values that are more spread out therefore the standard deviation value should be higher. The fourth example contains an outlier and dramatically affects the mean and standard deviation.
18
Various Distributions
Name several types of distributions such as the normal and chi-square distributions as shown and student t and F distributions students will work with not shown in this slide. Each distribution has specific characteristics. The normal distribution is known as the symmetric bell shaped curve. Depending on the standard deviation the normal distribution can be tall of short but still resemble a bell shaped curve. The mean of a normal distribution shifts the graph about the horizontal axis and as the standard deviation this changes the spread of the distribution (but not the general shape of a bell-shaped curve) . The chi-square distribution’s shape is dependent upon its degrees of freedom which will be discussed at a later time.
19
Empirical Rule 34% 34% 68% 47.5% 47.5% 95% 49.85% 49.85% 99.7%
Take time to slowly click through slide. Stress that the Empirical Rule can ONLY be used with the assumption that the distribution is normal (bell-shaped curve). Sixty-eight percent of the ordered data of a normal distribution lies within one standard deviation of the mean. Ninety-five percent of the ordered data of a normal distribution lies within two standard deviations of the mean. And, 99.7% of the ordered data of a normal distribution lies within 3 standard deviations of the mean. The normal distribution here, in this example shown, has a mean of 0 and standard deviation of 1. 49.85% 49.85% 99.7%
20
Empirical Rule 68-95-99.7% RULE 68% 95% 99.7%
Empirical Rule is sometimes referred to as the % Rule. Again, to use the Empirical Rule the distribution of the data must be normal bell-shaped curve. 99.7% % RULE
21
Empirical Rule-- Let H~N(69, 2.52)
What is the likelihood that a randomly selected adult male would have a height less than 69 inches? P(h < 69) = .50 Let H~N(69, 2.52) means H represents the variable height for an adult male has a normal distribution with population mean of 69 inches and a population standard deviation of 2.5 (or variance of 2.52). Start with an easy problem using the empirical rule AND introduce proper probability notation. Explain P implies probability (likelihood of an event), H represents the random variable of height of adult male, P(h<69) is the question in statistical notation.
22
Empirical Rule—restated
68% of the values fall within 1 standard deviation of the mean in either direction 95% of the values fall within 2 standard deviation of the mean in either direction 99.7% of the values fall within 3 standard deviation of the mean in either direction Note: If the range 6 doesn’t roughly equal the standard deviation, the data may contain outliers or have a skewed shape. Rule of Thumb: Range divide by 6 approximately the standard deviation.
23
Using the Empirical Rule
Let H~N(69, 2.52) What is the likelihood that a randomly selected adult male will have a height between 64 and 74 inches? P(64 < h < 74) = .95 Let H~N(69, 2.52) means H represents the variable height for an adult male has a normal distribution with population mean of 69 inches and a population standard deviation of 2.5 (or variance of 2.52).
24
Using Empirical Rule-- Let H~N(69, 2.52)
What is the likelihood that a randomly selected adult male would have a height of less than 66.5 inches? P(h < 66.5) = 1 – ( ) = = .16 There are several ways to do this problem. Some students may see it as to obtain .16. If students obtained an answer of .16 but did it differently please ask them to discuss their method.
25
Using Empirical Rule-- Let H~N(69, 2.52)
What is the likelihood that a randomly selected adult male would have a height of greater than 74 inches? P(h > 74) = 1 - ( ) = = .025
26
Using Empirical Rule-- Let H~N(69, 2.52)
What is the probability that a randomly selected adult male would have a height between 64 and 76.5 inches? P(64 < h < 76.5) = = .9735
27
Standardized Z-Score To get a Z-score, you need to have 3 things
Observed value of X Population mean, Population standard deviation, Then follow the formula. In form students that z-scores are used to compare similar information by calculating z-scores. Also, Z-distribution is the normal distribution with a population mean of 0 and a population standard deviation of 1. Z~N(0, 1).
28
Empirical Rule & Z-Score
About 68% of values in a normally distributed data set have z-scores between –1 and 1; 95% of the values have z-scores between –2 and 2; and 99.7% of the values have z-scores between –3 and 3. Since the Z-distribution is normal, the empirical rule holds.
29
Z-Score & Let H~N(69, 2.52) What would be the standardized score for an adult male who stood 71.5 inches? Slowly go through the steps on how to calculate z-score. Remember to explain notation. H~N(69, 2.52) Z~N(0, 12)
30
Z-Score & Let H~N(69, 2.52) What would be the standardized score for an adult male who stood inches? h = as z = -1.5
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.