Presentation is loading. Please wait.

Presentation is loading. Please wait.

What is Statistics? Thanks to Texas A&M University at College Station, TX for giving me a wonderful opportunity to advance my teaching of Statistics.

Similar presentations


Presentation on theme: "What is Statistics? Thanks to Texas A&M University at College Station, TX for giving me a wonderful opportunity to advance my teaching of Statistics."— Presentation transcript:

1 What is Statistics? Thanks to Texas A&M University at College Station, TX for giving me a wonderful opportunity to advance my teaching of Statistics. A special thanks to Dr. Jim Matis and Dr. Julie H. Carroll for their inspiration and dedication to improving the field of teaching statistics at the undergraduate level. Ask students why would they want to learn statistics. Besides the requirement for graduation…ask them if they ever read the sports summary statistics after games, watch the analysts predict stock market movements, watched the weather news forecast tomorrow’s climatic changes,…ALL of which involves statistics. Have they ever participated in a survey or experiment? I.E. phone surveys, internet surveys, medical experiments So what is Statistics? See if any volunteers will attempt to answer.

2 Statistics is a collection of procedures and principles for gathering data and analyzing information in order to help people make decisions when faced with uncertainty. Text: Mind On Statistics, by Jessica Utts and Robert Heckard Pg. 1—Stress definition of Statistics.

3 THINK—SHOW—TELL Why? Who? What? When? Where? How?
Text: Mind On Statistics, by Jessica Utts and Robert Heckard Humpty Dumpty sat on the wall, Humpty Dumpty had a great fall. All the king’s horses And all the king’s men Couldn’t put Humpty Dumpty Together again. We could all make a moral for this story such as Stay focused or The higher you get the greater the fall. Ch 1 contains 7 case studies that will be referred to continuously in the textbook. Read the moral first then the case study.

4 “Data are used to make a judgment about a situation”
What question needs to be answered? How should we collect data & how much? How can we summarize the data? What decisions or generalizations can be made in regards to the question based on the data collected? Stress that there has to be a question first before we gather data. With today’s presentation we will discuss summarizing data. Tomorrow we will discuss collection and how much data issues. The meat of this course consists of analysis of data. Everyday we will work on generalizations (in the context of the situation). Ex. How many drove to school today? Do you have an established route picked out? Was that particular route decided upon personal experience with traffic lights, road construction, parking space…data. Your analysis drew you to your route. Your generalization is the route you travel to school on a daily basis.

5 Population Data vs. Sample Data
Everyone—everything Parameters—summary measurements (p, ) of the population data. Representative smaller “subset” of population Statistics—summary measurements denoted by standard letters ( , ) of the sample data. Stress we infer THINGS about the population using statistics gathered from samples. Many times it is impossible to measure an entire population, so sample statistics is used to inform us of what is happening with the population of interest.

6 Data--Types of Variables
Categorical Group of category names w/no order Eye Color (brown, blue, green) Quantitative Numerical values taken from an individual Weight (117 lbs, 170 lbs, 253 lbs) Keep this simple. May want to discuss Discrete vs. Continuous quantitative variables.

7 Types of Quantitative (Numerical) Data
Discrete Example: Number of siblings, number of pockets in a pair of jeans, number of free throws made in a season,… Continuous Example: Time, Weight, Height, …because of our limitations of measurement accuracy we often round to the nearest second, ounce, inch,…

8 Summarizing Data w/ Bar graph of Categorical Data
Tables are used to organize data collected by case number and variables. “Fancier” tables can contain percents, totals, etc. Summary statistics can contain information such as average value, proportions, minimum and maximum values. Bar chart is used with categorical data to display frequency within the different categories of variable.

9 Summarizing Data with Pie Chart for Categorical Data
100%

10 Dotplot for Univariate Quantitative Data

11 Stemplot for Quantitative Data
Ages of Death of U.S. First Ladies 3 | 4, 6 4 | 3 5 | 2, 4, 5, 7, 8 6 | 0, 0, 1, 2, 4, 4, 4, 5, 6, 9 7 | 0, 1, 3, 4, 6, 7, 8, 8 8 | 1, 1, 2, 3, 3, 6, 7, 8, 9, 9 9 | 7 3 | 4 indicates 34 years old Stem Leaf—a single digit

12 Split Stemplot 1 | 7 1 | 8, 9, 9, 9, 9, 9 2 | 0, 0, 0, 0, 1, 1, 1, 1, 1, 1 2 | 2, 2, 2, 3, 3 2 | 4, 5 2 | 2 | 8 3 | 0, 1 Stem is split for every 2 leaves— (0, 1), (2, 3), (4, 5), (6, 7), and (8, 9) Age of 27 students randomly selected from Stat 303 at A&M

13 Split Stemplot 1 | 1 | 7, 8, 9, 9, 9, 9, 9 2 | 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4 2 | 5, 8 3 | 0, 1 3 | Stem is split for every 5 leaves—(0 thru 4) AND ( 5 thru 9) Age of 27 students randomly selected from Stat 303 at A&M

14 Back-to-back Stemplot
Babe Ruth Roger Maris | 0 | 8 | 1 | 3, 4, 6 5, 2 | 2 | 3, 6, 8 5, 4 | 3 | 3, 9 9, 7, 6, 6, 6, 1, 1 | 4 9, 4, 4 | 5 | 0 | 6 | 1 Number of home runs in a season

15 Histogram—Univariate Quantitative data
Frequency Count Univariate Variable Age Histogram is used to graphically display univariate quantitative data as the example shows. Smaller data sets can be sketched by hand with 5 to 7 equal width intervals. (Note: In Stat 303, we will be using the computer to generate graphs.) The vertical axis represent count (frequency) or it could represent percent (relative frequency).

16 Boxplot and Modified Boxplot
“Divides data into 4 quarters” 25% of data in each section

17 Comparative Parallel Boxplots—Univariate quantitative data by category
Outliers Boxplot can be used to graphically display univariate data. As in the example here, a quick comparison can be made by separating the data by the categorical variable, gender. The five number summary (minimum, quartile 1, median, quartile 3, and maximum) are the breaking points in boxplot. If outliers exist, then these points are not included in the modified boxplot.

18 Cumulative Frequency Plot

19 Scatterplot—Bivariate quantitative data
From the beginning of USA participation in the Olympics to 1996, the bivariate data represents the relationship between the year which the USA competed in the Olympics and the distance of first place finish in the Men’s long jump field event. What pattern exists in the graph? A linear trend. In 1920, the first place finisher would have not won the event in the prior years of 1900, 1904, 1908, and Remember due to the war no Olympics was held in 1916.

20 Summary Features of Quantitative Variables
Center—Location Spread—Variability Shape—Distribution pattern with data Any unusual features? Explain in context. CSS—Center, Spread, Shape. You need to be able to eyeball this information from a graph. Dotplot has 500 temperatures recorded at the Southpole for 379 months. Approximately where is the center? Median –54.6 or Mean –49.4 Approximately how spread out is the data? Overall from –69 to –21 or 48 degree variability Approximately what shape does the data show? Trimodal representing the “3” seasons (short summer, short fall & spring, and long winter) Any outliers? Potentially but not positive without further investigation.

21 Location—Center Mean(, ) —add up data values and divide by number of data values Median—list data values in order, locate middle data value Data Set: 19, 20, 20, 21, 22 Mean and median of a data set may or may NOT be one of the values in the data set. Inform students that mu symbol represents the population mean and x bar represents the sample mean. Remind students that the data must be ranked prior to finding median by hand. Ask them to refer to their textbook for step by step instructions. Mean is Median is 20 since it is the middle number of the ranked (ordered) data values.

22 Robust (Resistant) Statistic
Median is resistant to extreme values (outliers) in data set. Mean is NOT robust against extreme values. Mean is pulled away from the center of the distribution toward the extreme value (“tails of graph”).

23 Of the 2 segments, where’s the Mean with respect to the Median?
Remember the mean is pulled toward extreme values.

24 Where’s the Mean with respect to the Median?

25 Mean or Median?

26 Location—pth Percentile
The pth percentile of a distribution (set of data) is the value such that p percent of the observations fall at or below it. Suppose your Math SAT score is at the 80th percentile of all Math SAT scores. This means your score was higher than 80% of all other test takers.

27 Describing Location: Quartiles Spread: Range and Interquartile Range
Range = Maximum – minimum Q1 (Quartile 1) is the 25th percentile of ordered data or median of lower half of ordered data Median (Q2) is 50th percentile of ordered data Q3 (Quartile 3) is the 75th percentile of ordered data or median of upper half of ordered data IQR(Interquartile Range) = Q3 – Q1 Any point that falls outside the interval calculated by Q1- 1.5(IQR) and Q (IQR) is considered an outlier. Percentiles concept is implied but should be stressed to students that there exists other percentiles such as 56th percentile or the 98th percentile and what do these mean. Kth percentile means that k% of the ordered data values are at or below that data value. For example, if the median is 100, then 50% of the ordered data values falls at or below Also, (100-k)% represents the amount of ordered data that falls above the percentile data value. Outliers found by using the formula above creates an interval that if any data value falls outside that interval it is considered an outlier. We use this often combined with boxplots.

28 Summary Statistics 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13     
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13      min Q median Q max Range = Max – min = 13 – 1 = 12 IQR = Q3 – Q1 = 10.5 – 3.5 = 7

29 Boxplot—5 Number Summary
Computersx1000 250 1000 2400 3500 5400 8600 1000 2950 5400 8600 250 Data from The Presence of Computers in American Schools, by Ronald E. Anderson and Amy Ronnkvist Teaching, Learning, and Computing: 1998 Survey, Report #2, Center for Research on Information Technology and Organizations, The University of California, Irvine and The University of Minnesota. Stress 25% of the ordered data falls within the interval from min to Q1, as 25% of the ordered data set falls within the interval from Q1 and median, as this continues 25% of the ordered data falls within the interval from median to Q3, and the final 25% of the ordered data set falls within the interval from Q3 to max. Although the spreads appear different in length, the amount of data is the same within each interval. The difference in spread indicates a difference in data variation within each interval (not amount of data). Remember the IQR contains the middle 50% of the ordered data set. Q3 Max min Q1 median IQR = Q3 – Q1 = 5400 – 1000 = 4400

30 Calculating boundaries for potential outliers
Find Q1 and Q3. Q1 = 10, Q3 = 20 Calculate IQR = Q3 – Q1. IQR = 10 Multiply IQR by ·IQR = 15 Subtract this from Q1. Q1 – 1.5 IQR 10 –  -5 Add it to Q3. Q IQR  35 These are the boundaries. ………………………...(-5, 35) If any data value falls outside of this interval, the data values are to be considered potential outliers.

31 Describing Spread: Standard Deviation
Roughly speaking, standard deviation is the average distance values fall from the mean (center of graph). Let the arrow mark the center, mean. Each ring measures an average distance from the center, mean. Stress the the rings are of equal width (standard).

32 Population and Sample Standard Deviation
2 population variance s2 sample variance Be sure to go back over what each letter stands for in both formulas. Remind students what operation is performed by the summation symbol AND that a calculator or computer software will calculate these for them. Variance is another measure of spread and is calculated by squaring the standard deviation value. Students may ask why there is a difference dividing by n instead of n-1 for their respective formulas. Later in the course it hopefully will become clearer. What is Variance???

33 Variance = (Standard deviation)2
What is Variance? Variance = (Standard deviation)2

34 Calculated Standard Deviation is a measure of Variation in data
Sample Data Set Mean Standard Deviation 100, 100, 100, 100, 100 100 90, 90, 100, 110, 110 10 30, 90, 100, 110, 170 50 90, 90, 100, 110, 320 142 99.85 The first data set contained all the same value, so the mean is obvious and hopefully the standard deviation value is too. Since there is no variation in the data set, the standard deviation is zero. The second data set has a simple mean to calculate (mentally) but the standard deviation can be calculated using the formula on a side board. The third example has data values that are more spread out therefore the standard deviation value should be higher. The fourth example contains an outlier and dramatically affects the mean and standard deviation.

35 Descriptive Terms Trend

36 Shape----Bell-shaped curve----Symmetric
Descriptive Terms of Sampling Distribution (Histogram) and Model (Red Curve) Shape----Bell-shaped curve----Symmetric

37 Descriptive Terms of Population Models
Skewed Right (or Skewed Left) “Tail” points to right

38 Descriptive Terms of Sampling Distribution
Cluster---Gaps---Potential Outliers

39 Uniform Population Model
Total area under the curve (model) will always equal 1.

40 Various Population Models
Name several types of distributions such as the normal and chi-square distributions as shown and student t and F distributions students will work with not shown in this slide. Each distribution has specific characteristics. The normal distribution is known as the symmetric bell shaped curve. Depending on the standard deviation the normal distribution can be tall of short but still resemble a bell shaped curve. The mean of a normal distribution shifts the graph about the horizontal axis and as the standard deviation this changes the spread of the distribution (but not the general shape of a bell-shaped curve) . The chi-square distribution’s shape is dependent upon its degrees of freedom which will be discussed at a later time.


Download ppt "What is Statistics? Thanks to Texas A&M University at College Station, TX for giving me a wonderful opportunity to advance my teaching of Statistics."

Similar presentations


Ads by Google