Presentation is loading. Please wait.

Presentation is loading. Please wait.

Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs” [defined later] of these are denoted by different letters.

Similar presentations


Presentation on theme: "Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs” [defined later] of these are denoted by different letters."— Presentation transcript:

1 Unit 2: Some Basics

2 The whole vs. the part population vs. sample –means (avgs) and “std devs” [defined later] of these are denoted by different letters description vs. inference –mean of a population describes it (partly) –mean of a sample gives a basis to infer (guess) mean of the population

3 Computers are ubiquitous in stats Warning: We’ll compute with Excel because you can see the data and it’s available all over campus... but its stats results are unreliable (see next slide) In the real world, use a statistical program (SAS, SPSS,...)

4 Data Types categorical vs. numerical (vs. ordinal, …) –ex of categorical: red, blue, green –ex of numerical: person’s height in inches –(ex of ordinal: str agree, agree, neutral,...) [numerical: discrete vs. continuous] Types of Statistics graphical numerical

5 Graphs for categorical data pie charts: for fractions of total data in each category bar (Excel: “column”) graphs: for counts or fractions of data in each category (mode = most frequent value)

6 Example: Hair color at NYS Fair

7 Numerical variables: Graphs of frequencies I dot plots stem-and-leaf plots Ex: 79, 91, 59, 52, 94, 74, 75, 87, 67, 35, 91, 89, 96, 92, 92

8 Numerical variables: Graphs of frequencies II histogram 79, 91, 59, 52, 94, 74, 75, 87, 67, 35, 91, 89, 96, 92, 92 (Note similarity to dotplot and stem- and-leaf plot) Excel’s “histograms” aren’t, unless bar widths are equal

9 More on Histograms Frequencies are represented by areas, not heights Vertical axis should have “density scale” (in % per horizontal unit, not frequency) So (if widths differ), bar height is % / h.u. Result: total area under histogram = 100%

10 Density scale Weekly salaries in a company Vertical axis is in % per $200 Where is the high point?

11 Reading a histogram: Hrs slept by CU students (Data from questionnaire) HrsAreas 1,2,33x1=33/20 = 15% 4,52x3=66/20 = 30% 61x4=44/20 = 20% 71x3=33/20 = 15% 8,92x2=44/20 = 20% Sum= 20 So maybe 5% slept 1hr, 5% 2hr and 5% 3hr; Or maybe 0% slept 1hr, 7% 2hr, 8% 3hr 1.Which given interval contains the most students? 2.Which 1-hr period contains the most students (i.e., is most “crowded”)? 3.About what % slept 8 hr? 4.About what % slept 3 or 4 hr? 5.If there were 240 surveyed, about how many slept 6-7 hr?

12 Drawing a histogram:Horses’ weights in kg Wts (widths)Counts% / 10 kg 180-200 (20)2084 200-220 (20)803216 220-230 (10)3012 230-240 (10)2510 240-260 (20)40168 260-280 (20)25105 280-310 (30)30124 Sum = 250

13 Frequencies by area?

14 We are given lists of numbers taken from a large number of residents of New York State. The numbers represent 1.the person’s height 2.the distance from the person’s home to the nearest airport 3.the distance from the person’s home to Syracuse 4.the size of the person’s savings From the following histograms, choose the one that you think best approximates the true histogram of each list.

15 Remarks on histograms If data is continuous, need convention about what do with data on boundaries: Does it fall into left or right interval? If data is discrete, it’s natural to center bars on values (which Excel does even when it shouldn’t).

16 Possible traits of a number list “skewed” right or left (the skew is the tail) –incomes are usually skewed right –test scores are often skewed left –IQs are usually symmetric outlier values (far from others – sometimes a rule defines one, but...) unimodal vs. bimodal –ex: heights of a group of men vs. hts of a group of men and women

17

18 Centers of numerical data average [mean] (Greek letter µ or variable with overbar  x ): sum of data divided by number of numbers in list –this is the “arithmetic” mean (others include geometric, harmonic) –“center of gravity” of histogram median: value that is greater than half, less than half of data –half area of histogram to each side –more “resistant” to skew & outliers

19 Weekly salaries in a small factory: 30 workers 17 $200, 5 $400, 6 $500, 1 $2000, 1 $4600 avg = $500, median = $200

20 Test for skew median < avg : skewed to right median > avg : skewed to left –(These can be used as the def of skew left or right)

21 Pop Quiz Is the average or the median likely to be greater for a list of heights of people (including children)? heights of adults? salaries?

22 Measures of spread standard deviation σ (maybe with subscript of variable) = √[Σ( x -  x ) 2 /n] –this is “population std dev”, SD in text, stdevp in Excel –“sample std dev” s = √[Σ( x -  x ) 2 /(n-1)] is SD + in text (much later), stdev in Excel IQR (“inter-quartile range”): difference between third and first quartiles –first quartile: 1/4 of data is less, 3/4 greater –third quartile: [Guess!] –(n-th percentile: n% of data is less, (100-n)% greater) (range: max value – min value)

23 Rough estimates of σ Steps: Estimate avg, estimate deviations of data values from avg, estimate avg of deviations (guess high) Ex: 41, 48, 50, 50, 54, 57. Is σ closest to 0, 5, 10 or 20? Ex: Is σ closest to 0, 3, 8 or 15?

24 Rule of Thumb (best for, but not limited to, bell-shaped data) 68% of values are within 1σ of mean 95% of values are within 2σ of mean almost all values are within 3σ of mean

25 So values within one σ of the average are fairly common,... while values around two σ away from average are mildly surprising,... and values more than three σ away from the average are “a minor miracle”. “z-score” (“std units”): z = ( x –  x ) / σ –the number of σ’s above average –(if negative, below average)

26 Effect of outliers 1,2,3,3,4,5 –avg = 3 –σ = √(10/6) ≈ 1.3 –median = 3 –IQR = 3.75 - 2.25 (Excel values) = 1.5 1,2,3,3,4,100 –avg ≈ 18.8 –σ ≈ 36.3 –median = 3 –IQR = 3.75 - 2.25 (Excel values) = 1.5

27 Changing the list (I): “Linear” changes of variable (i.e., changing units of measure) Basic list: 6, 9, 15 – µ = [6+9+15]/3 = 10 – σ = √[((6-10) 2 +(9-10) 2 +(15-10) 2 )/3] = √14 Add 5: 11, 14, 20 – µ = [11+14+20]/3 = 15 – σ = √[((11-15) 2 +(14-15) 2 +(20-15) 2 )/3)] = √14 Multiply by 12 : 72, 108, 180 – µ = [72+108+180]/3 = 120 – σ = √[((72-120) 2 +(108-120) 2 +(180-120) 2 )/3] = 12√14

28 Changing the list (II): A: 25, 35 B: 40, 40, 50, 50 – µ A = 30, σ A = 5, µ B = 45, σ B = 5 A and B: 25, 35, 40, 40, 50, 50 – µ = 40, σ = 5√3 > 5 Combining two lists Adjoining copies of the average: 6, 9, 15, 10, 10, 10, 10 – µ = (6+9+15+4(10))/7 = 10 – σ = √[((6-10) 2 +(9-10) 2 +(15-10) 2 +4(10-10) 2 )/7] = √(3/7)∙√14 < √14

29 Box-&-whisker plots Uses all of min, first quartile, median, third quartile, max “Normal” just shows them all “Modified” puts limit on whisker length -- 1.5 IQR –3 rd quartile +1.5 IQR = “(inner) fence” whisker ends at last value before or on fence –beyond the fence is an “outlier” reject only “for cause” (?) –beyond 3 rd quartile + 3 IQR (“outer fence” or “bound”) is “extreme outlier”

30

31

32 Normal: Modified:


Download ppt "Unit 2: Some Basics. The whole vs. the part population vs. sample –means (avgs) and “std devs” [defined later] of these are denoted by different letters."

Similar presentations


Ads by Google