Introduction to the Practice of Statistics Instructor : Alex Kulik Office : C-11, p. 2.03 http://www.im.pwr.wroc.pl/~kulyk/
Your grade homework (25%) quizzes (25%) midterm 1 (25%) Total grade of 90%=bdb, 70%=db, 50%=dst. Late homeworks and quizzes are not accepted. Class participation is required. Contact the instructor if expecting problems to take an exam.
Textbook Introduction to the Practice of Statistics, 4th edition, by David S. Moore and George P. McCabe – available in the library of C-11. We will go through Chapters 1-12 omitting Chapter 11.
To do.. Get a calculator, especially for tests. Install MS Excel at home, to be occasionally used for your homework. Regularly visit out web page for the schedule, lecture notes, assignments, solutions, tables.
Data: We use data to answer scientific questions. Data has variability. To assess the evidence data provide, we need to distinguish signal from noise.
Example Study the effect of exercise on cholesterol levels. One group exercises and another does not. Is cholesterol reduced by exercise? Consider: people differ other factors may have an effect exercise may affect other factors
What is Statistics? The science of understanding data and making decisions in face of variability/randomness. The set of methods to analyze the data and to design the experiment in order to extract information and quantify its reliability
Section 1.1 (Numbering as in the textbook) Data set: Individuals and Variables Individuals – objects described by a set of data (people, animals, things) Variable – characteristic of the individuals
Types of Variables Variables Quantitative Continuous Discrete Ordinal Not ordinal Categorical
Types of variables Quantitative (numerical) Continuous: e.g. height, weight, concentration Discrete: e.g. number of customers, flowers Categorical (non-numerical) Ordinal: e.g. choices on a survey: never, rarely, occasionally, often, always Non-ordinal: e.g. shape, race
Example: Information on employees
Exploratory data analysis variables Distribution = description of count or percent. Categorical variables: visualize the distribution by using bar char or pie chart. Quantitative variables: visualize the distribution by stemplot or histogram.
Education of 25- to 34-years-olds (US) Count (in milions) Percent Less than high school 4.7 12.3 High school graduate 11.8 30.7 Some college 10.9 28.3 Bachelor’s degree 8.5 22.1 Advanced degree 2.5 6.6
Bar graph of education
Pie chart of education
Distribution of quantitative variables Individual observations often differ—we observe a cloud rather than a few values The distribution of quantitative variables is displayed by histogram
Examining distributions Describe the pattern: Shape: e.g. symmetric or skewed in one direction; the number of modes, Center – e.g. the midpoint, Spread –e.g. the range between the smallest and the largest values. Look for outliers – individual values that do not match the overall pattern.
A glimpse at the distribution Example: Numbers of home runs that Babe Ruth hit in each of his 15 years (1920 – 1034) with the New York Yankees: 54 59 35 41 46 25 47 60 54 46 49 46 41 34 22 Stemplot, also called stem-and-leaf plot. Leaf = the last digit Stem = all but the last digit
Draw the stem-and-leaf plot: a) write the stems b) write the leafs for each stem c) order the leaves on each stem We can increase the number of stems by splitting them into two, e.g. one with leaves 0 to 4 and one with leaves 5 through 9. We can also round numbers before making stemplot.
Back-to-back stemplot Compare the counts of Babe Ruth’s hits and Mark McGwire’s hits: 9 9 22 29 32 32 33 39 39 42 49 52 58 65 70
Distribution at large: Histograms
Frequency Table of the Hispanic data Class Count Percent 0.1-5.0 30 60 20.1-25 1 2 5.1-10.0 10 20 25.1-30 4 10.1-15 8 30.1-35 15.1-20 35.1-40
Histogram of Percent of Hispanic adults
Histogram, comments: The ranges of the variable are called bins. Bins should be convenient; usually of equal length, covering the whole range of data. The number of bins is a matter of judgement, choose e.g. an integer close to the square root of the number of observations. Frequency histogram = has counts Relative frequency histogram = has percents
Labelling the graph is important! The horizontal axis is for the variable. The vertical axis is for the counts/frequencies or relative frequencies/percents. Remember to label the axes precisely as in our examples.
... + 24,800 nanoseconds
Give frequency table of Newcomb’s data 20- 24.9 25- 29.9
Draw frequency histogram of Newcomb’s data (Then relative frequency histogram)
Histogram of Newcomb’s data (note left outliers)
Other plots: e.g. time series May exhibit hidden mechanisms Trend – persistent, long-term rise or fall Seasonal variation – a pattern that repeats itself at known regular intervals of time. ...less important in this course.
Time plots. Newcomb’s data.
Section 1.2 Describing distributions with numbers: Mean Median Quartiles Boxplot Standard deviation Changing the unit of measurement
Measures of Centre Mean The arithmetic mean of a data set (average) Denoted by Mean can be easily influenced by outliers, i.e. it is not resistant.
Median Median is the midpoint of a distribution: Sort the data in increasing order. Median equals the (n+1)/2-th observation if n is odd, and it is the average of the two middle observations if n is even. Median is a resistant measure of center. Outliers do not influence median much.
Mean vs. Median In a symmetric distribution mean=median In a skewed distribution the mean is further out in the long tail than the median is. Example: The mean price of existing houses sold in 2000 was 176,200. The median price of these houses was 139,000.
Measures of spread Quartiles: Q2 (second quartile)=Median Q1 (first quartile) =median of the lower “half” of the sorted data Q3 (third quartile) = median of the upper half of data p-th percentile – number q such that approximately p percent of the observations are smaller than q. Q1, Q2, Q3 are 25th, 50th, 75th percentiles.
The InterQuanileRange and criterion for outliers The interquartile range: IQR=Q3-Q1 An observation is an outlier if it falls more then 1.5*IQR above the third quartile or more than 1.5*IQR below the first quartile. We often remove the outliers from the data.
Standard deviation Deviation of i-th observation: Variance:
Five-Number Summary Minimum, Q1, Median, Q3, Maximum Boxplot – visual representation of the five- number summary.
Statistics: Minicomp. City Minicomp. Highway Two-seater City Highway w/o outlier mean 13.4 25.8 19.2 14.1 23.4 median 18 25 26 14.5 Q1 16 23 13 21 Q3 20 28 27 SD 2.42 3.16 11.2 11.5 5.07 5.34
Boxplots
Hispanics data: the histogram...
...and a boxplot... Modified boxplot: outliers shown.
Five-Number Summary VS. Standard Deviation s=0 when there is no spread s is not resistant The five-number summary usually better describes a skewed distribution or a distribution with outliers. Mean and standard deviation are usually used for reasonably symmetric distributions without outliers.
Linear Transformations: xnew=a+bxold Examples: xmiles=0.62 xkm xg=28.35 xoz
Linear transformations do not change the shape of a distribution. They do change the center and the spread e.g: Pythons 1 2 3 4 5 oz 1.13 1.02 1.23 1.06 1.16 g 32 29 35 30 33
Effect of a linear transformation: xnew=a+b*xold meannew=a+b*meanold mediannew=a+b*medianold stdnew=|b|*stdold IRQnew=|b|*IRQold
in [g] in [oz] Mean Median SD Calculate mean, median and SD for the weight of pythons in [g] in [oz] Mean Median SD