Probability and Statistics Univariate Analysis @ Prof. Liping Fu, University of Waterloo
The big picture 1) Data Collection Data Population Data 2) Explanatory Data Analysis (EDA) Sample 4) Inference 3) Probability The Big Picture
On Relationship between Two Variables Exploratory data analysis (EDA) Categorical On a Single Variable EDA Quantitative On Relationship between Two Variables
Categorical Data Numerical Measures Relative Frequency Table (by category)
Graphical Summary – Visualization Bar Chart and Pie Chart
Where is the center of the data? Quantitative Data Numerical Measures Where is the center of the data? Measures of Center Numerical Measures How varied is the data? Measures of Variation
𝑥 = 𝑖=1 𝑛 𝑥 𝑖 𝑛 Measures of Center Sample mean, median, mode Sample Mean = arithmetic average = ‘average’ 𝑥 = 𝑖=1 𝑛 𝑥 𝑖 𝑛 Sample Median = ‘middle number’ Sample Mode = ‘most frequent’ All of them measure the CENTER of the data In most cases: mean ≈ median ≈ mode
Mean is NOT a good measure in this case...
Measure of Variation Range: (max - min) Quartiles: Q1: First quartile (one quarter of the data less than this value) Q2: Second quartile (median, half point) Q3: Third quartile (three quarters of the data less than this value) Inter-quartile range (IQR) = Q3 - Q1 Sample Variance/Standard Deviation Frequency distribution (relative, cumulative)
Variance (s2) and Standard Deviation (s)
Distribution Frequency
Graphical Summary - Visualization Dot Plot Histogram Distribution Charts Bar Chart Polygon Visualization 5-number Plot Box Plot
Visualization Dot Plot Clusters, groups, and outliers ?
Box Plot/Box–and-Whisker Plot (5-number plot) HoursOnInternet By Male Students Median =4.0 Q1= 2.5 Q3 = 6.4 Q1-1.5 IQR Q3 +1.5 IQR IQR = Q3 – Q1
Bar Chart (Discrete Data) Relative Frequency Table
Histogram (Continuous Data) Relative Frequency Table
Visualize Degree of Variation
Visualize Patterns of Distribution
Cumulative Distribution Polygon Cumulative Frequency Table
Summary: EDA on A Single Variable Numerical Measures Graphical Tools Categorical Relative Frequency Bar Chart Pie Chart Quantitative Mean, median, mode Variance/Stdev Quartiles Frequency Histogram Polygon Box Plot
Descriptive Statistics - A Few Basic Concepts Example 1.1(a) Suppose we have a batch of 1000 I-beams for building construction, and we want to find out the tensile strengths of these beams. In order to do so, we take at random a set of 10 beams from the batch and test their tensile strengths. The test results are 126, 128, 135, 146, 137, 142, 125, 131, 139, 141 What is the relationship between the tensile strength of the 10 I-beams and that of the 1000 I-beams? What can we say (infer) about the tensile strength of the 1000 I-beams from that of the 10 beams?
How to Summarise Data Graphically? Example 1.1(b) Suppose we have a batch of 1000 I-beams for building construction, and we want to find out the tensile strengths of these beams. In order to do so, we take at random a set of 10 beams from the batch and test their tensile strengths. The test results are 126, 128, 135, 146, 137, 142, 125, 131, 139, 141 What can we say about the test results? How are the data varied or distributed?
How to Construct a Histogram (Polygon)? Identify the smallest and largest observed values, and choose a convenient range which includes the smallest and largest values. Divide the range into convenient intervals (also called classes or bins) (What is the optimal number of intervals?) Count the number of observations (or frequency of occurrences) that follow within each interval. For relative frequency histogram, calculate the relative frequency for each interval. Draw vertical bars with heights representing the frequency (frequency histogram) or the relative frequency (relative frequency histogram) Alternatively, draw a dot at the midpoint of each interval with height matching the frequency. The dots of all intervals are then connected by lines - frequency polygon
How to Construct a Cumulative Relative Frequency Polygon? Following Step 1-3 to determine the relative frequency for each interval Calculate the cumulative frequency for each interval Draw a dot at the midpoint of each interval with height matching the cumulative frequency. The dots of all intervals are then connected by lines - cumulative relative frequency polygon
Use Cumulative Relative Frequency Polygon? Example 1.1 (c) Suppose we have a batch of 1000 I-beams for building construction, and we want to find out the tensile strengths of these beams. In order to do so, we take at random a set of 10 beams from the batch and test their tensile strengths. The test results are 126, 128, 135, 146, 137, 142, 125, 131, 139, 141 What percent of (sampled) beams have a tensile strength less than 130? What is the tensile strength that is greater than or equal to the tensile strength of 95% of the sampled beams? (What is the 95th percentile of the tensile strength?)
How to Summarise Data Numerically? Example 1.1 (d) Suppose we have a batch of 1000 I-beams for building construction, and we want to find out the tensile strengths of these beams. In order to do so, we take at random a set of 10 beams from the batch and test their tensile strengths. The test results are 126, 128, 135, 146, 137, 142, 125, 131, 139, 141 Suppose we have another batch of 1000 I-beams and we take a set of 10 beams from it for test. The test results are 126, 138, 125, 132, 127, 122, 121, 131, 129, 131 Which batch has a higher tensile strength in average? Which batch is more uniform or less varied? If the design standard stipulate that 95% of beams must have a minimum tensile strength of 122, which batch meets the standard? cumulative relative frequency polygon Percentile function
Problem with the Mean? Example 1.2: A small company employs four young engineers, who each earn $24,000, and the owner (also an engineer), who gets $114,000. Comment on the claim that on the average the company pays $42,000 to its engineers and, hence, is a good place to work.
Think About It: (for next lecture) For Example 1.1, suppose we pick at random another I-beam from the batch. What is the probability that the tensile strength of that beam is between 130 and 140? For Example 1.1, what should be the minimum number of observations (size of sample) in order to make our inferences credible? Suppose we throw a coin, what is the chance of getting head? Do we need observations in order to answer this question? How long should a left-turn bay be in order to accommodate left-turning traffic at over 95% of the signal cycles during peak period?