Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Characterization

Similar presentations


Presentation on theme: "Data Characterization"— Presentation transcript:

1 Data Characterization
Chapter 3 Data Characterization 2/5/2019 BUS304 – Data Characterization

2 Types of Data Measurements
Measurements of Center and Location Measurements of Variation ? 2/5/2019

3 Measurements for Population and Sample
In general, we use the same set of measurements for both population and sample Population Parameters: numerical measurements for population. Usually represented using Greek letters or capitalized English letters. “N” for pop. Size; “” for pop mean Sample Statistic: numerical measurements for sample. Usually represented using small English letters. “n” for sample size; for sample mean 2/5/2019

4 Most commonly used -- Mean
Sample Mean: “sample average” Formula: Population Mean: “population average” Characterize the center of the data distribution The most commonly used data measure Ways to compute the mean: Use calculator. Use Excel. (function: average) Reading information from the chart. 2/5/2019 BUS304 – Data Characterization

5 Sensitivity to outliers
Compute the mean for the following 2 groups of data Sensitivity to outliers If the mayor decide to provide more public facilities to poor communities, and the decision is made based on whether the mean income in the community is below $50,000 per year. Does such a decision make sense? Household income in community a: (Unit =10000$) Household income in community b: (Unit =10000$) #1 #2 #3 #4 #5 #6 #7 #8 5 4 3 #1 #2 #3 #4 #5 #6 #7 #8 5 4 3 100 4.125 16 2/5/2019 BUS304 – Data Characterization

6 Exercise: The manager of a small hotel in Foster City, CA, was asked by the corporate VP to analyze the Sunday night registration information for the past eight weeks. Data on three variables were collected: x1 = total number of rooms rented x2 = total dollar revenue from the room rentals x3 = number of customer complaints that came from guests each Sunday Tasks: Create a histogram for the distribution of number of customer complaints every day Calculate the average number of rooms rented, the average revenue, the average number of complaints per day. Calculate the average number of complaints per room rented Explain the difference between “the average compliant per day” and “the average complaint per room rented“ from a managerial perspective. Week Rooms Rented Revenue Complaints 1 22 $1,870 2 13 $1,590 3 10 $1,760 4 16 $2,345 5 23 $4,563 6 $1,630 7 11 $2,156 8 $1,756 2/5/2019 6

7 Compute the mean from frequency table
Below is a frequency table showing the number of days the teams finish their projects How many days on average does a team finish one project? Create a histogram using the data on the left, locate the mean on the graph. How to describe the shape of the histogram? What is the relationship between the mean and peak? Use relative frequency to find out the mean. Days to Complete Frequency 5 4 =5*4 6 12 =6*12 7 8 =7*8 =8*6 9 =9*4 10 2 =10*2 6.31 days 2/5/2019 BUS304 – Data Characterization

8 Estimating the mean from Histogram
Treat Histogram as a frequency table, use the mid-value to estimate each range. Mathematical Expression: if sample, if population 2/5/2019 BUS304 – Data Characterization

9 BUS304 – Data Characterization
Weighted Mean The mean assumes that each piece of information equally. E.g. students’ GPA and score calculation. Weights are subjective. E.g. Different instructors assign different weights to homework and exams. Frequency table can be considered as an example of weighted mean (higher weights when higher frequency) Days to Complete Frequency Relative Frequency 5 4 11.11% 6 12 33.33% 7 8 22.22% 16.67% 9 10 2 5.56% 2/5/2019 BUS304 – Data Characterization

10 BUS304 – Data Characterization
Exercise: Estimate the mean based on the following histogram There are 30 full time faculty in CoBA. Their average age was 43 in In 2008, one new faculty with age 30 was hired and one faculty retired at 65. What is the new mean age for CoBA faculty? 2/5/2019 BUS304 – Data Characterization

11 BUS304 – Data Characterization
Variance A measure of data spread. Also called “the average of squared deviations from the mean” The larger the variance, the fat the histogram -- sample variance population variance Note the difference! 2/5/2019 BUS304 – Data Characterization

12 Steps to compute the variance
Identify whether the data are of a population or sample (the formulae are different.) Use the following table to compute the deviation: Find out the mean: Find out the distance (fill out the 2nd column) Find out the squared distance (the 3rd column) Add up the 3rd column divided by population size; or sample size -1 Data list Distance from the mean Square the distance 5 4 3 2 = =1.167 =(1.167)2=1.36 If the list of data is a population, what is the population variance? If the list of data is a sample, what is the sample variance? 2/5/2019 BUS304 – Data Characterization

13 Comparing variance vs. histogram
Find the variance for the following groups of sample data: Compare the mean and variance. Create the histogram to compare the distribution. 11 12 13 16 17 18 21 14 15 16 17 11 12 19 20 2/5/2019 BUS304 – Data Characterization

14 What does variance mean?
Variance indicate variation: The larger the variance, the more spread out the data. Indicates unpredictability. E.g. Weather data: weather changes dramatically, hard to predict tomorrow’s temperature (If look at temperature data: which has larger variance, Chicago or San Diego?) Stock: more risk on returns. A person’s performance: consistency. emotional… Other examples? 2/5/2019 BUS304 – Data Characterization

15 Use frequency table to compute the population variance:
14 15 16 17 Data value Frequency Relative Frequency 14 1 0.125 15 3 0.375 16 17 Data distance square 14 15 16 17 Data distance square 14 15 16 17 Compute the weighted average 2/5/2019 BUS304 – Data Characterization

16 BUS304 – Data Characterization
Standard Deviation Square root of variance. An indicator of data deviation, can be directly compared to the mean. Exercise: compute the standard deviation from the histogram on slide no. 5 and locate it on the histogram. OR Sample variance Population variance Sample standard deviation Population standard deviation 2/5/2019 BUS304 – Data Characterization

17 BUS304 – Data Characterization
Empirical Rule If the data is bell shaped (most of the time), then 68% of all data will fall in the range of 95% of all data will fall in the range of 99.7% of all data will fall in the range of 99.7% 95% 68% 2/5/2019 BUS304 – Data Characterization

18 Other Numerical Measures
Median Mode Range Percentiles Quartiles, Interquartile range 2/5/2019 BUS304 – Data Characterization 18 18

19 BUS304 – Data Characterization
-- The value which divides the data in half, with equal sizes above and below Median The middle value Steps: Put your data in ordered array (sort) If n (or N) is odd, the median is the middle number (i.e. the th number) If n (or N) is even, the median is the average of two middle numbers (i.e. the average of the and the th numbers) Reading information from the chart. 2/5/2019 BUS304 – Data Characterization 19 19

20 Sensitivity to outliers
Median = 3 Median does not affected by extreme values Median = 2.5 4.125 16 Median = 3 2/5/2019 BUS304 – Data Characterization 20 20

21 BUS304 – Data Characterization
Exercise 1 2 3 4 5 6 7 3 5 6 8 10 12 7 5 2 6 3 10 1 2 4 6 8 10 12 14 16 3 5 6 10 12 15 14 18 23 32 12 16 5 10 21 24 25 7 2/5/2019 BUS304 – Data Characterization 21 21

22 BUS304 – Data Characterization
The value that occurs most often Steps: Put your data in ordered array (sort) Find the data value(s) that repeats the most frequently Mode Mode does not affected by extreme value either. No Mode! Mode=5 Reading information from the chart. Boston Austin San Diego Los Angels Mode=San Diego Mode=5 and 9 2/5/2019 BUS304 – Data Characterization 22 22

23 Find Mode and Median from Frequency Table
Below is a frequency table showing the number of days the teams finish their projects Find the mean, median and mode. Create a histogram, locate the mode, median and mode. Describe the shape of the histogram, and find the relationship between mean, median and mode. Days to Complete Frequency 5 4 6 12 7 8 9 10 2 6.31 days 2/5/2019 BUS304 – Data Characterization 23 23

24 Shape of a distribution
Mean = Median = Mode Symmetric Mean < Median < Mode Left-Skewed (Longer tail extends to left) Mode < Median < Mean Right-Skewed (Longer tail extends to right) Note that Mean is affected by the extreme value the most. So mean is always leaning towards the tail compared to the other two measures. 2/5/2019 BUS304 – Data Characterization 24

25 Measures of center location
Mean Median Mode Mean is generally used, unless extreme values (outliers) exist; the next common is median, since the median is not sensitive to extreme values; mode is sometime used when there is a really large frequency. Think: Are house prices normally right-skewed or left-skewed? What measurement People normally use to measure the house market? 2/5/2019 BUS304 – Data Characterization 25 25

26 BUS304 – Data Characterization
Range Simplest measure of variation Describe how wide the data spread Formula Range = Maximum Value – Minimum Value Example: Range = = 13 2/5/2019 BUS304 – Data Characterization 26 26

27 Range is affected the most by outliers.
Disadvantage of Range Ignores the way in which data are distributed Sensitive to outliers 1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5 Range = = 4 Range = = 5 1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120 Range = = 119 Range = = 5 Range is affected the most by outliers. Feb 8, 2006 2/5/2019 BUS304 – Data Characterization 27 27

28 BUS304 – Data Characterization
Other measures Percentiles: Measures the percentage of data below the value. e.g. if the 60th percentile is 1240 (SAT score), that means there are 60% students getting a score less than 1240. Correspondingly, there are 40% of students getting 1240 or higher. How to find percentile? The pth percentile in an ordered array of n values is the value in the ith position, where If the list of data is a population, what is the population variance? If the list of data is a sample, what is the sample variance? 2/5/2019 BUS304 – Data Characterization 28 28

29 BUS304 – Data Characterization
Example Find the 80th percentile from the annual income data Step: Sort the data Find the location for the 80th percentile: Find the 80.8th person’s income Where is the 80.8th person? Combine the 80th and 81st numbers 80th  62245 81st  63485 80.8th  62245*20%+63485*80%=63237 1st th 80th 81st 80.8th should be in between, and closer to 81st. 80% because of the decimal is .8 2/5/2019 BUS304 – Data Characterization 29 29

30 BUS304 – Data Characterization
Exercise Find the 25th percentile Find the 50th percentile Find the 75th percentile Explain the meaning of 50th percentile? Have you learnt a similar measurement? How many people have income levels between the 25th and the 50th percentiles? How many people have income levels between 50th and the 75th percentile? 2/5/2019 BUS304 – Data Characterization 30 30

31 BUS304 – Data Characterization
Quartiles The 25th, 50th, and 75th percentiles Called the first, second, and third quartiles, respectively. Written as Q1, Q2, Q3, respectively. The quartiles split the ranked data into 4 equal groups. 25% 25% 25% 25% Q1 Q2 Q3 2/5/2019 BUS304 – Data Characterization 31 31

32 BUS304 – Data Characterization
Example: Example: Find the first quartile in the data sample: 2/5/2019 BUS304 – Data Characterization 32 32

33 BUS304 – Data Characterization
Interquartile Range Recall: Range? Disadvantage of range? Interquartile Range: Interquartile Range = Q3 – Q1 Example: Q1=13.5 Q3=19 Interquartile range = Q3 – Q1 = 19 – 13.5 = 5.5 2/5/2019 BUS304 – Data Characterization 33 33

34 BUS304 – Data Characterization
Summary Understand and compute the following two sets of data measures: Measures of central tendency Mean, Median, and Mode Measures of variation Range, Variance, and Standard deviation Other ways to describe data: Percentiles, Quartiles, Interquartile range 2/5/2019 BUS304 – Data Characterization 34 34


Download ppt "Data Characterization"

Similar presentations


Ads by Google