10b. Univariate Analysis Part 2 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson Department of Computer and Information Science,

10b. Univariate Analysis Part 2 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson linglu@iupui.edu Department of Computer and Information Science, IUPUI

The Range Difference between minimum and maximum values in a data set Larger range usually (but not always) indicates a large spread or deviation in the values of the data set. (73, 66, 69, 67, 49, 60, 81, 71, 78, 62, 53, 87, 74, 65, 74, 50, 85, 45, 63, 100) Range : 100 – 45 = 55 Some extreme low or high value might throw off the range, e.g. (20, 76, 77, 80, 82, 82, 84, 88, 90, 93, 99, 100) Range: 100 – 20 = 80

Variance One measure of dispersion (deviation from the mean) of a data set. How far away is each data from the mean? Variance – average distance to the mean The larger the variance, the greater is the average deviation of each datum from the mean (more numbers are away from the mean). E.g. 73, 67, 70, 67, 49, 60, 81, 71, 78, 62, 53, 87, 72, 65, 74, 50, 84, 45, 62,100 Variance = ((73-68.5) 2 +(67-68.5) 2 +(70-68.5) 2 + … +(100-68.5) 2 )/20 Variance = Average value of the data set Excel Functions: VARP() – variance for the whole population (data set is complete) VAR() – variance from a sample population (data set is a sample)

Standard Deviation Square root of the variance, as the variance gets the square of the distance. The magnitude of the number is more in line with the values in the data set. Can be thought of as the average deviation from the mean of a data set. Standard Deviation = Excel Functions: STDEVP() – use this when the data set is complete STDEV() – use this when the data set is a sample

Frequency Tables Use frequency table to observe the distribution E.g. Consider the following data set: {45, 49, 50, 53, 60, 62, 63, 65, 66, 67, 69, 71, 73, 74, 74, 78, 81, 85, 87, 100} Need to determine how to group data into different bins. Category LabelsFrequency 0-503 51-602 61-706 71-805 81-903 >901

Histogram A histogram is simply a column chart of the frequency table. Category Labels Frequency 0-503 51-602 61-706 71-805 81-903 >901 Page 6

Data Distribution Category Labels Frequency 0-501 51-603 61-705 71-805 81-903 >901

Normal Distributions The Bell curve –Symmetrical –Mean ≈ Median

Skewed Distributions Most of the times the distributions are skewed. Positively skewed distribution: mean > median Negatively skewed distribution: mean < median

Average (68.6) and Median (68) Mode (74) -1SD+1SD Data Distribution {45, 49, 50, 53, 60, 62, 63, 65, 66, 67, 69, 71, 73, 74, 74, 78, 81, 85, 87, 100} 55.1482.06

Standard Deviation With a normal distribution: mean + 1*SD covers 68% of data mean + 2*SD covers 95% of data mean + 3*SD covers 99.7% of data Page 11

10b. Univariate Analysis Part 2 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson Department of Computer and Information Science,

Similar presentations

Presentation on theme: "10b. Univariate Analysis Part 2 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson Department of Computer and Information Science,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

10b. Univariate Analysis Part 2 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson Department of Computer and Information Science,

Similar presentations

Presentation on theme: "10b. Univariate Analysis Part 2 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson Department of Computer and Information Science,"— Presentation transcript:

Similar presentations

About project

Feedback