Presentation is loading. Please wait.

Presentation is loading. Please wait.

Daniela Stan, PhD School of CTI, DePaul University

Similar presentations


Presentation on theme: "Daniela Stan, PhD School of CTI, DePaul University"— Presentation transcript:

1 Daniela Stan, PhD School of CTI, DePaul University
Data Analysis and Statistical Software I ( ) Quarter: Autumn 02/03 Daniela Stan, PhD School of CTI, DePaul University 1/18/2019 Daniela Stan - CSC323

2 Outline Introduction Individuals and Variables
Exploratory Data Analysis Describing Distributions with Graphs Describing Distributions with Numbers 1/18/2019 Daniela Stan - CSC323

3 Introduction Data: - numbers, measurements, facts
Where is the data coming from? - medical field - automotive industry - stock market & investment - census bureau - customer profiling; examples Statistics: - the science of collecting, organizing, and interpreting data; the goal is to gain understanding from data. 1/18/2019 Daniela Stan - CSC323

4 Individuals and Variables
Individuals: - are the objects described by a set of data individuals ~ cases ~ records Variable: - any characteristic of an individual; it can take different values for different individuals. variable ~ attribute Observation ~ value of a variable Categorical Variables Types of Variables Quantitative Variables 1/18/2019 Daniela Stan - CSC323

5 Individuals and Variables (cont.)
Example of individuals and variables Name Age Gender Marital Status John Smith 25 Male Single Joe Doe 32 Married Phillip Roberts 21 Sarah Lazar 26 Female The distribution of a variable gives: what values the variable takes; how often it takes these values: count percent or fraction 1/18/2019 Daniela Stan - CSC323

6 How We Describe The Variables?
Exploratory Data Analysis Single variable Two or more variables Categorical variable Numerical variable Scatterplots Correlation Regression Stemplots Histograms Five number summary Standard deviation Bar graphs Pie charts 1/18/2019 Daniela Stan - CSC323

7 Bar Graphs and Pie Charts
The distribution of the highest level of education for people aged 25 to 34 years: Education Count (millions) Percent Less than high school 4.7 12.3 High school graduate 11.8 30.7 Some College 10.9 28.3 Bachelor’s degree 8.5 22.1 Advanced degree 2.5 6.6 1/18/2019 Daniela Stan - CSC323

8 Bar Graphs and Pie Charts (cont.)
Pareto chart Pie charts require that you include all categories that make up a whole 1/18/2019 Daniela Stan - CSC323

9 Stemplots Stemplot ~ stem-and-leaf plot To make a stemplot:
Separate each observation into a stem consisting of all but the final (rightmost) digit and a leaf, the final digit. Write the stems in a vertical column with the smallest at the top, and draw a vertical line at the right of this column. Write each leaf in the row to the right of its stem, in increasing order out from the stem. 1/18/2019 Daniela Stan - CSC323

10 Stemplots (cont.) Example: Problem 1.24 (page 29)
Back-to-back stemplot How stemplots deal with large data sets? Splitting stems: One stem with leaves between 0 and 4 One stem with leaves between 5 and 9 How stemplots deal with observations with having many digits? Rounding 1/18/2019 Daniela Stan - CSC323

11 Stemplots (cont.) Advantages of stemplots:
Describe the shape of a distribution for small numbers Disadvantages: Don’t work well with large data sets since they display the values of the variables Divide the observations into groups (stems) determined by the number system rather than by judgment 1/18/2019 Daniela Stan - CSC323

12 Histograms A histogram breaks the range of values of a variable into intervals and displays the count or percent of the observations that fall into each interval. Count ~ frequency Percent ~ relative frequency Example: Problem 1.34 (page 34) Disadvantages: How many intervals? What width for the histogram intervals? The original data cannot be recovered 1/18/2019 Daniela Stan - CSC323

13 Examining Distributions
In any graph of data, look for the Overall pattern of a distribution described by: - Center ~ midpoint Spread ~ range between the smallest and largest value ~ variability Shape: 1. Symmetric or skewed 2. Unimodal (one major peak/mode) or multimodal Deviations from the pattern Outliers Example 1/18/2019 Daniela Stan - CSC323

14 Examining Distributions (cont.)
The histogram of all 947 seventh grade students in Gary, Indiana, on the vocabulary part of the Iowa test. Shape? - symmetric - unimodal 1/18/2019 Daniela Stan - CSC323

15 Outliers An outlier is an individual value that falls outside the overall pattern Outlier ~ extreme observation How to deal with outliers? Sources: - outliers from equipment failure - errors in recording data - extraordinary occurrence Applications 1/18/2019 Daniela Stan - CSC323

16 Time Plots A time plot of a variable plots each observation against the time at which was measured. 1/18/2019 Daniela Stan - CSC323

17 Time Plots (cont.) Time series: data sets produced by measurements of a variable taken at regular intervals over time. Time plots can reveal the main features of a time series such as: Seasonal variation: a pattern that repeats itself as known regular intervals of time A trend: a persistent, long-term rise or fall 1/18/2019 Daniela Stan - CSC323

18 Describing Distributions with numbers
Measuring center: the mean mean ~ average value If the n observations are x1, x2,…, xn, their mean is or, in more compact notation 1/18/2019 Daniela Stan - CSC323

19 Describing Distributions (cont.)
Measuring center: the median M is the number such that half the observations are smaller and the other half are larger; median ~ middle value To find the median M of a distribution, follow the steps: 1. Arrange all n observations in order of size, from smallest to largest. 2. If n is odd, M is the center observation in the ordered list; the location is (n+1)/2 from the bottom of the list. 3. If n is even, M is the mean of the two center observations in the ordered list; the location is again (n+1)/2 from the bottom of the list. 1/18/2019 Daniela Stan - CSC323

20 Describing Distributions (cont.)
Example 1.13 (textbook, page 40); Data: Fuel economy (miles per gallon) for 2001 model two-seater cars 1/18/2019 Daniela Stan - CSC323

21 Describing Distributions (cont.)
Calculate median: 1. Arrange the data in increasing order: 2. Find the location of the median: (n+1)/2=(19+1)/2=10 The 10th position 1/18/2019 Daniela Stan - CSC323

22 Describing Distributions (cont.)
How the median changes if we remove the last observation in the sorted list? How the median changes if the value of last observation is changed to 680? Calculate the mean: How the mean changes if we remove the outlier? How the mean changes if the value of last observation is changed to 680? 1/18/2019 Daniela Stan - CSC323

23 Describing Distributions (cont.)
Mean versus Median 1. The mean is sensitive to the influence of extreme observations/outliers, or skewed distributions. 2. A resistant measure of any aspect of a distribution is relatively unaffected by changes in the numerical value of a small proportion of the total number of observations, no matter how large these changes are. 3. The mean is no a resistant measure of the center. 4. The median is a resistant measure of the center. 1/18/2019 Daniela Stan - CSC323

24 1/18/2019 Daniela Stan - CSC323

25 1/18/2019 Daniela Stan - CSC323

26 Describing Distributions (cont.)
Measuring spread: the quartiles The pth percentile of a distribution is the value such that p percent of the observations fall at or below it. The 50th percentile = median, M The 25th percentile = first quartile, Q1 The 75th percentile = third quartile, Q3 1/18/2019 Daniela Stan - CSC323

27 Describing Distributions (cont.)
To calculate the quartiles: 1. Arrange the observations in increasing order and locate the median M in the list of observations. 2. The first quartile Q1 is the median of the observations whose position in the ordered list is to the left of the location of the overall median. 3. The third quartile Q3 is the median of the observations whose position in the ordered list is to the right of the location of the overall median. Example: 1.13 M=?, Q1=?, Q3=? 1/18/2019 Daniela Stan - CSC323

28 Describing Distributions (cont.)
The Five-Number Summary of a set of observations consists of the smallest observation, the first quartile, the median, the third quartile, and the largest observation, written in order from the smallest to the largest. In symbols, the five number summary is Minimum Q1 M Q3 Maximum A boxplot is a graph of the five-number summary: A central box spans the quartiles Q1 and Q3 A line in the box marks the median M Lines extend from the box out to the smallest and largest observations 1/18/2019 Daniela Stan - CSC323

29 Describing Distributions (cont.)
Example: Numerical Description of shopping data using SPSS 1/18/2019 Daniela Stan - CSC323

30 Recommended Problems Chapter 1: Section 1.1
IPS web site: 1/18/2019 Daniela Stan - CSC323


Download ppt "Daniela Stan, PhD School of CTI, DePaul University"

Similar presentations


Ads by Google