Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Analysis and Statistical Software I Quarter: Winter 02/03

Similar presentations


Presentation on theme: "Data Analysis and Statistical Software I Quarter: Winter 02/03"— Presentation transcript:

1 Data Analysis and Statistical Software I Quarter: Winter 02/03
Daniela Stan Raicu School of CTI, DePaul University 11/24/2018 Daniela Stan - CSC323

2 Outline Introduction Individuals and Variables
Exploratory Data Analysis Describing Distributions with Graphs Describing Distributions with Numbers 11/24/2018 Daniela Stan - CSC323

3 Introduction Data: - numbers, measurements, facts
Where is the data coming from? - medical field - automotive industry - stock market & investment - census bureau - customer profiling; examples Statistics: - the science of collecting, organizing, and interpreting data; the goal is to gain understanding from data. 11/24/2018 Daniela Stan - CSC323

4 Individuals and Variables
Individuals: - are the objects described by a set of data individuals ~ cases ~ records Variable: - any characteristic of an individual; it can take different values for different individuals. variable ~ attribute Observation ~ value of a variable Categorical Variables Types of Variables Quantitative Variables 11/24/2018 Daniela Stan - CSC323

5 Individuals and Variables (cont.)
Example of individuals and variables Name Age Gender Marital Status John Smith 25 Male Single Joe Doe 32 Married Phillip Roberts 21 Sarah Lazar 26 Female The distribution of a variable gives: what values the variable takes; how often it takes these values: count percent or fraction 11/24/2018 Daniela Stan - CSC323

6 How We Describe The Variables?
Exploratory Data Analysis Single variable Two or more variables Categorical variable Numerical variable Scatterplots Correlation Regression Stemplots Histograms Five number summary Standard deviation Bar graphs Pie charts 11/24/2018 Daniela Stan - CSC323

7 Bar Graphs and Pie Charts
The distribution of the highest level of education for people aged 25 to 34 years: Education Count (millions) Percent Less than high school 4.7 12.3 High school graduate 11.8 30.7 Some College 10.9 28.3 Bachelor’s degree 8.5 22.1 Advanced degree 2.5 6.6 11/24/2018 Daniela Stan - CSC323

8 Bar Graphs and Pie Charts (cont.)
Pareto chart Pie charts require that you include all categories that make up a whole 11/24/2018 Daniela Stan - CSC323

9 Stemplots Stemplot ~ stem-and-leaf plot To make a stemplot:
Separate each observation into a stem consisting of all but the final (rightmost) digit and a leaf, the final digit. Write the stems in a vertical column with the smallest at the top, and draw a vertical line at the right of this column. Write each leaf in the row to the right of its stem, in increasing order out from the stem. 11/24/2018 Daniela Stan - CSC323

10 Stemplots (cont.) Example: Problem 1.24 (page 29)
Back-to-back stemplot How stemplots deal with large data sets? Splitting stems: One stem with leaves between 0 and 4 One stem with leaves between 5 and 9 How stemplots deal with observations with having many digits? Rounding 11/24/2018 Daniela Stan - CSC323

11 Stemplots (cont.) Advantages of stemplots:
Describe the shape of a distribution for small numbers Disadvantages: Don’t work well with large data sets since they display the values of the variables Divide the observations into groups (stems) determined by the number system rather than by judgment 11/24/2018 Daniela Stan - CSC323

12 Histograms A histogram breaks the range of values of a variable into intervals and displays the count or percent of the observations that fall into each interval. Count ~ frequency Percent ~ relative frequency Example: Problem 1.34 (page 34) Disadvantages: How many intervals? What width for the histogram intervals? The original data cannot be recovered 11/24/2018 Daniela Stan - CSC323

13 Example: Weight Data 11/24/2018 Daniela Stan - CSC323

14 Weight Data: Stemplot (Stem & Leaf)
11 009 14 08 16 555 19 245 20 3 21 025 22 0 23 24 25 26 0 Weight Data: Stemplot (Stem & Leaf) Key 20|3 means 203 pounds Stems = 10’s Leaves = 1’s 11/24/2018 Daniela Stan - CSC323

15 Weight Data: Frequency Table
* Left endpoint is included in the group, right endpoint is not. 11/24/2018 Daniela Stan - CSC323

16 Weight Data: Histogram
100 120 140 160 180 200 220 240 260 280 Weight * Left endpoint is included in the group, right endpoint is not. 11/24/2018 Daniela Stan - CSC323

17 Examining Distributions
In any graph of data, look for the Overall pattern of a distribution described by: - Center ~ midpoint Spread ~ range between the smallest and largest value ~ variability Shape: 1. Symmetric or skewed 2. Unimodal (one major peak/mode) or multimodal Deviations from the pattern Outliers Example 11/24/2018 Daniela Stan - CSC323

18 Examining Distributions (cont.)
The histogram of all 947 seventh grade students in Gary, Indiana, on the vocabulary part of the Iowa test. Shape? - symmetric - unimodal 11/24/2018 Daniela Stan - CSC323

19 Symmetric Histograms Bell-Shaped
11/24/2018 Daniela Stan - CSC323

20 Symmetric Histograms Mound-Shaped
11/24/2018 Daniela Stan - CSC323

21 Asymmetric Histograms Skewed to the Left
11/24/2018 Daniela Stan - CSC323

22 Asymmetric Histograms Skewed to the Right
11/24/2018 Daniela Stan - CSC323

23 Outliers An outlier is an individual value that falls outside the overall pattern Outlier ~ extreme observation How to deal with outliers? Sources: - outliers from equipment failure - errors in recording data - extraordinary occurrence Applications 11/24/2018 Daniela Stan - CSC323

24 Time Plots A time plot of a variable plots each observation against the time at which was measured. 11/24/2018 Daniela Stan - CSC323

25 Time Plots (cont.) Time series: data sets produced by measurements of a variable taken at regular intervals over time. Time plots can reveal the main features of a time series such as: Seasonal variation: a pattern that repeats itself as known regular intervals of time A trend: a persistent, long-term rise or fall 11/24/2018 Daniela Stan - CSC323

26 Describing Distributions with numbers
Measuring center: the mean mean ~ average value If the n observations are x1, x2,…, xn, their mean is or, in more compact notation 11/24/2018 Daniela Stan - CSC323

27 Describing Distributions (cont.)
Measuring center: the median M is the number such that half the observations are smaller and the other half are larger; median ~ middle value To find the median M of a distribution, follow the steps: 1. Arrange all n observations in order of size, from smallest to largest. 2. If n is odd, M is the center observation in the ordered list; the location is (n+1)/2 from the bottom of the list. 3. If n is even, M is the mean of the two center observations in the ordered list; the location is again (n+1)/2 from the bottom of the list. 11/24/2018 Daniela Stan - CSC323

28 Describing Distributions (cont.)
Example 1.13 (textbook, page 40); Data: Fuel economy (miles per gallon) for 2001 model two-seater cars 11/24/2018 Daniela Stan - CSC323

29 Describing Distributions (cont.)
Calculate median: 1. Arrange the data in increasing order: 2. Find the location of the median: (n+1)/2=(19+1)/2=10 The 10th position 11/24/2018 Daniela Stan - CSC323

30 Describing Distributions (cont.)
How the median changes if we remove the last observation in the sorted list? How the median changes if the value of last observation is changed to 680? Calculate the mean: How the mean changes if we remove the outlier? How the mean changes if the value of last observation is changed to 680? 11/24/2018 Daniela Stan - CSC323

31 Describing Distributions (cont.)
Mean versus Median 1. The mean is sensitive to the influence of extreme observations/outliers, or skewed distributions. 2. A resistant measure of any aspect of a distribution is relatively unaffected by changes in the numerical value of a small proportion of the total number of observations, no matter how large these changes are. 3. The mean is no a resistant measure of the center. 4. The median is a resistant measure of the center. 11/24/2018 Daniela Stan - CSC323

32 11/24/2018 Daniela Stan - CSC323

33 Median versus Average A recent newspaper article in California said that the median price of single-family homes sold in the past year in the local area was $136,000 and the average price was $149,160. How do you think these values are computed? Which do you think is more useful to someone considering the purchase of a home, the median or the average? From Seeing Through Statistics, 2nd Edition by Jessica M. Utts. 11/24/2018 Daniela Stan - CSC323

34 11/24/2018 Daniela Stan - CSC323

35 Describing Distributions (cont.)
Measuring spread: the quartiles The pth percentile of a distribution is the value such that p percent of the observations fall at or below it. The 50th percentile = median, M The 25th percentile = first quartile, Q1 The 75th percentile = third quartile, Q3 11/24/2018 Daniela Stan - CSC323

36 Describing Distributions (cont.)
To calculate the quartiles: 1. Arrange the observations in increasing order and locate the median M in the list of observations. 2. The first quartile Q1 is the median of the observations whose position in the ordered list is to the left of the location of the overall median. 3. The third quartile Q3 is the median of the observations whose position in the ordered list is to the right of the location of the overall median. Example: 1.13 M=?, Q1=?, Q3=? 11/24/2018 Daniela Stan - CSC323

37 Describing Distributions (cont.)
The Five-Number Summary of a set of observations consists of the smallest observation, the first quartile, the median, the third quartile, and the largest observation, written in order from the smallest to the largest. In symbols, the five number summary is Minimum Q1 M Q3 Maximum A boxplot is a graph of the five-number summary: A central box spans the quartiles Q1 and Q3 A line in the box marks the median M Lines extend from the box out to the smallest and largest observations 11/24/2018 Daniela Stan - CSC323

38 Describing Distributions (cont.)
Example: Numerical Description of shopping data using SPSS 11/24/2018 Daniela Stan - CSC323

39 Recommended Problems Chapter 1: Section 1.1
IPS web site: 11/24/2018 Daniela Stan - CSC323


Download ppt "Data Analysis and Statistical Software I Quarter: Winter 02/03"

Similar presentations


Ads by Google