Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exploring Data Descriptive Data

Similar presentations


Presentation on theme: "Exploring Data Descriptive Data"— Presentation transcript:

1 Exploring Data Descriptive Data

2 Content Types of Variables Describing data using graphical summaries
Describing the Centre of Quantitative Data Describing the Spread of Quantitative Data How Measures of Position Describe Spread

3 Variable A variable is any characteristic that is recorded for the subjects in a study Examples: Marital status, Height, Weight, IQ A variable can be classified as either Categorical (e.g. Male / Female) Quantitative (e.g. Age) Discrete or (number of children in family) Continuous (weight: 70,25 kg)

4 Categorical Variable A variable is categorical if each observation belongs to one of a set of categories. Examples: Gender (Male or Female) Religion (Catholic, Jewish, …) Type of residence (Apartment, House, …) Belief in life after death (Yes or No)

5 Quantitative Variable
A variable is called quantitative if observations take numerical values for different magnitudes of the variable. Examples: Age Number of brothers/sisters Annual Income

6 Categorical vs. Quantitative
Math Chapter 2 Categorical vs. Quantitative Categorical variables percentage of observations in each category is important E.g. % Male, % Female Quantitative variables center (a representative value) and spread (variability) are important Average Age Variation around the average age

7 Discrete Quantitative Variable
A quantitative variable is discrete if its possible values form a set of separate numbers: 0,1,2,3,…. Examples: Number of pets in a household Number of children in a family Number of foreign languages spoken by an individual

8 Continuous Quantitative Variable
A quantitative variable is continuous if it has an infinite number of possible values Measurements Examples: Height/Weight Age Blood pressure

9 4 types of scale Nominal Ordinal Interval Ratio

10 Nominal Scale Nominal scale is simplest scale.
They are numbers or letters assigned to objects serve as labels for identification or classification e.g. names and gender are categorical variables; ‘M’ for Male and ‘F’ for Female, or ‘1’ for male and ‘2’ for female, or ‘1’ for female and ‘2’ for male. Other examples include marital status, religion, race, colour and employment status, and so forth.

11 Ordinal Scale A subset of the nominal scale
Math Chapter 2 Ordinal Scale A subset of the nominal scale Where the scale follows an order Ordinal scale creates an ordered (ranked) relationship Typical ordinal scales (i) result of examination: first, second, third and fail; (ii) quality of products: ‘excellent’, ‘good’, ‘fair’ or ‘poor’ (iii) social class: upper, middle, lower class

12 Interval Scale Indicate order and distance in units.
The Interval is a measuring tool But Zero point is arbitrary Example: a price index the number of the base year (say year 2010) is set to be usually 100 Price of bread is 40 kn (= 100) is year 2010 Price of bread is 50 kn (= 125) in year 2015 We then know price of bread is higher in 2015 by 25% Another example of interval scale temperature where the initial point is always arbitrary O degrees is freezing point in Celsius (used in Europe) 32 degrees is freezing point in Fahrenheit (used in US)

13 Ratio Scale Ratio scales are absolute rather than relative
If interval scale can only have an absolute zero then it is really a ratio scale. Absolute zero a point on scale where the attribute is zero Examples age, money and weight are ratio scales because they possess an absolute zero and interval properties A person can’t have a negative weight or negative age

14 Describing data using graphical summaries

15 Frequency Table Frequency table
Math Chapter 2 Frequency Table Frequency table a listing of possible values for a variable together with the number of observations or relative frequencies (%) for each value

16 Be careful to distinguish Proportions & Percentages (Rel. Freq.)
Proportions and percentages are also called relative frequencies.

17 Graphs for Categorical Variables
Use pie charts and bar graphs to summarize categorical variables Pie Chart: A circle having a “slice of pie” for each category Bar Graph: A graph that displays a vertical bar for each category

18 Pie Charts Summarize categorical variable
Drawn as circle where each category is a slice The size of each slice is proportional to the percentage in that category

19 Bar Graphs Summarizes categorical variable
Vertical bars for each category Height of each bar represents either counts or percentages Easier to compare categories with bar graph than with pie chart Called Pareto Charts when ordered from tallest to shortest

20 Histograms Graph that uses bars to portray frequencies or relative frequencies for a quantitative variable Frequency is always on vertical axis Intervals always on horizontal axis

21 Constructing a Histogram
Sodium in Cereals Divide into intervals of equal width Count # of observations in each interval

22 Constructing a Histogram
Math Chapter 2 Constructing a Histogram Label endpoints of intervals on horizontal axis Draw a bar over each value or interval with height equal to its frequency (or percentage) Label and title Sodium in Cereals

23 Interpreting Histograms
Assess where a distribution is centered by finding the median Assess the spread of a distribution Shape of a distribution: roughly symmetric, skewed to the right, or skewed to the left Left and right sides are mirror images

24 Examples of Skewness

25 Shape: Type of Mound Electricity demand or demand for seats in a restaurant different times of day Height of 10 year olds

26 Outlier An outlier falls far from the rest of the data

27 Time Plots Display a time series, data collected over time
Plots observation on the vertical against time on the horizontal Points are usually connected Time Plot from 1995 – 2001 of number of people globally who use the Internet

28 Describing the Centre of Quantitative Data

29 Mean The mean is the sum of the observations divided by the number of observations It is the center of mass

30 Median Midpoint of the observations when ordered from least to greatest Order observations If the number of observations is: Odd, the median is the middle observation (99) Even, the median is the average of the two middle observations ( =100)

31 Comparing the Mean and Median
Mean and median of a symmetric distribution are close Mean is often preferred because it uses all data But in a skewed distribution, the mean is farther out in the skewed tail than is the median Median is preferred because it is better representative of a typical observation

32 Mode Value that occurs most often Highest bar in the histogram
Mode is most often used with categorical data

33 Resistant Measures A measure is resistant if extreme observations (outliers) have little, if any, influence on its value Median is resistant to outliers Mean is not resistant to outliers Example: 75 people in class 72 people absent for 1 day year in year 2 people absent for 50 day each 1 person absent for 100 days Median = 1 day Mean = 2.42 days Mode = 1 day

34 Describing the Spread of Quantitative Data

35 Math Chapter 2 Range Range = max – min Two teams with same average (mean) height = 2.0m The range is strongly affected by outliers. 1.8m 1.9m 2.0m 2.1m 2.2m 1.5m 1.8m 2.1m 2.1m 2.5m

36 Properties of Sample Standard Deviation
Measures spread of data Only zero when all observations are same; otherwise, s > 0 As the spread increases, s gets larger Same units as observations Not resistant Strong skewness or outliers greatly increase s

37 How Measures of Position Describe Spread

38 Math Chapter 2 Percentile The pth percentile is a value such that p percent of the observations fall below or at that value 70th percentile

39 Math Chapter 2 Finding Quartiles Splits the data into four parts with same number of observations in each part Arrange data in order The median is the second quartile, Q2 Q1 is the median of the lower half of the observations Q3 is the median of the upper half of the observations

40 Measure of Spread: Quartiles
Math Chapter 2 Measure of Spread: Quartiles Quartiles divide a ranked data set into four equal parts: 25% of the data at or below Q1 and 75% above 50% of the obs are above the median and 50% are below 75% of the data at or below Q3 and 25% above Q1= first quartile = 2.2 M = median = 3.4 Q3= third quartile = 4.35

41 Calculating Interquartile Range
The interquartile range is the distance between the third and first quartile, giving spread of middle 50% of the data: IQR = Q3 - Q1

42 Criteria for Identifying an Outlier
An observation is a potential outlier if: it falls more than 1.5 x IQR below the first quartile or more than 1.5 x IQR above the third quartile. IQR: (75-25) = 50 Outlier < -25 Outlier > 150 25 50 75

43 5 Number Summary The five-number summary of a dataset consists of:
Minimum value First Quartile Median Third Quartile Maximum value

44 Boxplot Box goes from the Q1 to Q3 (the IQR)
Line is drawn inside the box at the median (the middle value) Lines go from lower end of box to smallest observation that’s not a potential outlier from upper end of box to largest observation that’s not a potential outlier Potential outliers are shown separately, often with * or +

45 Comparing Distributions using Boxplots
Boxplots do not display the shape of the distribution as clearly as histograms but are useful for making graphical comparisons of two or more distributions 1,3 m 1,3 m 1,6 m 1,9 m


Download ppt "Exploring Data Descriptive Data"

Similar presentations


Ads by Google