Introduction to Statistics. Objectives: Understand certain statistical concepts & terminology Describe types of measurement scales Differentiate between.

Introduction to Statistics

Objectives: Understand certain statistical concepts & terminology Describe types of measurement scales Differentiate between descriptive and inferential statistics Identify measures of central tendency and understand their uses. Identify measures of dispersion

Introduction Statistics - a set of concepts, rules, and procedures that help us to: – organize numerical information in the form of tables, graphs, and charts; – understand statistical techniques underlying decisions that affect our lives and well-being; and – make informed decisions.

Descriptive Statistics Statistics is a branch of mathematics designed to allow people to accomplish two goals: 1. The first is to accurately describe data and trends in data (descriptive statistics). Descriptive statistics: Collection, classification, analysis, and interpretation of data. - Any method or formula which yields some number and tells us about a set of data is referred to as descriptive statistics.

2. The second is to make predictions on future behavior, based on current data (predictive statistics). Predictive statistics: Using statistics generated from the sample in order to make predictions, this is also often called inferential statistics.

Terminology Data - facts, observations, and information that come from investigations. There are two types of data: 1. Measurement data sometimes called quantitative data -- the result of using some instrument to measure something (e.g., test score, weight).

data 2. Categorical data also referred to as frequency or qualitative data. Things are grouped according to some common property(ies) and the number of members of the group are recorded (e.g., males/females, vehicle type).

Variable property of an object or event that can take on different values. For example, college major is a variable that takes on values like mathematics, computer science, English, psychology, etc. Discrete Variable - a variable with a limited number of values (e.g., gender (male/female), employee (junior/senior).

Variable Continuous Variable - a variable that can take on many different values, in theory, any value between the lowest and highest points on the measurement scale. Independent Variable - a variable that is manipulated, measured, or selected by the researcher as an antecedent condition to an observed behavior. In a hypothesized cause-and- effect relationship, the independent variable is the cause and the dependent variable is the outcome or effect.

Dependent Variable - a variable that is not under the experimenter's control -- the data. It is the variable that is observed and measured in response to the independent variable. Qualitative Variable - a variable based on categorical data. Quantitative Variable - a variable based on quantitative data.

Types of Measurement Scales 1.Nominal: For qualitative data with distinct categories. For example the categories German, French, and Italian are categories but are not ordered in any way.

2. Ordinal: For quantitative data with distinct categories in which ordering (or ranking) is implied. A good example is the Likert scale that you see on many surveys: 1=Strongly disagree; 2=Disagree; 3=Neutral; 4=Agree; 5=Strongly agree.

3. Interval: For quantitative data with an ordered scale in which the interval between data values is meaningful. For example the categories of rank in the military. Clearly a major is higher ranked than a captain, but how much higher? Does he have twice the authority of a captain? It is impossible to say. You can only say he is higher ranked.

4. Ratio: For quantitative data which have an inherently defined zero and the ratio of data values is meaningful. Weight in kilograms is a very good example since it has a definite ratio from one weight to another. 50kg is indeed twice as heavy as 25 kg.

Two Types of Statistics Descriptive statistics of a POPULATION Relevant notation (Greek): –  mean – N population size –  sum Inferential statistics of SAMPLES from a population. – Assumptions are made that the sample reflects the population in an unbiased form. Roman Notation: – X mean – n sample size –  sum

Measures of Central Tendency These measures tap into the average distribution of a set of scores or values in the data. – Mean – Median – Mode

What is “Mean”? The “mean” of some data is the average score or value, such as the average age of an MPA student or average weight of professors that like to eat donuts. Inferential mean of a sample: X=(  X)/n Mean of a population:  =(  X)/N

Mean The mean is the most common measure of central tendency and the one that can be mathematically manipulated. It is defined as the average of a distribution is equal to the SX / N. Simply, the mean is computed by summing all the scores in the distribution (SX) and dividing that sum by the total number of scores (N).

Mean The mean is the balance point in a distribution such that if you subtract each value in the distribution from the mean and sum all of these deviation scores, the result will be zero. Example: 2, 5, 8,10,12,17 Mean = 54/6= 9 -7, -4, -1, 1, 3, 8 then the sum is Zero.

Problem of being “mean” The main problem associated with the mean value of some data is that it is sensitive to outliers (extreme values). Example, the average weight of 10 students might be affected if there was one who weighed 200 kg.

The Median Because the mean average can be sensitive to extreme values, the median is sometimes useful and more accurate. The median is simply the middle value among some scores of a variable. (no standard formula for its computation).

The median The median is the score that divides the distribution into halves; half of the scores are above the median and half are below it when the data are arranged in numerical order. The median is also referred to as the score at the 50 th percentile in the distribution.

When we have odd number of observations, the formula yields an integer that represents the value in a numerically ordered distribution corresponding to the median location. (For example, in the distribution of numbers (3 1 5 4 9 9 8) the median location is the 4 th value. When applied to the ordered distribution (1 3 4 5 8 9 9), the value 5 is the median, three scores are above 5 and three are below 5.

The Median If there were only 6 values (1 3 4 5 8 9), the median location in this case is half-way between the 3 rd and 4 th scores (4 and 5) or 4.5.

What is the Median? BoxerWeight Schmuggles165 Bopsey213 Pallitto189 Homer187 Schnickerson165 Levin148 Honkey-Doorey251 Zingers308 Boehmer151 Queenie132 Googles-Boop199 Calzone227 194.6 Weight 132 148 151 165 187 189 199 213 227 251 308 Rank order and choose middle value. If the number of values is even then the median is the average between two in the middle

The Mode Mode - The mode of a distribution is simply defined as the most frequent or common response or value for a variable. Multiple modes are possible: bimodal or multimodal.

Figuring the Mode BoxerWeight Schmuggles165 Bopsey213 Pallitto189 Homer187 Schnickerson165 Levin148 Honkey-Doorey251 Zingers308 Boehmer151 Queenie132 Googles-Boop199 Calzone227 What is the mode? Answer: ??

Percentiles If we know the median, then we can go up or down and rank the data as being above or below certain thresholds. You may be familiar with standardized tests. 90 th percentile, your score was higher than 90% of the rest of the sample.

To calculate the k th percentile (where k is any number between zero and one hundred), do the following steps: 1. Order all the values in the data set from smallest to largest. 2. Multiply k percent by the total number of values, n. This number is called the index. 3. If the index obtained in Step 2 is not a whole number, round it up to the nearest whole number and go to Step 4a. If the index obtained in Step 2 is a whole number, go to Step 4b.

4a. (Index is not a whole number) Count the values in your data set from left to right (from the smallest to the largest value) until you reach the number indicated by Step 3. The corresponding value in your data set is the k th percentile. 4b. (Index is a whole number) Count the values in your data set from left to right until you reach the number indicated by Step 2. The k th percentile is the average of that corresponding value in your data set and the value that directly follows it.

For example, suppose you have 25 test scores, and in order from lowest to highest they look like this: 43, 54, 56, 61, 62, 66, 68, 69, 69, 70, 71, 72, 77, 78, 79, 85, 87, 88, 89, 93, 95, 96, 98, 99, 99. To find the 90th percentile for these (ordered) scores, start by multiplying 90% times the total number of scores, which gives 90% ∗ 25 = 0.90 ∗ 25 = 22.5 (the index). Rounding up to the nearest whole number, you get 23. Counting from left to right (from the smallest to the largest value in the data set), you go until you find the 23rd value in the data set. That value is 98, and it is the 90th percentile for this data set. Make sure. 22/25=0.88

Now say you want to find the 20th percentile. Start by taking 0.20 x 25 = 5 (the index); this is a whole number, so proceed from Step 3 to Step 4b, which tells you the 20th percentile is the average of the 5th and 6th values in the ordered data set (62 and 66). The 20th percentile then comes to (62 + 66) ÷ 2 = 64. The median (the 50th percentile) for the test scores is the 13th score: 77.

Measures of Dispersion Measures of dispersion tell us about variability in the data. Basic question: how much do values differ for a variable from the minimum to maximum, and distance among scores in between. We use: – Range – Standard Deviation – Variance

Remember that we said in order to assemble information from data, i.e. to make an inference, we need to see variability in our variables. Measures of dispersion give us information about how much our variables vary from the mean, because if they don’t, this makes it difficult infer anything from the data. Dispersion is also known as the spread or range of variability.

The Range Range = highest- lowest r = h – l – Where h is high and l is low In other words, the range gives us the value between the minimum and maximum values of a variable. Understanding this statistic is important in understanding your data, especially for management and diagnostic purposes.

Example: Problem: Cheryl took 7 math tests in one marking period. What is the range of her test scores? 89, 73, 84, 91, 87, 77, 94 Solution: Ordering the test scores from least to greatest, we get: 73, 77, 84, 87, 89, 91, 94 highest - lowest = 94 - 73 = 21

The Standard Deviation A standardized measure of distance from the mean. Very useful and something you do read about when making predictions or other statements about the data.

Standard Deviation most popular and important measure of variability a measure of how far all of the individual scores in the distribution are from a standard (mean)

Standard Deviation low variability small SD high variability large SD

=square root  =sum (sigma) X=score for each point in data _ X=mean of scores for the variable n=sample size (number of observations or cases S = Formula for Standard Deviation

Example: Calculate the SD for the following values: 4, 2, 5, 8, 6. 1. Calculate the mean: 2. Calculate deviation from the mean for each value in the sample: 4-5=-1, 2-5=-3, 5-5=0, 8-5=3, 6-5=1 3. Calculate sum of all these deviations and square it (=20) 4. Calculate the standard deviation:

Variance Note that this is the same equation except for no square root taken. Its use is not often directly reported in research but instead is a building block for other statistical methods. =

Standard Deviation of the Mean or the standard error (SE) It is the variation in means of repeated samples. SE= Standard deviation divided by the square root of n.

Coefficient of Variation It measures variability in relation to mean (or average). Used to compare the relative dispersion of more than one data set. Data to be compared may be in the same units, in different units, with similar mean or with different mean. CV= Standard deviation divided by mean and multiplied by percentage. CV= S/M

Organizing and Graphing Data

Goal of Graphing? 1.Presentation of Descriptive Statistics 2.Presentation of Evidence 3.Some people understand subject matter better with visual aids. 4.Provide a sense of the underlying data generating process (data pattern).

Normal Distribution Most widely used continuous distribution Also known as the Gaussian distribution Symmetric

Graphing Data: Histograms

Graphing Data: Bar Graph

Pie Charts:

Line Graphs: A Time Series

Frequency Distribution Table

Properties of a Distribution Shape – symmetric vs. skewed – unimodal vs. multimodal Central Tendency – where most of the data are?? – mean, median, and mode Variability (spread) – how similar the scores are?? – range, variance, and standard deviation

Representing a Distribution Often it is helpful to visually represent distributions in various ways. Graphs – continuous variables (histogram, line graph) – categorical variables (pie chart, bar chart) Tables – frequency distribution table.

Shape of a Distribution Symmetrical (normal) – scores are equally distributed about the central tendency (i.e., mean)

Shape of a Distribution Skewed – extreme high or low scores can skew the distribution in either direction Negative skewPositive skew

Shape of a Distribution Unimodal Multimodal Minor ModeMajor Mode

Central Tendency Mode: the most frequent score – good for nominal scales (eye color) Median: the middle score – separates the bottom 50% and the top 50% of the distribution – good for skewed distributions (net worth).

Central Tendency Mean: the arithmetic average – add all of the scores and divide by total number of scores – This the preferred measure of central tendency (takes all of the scores into account) populationsample

Central Tendency Is the mean always the best measure of central tendency? No, skew pulls the mean in the direction of the skew

Central Tendency and Skew If negative skew: Mode Median Mean

Central Tendency and Skew If positive skew: Mode Median Mean

Normal Distribution Gives us a picture of the variability and central tendency.

Normal Distribution

Standard Deviation In a perfectly symmetrical (i.e. normal) distribution 2/3 of the scores will fall within +/- 1 standard deviation (suppose SD= 3.27) 6.4 +1 9.673.13

Introduction to Statistics. Objectives: Understand certain statistical concepts & terminology Describe types of measurement scales Differentiate between.

Similar presentations

Presentation on theme: "Introduction to Statistics. Objectives: Understand certain statistical concepts & terminology Describe types of measurement scales Differentiate between."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Statistics. Objectives: Understand certain statistical concepts & terminology Describe types of measurement scales Differentiate between.

Similar presentations

Presentation on theme: "Introduction to Statistics. Objectives: Understand certain statistical concepts & terminology Describe types of measurement scales Differentiate between."— Presentation transcript:

Similar presentations

About project

Feedback