Chapter 1: The What and the Why of Statistics

Slides:



Advertisements
Similar presentations
Relationships Between Two Variables: Cross-Tabulation
Advertisements

Chapter 7: Measures of Association for Nominal and Ordinal Variables
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Lecture Slides Elementary Statistics Eleventh Edition and the Triola.
Chapter 11 Contingency Table Analysis. Nonparametric Systems Another method of examining the relationship between independent (X) and dependant (Y) variables.
Data Analysis Statistics. Levels of Measurement Nominal – Categorical; no implied rankings among the categories. Also includes written observations and.
Leon-Guerrero and Frankfort-Nachmias,
Chapter 3: Graphic Presentation
Measures of Central Tendency
Chapter 8: Bivariate Regression and Correlation
Chapter 2: Organization of Information: Frequency Distributions Frequency Distributions Proportions and Percentages Percentage Distributions Comparisons.
Understanding Research Results
POLS 7000X STATISTICS IN POLITICAL SCIENCE CLASS 2 BROOKLYN COLLEGE – CUNY SHANG E. HA Leon-Guerrero and Frankfort-Nachmias, Essentials of Statistics for.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
Graphic Presentation The Pie Chart The Bar Graph The Statistical Map
Census A survey to collect data on the entire population.   Data The facts and figures collected, analyzed, and summarized for presentation and.
Variable  An item of data  Examples: –gender –test scores –weight  Value varies from one observation to another.
Chapter 1: The What and the Why of Statistics
© Copyright McGraw-Hill CHAPTER 3 Data Description.
PPA 501 – Analytical Methods in Administration Lecture 5a - Counting and Charting Responses.
Chapter 11 Descriptive Statistics Gay, Mills, and Airasian
Descriptive Statistics
Chapter 8 – 1 Chapter 8: Bivariate Regression and Correlation Overview The Scatter Diagram Two Examples: Education & Prestige Correlation Coefficient Bivariate.
Research Methods Chapter 8 Data Analysis. Two Types of Statistics Descriptive –Allows you to describe relationships between variables Inferential –Allows.
The What and the Why of Statistics The Research Process Asking a Research Question The Role of Theory Formulating the Hypotheses –Independent & Dependent.
Chapter 1: The What and the Why of Statistics  The Research Process  Asking a Research Question  The Role of Theory  Formulating the Hypotheses  Independent.
Chapter 10: Relationships Between Two Variables: CrossTabulation
Chapter 12: Measures of Association for Nominal and Ordinal Variables
TYPES OF STATISTICAL METHODS USED IN PSYCHOLOGY Statistics.
Chapter 2: The Organization of Information: Frequency Distributions  Frequency Distributions  Proportions and Percentages  Percentage Distributions.
Chapter 7 – 1 Chapter 12: Measures of Association for Nominal and Ordinal Variables Proportional Reduction of Error (PRE) Degree of Association For Nominal.
Chapter 7 – 1 Chapter 7 Measures of Association for Nominal and Ordinal Variables Proportional Reduction of Error (PRE) Degree of Association For Nominal.
Chapter 10: Cross-Tabulation Relationships Between Variables  Independent and Dependent Variables  Constructing a Bivariate Table  Computing Percentages.
Chapter Eight: Using Statistics to Answer Questions.
Chapter 8 – 1 Regression & Correlation:Extended Treatment Overview The Scatter Diagram Bivariate Linear Regression Prediction Error Coefficient of Determination.
Chapter 6 – 1 Relationships Between Two Variables: Cross-Tabulation Independent and Dependent Variables Constructing a Bivariate Table Computing Percentages.
Educational Research Descriptive Statistics Chapter th edition Chapter th edition Gay and Airasian.
(Unit 6) Formulas and Definitions:. Association. A connection between data values.
CHAPTER 8: RELATIONSHIPS BETWEEN TWO VARIABLES Leon-Guerrero and Frankfort-Nachmias, Essentials of Statistics for a Diverse Society.
Slide 1 Copyright © 2004 Pearson Education, Inc.  Descriptive Statistics summarize or describe the important characteristics of a known set of population.
Chapter 1: The What and the Why of Statistics
Leon-Guerrero and Frankfort-Nachmias,
The What and the Why of Statistics
Statistics & Evidence-Based Practice
Descriptive Statistics ( )
Exploratory Data Analysis
Chapter 12 Understanding Research Results: Description and Correlation
Business and Economics 6th Edition
Chapter 2: Methods for Describing Data Sets
Bi-variate #1 Cross-Tabulation
APPROACHES TO QUANTITATIVE DATA ANALYSIS
Chapter 5 STATISTICS (PART 1).
CHAPTER 3 Data Description 9/17/2018 Kasturiarachi.
NUMERICAL DESCRIPTIVE MEASURES
Social Research Methods
Numerical Descriptive Measures
Introduction to Summary Statistics
An Introduction to Statistics
Introduction to Statistics
Basic Statistical Terms
Introduction to Summary Statistics
Inferential Statistics
MEASURES OF CENTRAL TENDENCY
Product moment correlation
Honors Statistics Review Chapters 4 - 5
15.1 The Role of Statistics in the Research Process
MBA 510 Lecture 2 Spring 2013 Dr. Tonya Balan 4/20/2019.
Chapter Nine: Using Statistics to Answer Questions
Chapter 3: Graphic Presentation
Descriptive Statistics
Presentation transcript:

Chapter 1: The What and the Why of Statistics The Research Process Asking a Research Question The Role of Theory Formulating the Hypotheses Independent & Dependent Variables: Causality Independent & Dependent Variables: Guidelines Collecting Data Levels of Measurement Discrete and Continuous Variables Analyzing Data & Evaluating Hypotheses Descriptive and Inferential Statistics Looking at Social Differences

The Research Process THEORY Examine a social relationship, study the relevant literature Asking the Research Question Formulating the Hypotheses Contribute new evidence to literature and begin again Develop a research design THEORY Need to add an arrow from THEORY to ANALYZING DATA and back. Evaluating the Hypotheses Analyzing Data Collecting Data

Asking a Research Question What is Empirical Research? Research based on information that can be verified by using our direct experience. To answer research questions we cannot rely on reasoning, speculation, moral judgment, or subjective preference Empirical: “Are women paid less than men for the same types of work?” Not Empirical: “Is racial equality good for society?”

The Role of Theory A theory is an explanation of the relationship between two or more observable attributes of individuals or groups. Social scientists use theory to attempt to establish a link between what we observe (the data) and our understanding of why certain phenomena are related to each other in a particular way.

Formulating the Hypotheses Tentative answers to research questions (subject to empirical verification) A statement of a relationship between characteristics that vary (variables) Variable: A property of people or objects that takes on two or more values Must include categories that are both exhaustive and mutually exclusive

Units of Analysis The level of social life on which social scientists focus (individuals, groups). Examples: Individual as unit of analysis: What are your political views? Family as unit of analysis: Who does the housework? Organization as unit of analysis: What is the gender composition? City as unit of analysis: What was the crime rate last year?

Types of Variables IV  DV Dependent The variable to be explained (the “effect”). Independent The variable expected to account for (the “cause” of) the dependent variable. IV  DV

Cause and Effect Relationships Cause and effect relationships between variables are not easy to infer in the social sciences. Causal relationships must meet three criteria: The cause has to precede the effect in time There has to be an empirical relationship between the cause and effect This relationship cannot be explained by other factors Don’t include in PFP version!!

Guidelines for Independent and Dependent Variables The dependent variable is always the property you are trying to explain; it is always the object of the research. The independent variable usually occurs earlier in time than the dependent variables. The independent variable is often seen as influencing, directly or indirectly, the dependent variable.

Example 1 Identify the IV and DV Identify possible control variables People who attend church regularly are more likely to oppose abortion than people who do not attend church regularly. Identify the IV and DV independent variable: dependent variable: Church attendance Attitudes toward abortion Identify possible control variables Gender Age Religious affiliation (Catholic, Baptist, Islamic…) Political party identification Are the causal arguments sound? e.g. does party id affect abortion views or vice versa?

Example 2 Identify the IV and DV Identify possible control variables The number of books read to a child per day positively affects a child’s word recognition. Identify the IV and DV independent variable: dependent variable: Number of books read Word recognition Identify possible control variables Gender Older siblings Health status Birth order Are the causal arguments sound? Most likely. It is hard to construct an argument where a 36 month old child affects the number of books his or her parent reads to him/her.

Collecting Data THEORY Collecting Data Examine a social relationship, study the relevant literature Ask the Research Question Formulating the Hypotheses Contribute new evidence to literature and begin again Develop a research design THEORY Evaluating the Hypotheses Analyzing Data Collecting Data

Collecting Data Researchers must decide three things: How to measure the variables of interest How to select the cases for the research What kind of data collection techniques to use

Levels of Measurement Nominal Ordinal Interval-Ratio Not every statistical operation can be used with every variable. The type of statistical operations we employ will depend on how our variables are measured. Nominal Ordinal Interval-Ratio Nominal -- means “in name only.” Also known as categorical or qualitative. Ask them for examples of nominal vars: gender, religion, type of company (manufacturing, retail, health services, etc.) Ordinal -- e.g., attitudinal variables (views on abortion) Interval-Ratio -- can ask how much more of X (temperature, income, test scores)

Nominal Level of Measurement Numbers or other symbols are assigned to a set of categories for the purpose of naming, labeling, or classifying the observations. Examples: Political Party (Democrat, Republican) Religion (Catholic, Jewish, Muslim, Protestant) Race (African American, Latino, Native American)

Ordinal Level of Measurement Nominal variables that can be ranked from low to high. Example: Social Class Upper Class Middle Class Working Class

Interval-Ratio Level of Measurement Variables where measurements for all cases are expressed in the same units. (Variables with a natural zero point, such as height and weight, are called ratio variables.) Examples: Age Income SAT scores

Cumulative Property of Levels of Measurement Variables that can be measured at the interval-ratio level of measurement can also be measured at the ordinal and nominal levels. However, variables that are measured at the nominal and ordinal levels cannot be measured at higher levels. Different or Higher or How Much Level Equivalent Lower Higher Nominal Yes No Ordinal Interval-ratio

Cumulative Property of Levels of Measurement There is one exception, though Dichotomous variables Because there are only two possible values for a dichotomy, we can measure it at the ordinal or the interval-ratio level (e.g., gender) There is no way to get them out of order This gives the dichotomy more power than other nominal level variables

Discrete and Continuous Variables Discrete variables: variables that have a minimum-sized unit of measurement, which cannot be sub-divided Example: the number children per family Continuous variables: variables that, in theory, can take on all possible numerical values in a given interval Example: length

Analyzing Data: Descriptive and Inferential Statistics Population: The total set of individuals, objects, groups, or events in which the researcher is interested. Sample: A relatively small subset selected from a population. Descriptive statistics: Procedures that help us organize and describe data collected from either a sample or a population. Inferential statistics: The logic and procedures concerned with making predictions or inferences about a population from observations and analyses of a sample.

Analyze Data & Evaluate Hypotheses Examine a social relationship, study the relevant literature Asking the Research Question Formulating the Hypotheses Contribute new evidence to literature and begin again Develop a research design THEORY Evaluating the Hypotheses Analyzing Data Collecting Data

Begin the Process Again... Examine a social relationship, study the relevant literature Asking the Research Question Formulating the Hypotheses Contribute new evidence to literature and begin again Develop a research design THEORY Evaluating the Hypotheses Analyzing Data Collecting Data

Chapter 2: Organization of Information: Frequency Distributions Proportions and Percentages Percentage Distributions Comparisons The Construction of Frequency Distributions Frequency Distributions for Nominal Variables Frequency Distributions for Ordinal Variables Frequency Distributions for Interval-Ratio Variables Cumulative Distributions Rates Reading the Research Literature Basic Principles Tables with a Different Format

Frequency Distributions A table reporting the number of observations falling into each category of the variable. Identity Frequency (f) Native American 947,500 Native American of multiple ancestry 269,700 Native American of Indian descent 5,537,600 Total (N) 6,754,800

Death Penalty Statutes In 1993, 36 states and Washington, D.C. had statutes permitting capital punishment. Of these 36 states, 27 set a minimum age for execution. Assume you are a member of a legal reform group that is trying to get the states that do not have a minimum age for execution to change their laws. You want to prepare a report describing the minimum age for execution in the 27 states have an established minimum age for execution. (The data are on the following slides.)

Death Penalty Statutes Source: Kathleen Maguire and Ann L. Pastore, eds., Sourcebook of Criminal Justice Statistics. 1994. U.S. Department of Justice, Bureau of Justice Statistics. Washington, D.C.: U.S. Government Printing Office, 1995, pp. 115-116.

Creating a Frequency Distribution Minimum Age Tally 14 | 15 | 16 ||||||||| 17 |||| 18 |||||||||||| Frequency 1 9 4 12 Total N 27

Creating a Frequency Distribution Minimum Age Frequency 14 1 15 1 16 9 17 4 18 12 Total N 27

Proportions and Percentages Proportion (P): a relative frequency obtained by dividing the frequency in each category by the total number of cases. Percentage (%): a relative frequency obtained by dividing the frequency in each category by the total number of cases and multiplying by 100. N: total number of cases Proportions and percentages are relative frequencies

Proportions and Percentages Minimum Age Frequency Proportion Percentage 14 1 1/27=.037 3.7 15 1 .037 3.7 16 9 .333 33.3 17 4 .148 14.8 18 12 .444 44.4 Total N 27 1.0 100.0

Percentage Distributions A table showing the percentage of observations falling into each category of the variable. Minimum Age Frequency Percentage 4 1 3.7 15 1 3.7 16 9 33.3 17 4 14.8 18 12 44.4 Total N 27 100.0

Frequency Distributions for Nominal Variables Gender Tallies Freq. (f) Percentage Male ||||||||||||||| 15 37.5 Female ||||||||||||||||||||||||| 25 62.5 Total (N) 40 100.0 Note: The categories for nominal variables (male, female) need not be listed in any particular order.

Frequency Distributions for Ordinal Variables Happiness Tallies Freq. (f) Percentage Very Happy ||||||||| 9 22.5 Pretty Happy ||||||||||||||||||||||||| 25 62.5 Not too happy |||||| 6 15.0 Total (N) 40 100.0 Note: Because the categories or values of ordinal variables are rank- ordered, they must be listed in a way that reflects their rank – from the lowest to the highest or from the highest to the lowest.

Employment Status Example

Employment Status Example

Frequency Distributions for Interval-Ratio Variables Number of Children Freq. (f) Percentage 0 5 12.5 1 10 25.0 2 10 25.0 3 5 12.5 4 5 12.5 5 1 2.5 6 2 5.0 7 or more 2 5.0 Total (N) 40 100.0

Cumulative Distributions Sometimes we are interested in locating the relative position of a given score in a distribution. Cumulative frequency distribution: a distribution showing the frequency at or below each category (class interval or score) of the variable. Cumulative percentage distribution: a distribution showing the percentage at or below each category (class interval or score) of the variable.

Cumulative Frequency Distribution Minimum Cumulative Age Freq. (f) Percentage Frequency 14 1 3.7 1 15 1 3.7 2 16 9 33.3 11 17 4 14.8 15 18 12 44.4 27 Total (N) 27 100.0 * Doesn’t total to 100% due to rounding

Cumulative Percentage Distribution Minimum Cumulative Age Frequency Percentage Percentage 14 1 3.7 3.7 15 1 3.7 7.4 16 9 33.3 40.7 17 4 14.8 55.5 18 12 44.4 99.9* Total N 27 100.0 * Does not total to 100% due to rounding

What’s the problem with the “rate” computation below? Rates A number obtained by dividing the number of actual occurrences in a given time period by the number of possible occurrences. What’s the problem with the “rate” computation below? Marriage rate, 1990 = Number of marriages in 1990 Total population in 1990 Marriage rate, 1990 = 2,448,000 marriages 250,000,000 Americans Marriage rate, 1990 = .0098

Reading Statistical Tables Basic principles for understanding what the researcher is trying to tell you: What is the source of the table? How many variables are presented? What are their names? What is represented by the numbers presented in the first column? In the second column?

Chapter 3: Graphic Presentation The Pie Chart The Bar Graph The Statistical Map The Histogram Statistics in Practice The Frequency Polygon Times Series Charts Distortions in Graphs It is important to choose the appropriate graphs to make statistical information coherent.

The Pie Chart: The Race and Ethnicity of the Elderly Pie chart: a graph showing the differences in frequencies or percentages among categories of a nominal or an ordinal variable. The categories are displayed as segments of a circle whose pieces add up to 100 percent of the total frequencies.

Too many categories can be messy! 2.8% .8% .6% .5% 8.3% 87.7% N = 35,919,174 Figure 3.1 Annual Estimates of U.S. Population 65 Years and Over by Race, 2003

We can reduce some of the categories 4% 8.3% 87.7% N = 35,919,174 Figure 3.2 Annual Estimates of U.S. Population 65 Years and Over, 2003

The Bar Graph: The Living Arrangements and Labor Force Participation of the Elderly Bar graph: a graph showing the differences in frequencies or percentages among categories of a nominal or an ordinal variable. The categories are displayed as rectangles of equal width with their height proportional to the frequency or percentage of the category.

N=13,886,000 Figure 3.3 Living Arrangements of Males (65 and Older) in the United States, 2000

Can display more info by splitting sex Figure 3.4 Living Arrangement of U.S. Elderly (65 and Older) by Gender, 2003

Figure 3.5 Percent of Men and Women 55 Years and Over in the Civilian Labor Force, 2002

The Statistical Map: The Geographic Distribution of the Elderly We can display dramatic geographical changes in American society by using a statistical map. Maps are especially useful for describing geographical variations in variables, such as population distribution, voting patterns, crimes rates, or labor force participation.

The Histogram Histogram: a graph showing the differences in frequencies or percentages among categories of an interval-ratio variable. The categories are displayed as contiguous bars, with width proportional to the width of the category and height proportional to the frequency or percentage of that category.

Figure 3.7 Age Distribution of U.S. Population 65 Years and Over, 2000

The following two slides are applications of the histogram The following two slides are applications of the histogram. They examine, by gender, age distribution patterns in the U.S. population for 1955 and 2010 (projected). Notice that in both figures, age groups are arranged along the vertical axis, whereas the frequencies (in millions of people) are along the horizontal axis. Each age group is classified by males on the left and females on the right. Because this type of histogram reflects age distribution by gender, it is also called an age-sex pyramid.

The Frequency Polygon Frequency polygon: a graph showing the differences in frequencies or percentages among categories of an interval-ratio variable. Points representing the frequencies of each category are placed above the midpoint of the category and are jointed by a straight line.

Source: Adapted from U. S Source: Adapted from U.S. Bureau of the Census, Center for International Research, International Data Base, 2003. Figure 3.11. Population of Japan, Age 55 and Over, 2000, 2010, and 2020  

Time Series Charts Time series chart: a graph displaying changes in a variables at different points in time. It shows time (measured in units such as years or months) on the horizontal axis and the frequencies (percentages or rates) of another variable on the vertical axis.

Source: Federal Interagency Forum on Aging Related Statistics, Older Americans 2004: Key Indicators of Well Being, 2004. Figure 3.12 Percentage of Total U. S. Population 65 Years and Over, 1900 to 2050

Source: U.S. Bureau of the Census, “65+ in America,” Current Population Reports, 1996, Special Studies, P23-190, Table 6-1. Figure 3.13 Percentage Currently Divorced Among U.S. Population 65 Years and Over, by Gender, 1960 to 2040

Distortions in Graphs Graphs not only quickly inform us; they can quickly deceive us. Because we are often more interested in general impressions than in detailed analyses of the numbers, we are more vulnerable to being swayed by distorted graphs. What are graphical distortions? How can we recognize them?

Shrinking an Stretching the Axes: Visual Confusion Probably the most common distortions in graphical representations occur when the distance along the vertical or horizontal axis is altered in relation to the other axis. Axes can be stretched or shrunk to create any desired result.

Shrinking an Stretching the Axes: Visual Confusion

Distortions with Picture Graphs Another way to distort data with graphs is to use pictures to represent quantitative information. The problem with picture graphs is that the visual impression received is created by the picture’s total area rather than by is height (the graphs we have discussed so far rely heavily on height).

Statistics in Practice The following graphs are particularly suitable for making comparisons among groups: - Bar chart - Frequency polygon - Time series chart

Source: Smith, 2003. This bar chart compares elderly males and females who live alone by age, gender, and race or Hispanic origin. It shows that that the percentage of elderly who live alone varies not only by age but also by both race and gender. Figure 3.17 Percentage of College Graduates among People 55 years and over by age and sex, 2002

Source: Stoops, Nicole. 2004. “Educational Attainment in the United States: 2003.” Current Population Reports, P20-550. Washington D.C.: U.S. Government Printing Office. This frequency polygon compares years of school completed by black Americans age 25 to 64 and 65 years and older with that of all Americans in the same age groups. Figure 3.18 Years of School Completed in the United States by Race and Age, 2003

Why use charts and graphs? What do you lose? ability to examine numeric detail offered by a table potentially the ability to see additional relationships within the data potentially time: often we get caught up in selecting colors and formatting charts when a simply formatted table is sufficient What do you gain? ability to direct readers’ attention to one aspect of the evidence ability to reach readers who might otherwise be intimidated by the same data in a tabular format ability to focus on bigger picture rather than perhaps minor technical details We do this as an in-class exercise – where they pair up and construct a chart based on a table from the text or handed out in class and then answer the two questions above.

Chapter 4: Measures of Central Tendency What is a measure of central tendency? Measures of Central Tendency Mode Median Mean Shape of the Distribution Considerations for Choosing an Appropriate Measure of Central Tendency

What is a measure of Central Tendency? Numbers that describe what is average or typical of the distribution You can think of this value as where the middle of a distribution lies.

The Mode The category or score with the largest frequency (or percentage) in the distribution. The mode can be calculated for variables with levels of measurement that are: nominal, ordinal, or interval-ratio.

The Mode: An Example Example: Number of Votes for Candidates for Mayor. The mode, in this case, gives you the “central” response of the voters: the most popular candidate. Candidate A – 11,769 votes The Mode: Candidate B – 39,443 votes “Candidate C” Candidate C – 78,331 votes

The Median The score that divides the distribution into two equal parts, so that half the cases are above it and half below it. The median is the middle score, or average of middle scores in a distribution.

Median Exercise #1 (N is odd) Calculate the median for this hypothetical distribution: Job Satisfaction Frequency Very High 2 High 3 Moderate 5 Low 7 Very Low 4 TOTAL 21

Median Exercise #2 (N is even) Calculate the median for this hypothetical distribution: Satisfaction with Health Frequency Very High 5 High 7 Moderate 6 Low 7 Very Low 3 TOTAL 28

Finding the Median in Grouped Data

Percentiles A score below which a specific percentage of the distribution falls. Finding percentiles in grouped data:

The Mean The arithmetic average obtained by adding up all the scores and dividing by the total number of scores.

Formula for the Mean “Y bar” equals the sum of all the scores, Y, divided by the number of scores, N.

Calculating the mean with grouped scores where: f Y = a score multiplied by its frequency

Mean: Grouped Scores

Mean: Grouped Scores

Grouped Data: the Mean & Median Calculate the median and mean for the grouped frequency below. Number of People Age 18 or older living in a U.S. Household in 1996 (GSS 1996) Number of People Frequency 1 190 2 316 3 54 4 17 5 2 6 2 TOTAL 581

Shape of the Distribution Symmetrical (mean is about equal to median) Skewed Negatively (example: years of education) mean < median Positively (example: income) mean > median Bimodal (two distinct modes) Multi-modal (more than 2 distinct modes) Draw Examples on the board

Distribution Shape

Considerations for Choosing a Measure of Central Tendency For a nominal variable, the mode is the only measure that can be used. For ordinal variables, the mode and the median may be used. The median provides more information (taking into account the ranking of categories.) For interval-ratio variables, the mode, median, and mean may all be calculated. The mean provides the most information about the distribution, but the median is preferred if the distribution is skewed.

Central Tendency

Chapter 5: Measures of Variability The Importance of Measuring Variability The Range IQR (Inter-Quartile Range) Variance Standard Deviation Considerations for choosing a measure of variation

The Importance of Measuring Variability Central tendency - Numbers that describe what is typical or average (central) in a distribution Measures of Variability - Numbers that describe diversity or variability in the distribution. These two types of measures together help us to sum up a distribution of scores without looking at each and every score. Measures of central tendency tell you about typical (or central) scores. Measures of variation reveal how far from the typical or central score that the distribution tends to vary.

Notice that both distributions have the same mean, yet they are shaped differently

The Range Range = highest score - lowest score Range – A measure of variation in interval-ratio variables. It is the difference between the highest (maximum) and the lowest (minimum) scores in the distribution. Range is a good thing to look at to make sure your data are as you expect them to be.

Inter-Quartile Range Inter-Quartile Range (IQR) – A measure of variation for interval-ratio data. It indicates the width of the middle 50 percent of the distribution and is defined as the difference between the lower and upper quartiles (Q1 and Q3.) IQR = Q3 – Q1

The difference between the Range and IQR These values fall together closely Shows greater variability Importance of the IQR Yet the ranges are equal!

The Box Plot The Box Plot is a graphic device that visually presents the following elements: the range, the IQR, the median, the quartiles, the minimum (lowest value,) and the maximum (highest value.)

Variance Variance – A measure of variation for interval-ratio variables; it is the average of the squared deviations from the mean

Standard Deviation Standard Deviation – A measure of variation for interval-ratio variables; it is equal to the square root of the variance.

Find the Mean and the Standard Deviation

Considerations for Choosing a Measure of Variability For nominal variables, you can only use IQV (Index of Qualitative Variation.) For ordinal variables, you can calculate the IQV or the IQR (Inter-Quartile Range.) Though, the IQR provides more information about the variable. For interval-ratio variables, you can use IQV, IQR, or variance/standard deviation. The standard deviation (also variance) provides the most information, since it uses all of the values in the distribution in its calculation.

Chapter 6: Relationships Between Two Variables: Cross-Tabulation Independent and Dependent Variables Constructing a Bivariate Table Computing Percentages in a Bivariate Table Dealing with Ambiguous Relationships Between Variables Reading the Research Literature Properties of a Bivariate Relationship Elaboration Statistics in Practice

Introduction Bivariate Analysis: A statistical method designed to detect and describe the relationship between two variables. Cross-Tabulation: A technique for analyzing the relationship between two variables that have been organized in a table.

Understanding Independent and Dependent Variables Example: If we hypothesize that English proficiency varies by whether person is native born or foreign born, what is the independent variable, and what is the dependent variable? Independent: nativity Dependent: English proficiency

Constructing a Bivariate Table Bivariate table: A table that displays the distribution of one variable across the categories of another variable. Column variable: A variable whose categories are the columns of a bivariate table. Row variable: A variable whose categories are the rows of a bivariate table. Cell: The intersection of a row and a column in a bivariate table. Marginals: The row and column totals in a bivariate table.

Percentages Can Be Computed in Different Ways: Column Percentages: column totals as base Row Percentages: row totals as base

Support for Abortion by Job Security Absolute Frequencies Support for Abortion by Job Security Abortion Job Find Easy Job Find Not Easy Row Total Yes 24 25 49 No 20 26 46 Column Total 44 51 95

Support for Abortion by Job Security Column Percentages Support for Abortion by Job Security Abortion Job Find Easy Job Find Not Easy Row Total Yes 55% 49% 52% No 45% 51% 48% Column Total 100% 100% 100% (44) (51) (95)

Support for Abortion by Job Security Row Percentages Support for Abortion by Job Security Abortion Job Find Easy Job Find Not Easy Row Total Yes 49% 51% 100% (49) No 43% 57% 100% (46) Column Total 46% 54% 100% (95)

Properties of a Bivariate Relationship Does there appear to be a relationship? How strong is it? What is the direction of the relationship?

Existence of a Relationship IV: Number of Traumas DV: Support for Abortion If the number of traumas were unrelated to attitudes toward abortion among women, then we would expect to find equal percentages of women who are pro-choice (or anti-choice), regardless of the number of traumas experienced.

Existence of the Relationship

Determining the Strength of the Relationship A quick method is to examine the percentage difference across the different categories of the independent variable. The larger the percentage difference across the categories, the stronger the association. We rarely see a situation with either a 0 percent or a 100 percent difference.

Direction of the Relationship Positive relationship: A bivariate relationship between two variables measured at the ordinal level or higher in which the variables vary in the same direction. Negative relationship: A bivariate relationship between two variables measured at the ordinal level or higher in which the variables vary in opposite directions.

A Positive Relationship

A Negative Relationship

Elaboration Elaboration is a process designed to further explore a bivariate relationship; it involves the introduction of control variables. A control variable is an additional variable considered in a bivariate relationship. The variable is controlled for when we take into account its effect on the variables in the bivariate relationship.

Three Goals of Elaboration Elaboration allows us to test for non-spuriousness. Elaboration clarifies the causal sequence of bivariate relationships by introducing variables hypothesized to intervene between the IV and DV. Elaboration specifies the different conditions under which the original bivariate relationship might hold.

Testing for Nonspuriousness Direct causal relationship: a bivariate relationship that cannot be accounted for by other theoretically relevant variables. Spurious relationship: a relationship in which both the IV and DV are influenced by a causally prior control variable and there is no causal link between them. The relationship between the IV and DV is said to be “explained away” by the control variable.

Number of Firefighters  Property Damage The Bivariate Relationship Between Number of Firefighters and Property Damage Number of Firefighters  Property Damage (IV) (DV)

Process of Elaboration Partial tables: bivariate tables that display the relationship between the IV and DV while controlling for a third variable. Partial relationship: the relationship between the IV and DV shown in a partial table.

The Process of Elaboration Divide the observations into subgroups on the basis of the control variable. We have as many subgroups as there are categories in the control variable. Re-examine the relationship between the original two variables separately for the control variable subgroups. Compare the partial relationships with the original bivariate relationship for the total group.

Intervening Relationship Intervening variable: a control variable that follows an independent variable but precedes the dependent variable in a causal sequence. Intervening relationship: a relationship in which the control variable intervenes between the independent and dependent variables.

Intervening Relationship: Example Religion  Preferred Family Size  Support for Abortion (IV) (Intervening Control Variable) (DV)

Conditional Relationships Conditional relationship: a relationship in which the control variable’s effect on the dependent variable is conditional on its interaction with the independent variable. The relationship between the independent and dependent variables will change according to the different conditions of the control variable.

Conditional Relationships Another way to describe a conditional relationship is to say that there is a statistical interaction between the control variable and the independent variable.

Conditional Relationships

Conditional Relationships

Chapter 7: Measures of Association for Nominal and Ordinal Variables Proportional Reduction of Error (PRE) Degree of Association For Nominal Variables Lambda For Ordinal Variables Gamma Using Gamma for Dichotomous Variables

Measures of Association Measure of association—a single summarizing number that reflects the strength of a relationship, indicates the usefulness of predicting the dependent variable from the independent variable, and often shows the direction of the relationship.

The most common race/ethnicity for U.S. residents (e.g., the mode)! Take your best guess? If you know nothing else about a person except that he or she lives in United States and I asked you to guess his or her race/ethnicity, what would you guess? The most common race/ethnicity for U.S. residents (e.g., the mode)! Now, if we know that this person lives in San Diego, California, would you change your guess? With quantitative analyses we are generally trying to predict or take our best guess at value of the dependent variable. One way to assess the relationship between two variables is to consider the degree to which the extra information of the independent variable makes your guess better.

Proportional Reduction of Error (PRE) PRE—the concept that underlies the definition and interpretation of several measures of association. PRE measures are derived by comparing the errors made in predicting the dependent variable while ignoring the independent variable with errors made when making predictions that use information about the independent variable.

Proportional Reduction of Error (PRE) where: E1 = errors of prediction made when the independent variable is ignored E2 = errors of prediction made when the prediction is based on the independent variable

Two PRE Measures: Lambda & Gamma Appropriate for… Lambda NOMINAL variables Gamma ORDINAL & DICHOTOMOUS NOMINAL variables

Lambda Lambda—An asymmetrical measure of association suitable for use with nominal variables and may range from 0.0 (meaning the extra information provided by the independent variable does not help prediction) to 1.0 (meaning use of independent variable results in no prediction errors). It provides us with an indication of the strength of an association between the independent and dependent variables. A lower value represents a weaker association, while a higher value is indicative of a stronger association

Lambda where: E1= Ntotal - Nmode of dependent variable

Example 1: 2000 Vote By Abortion Attitudes Table 7.2 2000 Presidential Vote by Abortion Attitudes Abortion Attitudes (for any reason) Vote Yes No Row Total Gore 46 39 85 Bush 41 73 114 Total 87 112 199 Source: General Social Survey, 2002 Step One—Add percentages to the table to get the data in a format that allows you to clearly assess the nature of the relationship.

Example 1: 2000 Vote By Abortion Attitudes Table 7.2 2000 Presidential Vote by Abortion Attitudes Abortion Attitudes (for any reason) Vote Yes No Row Total Gore 52.9% 34.8% 42.7% 46 39 85 Bush 47.1% 65.2% 57.3% 41 73 114 Total 100% 100% 100% 87 112 199 Source: General Social Survey, 2002 Now calculate E1 E1 = Ntotal – Nmode = 199 – 114 = 85

Example 1: 2000 Vote By Abortion Attitudes Table 7.2 2000 Presidential Vote by Abortion Attitudes Abortion Attitudes (for any reason) Vote Yes No Row Total Gore 52.9% 34.8% 42.7% 46 39 85 Bush 47.1% 65.2% 57.3% 41 73 114 Total 100% 100% 100% 87 112 199 Source: General Social Survey, 2002 Now calculate E2 E2 = [N(Yes column total) – N(Yes column mode)] + [N(No column total) – N(No column mode)] = [87 – 46] + …

Example 1: 2000 Vote By Abortion Attitudes Table 7.2 2000 Presidential Vote by Abortion Attitudes Abortion Attitudes (for any reason) Vote Yes No Row Total Gore 52.9% 34.8% 42.7% 46 39 85 Bush 47.1% 65.2% 57.3% 41 73 114 Total 100% 100% 100% 87 112 199 Source: General Social Survey, 2002 Now calculate E2 E2 = [N(Yes column total) – N(Yes column mode)] + [N(No column total) – N(No column mode)] = [87 – 46] + [112 – 73]

Example 1: 2000 Vote By Abortion Attitudes Table 7.2 2000 Presidential Vote by Abortion Attitudes Abortion Attitudes (for any reason) Vote Yes No Row Total Gore 52.9% 34.8% 42.7% 46 39 85 Bush 47.1% 65.2% 57.3% 41 73 114 Total 100% 100% 100% 87 112 199 Source: General Social Survey, 2002 Now calculate E2 E2 = [N(Yes column total) – N(Yes column mode)] + [N(No column total) – N(No column mode)] = [87 – 46] + [112 – 73] = 80

Example 1: 2000 Vote By Abortion Attitudes Table 7.2 2000 Presidential Vote by Abortion Attitudes Abortion Attitudes (for any reason) Vote Yes No Row Total Gore 52.9% 34.8% 42.7% 46 39 85 Bush 47.1% 65.2% 57.3% 41 73 114 Total 100% 100% 100% 87 112 199 Source: General Social Survey, 2002 Lambda = [E1– E2] / E1 = [85 – 80] / 85 = .06

Example 1: 2000 Vote By Abortion Attitudes Table 7.2 2000 Presidential Vote by Abortion Attitudes Abortion Attitudes (for any reason) Vote Yes No Row Total Gore 52.9% 34.8% 42.7% 46 39 85 Bush 47.1% 65.2% 57.3% 41 73 114 Total 100% 100% 100% 87 112 199 Source: General Social Survey, 2002 Lambda = .06 So, we know that six percent of the errors in predicting the relationship between vote and abortion attitudes can be reduced by taking into account the voter’s attitude towards abortion.

EXAMPLE 2: Victim-Offender Relationship and Type of Crime: 1993 Step One—Add percentages to the table to get the data in a format that allows you to clearly assess the nature of the relationship. *Source: Kathleen Maguire and Ann L. Pastore, eds., Sourcebook of Criminal Justice Statistics 1994., U.S. Department of Justice, Bureau of Justice Statistics, Washington, D.C.: USGPO, 1995, p. 343.

Victim-Offender Relationship & Type of Crime: 1993 Now calculate E1 E1 = Ntotal – Nmode = 9,898,980 – 5,045,040 = 4,835,940

Victim-Offender Relationship & Type of Crime: 1993 Now calculate E2 E2 = [N(rape/sexual assault column total) – N(rape/sexual assault column mode)] + [N(robbery column total) – N(robbery column mode)] + [N(assault column total) – N(assault column mode)] = [472,760 – 350,670] + …

Victim-Offender Relationship and Type of Crime: 1993 Now calculate E2 E2 = [N(rape/sexual assault column total) – N(rape/sexual assault column mode)] + [N(robbery column total) – N(robbery column mode)] + [N(assault column total) – N(assault column mode)] = [472,760 – 350,670] + [1,161,900 – 930,860] + …

Victim-Offender Relationship and Type of Crime: 1993 Now calculate E2 E2 = [N(rape/sexual assault column total) – N(rape/sexual assault column mode)] + [N(robbery column total) – N(robbery column mode)] + [N(assault column total) – N(assault column mode)] = [472,760 – 350,670] + [1,161,900 – 930,860] + [8,264,320 – 4,272,230] = 4,345,220

Victim-Offender Relationship and Type of Crime: 1993 Lambda = [E1– E2] / E1 = [4,835,940 – 4,345,220] / 4,835,940 = .10 So, we know that ten percent of the errors in predicting the relationship between victim and offender (stranger vs. non-stranger;) can be reduced by taking into account the type of crime that was committed.

Asymmetrical Measure of Association A measure whose value may vary depending on which variable is considered the independent variable and which the dependent variable. Lambda is an asymmetrical measure of association.

Symmetrical Measure of Association A measure whose value will be the same when either variable is considered the independent variable or the dependent variable. Gamma is a symmetrical measure of association…

Before Computing GAMMA: It is necessary to introduce the concept of paired observations. Paired observations – Observations compared in terms of their relative rankings on the independent and dependent variables.

Tied Pairs Same order pair (Ns) – Paired observations that show a positive association; the member of the pair ranked higher on the independent variable is also ranked higher on the dependent variable.

Tied Pairs Inverse order pair (Nd) – Paired observations that show a negative association; the member of the pair ranked higher on the independent variable is ranked lower on the dependent variable.

Gamma Gamma—a symmetrical measure of association suitable for use with ordinal variables or with dichotomous nominal variables. It can vary from 0.0 (meaning the extra information provided by the independent variable does not help prediction) to 1.0 (meaning use of independent variable results in no prediction errors) and provides us with an indication of the strength and direction of the association between the variables. When there are more Ns pairs, gamma will be positive; when there are more Nd pairs, gamma will be negative.

Gamma

Interpreting Gamma The sign depends on the way the variables are coded: + the two “high” values are associated, as are the two “lows” – the “highs” are associated with the “lows” .00 to .24 “no relationship” .25 to .49 “weak relationship” .50 to .74 “moderate relationship” .75 to 1.00 “strong relationship”

Measures of Association Measures of association—a single summarizing number that reflects the strength of the relationship. This statistic shows the magnitude and/or direction of a relationship between variables. Magnitude—the closer to the absolute value of 1, the stronger the association. If the measure equals 0, there is no relationship between the two variables. Direction—the sign on the measure indicates if the relationship is positive or negative. In a positive relationship, when one variable is high, so is the other. In a negative relationship, when one variable is high, the other is low.

Chapter 8: Bivariate Regression and Correlation Overview The Scatter Diagram Two Examples: Education & Prestige Correlation Coefficient Bivariate Linear Regression Line SPSS Output Interpretation Covariance

Overview Independent Variables Interval Nominal Dependent Variable Nominal Interval Considers the distribution of one variable across the categories of another variable Considers the difference between the mean of one group on a variable with another group Considers how a change in a variable affects a discrete outcome Considers the degree to which a change in one variable results in a change in another

You already know how to deal with two nominal variables Overview You already know how to deal with two nominal variables Independent Variables Nominal Interval Considers how a change in a variable affects a discrete outcome Lambda Dependent Variable Interval Nominal Considers the difference between the mean of one group on a variable with another group Considers the degree to which a change in one variable results in a change in another

Overview TODAY! Independent Variables Interval Nominal Dependent You already know how to deal with two nominal variables TODAY! Independent Variables Nominal Interval Considers how a change in a variable affects a discrete outcome Lambda Dependent Variable Interval Nominal Considers the degree to which a change in one variable results in a change in another Confidence Intervals T-Test We will deal with this later in the course

Overview TODAY! What about this cell? Independent Variables Regression You already know how to deal with two nominal variables What about this cell? Independent Variables Nominal Interval Considers how a change in a variable affects a discrete outcome Lambda Dependent Variable Interval Nominal TODAY! Confidence Intervals T-Test Regression Correlation We will deal with this later in the course

Overview TODAY! Independent Variables You already know how to deal with two nominal variables This cell is not covered in this course Independent Variables Nominal Interval Logistic Regression Lambda Dependent Variable Interval Nominal TODAY! Confidence Intervals T-Test Regression Correlation We will deal with this later in the course

General Examples Does a change in one variable significantly affect another variable? Do two scores tend to co-vary positively (high on one score high on the other, low on one, low on the other)? Do two scores tend to co-vary negatively (high on one score low on the other; low on one, hi on the other)?

Specific Examples Does getting older significantly influence a person’s political views? Does marital satisfaction increase with length of marriage? How does an additional year of education affect one’s earnings?

Scatter Diagrams Scatter Diagram (scatterplot)—a visual method used to display a relationship between two interval-ratio variables. Typically, the independent variable is placed on the X-axis (horizontal axis), while the dependent variable is placed on the Y-axis (vertical axis.)

Scatter Diagram Example The data…

Scatter Diagram Example

A Scatter Diagram Example of a Negative Relationship

Linear Relationships Linear relationship – A relationship between two interval-ratio variables in which the observations displayed in a scatter diagram can be approximated with a straight line. Deterministic (perfect) linear relationship – A relationship between two interval-ratio variables in which all the observations (the dots) fall along a straight line. The line provides a predicted value of Y (the vertical axis) for any value of X (the horizontal axis.

Graph the data below and examine the relationship:

The Seniority-Salary Relationship

Example: Education & Prestige Does education predict occupational prestige? If so, then the higher the respondent’s level of education, as measured by number of years of schooling, the greater the prestige of the respondent’s occupation. Take a careful look at the scatter diagram on the next slide and see if you think that there exists a relationship between these two variables…

Scatterplot of Prestige by Education

Example: Education & Prestige The scatter diagram data can be represented by a straight line, therefore there does exist a relationship between these two variables. In addition, since occupational prestige becomes higher, as years of education increases, we can say also that the relationship is a positive one.

The mean age for U.S. residents. Take your best guess? If you know nothing else about a person, except that he or she lives in United States and I asked you to his or her age, what would you guess? The mean age for U.S. residents. Now if I tell you that this person owns a skateboard, would you change your guess? (Of course!) With quantitative analyses we are generally trying to predict or take our best guess at value of the dependent variable. One way to assess the relationship between two variables is to consider the degree to which the extra information of the second variable makes your guess better. If someone owns a skateboard, that is likely to indicate to us that s/he is younger and we may be able to guess closer to the actual value.

Take your best guess? Similar to the example of age and the skateboard, we can take a much better guess at someone’s occupational prestige, if we have information about her/his years or level of education.

Equation for a Straight Line Y= a + bX where a = intercept b = slope Y = dependent variable X = independent variable X Y a rise run = b

Bivariate Linear Regression Equation ^ Y = a + bX Y-intercept (a)—The point where the regression line crosses the Y-axis, or the value of Y when X=0. Slope (b)—The change in variable Y (the dependent variable) with a unit change in X (the independent variable.) The estimates of a and b will have the property that the sum of the squared differences between the observed and predicted (Y-Y)2 is minimized using ordinary least squares (OLS). Thus the regression line represents the Best Linear and Unbiased Estimators (BLUE) of the intercept and slope. ˆ

SPSS Regression Output (GSS) Education & Prestige

SPSS Regression Output (GSS) Education & Prestige Now let’s interpret the SPSS output...

The Regression Equation Prediction Equation: Y = 6.120 + 2.762(X) This line represents the predicted values for Y for any and all values of X ˆ

The Regression Equation Prediction Equation: Y = 6.120 + 2.762(X) This line represents the predicted values for Y for any and all values of X ˆ

Interpreting the regression equation Y = 6.120 + 2.762(X) ˆ If a respondent had zero years of schooling, this model predicts that his occupational prestige score would be 6.120 points. For each additional year of education, our model predicts a 2.762 point increase in occupational prestige.

Ordinary Least Squares Least-squares line (best fitting line) – A line where the errors sum of squares, or e2, is at a minimum. Least-squares method – The technique that produces the least squares line.

Estimating the slope: b The bivariate regression coefficient or the slope of the regression line can be obtained from the observed X and Y scores.

Covariance and Variance Variance of X = Covariance of X and Y—a measure of how X and Y vary together. Covariance will be close to zero when X and Y are unrelated. It will be greater than zero when the relationship is positive and less than zero when the relationship is negative. Variance of X—we have talked a lot about variance in the dependent variable. This is simply the variance for the independent variable

Estimating the Intercept The regression line always goes through the point corresponding to the mean of both X and Y, by definition. So we utilize this information to solve for a:

Back to the original scatterplot:

A Representative Line

Other Representative Lines

Calculating the Regression Equation

Calculating the Regression Equation

The Least Squares Line!

Summary: Properties of the Regression Line Represents the predicted values for Y for any and all values of X. Always goes through the point corresponding to the mean of both X and Y. It is the best fitting line in that it minimizes the sum of the squared deviations. Has a slope that can be positive or negative; null hypothesis is that the slope is zero.

Coefficient of Determination Coefficient of Determination (r2) – A PRE measure reflecting the proportional reduction of error that results from using the linear regression model. It reflects the proportion of the total variation in the dependent variable, Y, explained by the independent variable, X.

Coefficient of Determination

Coefficient of Determination

The Correlation Coefficient Pearson’s Correlation Coefficient (r) — The square root of r2. It is a measure of association between two interval-ratio variables. Symmetrical measure—No specification of independent or dependent variables. Ranges from –1.0 to +1.0. The sign () indicates direction. The closer the number is to 1.0 the stronger the association between X and Y.

The Correlation Coefficient r = 0 means that there is no association between the two variables. r = 0 Y X

The Correlation Coefficient r = 0 means that there is no association between the two variables. r = +1 means a perfect positive correlation. r = +1 Y X

The Correlation Coefficient r = 0 means that there is no association between the two variables. r = +1 means a perfect positive correlation. r = –1 means a perfect negative correlation. Y r = –1 X

Chapter 9: The Normal Distribution Properties of the Normal Distribution Shapes of Normal Distributions Standard (Z) Scores The Standard Normal Distribution Transforming Z Scores into Proportions Transforming Proportions into Z Scores Finding the Percentile Rank of a Raw Score Finding the Raw Score for a Percentile

Normal Distributions Normal Distribution – A bell-shaped and symmetrical theoretical distribution, with the mean, the median, and the mode all coinciding at its peak and with frequencies gradually decreasing at both ends of the curve. The normal distribution is a theoretical ideal distribution. Real-life empirical distributions never match this model perfectly. However, many things in life do approximate the normal distribution, and are said to be “normally distributed.” 42

Scores “Normally Distributed?” Is this distribution normal? There are two things to initially examine: (1) look at the shape illustrated by the bar chart, and (2) calculate the mean, median, and mode.

Scores Normally Distributed! The Mean = 70.07 The Median = 70 The Mode = 70 Since all three are essentially equal, and this is reflected in the bar graph, we can assume that these data are normally distributed. Also, since the median is approximately equal to the mean, we know that the distribution is symmetrical.

The Shape of a Normal Distribution: The Normal Curve

The Shape of a Normal Distribution Notice the shape of the normal curve in this graph. Some normal distributions are tall and thin, while others are short and wide. All normal distributions, though, are wider in the middle and symmetrical.

Different Shapes of the Normal Distribution Notice that the standard deviation changes the relative width of the distribution; the larger the standard deviation, the wider the curve. 43

Areas Under the Normal Curve by Measuring Standard Deviations

Standard (Z) Scores A standard score (also called Z score) is the number of standard deviations that a given raw score is above or below the mean.

The Standard Normal Table A table showing the area (as a proportion, which can be translated into a percentage) under the standard normal curve corresponding to any Z score or its fraction Area up to a given score

The Standard Normal Table A table showing the area (as a proportion, which can be translated into a percentage) under the standard normal curve corresponding to any Z score or its fraction Area beyond a given score

Finding the Area Between the Mean and a Positive Z Score Using the data presented in Table 10.1, find the percentage of students whose scores range from the mean (70.07) to 85. (1) Convert 85 to a Z score: Z = (85-70.07)/10.27 = 1.45 (2) Look up the Z score (1.45) in Column A, finding the proportion (.4265)

Finding the Area Between the Mean and a Positive Z Score (3) Convert the proportion (.4265) to a percentage (42.65%); this is the percentage of students scoring between the mean and 85 in the course.

Finding the Area Between the Mean and a Negative Z Score Using the data presented in Table 10.1, find the percentage of students scoring between 65 and the mean (70.07) (1) Convert 65 to a Z score: Z = (65-70.07)/10.27 = -.49 (2) Since the curve is symmetrical and negative area does not exist, use .49 to find the area in the standard normal table: .1879

Finding the Area Between the Mean and a Negative Z Score (3) Convert the proportion (.1879) to a percentage (18.79%); this is the percentage of students scoring between 65 and the mean (70.07)

Finding the Area Between 2 Z Scores on the Same Side of the Mean Using the same data presented in Table 10.1, find the percentage of students scoring between 74 and 84. (1) Find the Z scores for 74 and 84: Z = .38 and Z = 1.36 (2) Look up the corresponding areas for those Z scores: .1480 and .4131

Finding the Area Between 2 Z Scores on the Same Side of the Mean (3) To find the highlighted area above, subtract the smaller area from the larger area (.4131-.1480 = ) .2651 Now, we have the percentage of students scoring between 74 and 84.

Finding the Area Between 2 Z Scores on Opposite Sides of the Mean Using the same data, find the percentage of students scoring between 62 and 72. (1) Find the Z scores for 62 and 72: Z = (72-70.07)/10.27 = .19 Z = (62-70.07)/10.27 = -.79 (2) Look up the areas between these Z scores and the mean, like in the previous 2 examples: Z = .19 is .0753 and Z = -.79 is .2852 (3) Add the two areas together: .0753 + .2852 = .3605

Finding the Area Between 2 Z Scores on Opposite Sides of the Mean (4) Convert the proportion (.3605) to a percentage (36.05%); this is the percentage of students scoring between 62 and 72.

Finding Area Above a Positive Z Score or Below a Negative Z Score Find the percentage of students who did (a) very well, scoring above 85, and (b) those students who did poorly, scoring below 50. (a) Convert 85 to a Z score, then look up the value in Column C of the Standard Normal Table: Z = (85-70.07)/10.27 = 1.45  7.35% (b) Convert 50 to a Z score, then look up the value (look for a positive Z score!) in Column C: Z = (50-70.07)/10.27 = -1.95  2.56%

Finding Area Above a Positive Z Score or Below a Negative Z Score

Finding a Z Score Bounding an Area Above It Find the raw score that bounds the top 10 percent of the distribution (Table 10.1) (1) 10% = a proportion of .10 (2) Using the Standard Normal Table, look in Column C for .1000, then take the value in Column A; this is the Z score (1.28) (3) Finally convert the Z score to a raw score: Y=70.07 + 1.28 (10.27) = 83.22

Finding a Z Score Bounding an Area Above It (4) 83.22 is the raw score that bounds the upper 10% of the distribution. The Z score associated with 83.22 in this distribution is 1.28

Finding a Z Score Bounding an Area Below It Find the raw score that bounds the lowest 5 percent of the distribution (Table 10.1) (1) 5% = a proportion of .05 (2) Using the Standard Normal Table, look in Column C for .05, then take the value in Column A; this is the Z score (-1.65); negative, since it is on the left side of the distribution (3) Finally convert the Z score to a raw score: Y=70.07 + -1.65 (10.27) = 53.12

Finding a Z Score Bounding an Area Below It (4) 53.12 is the raw score that bounds the lower 5% of the distribution. The Z score associated with 53.12 in this distribution is -1.65

Finding the Percentile Rank of a Score Higher than the Mean Suppose your raw score was 85. You want to calculate the percentile (to see where in the class you rank.) (1) Convert the raw score to a Z score: Z = (85-70.07)/10.27 = 1.45 (2) Find the area beyond Z in the Standard Normal Table (Column C): .0735 (3) Subtract the area from 1.00 for the percentile, since .0735 is only the area not below the score: 1.00 - .0735 = .9265 (proportion of scores below 85)

Finding the Percentile Rank of a Score Higher than the Mean (4) .9265 represents the proportion of scores less than 85 corresponding to a percentile rank of 92.65%

Finding the Percentile Rank of a Score Lower than the Mean Now, suppose your raw score was 65. (1) Convert the raw score to a Z score Z = (65-70.07)/10.27 = -.49 (2) Find the are beyond Z in the Standard Normal Table, Column C: .3121 (3) Multiply by 100 to obtain the percentile rank: .3121 x 100 = 31.21%

Finding the Percentile Rank of a Score Lower than the Mean

Finding the Raw Score of a Percentile Higher than 50 Say you need to score in the 95th% to be accepted to a particular grad school program. What’s the cutoff for the 95th%? (1) Find the area associated with the percentile: 95/100 = .9500 (2) Subtract the area from 1.00 to find the area above & beyond the percentile rank: 1.00 - .9500 = .0500 (3) Find the Z Score by looking in Column C of the Standard Normal Table for .0500: Z = 1.65

Finding the Raw Score of a Percentile Higher than 50 (4) Convert the Z score to a raw score. Y= 70.07 + 1.65(10.27) = 87.02

Finding the Raw Score of a Percentile Lower than 50 What score is associated with the 40th%? (1) Find the area below the percentile: 40/100 = .4000 (2) Find the Z score associated with this area. Use Column C, but remember that this is a negative Z score since it is less than the mean; so, Sy = -.25 (3) Convert the Z score to a raw score: Y = 70.07 + -.25(10.27) = 67.5

Finding the Raw Score of a Percentile Lower than 50

Chapter 10: Sampling and Sampling Distributions Aims of Sampling Basic Principles of Probability Types of Random Samples Sampling Distributions Sampling Distribution of the Mean Standard Error of the Mean The Central Limit Theorem

Sampling Population – A group that includes all the cases (individuals, objects, or groups) in which the researcher is interested. Sample – A relatively small subset from a population.

Notation

Sampling Parameter – A measure (for example, mean or standard deviation) used to describe a population distribution. Statistic – A measure (for example, mean or standard deviation) used to describe a sample distribution.

Sampling: Parameter & Statistic

Probability Sampling Probability sampling – A method of sampling that enables the researcher to specify for each case in the population the probability of its inclusion in the sample.

Random Sampling Simple Random Sample – A sample designed in such a way as to ensure that (1) every member of the population has an equal chance of being chosen and (2) every combination of N members has an equal chance of being chosen. This can be done using a computer, calculator, or a table of random numbers

Population inferences can be made...

...by selecting a representative sample from the population

Random Sampling Systematic random sampling – A method of sampling in which every Kth member (K is a ration obtained by dividing the population size by the desired sample size) in the total population is chosen for inclusion in the sample after the first member of the sample is selected at random from among the first K members of the population.

Systematic Random Sampling

Stratified Random Sampling Stratified random sample – A method of sampling obtained by (1) dividing the population into subgroups based on one or more variables central to our analysis and (2) then drawing a simple random sample from each of the subgroups

Stratified Random Sampling Proportionate stratified sample – The size of the sample selected from each subgroup is proportional to the size of that subgroup in the entire population. Disproportionate stratified sample – The size of the sample selected from each subgroup is disproportional to the size of that subgroup in the population.

Disproportionate Stratified Sample

Sampling Distributions Sampling error – The discrepancy between a sample estimate of a population parameter and the real population parameter. Sampling distribution – A theoretical distribution of all possible sample values for the statistic in which we are interested.

Sampling Distributions Sampling distribution of the mean – A theoretical probability distribution of sample means that would be obtained by drawing from the population all possible samples of the same size. If we repeatedly drew samples from a population and calculated the sample means, those sample means would be normally distributed (as the number of samples drawn increases.) The next several slides demonstrate this. Standard error of the mean – The standard deviation of the sampling distribution of the mean. It describes how much dispersion there is in the sampling distribution of the mean. 50

Sampling Distributions

Distribution of Sample Means with 21 Samples 10 8 6 4 2 S.D. = 2.02 Mean of means = 41.0 Number of Means = 21 Frequency 37 38 39 40 41 42 43 44 45 46 Sample Means 47

Distribution of Sample Means with 96 Samples 14 12 10 8 6 4 2 S.D. = 1.80 Mean of Means = 41.12 Number of Means = 96 Frequency 37 38 39 40 41 42 43 44 45 46 Sample Means

Distribution of Sample Means with 170 Samples 30 20 10 S.D. = 1.71 Mean of Means= 41.12 Number of Means= 170 Frequency 37 38 39 40 41 42 43 44 45 46 Sample Means

The Central Limit Theorem If all possible random samples of size N are drawn from a population with mean y and a standard deviation , then as N becomes larger, the sampling distribution of sample means becomes approximately normal, with mean y and standard deviation .

Chapter 11: Estimation Estimation Defined Confidence Levels Confidence Intervals Confidence Interval Precision Standard Error of the Mean Sample Size Standard Deviation Confidence Intervals for Proportions

Estimation Defined: Estimation – A process whereby we select a random sample from a population and use a sample statistic to estimate a population parameter.

Point and Interval Estimation Point Estimate – A sample statistic used to estimate the exact value of a population parameter Confidence interval (interval estimate) – A range of values defined by the confidence level within which the population parameter is estimated to fall. Confidence Level – The likelihood, expressed as a percentage or a probability, that a specified interval will contain the population parameter.

Estimations Lead to Inferences Take a subset of the population

Estimations Lead to Inferences Try and reach conclusions about the population

Inferential Statistics involves Three Distributions: A population distribution – variation in the larger group that we want to know about. A distribution of sample observations – variation in the sample that we can observe. A sampling distribution – a normal distribution whose mean and standard deviation are unbiased estimates of the parameters and allows one to infer the parameters from the statistics.

The Central Limit Theorem Revisited What does this Theorem tell us: Even if a population distribution is skewed, we know that the sampling distribution of the mean is normally distributed As the sample size gets larger, the mean of the sampling distribution becomes equal to the population mean As the sample size gets larger, the standard error of the mean decreases in size (which means that the variability in the sample estimates from sample to sample decreases as N increases). It is important to remember that researchers do not typically conduct repeated samples of the same population. Instead, they use the knowledge of theoretical sampling distributions to construct confidence intervals around estimates.

Confidence Levels: Confidence Level – The likelihood, expressed as a percentage or a probability, that a specified interval will contain the population parameter. 95% confidence level – there is a .95 probability that a specified interval DOES contain the population mean. In other words, there are 5 chances out of 100 (or 1 chance out of 20) that the interval DOES NOT contain the population mean. 99% confidence level – there is 1 chance out of 100 that the interval DOES NOT contain the population mean.

Constructing a Confidence Interval (CI) The sample mean is the point estimate of the population mean. The sample standard deviation is the point estimate of the population standard deviation. The standard error of the mean makes it possible to state the probability that an interval around the point estimate contains the actual population mean.

What We are Wanting to Do We want to construct an estimate of where the population mean falls based on our sample statistics The actual population parameter falls somewhere on this line This is our Confidence Interval

The Standard Error Standard error of the mean – the standard deviation of a sampling distribution Standard Error

Estimating standard errors Since the standard error is generally not known, we usually work with the estimated standard error:

Determining a Confidence Interval (CI) where: = sample mean (estimate of ) Z = Z score for one-half the acceptable error = estimated standard error

Confidence Interval Width Confidence Level – Increasing our confidence level from 95% to 99% means we are less willing to draw the wrong conclusion – we take a 1% risk (rather than a 5%) that the specified interval does not contain the true population mean. If we reduce our risk of being wrong, then we need a wider range of values . . . So the interval becomes less precise.

Confidence Interval Width More precise, less confident More confident, less precise

Confidence Interval Z Values

Confidence Interval Width Sample Size – Larger samples result in smaller standard errors, and therefore, in sampling distributions that are more clustered around the population mean. A more closely clustered sampling distribution indicates that our confidence intervals will be narrower and more precise.

Confidence Interval Width Standard Deviation – Smaller sample standard deviations result in smaller, more precise confidence intervals. (Unlike sample size and confidence level, the researcher plays no role in determining the standard deviation of a sample.)

Example: Sample Size and Confidence Intervals

Example: Sample Size and Confidence Intervals

Example: Hispanic Migration and Earnings From 1980 Census data: Cubans had an average income of $16,368 (Sy = $3,069), N=3895 Mexicans had an average of $13,342 (Sy = $9,414), N=5726 Puerto Ricans had an average of $12,587 (Sy = $8,647), N=5908

Example: Hispanic Migration and Earnings Now, compute the 95% CI’s for all three groups: Cubans: standard error = 3069/ = 49.17 95%CI = 16,368+ 1.96(49.17) = 16,272 to 16,464 Mexicans: s.e. = 9414/ = 124.41 = 13,098 to 13,586

Example: Hispanic Migration and Earnings Puerto Ricans, s.e. = 8647/ = 112.5 = 12,367 to 12,807

Example: Hispanic Migration and Earnings

Confidence Intervals for Proportions Estimating the standard error of a proportion – based on the Central Limit Theorem, a sampling distribution of proportions is approximately normal, with a mean, p , equal to the population proportion, , and with a standard error of proportions equal to: Since the standard error of proportions is generally not known, we usually work with the estimated standard error:

Determining a Confidence Interval for a Proportion where: p = observed sample proportion (estimate of ) Z = Z score for one-half the acceptable error sp = estimated standard error of the proportion

Confidence Intervals for Proportions Protestants in favor of banning stem cell research: N = 2,188, p = .37 .10 Calculate the estimated standard error: Determine the confidence level Lets say we want to be 95% confident = .37 + 1.96(.010) = .37 ± .020 = .35 to .39

Confidence Intervals for Proportions Catholics in favor of banning stem cell research: N = 880, p = .32 .16 Calculate the estimated standard error: Determine the confidence level Lets say we want to be 95% confident = .32 + 1.96(.016) = .32 ± .031 = .29 to .35

Confidence Intervals for Proportions Interpretation:We are 95 percent confident that the true population proportion supporting a ban on stem-cell research is somewhere between .35 and .39 (or between 35.0% and 39.0%) for Protestants, and somewhere between .29 and .35 (or between 29.0% and 35.0%) for Catholics.

Chapter 12: Testing Hypotheses Overview Research and null hypotheses One and two-tailed tests Errors Testing the difference between two means t tests

You already know how to deal with two nominal variables Overview You already know how to deal with two nominal variables Interval Nominal Dependent Variable Independent Variables Nominal Interval Considers the distribution of one variable across the categories of another variable Considers the difference between the mean of one group on a variable with another group Considers how a change in a variable affects a discrete outcome Considers the degree to which a change in one variable results in a change in another

Overview Independent Variables Interval Nominal Dependent Variable You already know how to deal with two nominal variables Independent Variables Nominal Interval Considers how a change in a variable affects a discrete outcome Lambda Dependent Variable Interval Nominal TODAY! Testing the differences between groups Considers the difference between the mean of one group on a variable with another group Considers the degree to which a change in one variable results in a change in another

Overview Independent Variables Interval Nominal Dependent Variable You already know how to deal with two nominal variables Independent Variables Nominal Interval Considers how a change in a variable affects a discrete outcome Lambda Dependent Variable Interval Nominal TODAY! Testing the differences between groups Considers the degree to which a change in one variable results in a change in another Confidence Intervals t-test

General Examples Is one group scoring significantly higher on average than another group? Is a group statistically different from another on a particular dimension? Is Group A’s mean higher than Group B’s?

Specific Examples Do people living in rural communities live longer than those in urban or suburban areas? Do students from private high schools perform better in college than those from public high schools? Is the average number of years with an employer lower or higher for large firms (over 100 employees) compared to those with fewer than 100 employees?

Testing Hypotheses Statistical hypothesis testing – A procedure that allows us to evaluate hypotheses about population parameters based on sample statistics. Research hypothesis (H1) – A statement reflecting the substantive hypothesis. It is always expressed in terms of population parameters, but its specific form varies from test to test. Null hypothesis (H0) – A statement of “no difference,” which contradicts the research hypothesis and is always expressed in terms of population parameters.

Research and Null Hypotheses One Tail — specifies the hypothesized direction Research Hypothesis: H1: 2 1, or 2 1 > 0 Null Hypothesis: H0: 2 1, or 2 1 = 0 Two Tail — direction is not specified (more common) H1: 2 = 1, or 2 1 = 0

One-Tailed Tests One-tailed hypothesis test – A hypothesis test in which the alternative is stated in such a way that the probability of making a Type I error is entirely in one tail of a sampling distribution. Right-tailed test – A one-tailed test in which the sample outcome is hypothesized to be at the right tail of the sampling distribution. Left-tailed test – A one-tailed test in which the sample outcome is hypothesized to be at the left tail of the sampling distribution.

Two-Tailed Tests Two-tailed hypothesis test – A hypothesis test in which the region of rejection falls equally within both tails of the sampling distribution.

Probability Values Z statistic (obtained) – The test statistic computed by converting a sample statistic (such as the mean) to a Z score. The formula for obtaining Z varies from test to test. P value – The probability associated with the obtained value of Z.

Probability Values

Probability Values Alpha ( ) – The level of probability at which the null hypothesis is rejected. It is customary to set alpha at the .05, .01, or .001 level.

Five Steps to Hypothesis Testing Making assumptions (2) Stating the research and null hypotheses and selecting alpha (3) Selecting the sampling distribution and specifying the test statistic (4) Computing the test statistic (5) Making a decision and interpreting the results

Type I and Type II Errors Type I error (false rejection error)the probability (equal to ) associated with rejecting a true null hypothesis. Type II error (false acceptance error)the probability associated with failing to reject a false null hypothesis. Based on sample results, the decision made is to… reject H0 do not reject H0 In the true Type I correct population error () decision H0 is ... false correct Type II error decision

t Test t statistic (obtained) – The test statistic computed to test the null hypothesis about a population mean when the population standard deviation is unknow and is estimated using the sample standard deviation. t distribution – A family of curves, each determined by its degrees of freedom (df). It is used when the population standard deviation is unknown and the standard error is estimated from the sample standard deviation. Degrees of freedom (df) – The number of scores that are free to vary in calculating a statistic.

t distribution

t distribution table

t-test for difference between two means Is the value of 2 1 significantly different from 0? This test gives you the answer: If the t value is greater than 1.96, the difference between the means is significantly different from zero at an alpha of .05 (or a 95% confidence level). The difference between the two means  the estimated standard error of the difference The critical value of t will be higher than 1.96 if the total N is less than 122. See Appendix C for exact critical values when N < 122.

Estimated Standard Error of the difference between two means assuming unequal variances

t-test and Confidence Intervals The t-test is essentially creating a confidence interval around the difference score. Rearranging the above formula, we can calculate the confidence interval around the difference between two means: If this confidence interval overlaps with zero, then we cannot be certain that there is a difference between the means for the two samples.

Why a t score and not a Z score? Use of the Z distribution has assumes the population standard error of the difference is known. In practice, we have to estimate it and so we use a t score. When N gets larger than 50, the t distribution converges with a Z distribution so the results would be identical regardless of whether you used a t or Z. In most sociological studies, you will not need to worry about the distinction between Z and t.

What can we conclude about the difference in wages? t-Test Example 1 Mean pay according to gender: N Mean Pay S.D. Women 46 $10.29 .8766 Men 54 $10.06 .9051 What can we conclude about the difference in wages?

What can we conclude about the difference in wages? t-Test Example 2 Mean pay according to gender: N Mean Pay S.D. Women 57 $9.68 1.0550 Men 51 $10.32 .9461 What can we conclude about the difference in wages?

In-Class Exercise Using these GSS income data, calculate a t-test statistic to determine if the difference between the two group means is statistically significant.

Chapter 13: The Chi-Square Test Chi-Square as a Statistical Test Statistical Independence Hypothesis Testing with Chi-Square The Assumptions Stating the Research and Null Hypothesis Expected Frequencies Calculating Obtained Chi-Square Sampling Distribution of Chi-Square Determining the Degrees of Freedom Limitations of Chi-Square Test

Chi-Square as a Statistical Test Chi-square test: an inferential statistics technique designed to test for significant relationships between two variables organized in a bivariate table. Chi-square requires no assumptions about the shape of the population distribution from which a sample is drawn. It can be applied to nominally or ordinally measured variables.

Statistical Independence Independence (statistical): the absence of association between two cross-tabulated variables. The percentage distributions of the dependent variable within each category of the independent variable are identical.

Hypothesis Testing with Chi-Square Chi-square follows five steps: Making assumptions (random sampling) Stating the research and null hypotheses and selecting alpha Selecting the sampling distribution and specifying the test statistic Computing the test statistic Making a decision and interpreting the results

The Assumptions The chi-square test requires no assumptions about the shape of the population distribution from which the sample was drawn. However, like all inferential techniques it assumes random sampling. It can be applied to variables measured at a nominal and/or an ordinal level of measurement.

Stating Research and Null Hypotheses The research hypothesis (H1) proposes that the two variables are related in the population. The null hypothesis (H0) states that no association exists between the two cross-tabulated variables in the population, and therefore the variables are statistically independent.

H1: The two variables are related in the population. Gender and fear of walking alone at night are statistically dependent. Afraid Men Women Total No 83.3% 57.2% 71.1% Yes 16.7% 42.8% 28.9% Total 100% 100% 100%

H0: There is no association between the two variables. Gender and fear of walking alone at night are statistically independent. Afraid Men Women Total No 71.1% 71.1% 71.1% Yes 28.9% 28.9% 28.9% Total 100% 100% 100%

The Concept of Expected Frequencies Expected frequencies fe : the cell frequencies that would be expected in a bivariate table if the two tables were statistically independent. Observed frequencies fo: the cell frequencies actually observed in a bivariate table.

Calculating Expected Frequencies fe = (column marginal)(row marginal) N To obtain the expected frequencies for any cell in any cross-tabulation in which the two variables are assumed independent, multiply the row and column totals for that cell and divide the product by the total number of cases in the table.

Chi-Square (obtained) The test statistic that summarizes the differences between the observed (fo) and the expected (fe) frequencies in a bivariate table.

Calculating the Obtained Chi-Square fe = expected frequencies fo = observed frequencies

The Sampling Distribution of Chi-Square The sampling distribution of chi-square tells the probability of getting values of chi-square, assuming no relationship exists in the population. The chi-square sampling distributions depend on the degrees of freedom. The  sampling distribution is not one distribution, but is a family of distributions.

The Sampling Distribution of Chi-Square The distributions are positively skewed. The research hypothesis for the chi-square is always a one-tailed test. Chi-square values are always positive. The minimum possible value is zero, with no upper limit to its maximum value. As the number of degrees of freedom increases, the  distribution becomes more symmetrical.

Determining the Degrees of Freedom df = (r – 1)(c – 1) where r = the number of rows c = the number of columns

Calculating Degrees of Freedom How many degrees of freedom would a table with 3 rows and 2 columns have? (3 – 1)(2 – 1) = 2 2 degrees of freedom

Chapter 14: Analysis of Variance Understanding Analysis of Variance The Structure of Hypothesis Testing with ANOVA Decomposition of SST Assessing the Relationship Between Variables SPSS Applications Reading the Research Literature

ANOVA Analysis of Variance (ANOVA) - An inferential statistics technique designed to test for significant relationship between two variables in two or more samples. The logic is the same as in t-tests, just extended to independent variables with two or more samples.

Understanding Analysis of Variance One-way ANOVA – An analysis of variance procedure using one dependent and one independent variable. ANOVAs examine the differences between samples, as well as the differences within a single sample.

The Structure of Hypothesis Testing with ANOVA Assumptions: (1) Independent random samples are used. Our choice of sample members from one population has no effect on the choice of members from subsequent populations. (2) The dependent variable is measured at the interval-ratio level. Some researchers, however, do apply ANOVA to ordinal level measurements.

The Structure of Hypothesis Testing with ANOVA Assumptions: (3) The population is normally distributed. Though we generally cannot confirm whether the populations are normal, we must assume that the population is normally distributed in order to continue with the analysis. (4) The population variances are equal.

Stating the Research and Null Hypotheses H1: At least one mean is different from the others. H0: μ1 = μ2 = μ3 = μ4

The Structure of Hypothesis Testing with ANOVA Between-Group Sum of Squares This tells us the differences between the groups Nk = the number of cases in a sample (k represents the number of different samples) = the mean of a sample = the overall mean

The Structure of Hypothesis Testing with ANOVA Within-Group Sum of Squares This tells us the variations within our groups; it also tells us the amount of unexplained variance. Nk = the number of cases in a sample (k represents the number of different samples) = the mean of a sample = each individual score in a sample

Alternative Formula for Calculating the Within-Group Sum of Squares where = the squared scores from each sample, = the sum of the scores of each sample, and = the total of each sample

The Structure of Hypothesis Testing with ANOVA Total Sum of Squares Nk = the number of cases in a sample (k represents the number of different samples) = each individual score = the overall mean

The Structure of Hypothesis Testing with ANOVA Mean Square Between An estimate of the between-group variance obtained by dividing the between-group sum of squares by its degrees of freedom. Mean square between = SSB/dfb where dfb = degrees of freedom between dfb = k – 1 k = number of categories

The Structure of Hypothesis Testing with ANOVA Mean Square Within An estimate of the within-group variance obtained by dividing the within-group sum of squares by its degrees of freedom. Mean square between = SSW/dfw where dfw = degrees of freedom within dfw = N – k N = total number of cases k = number of categories

The F Statistic The ratio of between-group variance to within-group variance

Definitions F ratio (F statistic) – Used in an analysis of variance, the F statistic represents the ratio of between-group variance to within-group variance F obtained – The test statistic computed by the ratio for between-group to within-group variance. F critical – The F score associated with a particular alpha level and degrees of freedom. This F score marks the beginning of the region of rejection for our null hypothesis.

dfb Alpha Distribution: dfw

Example: Obtained vs. Critical F Since the obtained F is beyond the critical F value, we reject the Null hypothesis of no difference

SPSS Example: Bush’s Job Approval

SPSS Example: Clinton’s Job Approval

Reading the Research Literature

Reading the Research Literature