Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 1: The What and the Why of Statistics

Similar presentations


Presentation on theme: "Chapter 1: The What and the Why of Statistics"— Presentation transcript:

1 Chapter 1: The What and the Why of Statistics
The Research Process Asking a Research Question The Role of Theory Formulating the Hypotheses Independent & Dependent Variables: Causality Independent & Dependent Variables: Guidelines Collecting Data Levels of Measurement Discrete and Continuous Variables Analyzing Data & Evaluating Hypotheses Descriptive and Inferential Statistics Looking at Social Differences

2 The Research Process THEORY
Examine a social relationship, study the relevant literature Asking the Research Question Formulating the Hypotheses Contribute new evidence to literature and begin again Develop a research design THEORY Need to add an arrow from THEORY to ANALYZING DATA and back. Evaluating the Hypotheses Analyzing Data Collecting Data

3 Asking a Research Question
What is Empirical Research? Research based on information that can be verified by using our direct experience. To answer research questions we cannot rely on reasoning, speculation, moral judgment, or subjective preference Empirical: “Are women paid less than men for the same types of work?” Not Empirical: “Is racial equality good for society?”

4 The Role of Theory A theory is an explanation of the relationship between two or more observable attributes of individuals or groups. Social scientists use theory to attempt to establish a link between what we observe (the data) and our understanding of why certain phenomena are related to each other in a particular way.

5 Formulating the Hypotheses
Tentative answers to research questions (subject to empirical verification) A statement of a relationship between characteristics that vary (variables) Variable: A property of people or objects that takes on two or more values Must include categories that are both exhaustive and mutually exclusive

6 Units of Analysis The level of social life on which social scientists focus (individuals, groups). Examples: Individual as unit of analysis: What are your political views? Family as unit of analysis: Who does the housework? Organization as unit of analysis: What is the gender composition? City as unit of analysis: What was the crime rate last year?

7 Types of Variables IV  DV
Dependent The variable to be explained (the “effect”). Independent The variable expected to account for (the “cause” of) the dependent variable. IV  DV

8 Cause and Effect Relationships
Cause and effect relationships between variables are not easy to infer in the social sciences. Causal relationships must meet three criteria: The cause has to precede the effect in time There has to be an empirical relationship between the cause and effect This relationship cannot be explained by other factors Don’t include in PFP version!!

9 Guidelines for Independent and Dependent Variables
The dependent variable is always the property you are trying to explain; it is always the object of the research. The independent variable usually occurs earlier in time than the dependent variables. The independent variable is often seen as influencing, directly or indirectly, the dependent variable.

10 Example 1 Identify the IV and DV Identify possible control variables
People who attend church regularly are more likely to oppose abortion than people who do not attend church regularly. Identify the IV and DV independent variable: dependent variable: Church attendance Attitudes toward abortion Identify possible control variables Gender Age Religious affiliation (Catholic, Baptist, Islamic…) Political party identification Are the causal arguments sound? e.g. does party id affect abortion views or vice versa?

11 Example 2 Identify the IV and DV Identify possible control variables
The number of books read to a child per day positively affects a child’s word recognition. Identify the IV and DV independent variable: dependent variable: Number of books read Word recognition Identify possible control variables Gender Older siblings Health status Birth order Are the causal arguments sound? Most likely. It is hard to construct an argument where a 36 month old child affects the number of books his or her parent reads to him/her.

12 Collecting Data THEORY Collecting Data
Examine a social relationship, study the relevant literature Ask the Research Question Formulating the Hypotheses Contribute new evidence to literature and begin again Develop a research design THEORY Evaluating the Hypotheses Analyzing Data Collecting Data

13 Collecting Data Researchers must decide three things:
How to measure the variables of interest How to select the cases for the research What kind of data collection techniques to use

14 Levels of Measurement Nominal Ordinal Interval-Ratio
Not every statistical operation can be used with every variable. The type of statistical operations we employ will depend on how our variables are measured. Nominal Ordinal Interval-Ratio Nominal -- means “in name only.” Also known as categorical or qualitative. Ask them for examples of nominal vars: gender, religion, type of company (manufacturing, retail, health services, etc.) Ordinal -- e.g., attitudinal variables (views on abortion) Interval-Ratio -- can ask how much more of X (temperature, income, test scores)

15 Nominal Level of Measurement
Numbers or other symbols are assigned to a set of categories for the purpose of naming, labeling, or classifying the observations. Examples: Political Party (Democrat, Republican) Religion (Catholic, Jewish, Muslim, Protestant) Race (African American, Latino, Native American)

16 Ordinal Level of Measurement
Nominal variables that can be ranked from low to high. Example: Social Class Upper Class Middle Class Working Class

17 Interval-Ratio Level of Measurement
Variables where measurements for all cases are expressed in the same units. (Variables with a natural zero point, such as height and weight, are called ratio variables.) Examples: Age Income SAT scores

18 Cumulative Property of Levels of Measurement
Variables that can be measured at the interval-ratio level of measurement can also be measured at the ordinal and nominal levels. However, variables that are measured at the nominal and ordinal levels cannot be measured at higher levels. Different or Higher or How Much Level Equivalent Lower Higher Nominal Yes No Ordinal Interval-ratio

19 Cumulative Property of Levels of Measurement
There is one exception, though Dichotomous variables Because there are only two possible values for a dichotomy, we can measure it at the ordinal or the interval-ratio level (e.g., gender) There is no way to get them out of order This gives the dichotomy more power than other nominal level variables

20 Discrete and Continuous Variables
Discrete variables: variables that have a minimum-sized unit of measurement, which cannot be sub-divided Example: the number children per family Continuous variables: variables that, in theory, can take on all possible numerical values in a given interval Example: length

21 Analyzing Data: Descriptive and Inferential Statistics
Population: The total set of individuals, objects, groups, or events in which the researcher is interested. Sample: A relatively small subset selected from a population. Descriptive statistics: Procedures that help us organize and describe data collected from either a sample or a population. Inferential statistics: The logic and procedures concerned with making predictions or inferences about a population from observations and analyses of a sample.

22 Analyze Data & Evaluate Hypotheses
Examine a social relationship, study the relevant literature Asking the Research Question Formulating the Hypotheses Contribute new evidence to literature and begin again Develop a research design THEORY Evaluating the Hypotheses Analyzing Data Collecting Data

23 Begin the Process Again...
Examine a social relationship, study the relevant literature Asking the Research Question Formulating the Hypotheses Contribute new evidence to literature and begin again Develop a research design THEORY Evaluating the Hypotheses Analyzing Data Collecting Data

24 Chapter 2: Organization of Information: Frequency Distributions
Proportions and Percentages Percentage Distributions Comparisons The Construction of Frequency Distributions Frequency Distributions for Nominal Variables Frequency Distributions for Ordinal Variables Frequency Distributions for Interval-Ratio Variables Cumulative Distributions Rates Reading the Research Literature Basic Principles Tables with a Different Format

25 Frequency Distributions
A table reporting the number of observations falling into each category of the variable. Identity Frequency (f) Native American ,500 Native American of multiple ancestry ,700 Native American of Indian descent 5,537,600 Total (N) 6,754,800

26 Death Penalty Statutes
In 1993, 36 states and Washington, D.C. had statutes permitting capital punishment. Of these 36 states, 27 set a minimum age for execution. Assume you are a member of a legal reform group that is trying to get the states that do not have a minimum age for execution to change their laws. You want to prepare a report describing the minimum age for execution in the 27 states have an established minimum age for execution. (The data are on the following slides.)

27 Death Penalty Statutes
Source: Kathleen Maguire and Ann L. Pastore, eds., Sourcebook of Criminal Justice Statistics U.S. Department of Justice, Bureau of Justice Statistics. Washington, D.C.: U.S. Government Printing Office, 1995, pp

28 Creating a Frequency Distribution
Minimum Age Tally 14 | 15 | 16 ||||||||| 17 |||| 18 |||||||||||| Frequency 1 9 4 12 Total N 27

29 Creating a Frequency Distribution
Minimum Age Frequency 14 1 15 1 16 9 17 4 18 12 Total N 27

30 Proportions and Percentages
Proportion (P): a relative frequency obtained by dividing the frequency in each category by the total number of cases. Percentage (%): a relative frequency obtained by dividing the frequency in each category by the total number of cases and multiplying by 100. N: total number of cases Proportions and percentages are relative frequencies

31 Proportions and Percentages
Minimum Age Frequency Proportion Percentage /27= Total N

32 Percentage Distributions
A table showing the percentage of observations falling into each category of the variable. Minimum Age Frequency Percentage Total N

33 Frequency Distributions for Nominal Variables
Gender Tallies Freq. (f) Percentage Male ||||||||||||||| Female ||||||||||||||||||||||||| Total (N) Note: The categories for nominal variables (male, female) need not be listed in any particular order.

34 Frequency Distributions for Ordinal Variables
Happiness Tallies Freq. (f) Percentage Very Happy ||||||||| Pretty Happy ||||||||||||||||||||||||| Not too happy |||||| Total (N) Note: Because the categories or values of ordinal variables are rank- ordered, they must be listed in a way that reflects their rank – from the lowest to the highest or from the highest to the lowest.

35 Employment Status Example

36 Employment Status Example

37 Frequency Distributions for Interval-Ratio Variables
Number of Children Freq. (f) Percentage 7 or more Total (N)

38 Cumulative Distributions
Sometimes we are interested in locating the relative position of a given score in a distribution. Cumulative frequency distribution: a distribution showing the frequency at or below each category (class interval or score) of the variable. Cumulative percentage distribution: a distribution showing the percentage at or below each category (class interval or score) of the variable.

39 Cumulative Frequency Distribution
Minimum Cumulative Age Freq. (f) Percentage Frequency Total (N) * Doesn’t total to 100% due to rounding

40 Cumulative Percentage Distribution
Minimum Cumulative Age Frequency Percentage Percentage * Total N * Does not total to 100% due to rounding

41 What’s the problem with the “rate” computation below?
Rates A number obtained by dividing the number of actual occurrences in a given time period by the number of possible occurrences. What’s the problem with the “rate” computation below? Marriage rate, = Number of marriages in 1990 Total population in 1990 Marriage rate, 1990 = 2,448,000 marriages 250,000,000 Americans Marriage rate, 1990 = .0098

42 Reading Statistical Tables
Basic principles for understanding what the researcher is trying to tell you: What is the source of the table? How many variables are presented? What are their names? What is represented by the numbers presented in the first column? In the second column?

43 Chapter 3: Graphic Presentation
The Pie Chart The Bar Graph The Statistical Map The Histogram Statistics in Practice The Frequency Polygon Times Series Charts Distortions in Graphs It is important to choose the appropriate graphs to make statistical information coherent.

44 The Pie Chart: The Race and Ethnicity of the Elderly
Pie chart: a graph showing the differences in frequencies or percentages among categories of a nominal or an ordinal variable. The categories are displayed as segments of a circle whose pieces add up to 100 percent of the total frequencies.

45 Too many categories can be messy!
2.8% .8% .6% .5% 8.3% 87.7% N = 35,919,174 Figure 3.1 Annual Estimates of U.S. Population 65 Years and Over by Race, 2003

46 We can reduce some of the categories
4% 8.3% 87.7% N = 35,919,174 Figure 3.2 Annual Estimates of U.S. Population 65 Years and Over, 2003

47 The Bar Graph: The Living Arrangements and Labor Force Participation of the Elderly
Bar graph: a graph showing the differences in frequencies or percentages among categories of a nominal or an ordinal variable. The categories are displayed as rectangles of equal width with their height proportional to the frequency or percentage of the category.

48 N=13,886,000 Figure 3.3 Living Arrangements of Males (65 and Older) in the United States, 2000

49 Can display more info by splitting sex
Figure 3.4 Living Arrangement of U.S. Elderly (65 and Older) by Gender, 2003

50 Figure 3.5 Percent of Men and Women 55 Years and Over in the Civilian Labor Force, 2002

51 The Statistical Map: The Geographic Distribution of the Elderly
We can display dramatic geographical changes in American society by using a statistical map. Maps are especially useful for describing geographical variations in variables, such as population distribution, voting patterns, crimes rates, or labor force participation.

52

53

54 The Histogram Histogram: a graph showing the differences in frequencies or percentages among categories of an interval-ratio variable. The categories are displayed as contiguous bars, with width proportional to the width of the category and height proportional to the frequency or percentage of that category.

55 Figure 3.7 Age Distribution of U.S. Population 65 Years and Over, 2000

56 The following two slides are applications of the histogram
The following two slides are applications of the histogram. They examine, by gender, age distribution patterns in the U.S. population for 1955 and 2010 (projected). Notice that in both figures, age groups are arranged along the vertical axis, whereas the frequencies (in millions of people) are along the horizontal axis. Each age group is classified by males on the left and females on the right. Because this type of histogram reflects age distribution by gender, it is also called an age-sex pyramid.

57

58 The Frequency Polygon Frequency polygon: a graph showing the differences in frequencies or percentages among categories of an interval-ratio variable. Points representing the frequencies of each category are placed above the midpoint of the category and are jointed by a straight line.

59 Source: Adapted from U. S
Source: Adapted from U.S. Bureau of the Census, Center for International Research, International Data Base, 2003. Figure Population of Japan, Age 55 and Over, 2000, 2010, and 2020

60 Time Series Charts Time series chart: a graph displaying changes in a variables at different points in time. It shows time (measured in units such as years or months) on the horizontal axis and the frequencies (percentages or rates) of another variable on the vertical axis.

61 Source: Federal Interagency Forum on Aging Related Statistics, Older Americans 2004: Key Indicators of Well Being, 2004. Figure Percentage of Total U. S. Population 65 Years and Over, 1900 to 2050

62 Source: U.S. Bureau of the Census, “65+ in America,” Current Population Reports,
1996, Special Studies, P23-190, Table 6-1. Figure Percentage Currently Divorced Among U.S. Population 65 Years and Over, by Gender, 1960 to 2040

63 Distortions in Graphs Graphs not only quickly inform us; they can quickly deceive us. Because we are often more interested in general impressions than in detailed analyses of the numbers, we are more vulnerable to being swayed by distorted graphs. What are graphical distortions? How can we recognize them?

64 Shrinking an Stretching the Axes: Visual Confusion
Probably the most common distortions in graphical representations occur when the distance along the vertical or horizontal axis is altered in relation to the other axis. Axes can be stretched or shrunk to create any desired result.

65 Shrinking an Stretching the Axes: Visual Confusion

66 Distortions with Picture Graphs
Another way to distort data with graphs is to use pictures to represent quantitative information. The problem with picture graphs is that the visual impression received is created by the picture’s total area rather than by is height (the graphs we have discussed so far rely heavily on height).

67 Statistics in Practice
The following graphs are particularly suitable for making comparisons among groups: - Bar chart - Frequency polygon - Time series chart

68 Source: Smith, 2003. This bar chart compares elderly males and females who live alone by age, gender, and race or Hispanic origin. It shows that that the percentage of elderly who live alone varies not only by age but also by both race and gender. Figure 3.17 Percentage of College Graduates among People 55 years and over by age and sex, 2002

69 Source: Stoops, Nicole. 2004. “Educational Attainment in the United States: 2003.”
Current Population Reports, P Washington D.C.: U.S. Government Printing Office. This frequency polygon compares years of school completed by black Americans age 25 to 64 and 65 years and older with that of all Americans in the same age groups. Figure 3.18 Years of School Completed in the United States by Race and Age, 2003

70 Why use charts and graphs?
What do you lose? ability to examine numeric detail offered by a table potentially the ability to see additional relationships within the data potentially time: often we get caught up in selecting colors and formatting charts when a simply formatted table is sufficient What do you gain? ability to direct readers’ attention to one aspect of the evidence ability to reach readers who might otherwise be intimidated by the same data in a tabular format ability to focus on bigger picture rather than perhaps minor technical details We do this as an in-class exercise – where they pair up and construct a chart based on a table from the text or handed out in class and then answer the two questions above.

71 Chapter 4: Measures of Central Tendency
What is a measure of central tendency? Measures of Central Tendency Mode Median Mean Shape of the Distribution Considerations for Choosing an Appropriate Measure of Central Tendency

72 What is a measure of Central Tendency?
Numbers that describe what is average or typical of the distribution You can think of this value as where the middle of a distribution lies.

73 The Mode The category or score with the largest frequency (or percentage) in the distribution. The mode can be calculated for variables with levels of measurement that are: nominal, ordinal, or interval-ratio.

74 The Mode: An Example Example: Number of Votes for Candidates for Mayor. The mode, in this case, gives you the “central” response of the voters: the most popular candidate. Candidate A – 11,769 votes The Mode: Candidate B – 39,443 votes “Candidate C” Candidate C – 78,331 votes

75 The Median The score that divides the distribution into two equal parts, so that half the cases are above it and half below it. The median is the middle score, or average of middle scores in a distribution.

76 Median Exercise #1 (N is odd)
Calculate the median for this hypothetical distribution: Job Satisfaction Frequency Very High 2 High 3 Moderate 5 Low 7 Very Low 4 TOTAL 21

77 Median Exercise #2 (N is even)
Calculate the median for this hypothetical distribution: Satisfaction with Health Frequency Very High 5 High 7 Moderate 6 Low 7 Very Low 3 TOTAL 28

78 Finding the Median in Grouped Data

79 Percentiles A score below which a specific percentage of the distribution falls. Finding percentiles in grouped data:

80 The Mean The arithmetic average obtained by adding up all the scores and dividing by the total number of scores.

81 Formula for the Mean “Y bar” equals the sum of all the scores, Y, divided by the number of scores, N.

82 Calculating the mean with grouped scores
where: f Y = a score multiplied by its frequency

83 Mean: Grouped Scores

84 Mean: Grouped Scores

85 Grouped Data: the Mean & Median
Calculate the median and mean for the grouped frequency below. Number of People Age 18 or older living in a U.S. Household in 1996 (GSS 1996) Number of People Frequency 1 190 2 316 3 54 4 17 5 2 6 2 TOTAL 581

86 Shape of the Distribution
Symmetrical (mean is about equal to median) Skewed Negatively (example: years of education) mean < median Positively (example: income) mean > median Bimodal (two distinct modes) Multi-modal (more than 2 distinct modes) Draw Examples on the board

87 Distribution Shape

88 Considerations for Choosing a Measure of Central Tendency
For a nominal variable, the mode is the only measure that can be used. For ordinal variables, the mode and the median may be used. The median provides more information (taking into account the ranking of categories.) For interval-ratio variables, the mode, median, and mean may all be calculated. The mean provides the most information about the distribution, but the median is preferred if the distribution is skewed.

89 Central Tendency

90 Chapter 5: Measures of Variability
The Importance of Measuring Variability The Range IQR (Inter-Quartile Range) Variance Standard Deviation Considerations for choosing a measure of variation

91 The Importance of Measuring Variability
Central tendency - Numbers that describe what is typical or average (central) in a distribution Measures of Variability - Numbers that describe diversity or variability in the distribution. These two types of measures together help us to sum up a distribution of scores without looking at each and every score. Measures of central tendency tell you about typical (or central) scores. Measures of variation reveal how far from the typical or central score that the distribution tends to vary.

92 Notice that both distributions have the same mean, yet they are shaped differently

93 The Range Range = highest score - lowest score
Range – A measure of variation in interval-ratio variables. It is the difference between the highest (maximum) and the lowest (minimum) scores in the distribution. Range is a good thing to look at to make sure your data are as you expect them to be.

94 Inter-Quartile Range Inter-Quartile Range (IQR) – A measure of variation for interval-ratio data. It indicates the width of the middle 50 percent of the distribution and is defined as the difference between the lower and upper quartiles (Q1 and Q3.) IQR = Q3 – Q1

95 The difference between the Range and IQR
These values fall together closely Shows greater variability Importance of the IQR Yet the ranges are equal!

96 The Box Plot The Box Plot is a graphic device that visually presents the following elements: the range, the IQR, the median, the quartiles, the minimum (lowest value,) and the maximum (highest value.)

97 Variance Variance – A measure of variation for interval-ratio variables; it is the average of the squared deviations from the mean

98 Standard Deviation Standard Deviation – A measure of variation for interval-ratio variables; it is equal to the square root of the variance.

99 Find the Mean and the Standard Deviation

100 Considerations for Choosing a Measure of Variability
For nominal variables, you can only use IQV (Index of Qualitative Variation.) For ordinal variables, you can calculate the IQV or the IQR (Inter-Quartile Range.) Though, the IQR provides more information about the variable. For interval-ratio variables, you can use IQV, IQR, or variance/standard deviation. The standard deviation (also variance) provides the most information, since it uses all of the values in the distribution in its calculation.

101 Chapter 6: Relationships Between Two Variables: Cross-Tabulation
Independent and Dependent Variables Constructing a Bivariate Table Computing Percentages in a Bivariate Table Dealing with Ambiguous Relationships Between Variables Reading the Research Literature Properties of a Bivariate Relationship Elaboration Statistics in Practice

102 Introduction Bivariate Analysis: A statistical method designed to detect and describe the relationship between two variables. Cross-Tabulation: A technique for analyzing the relationship between two variables that have been organized in a table.

103 Understanding Independent and Dependent Variables
Example: If we hypothesize that English proficiency varies by whether person is native born or foreign born, what is the independent variable, and what is the dependent variable? Independent: nativity Dependent: English proficiency

104 Constructing a Bivariate Table
Bivariate table: A table that displays the distribution of one variable across the categories of another variable. Column variable: A variable whose categories are the columns of a bivariate table. Row variable: A variable whose categories are the rows of a bivariate table. Cell: The intersection of a row and a column in a bivariate table. Marginals: The row and column totals in a bivariate table.

105

106

107

108 Percentages Can Be Computed in Different Ways:
Column Percentages: column totals as base Row Percentages: row totals as base

109 Support for Abortion by Job Security
Absolute Frequencies Support for Abortion by Job Security Abortion Job Find Easy Job Find Not Easy Row Total Yes No Column Total

110 Support for Abortion by Job Security
Column Percentages Support for Abortion by Job Security Abortion Job Find Easy Job Find Not Easy Row Total Yes % % 52% No % % % Column Total % % % (44) (51) (95)

111 Support for Abortion by Job Security
Row Percentages Support for Abortion by Job Security Abortion Job Find Easy Job Find Not Easy Row Total Yes % % 100% (49) No % % % (46) Column Total % % % (95)

112 Properties of a Bivariate Relationship
Does there appear to be a relationship? How strong is it? What is the direction of the relationship?

113 Existence of a Relationship
IV: Number of Traumas DV: Support for Abortion If the number of traumas were unrelated to attitudes toward abortion among women, then we would expect to find equal percentages of women who are pro-choice (or anti-choice), regardless of the number of traumas experienced.

114 Existence of the Relationship

115 Determining the Strength of the Relationship
A quick method is to examine the percentage difference across the different categories of the independent variable. The larger the percentage difference across the categories, the stronger the association. We rarely see a situation with either a 0 percent or a 100 percent difference.

116 Direction of the Relationship
Positive relationship: A bivariate relationship between two variables measured at the ordinal level or higher in which the variables vary in the same direction. Negative relationship: A bivariate relationship between two variables measured at the ordinal level or higher in which the variables vary in opposite directions.

117 A Positive Relationship

118 A Negative Relationship

119 Elaboration Elaboration is a process designed to further explore a bivariate relationship; it involves the introduction of control variables. A control variable is an additional variable considered in a bivariate relationship. The variable is controlled for when we take into account its effect on the variables in the bivariate relationship.

120 Three Goals of Elaboration
Elaboration allows us to test for non-spuriousness. Elaboration clarifies the causal sequence of bivariate relationships by introducing variables hypothesized to intervene between the IV and DV. Elaboration specifies the different conditions under which the original bivariate relationship might hold.

121 Testing for Nonspuriousness
Direct causal relationship: a bivariate relationship that cannot be accounted for by other theoretically relevant variables. Spurious relationship: a relationship in which both the IV and DV are influenced by a causally prior control variable and there is no causal link between them. The relationship between the IV and DV is said to be “explained away” by the control variable.

122 Number of Firefighters  Property Damage
The Bivariate Relationship Between Number of Firefighters and Property Damage Number of Firefighters  Property Damage (IV) (DV)

123

124 Process of Elaboration
Partial tables: bivariate tables that display the relationship between the IV and DV while controlling for a third variable. Partial relationship: the relationship between the IV and DV shown in a partial table.

125 The Process of Elaboration
Divide the observations into subgroups on the basis of the control variable. We have as many subgroups as there are categories in the control variable. Re-examine the relationship between the original two variables separately for the control variable subgroups. Compare the partial relationships with the original bivariate relationship for the total group.

126

127

128 Intervening Relationship
Intervening variable: a control variable that follows an independent variable but precedes the dependent variable in a causal sequence. Intervening relationship: a relationship in which the control variable intervenes between the independent and dependent variables.

129 Intervening Relationship: Example
Religion  Preferred Family Size  Support for Abortion (IV) (Intervening Control Variable) (DV)

130 Conditional Relationships
Conditional relationship: a relationship in which the control variable’s effect on the dependent variable is conditional on its interaction with the independent variable. The relationship between the independent and dependent variables will change according to the different conditions of the control variable.

131 Conditional Relationships
Another way to describe a conditional relationship is to say that there is a statistical interaction between the control variable and the independent variable.

132 Conditional Relationships

133 Conditional Relationships

134 Chapter 7: Measures of Association for Nominal and Ordinal Variables
Proportional Reduction of Error (PRE) Degree of Association For Nominal Variables Lambda For Ordinal Variables Gamma Using Gamma for Dichotomous Variables

135 Measures of Association
Measure of association—a single summarizing number that reflects the strength of a relationship, indicates the usefulness of predicting the dependent variable from the independent variable, and often shows the direction of the relationship.

136 The most common race/ethnicity for U.S. residents (e.g., the mode)!
Take your best guess? If you know nothing else about a person except that he or she lives in United States and I asked you to guess his or her race/ethnicity, what would you guess? The most common race/ethnicity for U.S. residents (e.g., the mode)! Now, if we know that this person lives in San Diego, California, would you change your guess? With quantitative analyses we are generally trying to predict or take our best guess at value of the dependent variable. One way to assess the relationship between two variables is to consider the degree to which the extra information of the independent variable makes your guess better.

137 Proportional Reduction of Error (PRE)
PRE—the concept that underlies the definition and interpretation of several measures of association. PRE measures are derived by comparing the errors made in predicting the dependent variable while ignoring the independent variable with errors made when making predictions that use information about the independent variable.

138 Proportional Reduction of Error (PRE)
where: E1 = errors of prediction made when the independent variable is ignored E2 = errors of prediction made when the prediction is based on the independent variable

139 Two PRE Measures: Lambda & Gamma
Appropriate for… Lambda NOMINAL variables Gamma ORDINAL & DICHOTOMOUS NOMINAL variables

140 Lambda Lambda—An asymmetrical measure of association suitable for use with nominal variables and may range from 0.0 (meaning the extra information provided by the independent variable does not help prediction) to 1.0 (meaning use of independent variable results in no prediction errors). It provides us with an indication of the strength of an association between the independent and dependent variables. A lower value represents a weaker association, while a higher value is indicative of a stronger association

141 Lambda where: E1= Ntotal - Nmode of dependent variable

142 Example 1: 2000 Vote By Abortion Attitudes
Table Presidential Vote by Abortion Attitudes Abortion Attitudes (for any reason) Vote Yes No Row Total Gore Bush Total Source: General Social Survey, 2002 Step One—Add percentages to the table to get the data in a format that allows you to clearly assess the nature of the relationship.

143 Example 1: 2000 Vote By Abortion Attitudes
Table Presidential Vote by Abortion Attitudes Abortion Attitudes (for any reason) Vote Yes No Row Total Gore 52.9% % % Bush 47.1% 65.2% % Total 100% 100% % Source: General Social Survey, 2002 Now calculate E1 E1 = Ntotal – Nmode = 199 – 114 = 85

144 Example 1: 2000 Vote By Abortion Attitudes
Table Presidential Vote by Abortion Attitudes Abortion Attitudes (for any reason) Vote Yes No Row Total Gore 52.9% % % Bush 47.1% 65.2% % Total 100% 100% % Source: General Social Survey, 2002 Now calculate E2 E2 = [N(Yes column total) – N(Yes column mode)] + [N(No column total) – N(No column mode)] = [87 – 46] + …

145 Example 1: 2000 Vote By Abortion Attitudes
Table Presidential Vote by Abortion Attitudes Abortion Attitudes (for any reason) Vote Yes No Row Total Gore 52.9% % % Bush 47.1% 65.2% % Total 100% 100% % Source: General Social Survey, 2002 Now calculate E2 E2 = [N(Yes column total) – N(Yes column mode)] + [N(No column total) – N(No column mode)] = [87 – 46] + [112 – 73]

146 Example 1: 2000 Vote By Abortion Attitudes
Table Presidential Vote by Abortion Attitudes Abortion Attitudes (for any reason) Vote Yes No Row Total Gore 52.9% % % Bush 47.1% 65.2% % Total 100% 100% % Source: General Social Survey, 2002 Now calculate E2 E2 = [N(Yes column total) – N(Yes column mode)] + [N(No column total) – N(No column mode)] = [87 – 46] + [112 – 73] = 80

147 Example 1: 2000 Vote By Abortion Attitudes
Table Presidential Vote by Abortion Attitudes Abortion Attitudes (for any reason) Vote Yes No Row Total Gore 52.9% % % Bush 47.1% 65.2% % Total 100% 100% % Source: General Social Survey, 2002 Lambda = [E1– E2] / E1 = [85 – 80] / 85 = .06

148 Example 1: 2000 Vote By Abortion Attitudes
Table Presidential Vote by Abortion Attitudes Abortion Attitudes (for any reason) Vote Yes No Row Total Gore 52.9% % % Bush 47.1% 65.2% % Total 100% 100% % Source: General Social Survey, 2002 Lambda = .06 So, we know that six percent of the errors in predicting the relationship between vote and abortion attitudes can be reduced by taking into account the voter’s attitude towards abortion.

149 EXAMPLE 2: Victim-Offender Relationship and Type of Crime: 1993
Step One—Add percentages to the table to get the data in a format that allows you to clearly assess the nature of the relationship. *Source: Kathleen Maguire and Ann L. Pastore, eds., Sourcebook of Criminal Justice Statistics 1994., U.S. Department of Justice, Bureau of Justice Statistics, Washington, D.C.: USGPO, 1995, p. 343.

150 Victim-Offender Relationship & Type of Crime: 1993
Now calculate E1 E1 = Ntotal – Nmode = 9,898,980 – 5,045,040 = 4,835,940

151 Victim-Offender Relationship & Type of Crime: 1993
Now calculate E2 E2 = [N(rape/sexual assault column total) – N(rape/sexual assault column mode)] + [N(robbery column total) – N(robbery column mode)] + [N(assault column total) – N(assault column mode)] = [472,760 – 350,670] + …

152 Victim-Offender Relationship and Type of Crime: 1993
Now calculate E2 E2 = [N(rape/sexual assault column total) – N(rape/sexual assault column mode)] + [N(robbery column total) – N(robbery column mode)] + [N(assault column total) – N(assault column mode)] = [472,760 – 350,670] + [1,161,900 – 930,860] + …

153 Victim-Offender Relationship and Type of Crime: 1993
Now calculate E2 E2 = [N(rape/sexual assault column total) – N(rape/sexual assault column mode)] + [N(robbery column total) – N(robbery column mode)] + [N(assault column total) – N(assault column mode)] = [472,760 – 350,670] + [1,161,900 – 930,860] + [8,264,320 – 4,272,230] = 4,345,220

154 Victim-Offender Relationship and Type of Crime: 1993
Lambda = [E1– E2] / E1 = [4,835,940 – 4,345,220] / 4,835,940 = .10 So, we know that ten percent of the errors in predicting the relationship between victim and offender (stranger vs. non-stranger;) can be reduced by taking into account the type of crime that was committed.

155 Asymmetrical Measure of Association
A measure whose value may vary depending on which variable is considered the independent variable and which the dependent variable. Lambda is an asymmetrical measure of association.

156 Symmetrical Measure of Association
A measure whose value will be the same when either variable is considered the independent variable or the dependent variable. Gamma is a symmetrical measure of association…

157 Before Computing GAMMA:
It is necessary to introduce the concept of paired observations. Paired observations – Observations compared in terms of their relative rankings on the independent and dependent variables.

158 Tied Pairs Same order pair (Ns) – Paired observations that show a positive association; the member of the pair ranked higher on the independent variable is also ranked higher on the dependent variable.

159 Tied Pairs Inverse order pair (Nd) – Paired observations that show a negative association; the member of the pair ranked higher on the independent variable is ranked lower on the dependent variable.

160 Gamma Gamma—a symmetrical measure of association suitable for use with ordinal variables or with dichotomous nominal variables. It can vary from 0.0 (meaning the extra information provided by the independent variable does not help prediction) to 1.0 (meaning use of independent variable results in no prediction errors) and provides us with an indication of the strength and direction of the association between the variables. When there are more Ns pairs, gamma will be positive; when there are more Nd pairs, gamma will be negative.

161 Gamma

162 Interpreting Gamma The sign depends on the way the variables are coded: + the two “high” values are associated, as are the two “lows” – the “highs” are associated with the “lows” .00 to .24 “no relationship” .25 to .49 “weak relationship” .50 to .74 “moderate relationship” .75 to “strong relationship”

163 Measures of Association
Measures of association—a single summarizing number that reflects the strength of the relationship. This statistic shows the magnitude and/or direction of a relationship between variables. Magnitude—the closer to the absolute value of 1, the stronger the association. If the measure equals 0, there is no relationship between the two variables. Direction—the sign on the measure indicates if the relationship is positive or negative. In a positive relationship, when one variable is high, so is the other. In a negative relationship, when one variable is high, the other is low.

164 Chapter 8: Bivariate Regression and Correlation
Overview The Scatter Diagram Two Examples: Education & Prestige Correlation Coefficient Bivariate Linear Regression Line SPSS Output Interpretation Covariance

165 Overview Independent Variables Interval Nominal Dependent Variable
Nominal Interval Considers the distribution of one variable across the categories of another variable Considers the difference between the mean of one group on a variable with another group Considers how a change in a variable affects a discrete outcome Considers the degree to which a change in one variable results in a change in another

166 You already know how to deal with two nominal variables
Overview You already know how to deal with two nominal variables Independent Variables Nominal Interval Considers how a change in a variable affects a discrete outcome Lambda Dependent Variable Interval Nominal Considers the difference between the mean of one group on a variable with another group Considers the degree to which a change in one variable results in a change in another

167 Overview TODAY! Independent Variables Interval Nominal Dependent
You already know how to deal with two nominal variables TODAY! Independent Variables Nominal Interval Considers how a change in a variable affects a discrete outcome Lambda Dependent Variable Interval Nominal Considers the degree to which a change in one variable results in a change in another Confidence Intervals T-Test We will deal with this later in the course

168 Overview TODAY! What about this cell? Independent Variables Regression
You already know how to deal with two nominal variables What about this cell? Independent Variables Nominal Interval Considers how a change in a variable affects a discrete outcome Lambda Dependent Variable Interval Nominal TODAY! Confidence Intervals T-Test Regression Correlation We will deal with this later in the course

169 Overview TODAY! Independent Variables
You already know how to deal with two nominal variables This cell is not covered in this course Independent Variables Nominal Interval Logistic Regression Lambda Dependent Variable Interval Nominal TODAY! Confidence Intervals T-Test Regression Correlation We will deal with this later in the course

170 General Examples Does a change in one variable significantly affect another variable? Do two scores tend to co-vary positively (high on one score high on the other, low on one, low on the other)? Do two scores tend to co-vary negatively (high on one score low on the other; low on one, hi on the other)?

171 Specific Examples Does getting older significantly influence a person’s political views? Does marital satisfaction increase with length of marriage? How does an additional year of education affect one’s earnings?

172 Scatter Diagrams Scatter Diagram (scatterplot)—a visual method used to display a relationship between two interval-ratio variables. Typically, the independent variable is placed on the X-axis (horizontal axis), while the dependent variable is placed on the Y-axis (vertical axis.)

173 Scatter Diagram Example
The data…

174 Scatter Diagram Example

175 A Scatter Diagram Example of a Negative Relationship

176 Linear Relationships Linear relationship – A relationship between two interval-ratio variables in which the observations displayed in a scatter diagram can be approximated with a straight line. Deterministic (perfect) linear relationship – A relationship between two interval-ratio variables in which all the observations (the dots) fall along a straight line. The line provides a predicted value of Y (the vertical axis) for any value of X (the horizontal axis.

177 Graph the data below and examine the relationship:

178 The Seniority-Salary Relationship

179 Example: Education & Prestige
Does education predict occupational prestige? If so, then the higher the respondent’s level of education, as measured by number of years of schooling, the greater the prestige of the respondent’s occupation. Take a careful look at the scatter diagram on the next slide and see if you think that there exists a relationship between these two variables…

180 Scatterplot of Prestige by Education

181 Example: Education & Prestige
The scatter diagram data can be represented by a straight line, therefore there does exist a relationship between these two variables. In addition, since occupational prestige becomes higher, as years of education increases, we can say also that the relationship is a positive one.

182 The mean age for U.S. residents.
Take your best guess? If you know nothing else about a person, except that he or she lives in United States and I asked you to his or her age, what would you guess? The mean age for U.S. residents. Now if I tell you that this person owns a skateboard, would you change your guess? (Of course!) With quantitative analyses we are generally trying to predict or take our best guess at value of the dependent variable. One way to assess the relationship between two variables is to consider the degree to which the extra information of the second variable makes your guess better. If someone owns a skateboard, that is likely to indicate to us that s/he is younger and we may be able to guess closer to the actual value.

183 Take your best guess? Similar to the example of age and the skateboard, we can take a much better guess at someone’s occupational prestige, if we have information about her/his years or level of education.

184 Equation for a Straight Line
Y= a + bX where a = intercept b = slope Y = dependent variable X = independent variable X Y a rise run = b

185 Bivariate Linear Regression Equation
^ Y = a + bX Y-intercept (a)—The point where the regression line crosses the Y-axis, or the value of Y when X=0. Slope (b)—The change in variable Y (the dependent variable) with a unit change in X (the independent variable.) The estimates of a and b will have the property that the sum of the squared differences between the observed and predicted (Y-Y)2 is minimized using ordinary least squares (OLS). Thus the regression line represents the Best Linear and Unbiased Estimators (BLUE) of the intercept and slope. ˆ

186 SPSS Regression Output (GSS) Education & Prestige

187 SPSS Regression Output (GSS) Education & Prestige
Now let’s interpret the SPSS output...

188 The Regression Equation
Prediction Equation: Y = (X) This line represents the predicted values for Y for any and all values of X ˆ

189 The Regression Equation
Prediction Equation: Y = (X) This line represents the predicted values for Y for any and all values of X ˆ

190 Interpreting the regression equation
Y = (X) ˆ If a respondent had zero years of schooling, this model predicts that his occupational prestige score would be points. For each additional year of education, our model predicts a point increase in occupational prestige.

191 Ordinary Least Squares
Least-squares line (best fitting line) – A line where the errors sum of squares, or e2, is at a minimum. Least-squares method – The technique that produces the least squares line.

192 Estimating the slope: b
The bivariate regression coefficient or the slope of the regression line can be obtained from the observed X and Y scores.

193 Covariance and Variance
Variance of X = Covariance of X and Y—a measure of how X and Y vary together. Covariance will be close to zero when X and Y are unrelated. It will be greater than zero when the relationship is positive and less than zero when the relationship is negative. Variance of X—we have talked a lot about variance in the dependent variable. This is simply the variance for the independent variable

194 Estimating the Intercept
The regression line always goes through the point corresponding to the mean of both X and Y, by definition. So we utilize this information to solve for a:

195 Back to the original scatterplot:

196 A Representative Line

197 Other Representative Lines

198 Calculating the Regression Equation

199 Calculating the Regression Equation

200 The Least Squares Line!

201 Summary: Properties of the Regression Line
Represents the predicted values for Y for any and all values of X. Always goes through the point corresponding to the mean of both X and Y. It is the best fitting line in that it minimizes the sum of the squared deviations. Has a slope that can be positive or negative; null hypothesis is that the slope is zero.

202 Coefficient of Determination
Coefficient of Determination (r2) – A PRE measure reflecting the proportional reduction of error that results from using the linear regression model. It reflects the proportion of the total variation in the dependent variable, Y, explained by the independent variable, X.

203 Coefficient of Determination

204 Coefficient of Determination

205 The Correlation Coefficient
Pearson’s Correlation Coefficient (r) — The square root of r2. It is a measure of association between two interval-ratio variables. Symmetrical measure—No specification of independent or dependent variables. Ranges from –1.0 to The sign () indicates direction. The closer the number is to 1.0 the stronger the association between X and Y.

206 The Correlation Coefficient
r = 0 means that there is no association between the two variables. r = 0 Y X

207 The Correlation Coefficient
r = 0 means that there is no association between the two variables. r = +1 means a perfect positive correlation. r = +1 Y X

208 The Correlation Coefficient
r = 0 means that there is no association between the two variables. r = +1 means a perfect positive correlation. r = –1 means a perfect negative correlation. Y r = –1 X

209 Chapter 9: The Normal Distribution
Properties of the Normal Distribution Shapes of Normal Distributions Standard (Z) Scores The Standard Normal Distribution Transforming Z Scores into Proportions Transforming Proportions into Z Scores Finding the Percentile Rank of a Raw Score Finding the Raw Score for a Percentile

210 Normal Distributions Normal Distribution – A bell-shaped and symmetrical theoretical distribution, with the mean, the median, and the mode all coinciding at its peak and with frequencies gradually decreasing at both ends of the curve. The normal distribution is a theoretical ideal distribution. Real-life empirical distributions never match this model perfectly. However, many things in life do approximate the normal distribution, and are said to be “normally distributed.” 42

211 Scores “Normally Distributed?”
Is this distribution normal? There are two things to initially examine: (1) look at the shape illustrated by the bar chart, and (2) calculate the mean, median, and mode.

212 Scores Normally Distributed!
The Mean = 70.07 The Median = 70 The Mode = 70 Since all three are essentially equal, and this is reflected in the bar graph, we can assume that these data are normally distributed. Also, since the median is approximately equal to the mean, we know that the distribution is symmetrical.

213 The Shape of a Normal Distribution: The Normal Curve

214 The Shape of a Normal Distribution
Notice the shape of the normal curve in this graph. Some normal distributions are tall and thin, while others are short and wide. All normal distributions, though, are wider in the middle and symmetrical.

215 Different Shapes of the Normal Distribution
Notice that the standard deviation changes the relative width of the distribution; the larger the standard deviation, the wider the curve. 43

216 Areas Under the Normal Curve by Measuring Standard Deviations

217 Standard (Z) Scores A standard score (also called Z score) is the number of standard deviations that a given raw score is above or below the mean.

218 The Standard Normal Table
A table showing the area (as a proportion, which can be translated into a percentage) under the standard normal curve corresponding to any Z score or its fraction Area up to a given score

219 The Standard Normal Table
A table showing the area (as a proportion, which can be translated into a percentage) under the standard normal curve corresponding to any Z score or its fraction Area beyond a given score

220 Finding the Area Between the Mean and a Positive Z Score
Using the data presented in Table 10.1, find the percentage of students whose scores range from the mean (70.07) to 85. (1) Convert 85 to a Z score: Z = ( )/10.27 = 1.45 (2) Look up the Z score (1.45) in Column A, finding the proportion (.4265)

221 Finding the Area Between the Mean and a Positive Z Score
(3) Convert the proportion (.4265) to a percentage (42.65%); this is the percentage of students scoring between the mean and 85 in the course.

222 Finding the Area Between the Mean and a Negative Z Score
Using the data presented in Table 10.1, find the percentage of students scoring between 65 and the mean (70.07) (1) Convert 65 to a Z score: Z = ( )/10.27 = -.49 (2) Since the curve is symmetrical and negative area does not exist, use .49 to find the area in the standard normal table: .1879

223 Finding the Area Between the Mean and a Negative Z Score
(3) Convert the proportion (.1879) to a percentage (18.79%); this is the percentage of students scoring between 65 and the mean (70.07)

224 Finding the Area Between 2 Z Scores on the Same Side of the Mean
Using the same data presented in Table 10.1, find the percentage of students scoring between 74 and 84. (1) Find the Z scores for 74 and 84: Z = .38 and Z = 1.36 (2) Look up the corresponding areas for those Z scores: .1480 and .4131

225 Finding the Area Between 2 Z Scores on the Same Side of the Mean
(3) To find the highlighted area above, subtract the smaller area from the larger area ( = ) .2651 Now, we have the percentage of students scoring between 74 and 84.

226 Finding the Area Between 2 Z Scores on Opposite Sides of the Mean
Using the same data, find the percentage of students scoring between 62 and 72. (1) Find the Z scores for 62 and 72: Z = ( )/10.27 = .19 Z = ( )/10.27 = -.79 (2) Look up the areas between these Z scores and the mean, like in the previous 2 examples: Z = .19 is and Z = -.79 is .2852 (3) Add the two areas together: = .3605

227 Finding the Area Between 2 Z Scores on Opposite Sides of the Mean
(4) Convert the proportion (.3605) to a percentage (36.05%); this is the percentage of students scoring between 62 and 72.

228 Finding Area Above a Positive Z Score or Below a Negative Z Score
Find the percentage of students who did (a) very well, scoring above 85, and (b) those students who did poorly, scoring below 50. (a) Convert 85 to a Z score, then look up the value in Column C of the Standard Normal Table: Z = ( )/10.27 = 1.45  7.35% (b) Convert 50 to a Z score, then look up the value (look for a positive Z score!) in Column C: Z = ( )/10.27 = -1.95  2.56%

229 Finding Area Above a Positive Z Score or Below a Negative Z Score

230 Finding a Z Score Bounding an Area Above It
Find the raw score that bounds the top 10 percent of the distribution (Table 10.1) (1) 10% = a proportion of .10 (2) Using the Standard Normal Table, look in Column C for .1000, then take the value in Column A; this is the Z score (1.28) (3) Finally convert the Z score to a raw score: Y= (10.27) = 83.22

231 Finding a Z Score Bounding an Area Above It
(4) is the raw score that bounds the upper 10% of the distribution. The Z score associated with in this distribution is 1.28

232 Finding a Z Score Bounding an Area Below It
Find the raw score that bounds the lowest 5 percent of the distribution (Table 10.1) (1) 5% = a proportion of .05 (2) Using the Standard Normal Table, look in Column C for .05, then take the value in Column A; this is the Z score (-1.65); negative, since it is on the left side of the distribution (3) Finally convert the Z score to a raw score: Y= (10.27) = 53.12

233 Finding a Z Score Bounding an Area Below It
(4) is the raw score that bounds the lower 5% of the distribution. The Z score associated with in this distribution is -1.65

234 Finding the Percentile Rank of a Score Higher than the Mean
Suppose your raw score was 85. You want to calculate the percentile (to see where in the class you rank.) (1) Convert the raw score to a Z score: Z = ( )/10.27 = 1.45 (2) Find the area beyond Z in the Standard Normal Table (Column C): .0735 (3) Subtract the area from 1.00 for the percentile, since is only the area not below the score: = .9265 (proportion of scores below 85)

235 Finding the Percentile Rank of a Score Higher than the Mean
(4) represents the proportion of scores less than 85 corresponding to a percentile rank of 92.65%

236 Finding the Percentile Rank of a Score Lower than the Mean
Now, suppose your raw score was 65. (1) Convert the raw score to a Z score Z = ( )/10.27 = -.49 (2) Find the are beyond Z in the Standard Normal Table, Column C: .3121 (3) Multiply by 100 to obtain the percentile rank: .3121 x 100 = 31.21%

237 Finding the Percentile Rank of a Score Lower than the Mean

238 Finding the Raw Score of a Percentile Higher than 50
Say you need to score in the 95th% to be accepted to a particular grad school program. What’s the cutoff for the 95th%? (1) Find the area associated with the percentile: 95/100 = .9500 (2) Subtract the area from 1.00 to find the area above & beyond the percentile rank: = .0500 (3) Find the Z Score by looking in Column C of the Standard Normal Table for .0500: Z = 1.65

239 Finding the Raw Score of a Percentile Higher than 50
(4) Convert the Z score to a raw score. Y= (10.27) = 87.02

240 Finding the Raw Score of a Percentile Lower than 50
What score is associated with the 40th%? (1) Find the area below the percentile: 40/100 = .4000 (2) Find the Z score associated with this area. Use Column C, but remember that this is a negative Z score since it is less than the mean; so, Sy = -.25 (3) Convert the Z score to a raw score: Y = (10.27) = 67.5

241 Finding the Raw Score of a Percentile Lower than 50

242 Chapter 10: Sampling and Sampling Distributions
Aims of Sampling Basic Principles of Probability Types of Random Samples Sampling Distributions Sampling Distribution of the Mean Standard Error of the Mean The Central Limit Theorem

243 Sampling Population – A group that includes all the cases (individuals, objects, or groups) in which the researcher is interested. Sample – A relatively small subset from a population.

244 Notation

245 Sampling Parameter – A measure (for example, mean or standard deviation) used to describe a population distribution. Statistic – A measure (for example, mean or standard deviation) used to describe a sample distribution.

246 Sampling: Parameter & Statistic

247 Probability Sampling Probability sampling – A method of sampling that enables the researcher to specify for each case in the population the probability of its inclusion in the sample.

248 Random Sampling Simple Random Sample – A sample designed in such a way as to ensure that (1) every member of the population has an equal chance of being chosen and (2) every combination of N members has an equal chance of being chosen. This can be done using a computer, calculator, or a table of random numbers

249 Population inferences can be made...

250 ...by selecting a representative sample from the population

251 Random Sampling Systematic random sampling – A method of sampling in which every Kth member (K is a ration obtained by dividing the population size by the desired sample size) in the total population is chosen for inclusion in the sample after the first member of the sample is selected at random from among the first K members of the population.

252 Systematic Random Sampling

253 Stratified Random Sampling
Stratified random sample – A method of sampling obtained by (1) dividing the population into subgroups based on one or more variables central to our analysis and (2) then drawing a simple random sample from each of the subgroups

254 Stratified Random Sampling
Proportionate stratified sample – The size of the sample selected from each subgroup is proportional to the size of that subgroup in the entire population. Disproportionate stratified sample – The size of the sample selected from each subgroup is disproportional to the size of that subgroup in the population.

255 Disproportionate Stratified Sample

256 Sampling Distributions
Sampling error – The discrepancy between a sample estimate of a population parameter and the real population parameter. Sampling distribution – A theoretical distribution of all possible sample values for the statistic in which we are interested.

257 Sampling Distributions
Sampling distribution of the mean – A theoretical probability distribution of sample means that would be obtained by drawing from the population all possible samples of the same size. If we repeatedly drew samples from a population and calculated the sample means, those sample means would be normally distributed (as the number of samples drawn increases.) The next several slides demonstrate this. Standard error of the mean – The standard deviation of the sampling distribution of the mean. It describes how much dispersion there is in the sampling distribution of the mean. 50

258 Sampling Distributions

259 Distribution of Sample Means with 21 Samples
10 8 6 4 2 S.D. = 2.02 Mean of means = 41.0 Number of Means = 21 Frequency Sample Means 47

260 Distribution of Sample Means with 96 Samples
14 12 10 8 6 4 2 S.D. = 1.80 Mean of Means = 41.12 Number of Means = 96 Frequency Sample Means

261 Distribution of Sample Means with 170 Samples
30 20 10 S.D. = 1.71 Mean of Means= 41.12 Number of Means= 170 Frequency Sample Means

262 The Central Limit Theorem
If all possible random samples of size N are drawn from a population with mean y and a standard deviation , then as N becomes larger, the sampling distribution of sample means becomes approximately normal, with mean y and standard deviation

263 Chapter 11: Estimation Estimation Defined Confidence Levels
Confidence Intervals Confidence Interval Precision Standard Error of the Mean Sample Size Standard Deviation Confidence Intervals for Proportions

264 Estimation Defined: Estimation – A process whereby we select a random sample from a population and use a sample statistic to estimate a population parameter.

265 Point and Interval Estimation
Point Estimate – A sample statistic used to estimate the exact value of a population parameter Confidence interval (interval estimate) – A range of values defined by the confidence level within which the population parameter is estimated to fall. Confidence Level – The likelihood, expressed as a percentage or a probability, that a specified interval will contain the population parameter.

266 Estimations Lead to Inferences
Take a subset of the population

267 Estimations Lead to Inferences
Try and reach conclusions about the population

268 Inferential Statistics involves Three Distributions:
A population distribution – variation in the larger group that we want to know about. A distribution of sample observations – variation in the sample that we can observe. A sampling distribution – a normal distribution whose mean and standard deviation are unbiased estimates of the parameters and allows one to infer the parameters from the statistics.

269 The Central Limit Theorem Revisited
What does this Theorem tell us: Even if a population distribution is skewed, we know that the sampling distribution of the mean is normally distributed As the sample size gets larger, the mean of the sampling distribution becomes equal to the population mean As the sample size gets larger, the standard error of the mean decreases in size (which means that the variability in the sample estimates from sample to sample decreases as N increases). It is important to remember that researchers do not typically conduct repeated samples of the same population. Instead, they use the knowledge of theoretical sampling distributions to construct confidence intervals around estimates.

270 Confidence Levels: Confidence Level – The likelihood, expressed as a percentage or a probability, that a specified interval will contain the population parameter. 95% confidence level – there is a .95 probability that a specified interval DOES contain the population mean. In other words, there are 5 chances out of 100 (or 1 chance out of 20) that the interval DOES NOT contain the population mean. 99% confidence level – there is 1 chance out of 100 that the interval DOES NOT contain the population mean.

271 Constructing a Confidence Interval (CI)
The sample mean is the point estimate of the population mean. The sample standard deviation is the point estimate of the population standard deviation. The standard error of the mean makes it possible to state the probability that an interval around the point estimate contains the actual population mean.

272 What We are Wanting to Do
We want to construct an estimate of where the population mean falls based on our sample statistics The actual population parameter falls somewhere on this line This is our Confidence Interval

273 The Standard Error Standard error of the mean – the standard deviation of a sampling distribution Standard Error

274 Estimating standard errors
Since the standard error is generally not known, we usually work with the estimated standard error:

275 Determining a Confidence Interval (CI)
where: = sample mean (estimate of ) Z = Z score for one-half the acceptable error = estimated standard error

276 Confidence Interval Width
Confidence Level – Increasing our confidence level from 95% to 99% means we are less willing to draw the wrong conclusion – we take a 1% risk (rather than a 5%) that the specified interval does not contain the true population mean. If we reduce our risk of being wrong, then we need a wider range of values So the interval becomes less precise.

277 Confidence Interval Width
More precise, less confident More confident, less precise

278 Confidence Interval Z Values

279 Confidence Interval Width
Sample Size – Larger samples result in smaller standard errors, and therefore, in sampling distributions that are more clustered around the population mean. A more closely clustered sampling distribution indicates that our confidence intervals will be narrower and more precise.

280 Confidence Interval Width
Standard Deviation – Smaller sample standard deviations result in smaller, more precise confidence intervals. (Unlike sample size and confidence level, the researcher plays no role in determining the standard deviation of a sample.)

281 Example: Sample Size and Confidence Intervals

282 Example: Sample Size and Confidence Intervals

283 Example: Hispanic Migration and Earnings
From 1980 Census data: Cubans had an average income of $16,368 (Sy = $3,069), N=3895 Mexicans had an average of $13, (Sy = $9,414), N=5726 Puerto Ricans had an average of $12,587 (Sy = $8,647), N=5908

284 Example: Hispanic Migration and Earnings
Now, compute the 95% CI’s for all three groups: Cubans: standard error = 3069/ = 49.17 95%CI = 16, (49.17) = 16,272 to 16,464 Mexicans: s.e. = 9414/ = = 13,098 to 13,586

285 Example: Hispanic Migration and Earnings
Puerto Ricans, s.e. = 8647/ = 112.5 = 12,367 to 12,807

286 Example: Hispanic Migration and Earnings

287 Confidence Intervals for Proportions
Estimating the standard error of a proportion – based on the Central Limit Theorem, a sampling distribution of proportions is approximately normal, with a mean, p , equal to the population proportion, , and with a standard error of proportions equal to: Since the standard error of proportions is generally not known, we usually work with the estimated standard error:

288 Determining a Confidence Interval for a Proportion
where: p = observed sample proportion (estimate of ) Z = Z score for one-half the acceptable error sp = estimated standard error of the proportion

289 Confidence Intervals for Proportions
Protestants in favor of banning stem cell research: N = 2,188, p = .37 .10 Calculate the estimated standard error: Determine the confidence level Lets say we want to be 95% confident = (.010) = .37 ± .020 = .35 to .39

290 Confidence Intervals for Proportions
Catholics in favor of banning stem cell research: N = 880, p = .32 .16 Calculate the estimated standard error: Determine the confidence level Lets say we want to be 95% confident = (.016) = .32 ± .031 = .29 to .35

291 Confidence Intervals for Proportions
Interpretation:We are 95 percent confident that the true population proportion supporting a ban on stem-cell research is somewhere between .35 and .39 (or between 35.0% and 39.0%) for Protestants, and somewhere between .29 and .35 (or between 29.0% and 35.0%) for Catholics.

292 Chapter 12: Testing Hypotheses
Overview Research and null hypotheses One and two-tailed tests Errors Testing the difference between two means t tests

293 You already know how to deal with two nominal variables
Overview You already know how to deal with two nominal variables Interval Nominal Dependent Variable Independent Variables Nominal Interval Considers the distribution of one variable across the categories of another variable Considers the difference between the mean of one group on a variable with another group Considers how a change in a variable affects a discrete outcome Considers the degree to which a change in one variable results in a change in another

294 Overview Independent Variables Interval Nominal Dependent Variable
You already know how to deal with two nominal variables Independent Variables Nominal Interval Considers how a change in a variable affects a discrete outcome Lambda Dependent Variable Interval Nominal TODAY! Testing the differences between groups Considers the difference between the mean of one group on a variable with another group Considers the degree to which a change in one variable results in a change in another

295 Overview Independent Variables Interval Nominal Dependent Variable
You already know how to deal with two nominal variables Independent Variables Nominal Interval Considers how a change in a variable affects a discrete outcome Lambda Dependent Variable Interval Nominal TODAY! Testing the differences between groups Considers the degree to which a change in one variable results in a change in another Confidence Intervals t-test

296 General Examples Is one group scoring significantly higher on average than another group? Is a group statistically different from another on a particular dimension? Is Group A’s mean higher than Group B’s?

297 Specific Examples Do people living in rural communities live longer than those in urban or suburban areas? Do students from private high schools perform better in college than those from public high schools? Is the average number of years with an employer lower or higher for large firms (over 100 employees) compared to those with fewer than 100 employees?

298 Testing Hypotheses Statistical hypothesis testing – A procedure that allows us to evaluate hypotheses about population parameters based on sample statistics. Research hypothesis (H1) – A statement reflecting the substantive hypothesis. It is always expressed in terms of population parameters, but its specific form varies from test to test. Null hypothesis (H0) – A statement of “no difference,” which contradicts the research hypothesis and is always expressed in terms of population parameters.

299 Research and Null Hypotheses
One Tail — specifies the hypothesized direction Research Hypothesis: H1: 2 1, or 2 1 > 0 Null Hypothesis: H0: 2 1, or 2 1 = 0 Two Tail — direction is not specified (more common) H1: 2 = 1, or 2 1 = 0

300 One-Tailed Tests One-tailed hypothesis test – A hypothesis test in which the alternative is stated in such a way that the probability of making a Type I error is entirely in one tail of a sampling distribution. Right-tailed test – A one-tailed test in which the sample outcome is hypothesized to be at the right tail of the sampling distribution. Left-tailed test – A one-tailed test in which the sample outcome is hypothesized to be at the left tail of the sampling distribution.

301 Two-Tailed Tests Two-tailed hypothesis test – A hypothesis test in which the region of rejection falls equally within both tails of the sampling distribution.

302 Probability Values Z statistic (obtained) – The test statistic computed by converting a sample statistic (such as the mean) to a Z score. The formula for obtaining Z varies from test to test. P value – The probability associated with the obtained value of Z.

303 Probability Values

304 Probability Values Alpha ( ) – The level of probability at which the null hypothesis is rejected. It is customary to set alpha at the .05, .01, or .001 level.

305 Five Steps to Hypothesis Testing
Making assumptions (2) Stating the research and null hypotheses and selecting alpha (3) Selecting the sampling distribution and specifying the test statistic (4) Computing the test statistic (5) Making a decision and interpreting the results

306 Type I and Type II Errors
Type I error (false rejection error)the probability (equal to ) associated with rejecting a true null hypothesis. Type II error (false acceptance error)the probability associated with failing to reject a false null hypothesis. Based on sample results, the decision made is to… reject H0 do not reject H0 In the true Type I correct population error () decision H0 is ... false correct Type II error decision

307 t Test t statistic (obtained) – The test statistic computed to test the null hypothesis about a population mean when the population standard deviation is unknow and is estimated using the sample standard deviation. t distribution – A family of curves, each determined by its degrees of freedom (df). It is used when the population standard deviation is unknown and the standard error is estimated from the sample standard deviation. Degrees of freedom (df) – The number of scores that are free to vary in calculating a statistic.

308 t distribution

309 t distribution table

310 t-test for difference between two means
Is the value of 2 1 significantly different from 0? This test gives you the answer: If the t value is greater than 1.96, the difference between the means is significantly different from zero at an alpha of .05 (or a 95% confidence level). The difference between the two means  the estimated standard error of the difference The critical value of t will be higher than 1.96 if the total N is less than 122. See Appendix C for exact critical values when N < 122.

311 Estimated Standard Error of the difference between two means assuming unequal variances

312 t-test and Confidence Intervals
The t-test is essentially creating a confidence interval around the difference score. Rearranging the above formula, we can calculate the confidence interval around the difference between two means: If this confidence interval overlaps with zero, then we cannot be certain that there is a difference between the means for the two samples.

313 Why a t score and not a Z score?
Use of the Z distribution has assumes the population standard error of the difference is known. In practice, we have to estimate it and so we use a t score. When N gets larger than 50, the t distribution converges with a Z distribution so the results would be identical regardless of whether you used a t or Z. In most sociological studies, you will not need to worry about the distinction between Z and t.

314 What can we conclude about the difference in wages?
t-Test Example 1 Mean pay according to gender: N Mean Pay S.D. Women 46 $ Men 54 $ What can we conclude about the difference in wages?

315 What can we conclude about the difference in wages?
t-Test Example 2 Mean pay according to gender: N Mean Pay S.D. Women 57 $ Men 51 $ What can we conclude about the difference in wages?

316 In-Class Exercise Using these GSS income data, calculate a t-test statistic to determine if the difference between the two group means is statistically significant.

317 Chapter 13: The Chi-Square Test
Chi-Square as a Statistical Test Statistical Independence Hypothesis Testing with Chi-Square The Assumptions Stating the Research and Null Hypothesis Expected Frequencies Calculating Obtained Chi-Square Sampling Distribution of Chi-Square Determining the Degrees of Freedom Limitations of Chi-Square Test

318 Chi-Square as a Statistical Test
Chi-square test: an inferential statistics technique designed to test for significant relationships between two variables organized in a bivariate table. Chi-square requires no assumptions about the shape of the population distribution from which a sample is drawn. It can be applied to nominally or ordinally measured variables.

319 Statistical Independence
Independence (statistical): the absence of association between two cross-tabulated variables. The percentage distributions of the dependent variable within each category of the independent variable are identical.

320 Hypothesis Testing with Chi-Square
Chi-square follows five steps: Making assumptions (random sampling) Stating the research and null hypotheses and selecting alpha Selecting the sampling distribution and specifying the test statistic Computing the test statistic Making a decision and interpreting the results

321 The Assumptions The chi-square test requires no assumptions about the shape of the population distribution from which the sample was drawn. However, like all inferential techniques it assumes random sampling. It can be applied to variables measured at a nominal and/or an ordinal level of measurement.

322 Stating Research and Null Hypotheses
The research hypothesis (H1) proposes that the two variables are related in the population. The null hypothesis (H0) states that no association exists between the two cross-tabulated variables in the population, and therefore the variables are statistically independent.

323 H1: The two variables are related in the population.
Gender and fear of walking alone at night are statistically dependent. Afraid Men Women Total No 83.3% 57.2% 71.1% Yes 16.7% 42.8% 28.9% Total 100% 100% 100%

324 H0: There is no association between the two variables.
Gender and fear of walking alone at night are statistically independent. Afraid Men Women Total No 71.1% 71.1% 71.1% Yes 28.9% 28.9% 28.9% Total 100% 100% 100%

325 The Concept of Expected Frequencies
Expected frequencies fe : the cell frequencies that would be expected in a bivariate table if the two tables were statistically independent. Observed frequencies fo: the cell frequencies actually observed in a bivariate table.

326 Calculating Expected Frequencies
fe = (column marginal)(row marginal) N To obtain the expected frequencies for any cell in any cross-tabulation in which the two variables are assumed independent, multiply the row and column totals for that cell and divide the product by the total number of cases in the table.

327 Chi-Square (obtained)
The test statistic that summarizes the differences between the observed (fo) and the expected (fe) frequencies in a bivariate table.

328 Calculating the Obtained Chi-Square
fe = expected frequencies fo = observed frequencies

329 The Sampling Distribution of Chi-Square
The sampling distribution of chi-square tells the probability of getting values of chi-square, assuming no relationship exists in the population. The chi-square sampling distributions depend on the degrees of freedom. The  sampling distribution is not one distribution, but is a family of distributions.

330 The Sampling Distribution of Chi-Square
The distributions are positively skewed. The research hypothesis for the chi-square is always a one-tailed test. Chi-square values are always positive. The minimum possible value is zero, with no upper limit to its maximum value. As the number of degrees of freedom increases, the  distribution becomes more symmetrical.

331

332 Determining the Degrees of Freedom
df = (r – 1)(c – 1) where r = the number of rows c = the number of columns

333 Calculating Degrees of Freedom
How many degrees of freedom would a table with 3 rows and 2 columns have? (3 – 1)(2 – 1) = 2 2 degrees of freedom

334 Chapter 14: Analysis of Variance
Understanding Analysis of Variance The Structure of Hypothesis Testing with ANOVA Decomposition of SST Assessing the Relationship Between Variables SPSS Applications Reading the Research Literature

335 ANOVA Analysis of Variance (ANOVA) - An inferential statistics technique designed to test for significant relationship between two variables in two or more samples. The logic is the same as in t-tests, just extended to independent variables with two or more samples.

336 Understanding Analysis of Variance
One-way ANOVA – An analysis of variance procedure using one dependent and one independent variable. ANOVAs examine the differences between samples, as well as the differences within a single sample.

337 The Structure of Hypothesis Testing with ANOVA Assumptions:
(1) Independent random samples are used. Our choice of sample members from one population has no effect on the choice of members from subsequent populations. (2) The dependent variable is measured at the interval-ratio level. Some researchers, however, do apply ANOVA to ordinal level measurements.

338 The Structure of Hypothesis Testing with ANOVA Assumptions:
(3) The population is normally distributed. Though we generally cannot confirm whether the populations are normal, we must assume that the population is normally distributed in order to continue with the analysis. (4) The population variances are equal.

339 Stating the Research and Null Hypotheses
H1: At least one mean is different from the others. H0: μ1 = μ2 = μ3 = μ4

340 The Structure of Hypothesis Testing with ANOVA Between-Group Sum of Squares
This tells us the differences between the groups Nk = the number of cases in a sample (k represents the number of different samples) = the mean of a sample = the overall mean

341 The Structure of Hypothesis Testing with ANOVA Within-Group Sum of Squares
This tells us the variations within our groups; it also tells us the amount of unexplained variance. Nk = the number of cases in a sample (k represents the number of different samples) = the mean of a sample = each individual score in a sample

342 Alternative Formula for Calculating the Within-Group Sum of Squares
where = the squared scores from each sample, = the sum of the scores of each sample, and = the total of each sample

343 The Structure of Hypothesis Testing with ANOVA Total Sum of Squares
Nk = the number of cases in a sample (k represents the number of different samples) = each individual score = the overall mean

344 The Structure of Hypothesis Testing with ANOVA Mean Square Between
An estimate of the between-group variance obtained by dividing the between-group sum of squares by its degrees of freedom. Mean square between = SSB/dfb where dfb = degrees of freedom between dfb = k – 1 k = number of categories

345 The Structure of Hypothesis Testing with ANOVA Mean Square Within
An estimate of the within-group variance obtained by dividing the within-group sum of squares by its degrees of freedom. Mean square between = SSW/dfw where dfw = degrees of freedom within dfw = N – k N = total number of cases k = number of categories

346 The F Statistic The ratio of between-group variance to within-group variance

347 Definitions F ratio (F statistic) – Used in an analysis of variance, the F statistic represents the ratio of between-group variance to within-group variance F obtained – The test statistic computed by the ratio for between-group to within-group variance. F critical – The F score associated with a particular alpha level and degrees of freedom. This F score marks the beginning of the region of rejection for our null hypothesis.

348 dfb Alpha Distribution: dfw

349 Example: Obtained vs. Critical F
Since the obtained F is beyond the critical F value, we reject the Null hypothesis of no difference

350 SPSS Example: Bush’s Job Approval

351 SPSS Example: Clinton’s Job Approval

352 Reading the Research Literature

353 Reading the Research Literature


Download ppt "Chapter 1: The What and the Why of Statistics"

Similar presentations


Ads by Google