Inference, Representation, Replication & Programmatic Research Populations & Samples -- Parameters & Statistics Descriptive vs. Inferential Statistics How we evaluate and support our inferences Identifying & evaluating replications Extracting replications from larger studies
Parameter -- summary of a population characteristic Just a bit of review … The group that we are hoping our data will represent is the … The list of folks in that group that we have ... The folks we select from that list with the intent of obtaining data from each ... The group from which we actually collect data is the ... population sampling frame selected sample data sample Some new language … Parameter -- summary of a population characteristic Statistic -- summary of a sample characteristic Just two more … Descriptive Statistic -- calculated from sample data to describe the sample Inferential Statistic -- calculated from sample data to infer about a specific population parameter
Identify each of the following… So, let’s put it together -- first a simple one … The purpose of the study was to determine the mean number of science credits taken by all students previously enrolled in Psyc350. We got the list of everybody enrolled last fall and from it selected 20 students to ask about their science coursework. It turned out that 3 had dropped, one was out of town and another didn’t remember whether they had taken 1 chem course or two. Sampling -- Identify each of the following... Target population Sampling frame Selected sample Data sample Last fall’s enrollment list Psyc350 enrollees 20 from last fall 15 providing data Identify each of the following… Mean # science credits from all students previously enrolled in Psyc350 Mean # science credits from 15 students enrolled in Psyc350 during the fall semester who were in the data sample Mean # science credits from all students previously enrolled in Psyc350 estimated from the 15 students in the data sample population parameter descriptive statistic inferential statistic
Descriptive statistics obtained from the sample Reviewing descriptive and inferential statistics … Remember that the major difference between descriptive and inferential statistics is intent – what information you intend to get from the statistic Descriptive statistics obtained from the sample used to describe characteristics of the sample used to determine if the sample represents the target population by comparing sample statistics and population parameters Inferential statistics used to describe, infer, estimate, approximate characteristics of the target population Parameters – description of population characteristics usually aren’t obtained from the population (we can’t measure everybody) ideally they are from repeated large samplings that produce consistent results, giving us confidence to use them as parameters Let’s look at an example of the interplay of these three…
First.. Sampling -- Identify each of the following... Each year we used to interview everybody from senior managers to part-time janitorial staff once a year to get a feel for “How things were going?”. Generally we found we had 70% female and 30% male employees, divided about 10% management, 70% clerical and 20% service/janitorial, with an average age of 32.5 (std=5.0) years and 7.4 (std= 5.0) years seniority. From these folks we usually got an overall satisfaction rating of about 5.2 (std = 1.1) on a 7-point scale. With the current cost increases we can no longer interview everybody. So, this year we had a company that conducts surveys complete the interview using a sample of 120 employees who volunteered in response to a notice in the weekly company newsletter. We were very disappointed to find that the overall satisfaction rating had dropped to 3.1 (std=1.0). At a meeting to discuss how to improve worker satisfaction, one of the younger managers asked to see the rest of the report and asked use to look carefully at one of the tables… Table 3 -- Sample statistics (From 100 completed interviews) gender 50 males (50%) 50 females (50%) job 34 mang. (34%) 30 cler. (30%) 36 ser/jan (36%) age mean = 21.3 std = 10 seniority mean = 2.1 std = 6 First.. Sampling -- Identify each of the following... Target population Sampling Frame Selected sample Data sample our company company newsletter 120 volunteers 100 who completed the survey
And now … Kinds of “values” -- identify all of each type … Each year we used to interview everybody from senior managers to part-time janitorial staff once a year to get a feel for “How things were going?”. Generally we found we had 70% female and 30% male employees, divided about 10% management, 70% clerical and 20% service/janitorial, with an average age of 32.5 (std=5.0) years and 7.4 (std= 5.0) years seniority. From these folks we usually got an overall satisfaction rating of about 5.2 (std = 1.1) on a 7-point scale. With the current cost increases we can no longer interview everybody. So, this year we had a company that conducts surveys complete the interview using a sample of 120 employees who volunteered in response to a notice in the weekly company newsletter. We were very disappointed to find that the overall satisfaction rating had dropped to 3.1 (std=1.0). At a meeting to discuss how to improve worker satisfaction, one of the younger managers asked to see the rest of the report and asked use to look carefully at one of the tables… Table 3 -- Sample statistics (From 100 completed interviews) gender 50 males (50%) 50 females (50%) job 34 mang. (34%) 30 cler. (30%) 36 ser/jan (36%) age mean = 21.3 std = 10 seniority mean = 2.1 std = 6 And now … Kinds of “values” -- identify all of each type … parameter descriptive statistics inferential statistic
And now … Kinds of “values” -- identify all of each type … Each year we used to interview everybody from senior managers to part-time janitorial staff once a year to get a feel for “How things were going?”. Generally we found we had 70% female and 30% male employees, divided about 10% management, 70% clerical and 20% service/janitorial, with an average age of 32.5 (std=5.0) years and 7.4 (std= 5.0) years seniority. From these folks we usually got an overall satisfaction rating of 5.2 (std = 1.1) on a 7-point scale. With the current cost increases we can no longer interview everybody. So, this year we had a company that conducts surveys complete the interview using a sample of 120 employees who volunteered in response to a notice in the weekly company newsletter. We were very disappointed to find that the overall satisfaction rating had dropped to 3.1 (std=1.0). At a meeting to discuss how to improve worker satisfaction, one of the younger managers asked to see the rest of the report and asked use to look carefully at one of the tables… Table 3 -- Sample statistics (From 100 completed interviews) gender 50 males (50%) 50 females (50%) job 34 mang. (34%) 30 cler. (30%) 36 ser/jan (36%) age mean = 21.3 std = 10 seniority mean = 2.1 std = 6 And now … Kinds of “values” -- identify all of each type … parameter descriptive statistics inferential statistic
We can compare sample descriptive statistics & population parameters Of course, the real question is whether the “3.1 rating” is cause for concern… Should we interpret the mean rating of 3.1 as indicating that the folks who work here are much less satisfied than they used to be? Why or why not? Looks bad, doesn’t it ? Well – depends upon whether the sample is representative of the population. Any way to check that? We can compare sample descriptive statistics & population parameters this sample is not representative of the population the sample is “too male,” “too managerial & janitorial”, “too young,” and “short-tenured” compared to the population parameters
How do we know our inferences are correct? Here’s the point… Our theories, hypotheses and implementations are about populations, but (because we can never, ever collect data from the entire population) our data come from samples !!! We infer that the data and analysis results we obtain from our sample tell us about the relationships between those variables in the population! How do we know our inferences are correct? we can never “know” – there will never be “proof” (only evidence) check the selection/sampling procedures we used we check that our sample statistics match known population parameters (when we know those parameters) we check if our results agree with previous results from “the lit” we perform replication and converging operations research Here’s another version of this same idea !!!
“Critical Experiment” vs. “Converging Operations” You might be asking yourself, “How can we sure we ‘got the study right’?” How can we be sure that we.. … have a sample that represents the target population? … have the best research design? … have good measures, tasks and a good setting? … did the right analyses and make the correct interpretations? Said differently – How can we be sure we’re running the right study in the right way ??? This question assumes the “critical experiment” approach to empirical research – that there is “one correct way to run the one correct study” and the answer to that study will be “proof”. For both philosophical and pragmatic reasons (that will become apparent as we go along) scientific psychologists have abandoned this approach and adopted “converging operations” – the process of running multiple comparable versions of each study and looking for consistency (or determining sources of inconsistencies) – also called the Research Loop
the “Research Loop” Novel RH: Replication Convergence Library Research Learning “what is known” about the target behavior Hypothesis Formation Based on Lib. Rsh., propose some “new knowledge” Research Design Determine how to obtain the data to test the RH: the “Research Loop” Data Collection Carrying out the research design and getting the data. Novel RH: Replication Convergence Draw Conclusions Decide how your “new knowledge” changes “what is known” about the target behavior Data Analysis Data collation and statistical analysis Hypothesis Testing Based on design properties and statistical results
“Comparable” studies -- replication The more similar the studies the more direct the test of replication and the more meaningful will be a “failure to replicate” The more differences between the studies the more “reasons” the results might not agree, and so, the less meaningful will be a “failure to replicate” Ground rules… Same or similar IV (qual vs. quant not important) Same or similar DV (qual vs. quant not important) Similar population, setting & task/stimulus Note if similar design (e.g., experiment or non-experiment)
“Comparing” studies -- replication If the studies are comparable, then the comparison is based on… Effect size (r) and direction/pattern Direction/pattern Be sure to take DV “direction” into account (e.g., measuring % correct vs. % error or “depression” vs. “happiness” Effect size Don’t get too fussy about effect size comparability … Remember .1 = small .3 = medium .5 = large Smaller but in the same direction is still pretty similar What about differences in “significance”??? If the effect sizes are similar, these are usually just “power” or “sample size” differences – far less important than effect size/direction !
Replication – some classic “conundrums” #1 Imagine there are four previous studies, all procedurally similar, all looking at the relationship between social skills and helping behavior. Here are the results… r(36) = .28, p>.05 r(65) = .27, p<.05 r(72) = .27, p<.05 r(31) = .28, p>.05 Do these studies replicate one another? Yes !!! the studies all found similar effects (size & direction) the differences in significance are due to power & sample size differences
Replication – some classic “conundrums” #2 Imagine there are four previous studies, all procedurally similar, all looking at the relationship between amount of therapy and decrease in depressive symptoms. Here are the results… r(36) = .25, p>.05 r(42) = .27, p>.05 r(51) = .27, p>.05 r(31) = .25, p>.05 Given these results, what is your “best guess” of the population correlation between social skills and helping behavior? r = .00, since none of the results were significant? r ≈ .25 - .27, since this is the consistent answer? I’d go with the .25-.27, but encourage someone to do an a priori power analysis before the next study!!!
Remember this one … Researcher #1 Acquired 20 computers of each type, had researcher assistants (working in shifts & following a prescribed protocol) keep each machine working continually for 24 hours & count the number of times each machine failed and was re-booted. Researcher #2 Acquired 20 computers of each type, had researcher assistants (working in shifts & following a prescribed protocol) keep each machine working continually for 24 hours or until it failed. PC Mac Mean failures PC = 5.7, std = 2.1 Mean failures Mac = 3.6, std = 2.1 F(1,38) = 10.26, p = .003 Failed Not 15 5 6 14 X2(1) = 8.12, p <.003 F / (F + df) = 10.26 / (10.26+38) r = .46 ² / N = 8.12 / 40 r = .45 So, by computing effect sizes and effect direction/pattern, we can compare these similar studies (same IV – conceptually similar DV) and see that the results replicate!
Try this one … Researcher #1 Asked each of 60 students whether or not they had completed the 20-problem on-line exam preparation and noted their scores on the exam (%). She used BG ANOVA to compare the mean % of the two groups. Researcher #2 Asked each of 304 students how many of the on-line exam questions they had completed and noted their scores on the exam (%). She used correlation to test for a relationship between these two quantitative variables. Completed Exam Prep = 83% No Exam Prep = 76% F(1,58) = 8.32, p = .003 F / (F + df) = 8.32 / (8.32+58) r = .35 r (301) = .12, p = .042 Comparing the two studies we see that while the effects are in the same direction (better performance is associated with “more” on-line exam practice), the size of the effects in the two studies is very different. Also, the significance of the second effect is due to the huge sample size!
And this one … Researcher #1 Interviewed 85 patients from a local clinic and recorded the number of weeks of therapy they had attended and the change in their wellness scores. She used correlation to examine the relationship between these two variables. Researcher #2 Assigned each of 120 patients to receive group or not and noted whether or not they had improved after 24 weeks. She used X2 to examine the relationship between these variables. Therapy Control Improved Not 45 15 25 35 X2(1) = 13.71, p <.001 ² / N = 13.71 / 1120 r = .34 r (83) = .30, p = .002 So, by computing effect sizes and effect direction/pattern, we can compare these similar studies (conceptually similar IV & DV) and see that the results show a strong replication!
Replication & Generalization in k-group Designs -- ANOVA Most k-group designs are an “expansion” or an extension of an earlier, simpler design When comparing with a 2-group design, be sure to use the correct conditions Study #1 Study #2 Mean failures PC = 5.7, std = 2.1 Mean failures Mac = 3.6, std = 2.1 F(1,38) = 10.26, p = .003 r = F / (F + df) = .46 Mean failures IBM = 5.9, std = 2.1 Mean failures Dell = 3.8, std = 2.1 Mean failures Mac = 3.6, std = 2.1 F(2,57) = 10.26, p = .003, MSe = We need to know what “PC” means in the first study! What is “PCs” were IBM? What if PC were Dell? What if PC were something else?
Replication & Generalization in k-group Designs -- X2 Most k-group designs are an “expansion” or an extension of an earlier, simpler design Study #1 Study #2 PC Mac IBM Dell Mac 7 13 Failed Not 15 5 6 14 Failed Not 16 4 5 15 X2(1) = 8.12, p <.003 r = ² / N = .45 X2(1) = 13.79, p <.004 We need to know what “PC” means in the first study! What if “PCs” were IBM? What if PC were Dell? What if PC were something else?