1 THE EFFECT OF CRITERION RELIABILITY ON MEANS AND INTERACTIONS IN META- ANALYSIS LAWRENCE R. JAMES PSYCHOLOGY AND MANAGEMENT GEORGIA INSTITUTE OF TECHNOLOGY.

Slides:



Advertisements
Similar presentations
IOP 301-T Test Validity.
Advertisements

Lesson 10: Linear Regression and Correlation
RELIABILITY Reliability refers to the consistency of a test or measurement. Reliability studies Test-retest reliability Equipment and/or procedures Intra-
1 COMM 301: Empirical Research in Communication Kwan M Lee Lect4_1.
Reliability and Validity
 A description of the ways a research will observe and measure a variable, so called because it specifies the operations that will be taken into account.
Statistical Issues in Research Planning and Evaluation
Alvin Kwan Division of Information & Technology Studies
Multiple Regression Fenster Today we start on the last part of the course: multivariate analysis. Up to now we have been concerned with testing the significance.
When Measurement Models and Factor Models Conflict: Maximizing Internal Consistency James M. Graham, Ph.D. Western Washington University ABSTRACT: The.
Simple Regression correlation vs. prediction research prediction and relationship strength interpreting regression formulas –quantitative vs. binary predictor.
Meta-analysis & psychotherapy outcome research
Stat 112: Lecture 9 Notes Homework 3: Due next Thursday
Chapter 7 Correlational Research Gay, Mills, and Airasian
FINAL REPORT: OUTLINE & OVERVIEW OF SURVEY ERRORS
General Mental Ability aka (GMA) aka (g factor) aka (g)
Educational Research: Correlational Studies EDU 8603 Educational Research Richard M. Jacobs, OSA, Ph.D.
Relationships Among Variables
Multiple Linear Regression A method for analyzing the effects of several predictor variables concurrently. - Simultaneously - Stepwise Minimizing the squared.
1 Psych 5500/6500 Statistics and Parameters Fall, 2008.
Measurement and Data Quality
Issues in Experimental Design Reliability and ‘Error’
LEARNING PROGRAMME Hypothesis testing Intermediate Training in Quantitative Analysis Bangkok November 2007.
Determining Sample Size
Copyright © Cengage Learning. All rights reserved. 8 Tests of Hypotheses Based on a Single Sample.
Near East University Department of English Language Teaching Advanced Research Techniques Correlational Studies Abdalmonam H. Elkorbow.
Technical Adequacy Session One Part Three.
Foundations of Recruitment and Selection I: Reliability and Validity
Understanding Statistics
MGTO 324 Recruitment and Selections Validity II (Criterion Validity) Kin Fai Ellick Wong Ph.D. Department of Management of Organizations Hong Kong University.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
CJT 765: Structural Equation Modeling Class 7: fitting a model, fit indices, comparingmodels, statistical power.
Regression Analysis. Scatter plots Regression analysis requires interval and ratio-level data. To see if your data fits the models of regression, it is.
L 1 Chapter 12 Correlational Designs EDUC 640 Dr. William M. Bauer.
Correlational Research Chapter Fifteen Bring Schraw et al.
Statistics (cont.) Psych 231: Research Methods in Psychology.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
6. Evaluation of measuring tools: validity Psychometrics. 2012/13. Group A (English)
Gile Sampling1 Sampling. Fundamental principles. Daniel Gile
Appraisal and Its Application to Counseling COUN 550 Saint Joseph College For Class # 3 Copyright © 2005 by R. Halstead. All rights reserved.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
Stat 112 Notes 9 Today: –Multicollinearity (Chapter 4.6) –Multiple regression and causal inference.
Robust Estimators.
Fall 2002Biostat Statistical Inference - Proportions One sample Confidence intervals Hypothesis tests Two Sample Confidence intervals Hypothesis.
Correlation They go together like salt and pepper… like oil and vinegar… like bread and butter… etc.
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Chapter 7 Measuring of data Reliability of measuring instruments The reliability* of instrument is the consistency with which it measures the target attribute.
9-1 Copyright © 2016 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
Single-Subject and Correlational Research Bring Schraw et al.
Scientific Methodology: The Hypothetico-Deductive Approach, the Test of Hypothesis, and Null Hypotheses BIOL January 2016.
Chapter 7: Hypothesis Testing. Learning Objectives Describe the process of hypothesis testing Correctly state hypotheses Distinguish between one-tailed.
Inferential Statistics Psych 231: Research Methods in Psychology.
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 11 Measurement and Data Quality.
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
Logic of Hypothesis Testing
Regression Analysis.
Selecting the Best Measure for Your Study
CJT 765: Structural Equation Modeling
12 Inferential Analysis.
Reliability and Validity of Measurement
Evaluation of measuring tools: reliability
12 Inferential Analysis.
Psych 231: Research Methods in Psychology
Psych 231: Research Methods in Psychology
Inferential Statistics
Psych 231: Research Methods in Psychology
Psych 231: Research Methods in Psychology
Psych 231: Research Methods in Psychology
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

1 THE EFFECT OF CRITERION RELIABILITY ON MEANS AND INTERACTIONS IN META- ANALYSIS LAWRENCE R. JAMES PSYCHOLOGY AND MANAGEMENT GEORGIA INSTITUTE OF TECHNOLOGY

2 META-ANALYSIS Correlations involving the same or very similar predictors and criteria are retrieved from prior studies. This set of validities constitutes a distribution that can be summarized statistically using standard descriptors such as the mean and the variance.

3 VALIDITY GENERALIZATION Archival information pertaining to statistical artifacts that might affect each validity is obtained (e.g., sampling error, reliability of criterion and predictor, range restriction). Distributional summary statistics are corrected for artifacts to provide estimates of the mean true (population) validity and the variance among the mean true (population) validities.

4 WHY VALIDITY GENERALIZATION? Validity generalization is founded on the possibility that true validities from different populations may be equal, and yet the sample validities may vary because of the operation of statistical artifacts (Hunter, Schmidt, & Jackson, 1982). (This is a question of interaction.) There is also the strong likelihood that true validities are underestimated by sample validities due to unreliability and range restriction.

5 RESULTS OF VALIDITY GENERALIZATION Meta-analyses based on validity generalization (VG) procedures continue to be impressive

6 ILLUSTRATIVE RESULTS General intellectual ability is said to have an average corrected validity of.53 in predicting job performance (Hunter & Hunter, 1984). Structured interviews can attain corrected validities in the.47 to.60 range against job performance (Huffcut & Arthur, 1994). Perceptual speed has an average corrected validity of.47 against clerical performance (Schmidt, 1992). Integrity tests have average corrected validities of.40 against job performance (applicant samples) and.47 against counterproductive behaviors (all samples) (Ones, Viswesvaran, & Schmidt, 1993).

7 INFERENCES Many VG studies suggest that a single intellectual, cognitive, or personality trait can account for upwards of 16% to 36% of the variance in some aspect of job performance. The days when 16% of the variance (or a validity of.40) was the maximum expected for a trait (Ghiselli & Brown, 1955) are gone, as are the days when validities in the.20s and.30s were commonplace in the reports of ”well-done" validity studies.

8 QUESTION What precipitated this boost in validities and accountable variance?

9 BETTER SCIENCE? Improved measurement instruments? More sophisticated sampling techniques? Superior research designs?

10 Well, not really. We still rely on the same measurement procedures the same small samples the same bivariate correlation designs

11 Then what gave rise to this bountiful enhancement in validities?

12 ENHANCEMENT IN VALIDITIES The boosts in validities come from correcting the observed validities, which have stayed pretty much the same, for attenuation due to unreliability in the criterion (and sometimes the predictor) and direct range restriction in the predictor.

13 WHAT CHANGED? Change was not due to improvements in science. What changed was our historical cautiousness in applying correction equations to validity coefficients?

14 A CULTURE OF CORRECTIONS The genesis of this “culture of corrections” can be traced to desires to estimate relationships devoid of statistical artifacts.

15 A FORERUNNER: LATENT VARIABLES For example, latent variable procedures such as LISREL frame the opportunity to employ estimates of perfectly reliable variables in studies of covariation as a major advance in science.

16 ANOTHER FORM OF LATENT VARIABLE No less dedicated to the pursuit of truth and scientific principle is VG (Schmidt, 1992), the objective being to estimate correlations among true scores (i.e., latent variables) unencumbered by statistical artifacts (e.g., unreliability).

17 RECEPTIVENESS TO CORRECTIONS It is the idea that corrected coefficients give greater insight into scientific truths that engendered the current culture of corrections. Investigators are prone to compute corrected coefficients, and editors, reviewers, and readers tend to be receptive to them.

18 OUR GOALS It is not our intent to stand between scientists and the seeking of truth via corrected coefficients. We do feel that it is reasonable, however, to inquire about the statistical values that are being used to make the corrections. We are specifically interested in corrections for attenuation due to unreliability in criteria assessed via ratings of job performance. Study the effects these corrections have on the estimates of the mean true validity and the variance among the estimated true validities from separate populations.

19 INTERRATER RELIABILITY FOR RATINGS Viswesvaran, Ones, & Schmidt (1996) concluded that job performance is typically assessed by ratings, the reliability of ratings should be estimated via an interrater reliability analysis, and the mean interrater reliability for job performance ratings over studies is approximately.52.

20 WHERE AND WHEN TO USE.52 If a given study in a VG analysis fails to report criterion reliability, and the criterion is based on ratings, then the best estimate of the missing interrater reliability is.52. If one is using one of the myriad of VG equations to estimate means and variances of true correlations, and interrater reliability for ratings is missing from many studies (as is often the case), then.52 is the value to insert into the estimating equations for mean observed criterion reliability.

21 CONSEQUENCE OF USING.52 It is instructive to illustrate the product of using.52 as an estimate of interrater reliability. Using the standard correction for attenuation an observed validity of.25 becomes a.35 (i.e., 25/[.52] 1/2 ).30 becomes a.42,.35 becomes a.49,.40 becomes a.55.

22 MAGNITUDE OF INCREASE So, simply by correcting for attenuation based on an interrater reliability of.52, we obtain an 89% increase (i.e., [ ]/.40 2 ) in what is regarded as the maximum expected variance accounted for by a single predictor (i.e.,.16 to.30).

23 AN ADVANCE IN SCIENCE? To what extent is this 89% increase in maximum expected variance accounted for reflective of science?

24 COMPARISONS TO OTHER VARIABLES Where else in personnel research do we accept, and use, measurement procedures that produce variables with reliabilities of.52? Is it not true that almost every conceivable variable except performance ratings would be cast out of personnel research if its reliability were.52?

25 NUNNALLY & BERNSTEIN, 1994 “A reliability of.80 may not be nearly high enough in making decisions about individuals….If important decisions are being made with respect to specific test scores, a reliability of.90 is the bare minimum, and a reliability of.95 should be considered the desirable standard.” (p.265)

26 DESIRABLE STANDARD FOR PERFORMANCE RATINGS If we desire a.95 reliability for the test scores that are used to hire people for jobs, it seems reasonable to expect the same standard of reliability for the ratings that are used to determine whether people keep their jobs.

27 PRACTICAL CONSIDERATIONS Many reliabilities for scores used to make decisions about individuals are not in the.90s. Many, however, are in the.80s. With the exception of performance ratings, almost none are in the.50s.

28 QUESTIONS Why are performance ratings allowed to survive in spite of what most would agree is questionable measurement? How do we allow observed validities to be corrected for unreliability in what appear to be flawed variables, and then act as if these corrected validities actually convey some sort of credible scientific information?

29 QUESTIONS (continued) Does anyone really believe that it makes sense to talk about a "perfectly reliable criterion" when the observed criterion begins with an interrater reliability of.52? How exactly does a variable in which almost one-half of the observed variance is some form of bias or error become perfectly reliable?

30 WHERE IS THE NEW TECHNOLOGY? It would seem that researchers would have instituted the necessary improvements, given that problems with performance ratings were documented as early as 50 years ago in Guilford’s (1954) classic text in psychometrics. Have not hundreds of articles been written on the biases and errors that affect performance ratings, especially after the classic articles on problems with performance ratings written by Feldman and Landy & Farr? We know what the problems are. Why have we not fixed them?

31 IS THE PROBLEM INTRACTABLE? Maybe it is not possible to build ratings that can achieve high interrater reliabilities. If we admit that this is true, then should we also not admit that we cannot justify inserting.52 in corrections for attenuation because we know that “theoretically perfectly reliable” is not going to be even remotely approximated?

32 Is.52 an accurate estimate of interrater reliability? This issue is currently being debated elsewhere (LeBreton, Kaiser, Burgess, Atchley, & James, 2001; Murphy & DeShon, 2000a, 2000b; Schmidt, Viswesvaran, & Ones, 2000). If this estimate is later shown to be inaccurate or ill-founded, then a different debate ensues. However, for now, let us assume that the.52 estimate is legitimate and accurate.

33 THE ISSUE We may then deal with the issue of concern here, which is basing substantive scientific judgments on corrections which employ a below threshold reliability for a criterion to produce an enhanced, sometimes much enhanced, estimate of corrected validity.

34 Is 40 years of research wrong and job satisfaction really is correlated with job performance? Judge, Thoresen, Bono, and Patton (2001) used.52 as an estimate of criterion reliability to repudiate 40 years of research findings and previous meta-analyses that concluded that job satisfaction has a low correlation with overall job performance. A mean observed correlation of.18 was corrected to a mean (estimated) true correlation of.30. Correction for unreliability in the criterion accounted for approximately 60% of this increase. The use of.52 in the correction for attenuation was justified by arguing that this approach was “consistent with all contemporary (post-1990) meta-analytic studies involving job performance.” (p.384)

35 A COMPARISON Had criterion reliability been.85 instead of.52, the corrected correlation would have been approximately.23 (job satisfaction reliability was set at.74). Had the reliabilities for both variables been.85, the corrected correlation would have been approximately.21. Neither of these correlations suggests a substantial linear, additive relationship between job satisfaction and job performance. Are we going to change this conclusion based on corrections engendered by not being able to measure job satisfaction particularly well and performance hardly at all?

36 STATISTICAL DYSFUNCTIONS OF CORRECTING FOR LOW RELIABILITIES At this juncture, I hope that you realize that we have a problem. We cannot base our science on large corrections engendered by poor measurement. If you have yet to be convinced, then allow me to proceed to demonstrate some unanticipated dysfunctions of inserting low reliabilities into correction equations. Statistics are based on a working paper by James, LeBreton, and Ladd.

37 A SINGLE VG ANALYSIS A meta-analysis is conducted on the correlations between scores on a structured interview and ratings of overall job performance. The mean observed correlation is.35. Mean criterion reliability is set at.52. Mean predictor reliability is set at.80. The ratio between the restricted and unrestricted standard deviations on the predictor is set at.71. (a common value).

38 Result of a Single VG Analysis The estimate of mean true validity is.67 (Raju, Burke, Normand, & Langolis, 1991, Equation 2).

39 ADDITIONAL PREDICTORS Three additional predictors chosen to contribute unique variance to prediction. intelligence test integrity test biographical questionnaire

40 PSYCHOMETRICS OF SEPARATE PREDICTORS Each additional predictor has an observed validity of.35 against job performance, correlates.20 with each of the other predictors, and has a reliability of.80. The ratio between the restricted and unrestricted standard deviations is again set at.71.

41 RESULTS OF THREE ADDITIONAL VG ANALYSES The estimate of mean true validity in each additional VG analysis is.67

42 MULTIPLE CORRELATION ANALYSIS Our four separate VG analyses each furnish an impressive increase in validity from.35 to.67. Now let’s compute a multiple correlation by inserting the results of each separate VG analysis into a multiple correlation analysis.

43 RESULTS The squared multiple correlation (R 2 ) is 1.03 We account for more than 100% of the variance in the job performance ratings.

44 COMPARATIVE RESULTS-1 A multiple correlation analysis based on the observed or uncorrected data produces an R 2 of.31.

45 COMPARATIVE RESULTS-2 If all corrections remained the same except that the performance ratings were given a reliability of.80 rather than.52, then the mean estimated true validity for each of the four variables would have been.54. the R 2 would have been.67.

46 COMPARATIVE RESULTS-3 If all corrections remained the same except that the performance ratings were given a reliability of.70, which is often considered the lower bound for reliability (Nunnally & Bernstein, 1994), then the mean estimated true validity for each of the four variables would have been.57. the R 2 would have been.77.

47 IMPROPER TERRITORY With reasonable values for criterion reliability set by accepted standards in psychometrics, corrected coefficients provide R 2 s in proper ranges. When accepted standards are suspended, the R 2 may wander off into improper territory.

48 SLIPPERY SLOPE We typically do not see r 2 s greater than 1.0 in bivariate studies. Investigators have thus failed to realize that once one begins to suspend judgment about acceptable thresholds for criterion reliability and to allow a value as low as.52 into correction equations, one is on a slippery slope. The multiple correlation analysis picked up on the slippery slope by producing an improper R 2. It follows that the bivariate corrections that engendered this improper value have a tenuous foundation.

49 VARIANCES Heretofore we have focused on the mean of a distribution of validities and the estimate of the mean true validity. It is also possible to focus on the variance of a distribution of validities and the estimate of the variance among true validities.

50 ESTIMATED VARIANCE AMONG TRUE VALIDITIES Each sample validity is corrected for artifacts. This provides an estimate of the true validity for the population from which that sample was drawn. The variance among the estimated true validities is calculated. This variance is adjusted for sampling error (Raju et al., 1991). If artifact data are not available for each sample, estimating equations are available.

51 OBSERVED AND CORRECTED VALIDITIES DATA WITH RELIABILITY OF.52

52 OBSERVED AND CORRECTED VALIDITIES DATA WITH RELIABILITY OF.75

53 ESTIMATES OF TRUE VARIANCE

54 KEY IMPLICATIONS Lower criterion reliabilities result in higher estimates of true variance. This means that the interpretation of mean true validity is more likely to be subject to moderation. In other words, use of below threshold criterion reliabilities to enhance validity makes interpretation of that enhanced validity ambiguous.

55 CONCLUSIONS It is time to call a moratorium on the use of low mean interrater reliabilities to enhance estimates of mean true validities in VG analyses. It is time to have a serious debate on how to estimate the reliability of ratings.

56