Reliability.

Slides:



Advertisements
Similar presentations
Measurement Concepts Operational Definition: is the definition of a variable in terms of the actual procedures used by the researcher to measure and/or.
Advertisements

Topics: Quality of Measurements
Reliability.
Taking Stock Of Measurement. Basics Of Measurement Measurement: Assignment of number to objects or events according to specific rules. Conceptual variables:
Some (Simplified) Steps for Creating a Personality Questionnaire Generate an item pool Administer the items to a sample of people Assess the uni-dimensionality.
© 2006 The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Validity and Reliability Chapter Eight.
Chapter 4 – Reliability Observed Scores and True Scores Error
Assessment Procedures for Counselors and Helping Professionals, 7e © 2010 Pearson Education, Inc. All rights reserved. Chapter 5 Reliability.
VALIDITY AND RELIABILITY
1Reliability Introduction to Communication Research School of Communication Studies James Madison University Dr. Michael Smilowitz.
Research Methodology Lecture No : 11 (Goodness Of Measures)
Part II Sigma Freud & Descriptive Statistics
Reliability for Teachers Kansas State Department of Education ASSESSMENT LITERACY PROJECT1 Reliability = Consistency.
What is a Good Test Validity: Does test measure what it is supposed to measure? Reliability: Are the results consistent? Objectivity: Can two or more.
Measurement Reliability and Validity
Part II Knowing How to Assess Chapter 5 Minimizing Error p115 Review of Appl 644 – Measurement Theory – Reliability – Validity Assessment is broader term.
Can you do it again? Reliability and Other Desired Characteristics Linn and Gronlund Chap.. 5.
Research Methods in MIS
Evaluating a Norm-Referenced Test Dr. Julie Esparza Brown SPED 510: Assessment Portland State University.
Measurement Concepts & Interpretation. Scores on tests can be interpreted: By comparing a client to a peer in the norm group to determine how different.
Measurement and Data Quality
Descriptive and Causal Research Designs
Foundations of Educational Measurement
MEASUREMENT CHARACTERISTICS Error & Confidence Reliability, Validity, & Usability.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 14 Measurement and Data Quality.
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
LECTURE 06B BEGINS HERE THIS IS WHERE MATERIAL FOR EXAM 3 BEGINS.
Technical Adequacy Session One Part Three.
Reliability & Validity
Tests and Measurements Intersession 2006.
Appraisal and Its Application to Counseling COUN 550 Saint Joseph College For Class # 3 Copyright © 2005 by R. Halstead. All rights reserved.
Validity and Reliability Neither Valid nor Reliable Reliable but not Valid Valid & Reliable Fairly Valid but not very Reliable Think in terms of ‘the purpose.
Validity Validity: A generic term used to define the degree to which the test measures what it claims to measure.
RELIABILITY Prepared by Marina Gvozdeva, Elena Onoprienko, Yulia Polshina, Nadezhda Shablikova.
Research Methodology and Methods of Social Inquiry Nov 8, 2011 Assessing Measurement Reliability & Validity.
Copyright © 2008 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Assessing Measurement Quality in Quantitative Studies.
Measurement Issues General steps –Determine concept –Decide best way to measure –What indicators are available –Select intermediate, alternate or indirect.
SOCW 671: #5 Measurement Levels, Reliability, Validity, & Classic Measurement Theory.
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.
Reliability Ability to produce similar results when repeated measurements are made under identical conditions. Consistency of the results Can you get.
Chapter 6 - Standardized Measurement and Assessment
Reliability a measure is reliable if it gives the same information every time it is used. reliability is assessed by a number – typically a correlation.
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 11 Measurement and Data Quality.
RELIABILITY AND VALIDITY Dr. Rehab F. Gwada. Control of Measurement Reliabilityvalidity.
1 Measurement Error All systematic effects acting to bias recorded results: -- Unclear Questions -- Ambiguous Questions -- Unclear Instructions -- Socially-acceptable.
Survey Methodology Reliability and Validity
MGMT 588 Research Methods for Business Studies
Ch. 5 Measurement Concepts.
Reliability and Validity
MEASUREMENT: RELIABILITY AND VALIDITY
Selecting Employees – Validation
Reliability.
RELIABILITY OF QUANTITATIVE & QUALITATIVE RESEARCH TOOLS
Validity and Reliability
CHAPTER 5 MEASUREMENT CONCEPTS © 2007 The McGraw-Hill Companies, Inc.
Journalism 614: Reliability and Validity
Reliability & Validity
Introduction to Measurement
Human Resource Management By Dr. Debashish Sengupta
پرسشنامه کارگاه.
Reliability and Validity of Measurement
PSY 614 Instructor: Emily Bullock, Ph.D.
RESEARCH METHODS Lecture 18
The first test of validity
15.1 The Role of Statistics in the Research Process
Psy 425 Tests & Measurements
Chapter 8 VALIDITY AND RELIABILITY
MGS 3100 Business Analysis Regression Feb 18, 2016
Chapter 3: How Standardized Test….
Presentation transcript:

Reliability

Evaluation of Measurement Instruments Reliability has to do with the consistency of the instrument. - Internal Consistency (Consistency of the items) - Test-retest Reliability (Consistency over time) - Interrater Reliability (Consistency between raters) - Split-half Methods - Alternate Forms Methods Validity of an instrument has to do with the ability to measure what it is supposed to measure and the extent to which it predicts outcomes. - Face Validity - Construct & Content Validity - Convergent & Divergent Validity - Predictive Validity - Discriminant Validity

Would you keep using these measurement tools? Reliability Reliability is synonymous with consistency. It is the degree to which test scores for a an individual test taker or group of test takers are consistent over repeated applications. No psychological test is completely consistent, however, a measurement that is unreliable is worthless. For Example A student receives a score of 100 on one intelligence tests and 114 in another or imagine that every time you stepped on a scale it showed a different weight. Would you keep using these measurement tools? The consistency of test scores is critically important in determining whether a test can provide good measurement.

Observed Test Score = True Score + Errors of Measurement Reliability (cont.) Because no unit of measurement is exact, any time you measure something (observed score), you are really measuring two things 1. True Score - the amount of observed score that truly represents what you are intending to measure. 2. Error Component - the amount of other variables that can impact the observed score Observed Test Score = True Score + Errors of Measurement For Example - if you weigh yourself today and weigh 140 lbs. and then weigh yourself tomorrow and weigh 142 lbs., is the 2 pound increase a true measure of your weight gain or could other variables be involved? Other variables may include: food intake, placement of scale, error in the scale itself.

Why Do Test Scores Vary? Possible Sources of Variability of Scores (pg. 110) - General Ability to comprehend instructions - Stable response sets (e.g., answering “C” option more frequently) - The element of chance of getting a question right - Conditions of testing - Unreliability or bias in grading or rating performance - Motivation - Emotional Strain

Measurement Error Any fluctuation in test scores that results from factors related to the measurement process that are irrelevant to what is being measured. The difference between the observed score and the true score is called the error score. S true = S observed - S error Developing better tests with less random measurement error is better than simply documenting the amount of error. Measurement Error is Reduced By: - Writing items clearly - Making instructions easily understood - Adhering to proper test administration - Providing consistent scoring

Determining Reliability There are several ways that a measurements reliability can be determined, depending on the type of measurement the and the supporting data required. They include: - Internal Consistency - Test-retest Reliability - Interrater Reliability - Split-half Methods - Odd-even Reliability - Alternate Forms Methods

Internal Consistency Measures the reliability of a test solely on the number of items on the test and the intercorrelation among the items. Therefore, it compares each item to every other item. If a scale is measuring a construct, then overall the items on that scale should be highly correlated with one another. There are two common ways of measuring internal consistency … 1. Cronbach’s Alpha: .80 to .95 (Excellent) .70 to .80 (Very Good) .60 to .70 (Satisfactory) <.60 (Suspect) 2. Item-Total Correlations - the correlation of the item with the remainder of the items (.30 is the minimum acceptable item-total correlation).

Internal Consistency (cont.) Internal consistency estimates are a function of: The Number of Items - if we think that each test item is an observation of behaviour, high internal consistency strengthens the relationship --- i.e., There is more of it to observe. Average Intercorrelation - the extent to which each item represents the observation of the same thing observed. The more you observe a construct, with greater consistency = Reliability

Split Half & Odd-Even Reliability Split Half - refers to determining a correlation between the first half of the measurement and the second half of the measurement (i.e., we would expect answers to the first half to be similar to the second half). Odd-Even - refers to the correlation between even items and odd items of a measurement tool. In this sense, we are using a single test to create two tests, eliminating the need for additional items and multiple administrations. Since in both of these types only 1 administration is needed and the groups are determined by the internal components of the test, it is referred to as an internal consistency measure.

Split Half & Odd-Even Reliability Possible Advantages Simplest method - easy to perform Time and Cost Effective Possible Disadvantages Many was of splitting Each split yields a somewhat different reliability estimate Which is the real reliability of the test?

Test-retest Reliability Test-retest reliability is usually measured by computing the correlation coefficient between scores of two administrations.

Test-retest Reliability (cont.) The amount of time allowed between measures is critical. The shorter the time gap, the higher the correlation; the longer the time gap, the lower the correlation. This is because the two observations are related over time. Optimum time betweem administrations is 2 to 4 weeks. If a scale is measuring a construct consistently, then there should not be radical changes on the scores between administrations --- unless something significant happened. The rationale behind this method is that the difference between the scores of the test and the retest should be due to measurement solely.

Test-retest Reliability (cont.) It is hard to specify one acceptable test-retest correlation since what is considered acceptable depends on the the type of scale, the use of the scale, and the time between testing. For example - it is not clear whether differences in test scores are regarded as sources of measurement error or as sources of real stability. Possible difference in scores between tests? : experience, characteristic being measured may change over time (e.g. reading test), carryover effects (e.g., remember test)

Test-retest Reliability (cont.) A minimum correlation of at least .50 is expected. The higher the correlation (in a positive direction) the higher the test-retest reliability The biggest problem with this type of reliability is what called memory effect. Which means that a respondent may recall the answers from the original test, therefore inflating the reliability. Also, is it practical?

Interrater Reliability Whenever you use humans as a part of your measurement procedure, you have to worry about whether the results you get are reliable or consistent. People are notorious for their inconsistency. We are easily distractible. We get tired of doing repetitive tasks. We daydream. We misinterpret.

Interrater Reliability (cont.) For some scales it is important to assess interrater reliability. Interrater reliability means that if two different raters scored the scale using the scoring rules, they should attain the same result. Interrater reliability is usually measured by computing the correlation coefficient between the scores of two raters for the set of respondents. Here the criterion of acceptability is pretty high (e.g., a correlation of at least .9), but what is considered acceptable will vary from situation to situation.

Parallel/Alternate Forms Method Parallel/Alternate Forms Method - refers to the administration of two alternate forms of the same measurement device and then comparing the scores. Both forms are administered to the same person and the scores are correlated. If the two produce the same results, then the instrument is considered reliable.

Parallel/Alternate Forms Method (cont.) A correlation between these two forms is computed just as the test-retest method. Advantages Eliminates the problem of memory effect. Reactivity effects (i.e., experience of taking the test) are also partially controlled. Can address a wider array of sampling of the entire domain than the test-retest method.

Parallel/Alternate Forms Method (cont.) Possible Disadvantages Are the two forms of the test actually measuring the same thing. More Expensive Requires additional work to develop two measurement tools.

Factors Affecting Reliability Administrator Factors Number of Items on the instrument The Instrument Taker Heterogeneity of the Items Heterogeneity of the Group Members Length of Time between Test and Retest

Administrator Factors Poor or unclear directions given during administration or inaccurate scoring can affect reliability. For Example - say you were told that your scores on being social determined your promotion. The result is more likely to be what you think they want than what your behavior is.

Number of Items on the Instrument The larger the number of items, the greater the chance for high reliability. For Example -it makes sense when you ponder that twenty questions on your leadership style is more likely to get a consistent result than four questions. Remedy: Use longer tests or accumulate scores from short tests.

The Test Taker For Example -If you took an instrument in August when you had a terrible flu and then in December when you were feeling quite good, we might see a difference in your response consistency. If you were under considerable stress of some sort or if you were interrupted while answering the instrument questions, you might give different responses.

Heterogeneity Heterogeneity of the Items -- The greater the heterogeneity (differences in the kind of questions or difficulty of the question) of the items, the greater the chance for high reliability correlation coefficients. Heterogeneity of the Group Members -- The greater the heterogeneity of the group members in the preferences, skills or behaviors being tested, the greater the chance for high reliability correlation coefficients.

Length of Time between Test and Retest The shorter the time, the greater the chance for high reliability correlation coefficients. As we have experiences, we tend to adjust our views a little from time to time. Therefore, the time interval between the first time we took an instrument and the second time is really an "experience" interval. Experience happens, and it influences how we see things. Because internal consistency has no time lapse, one can expect it to have the highest reliability correlation coefficient.

How High Should Reliability Be? A highly reliable test is always preferable to a test with lower reliability. .80 > greater (Excellent) .70 to .80 (Very Good) .60 to .70 (Satisfactory) <.60 (Suspect) A reliability coefficient of .80 indicates that 20% of the variability in test scores is due to measurement error.

Generalizability Theory Theory of measurement that attempts to determine the sources of consistency and inconsistency Allows for the evaluation of interaction effects from different types of error sources. It is necessary to obtain multiple observations for the sample group of individuals on all the variables that might contribute to causing measurement error (e.g., scores across occasions, across scorers, across alternative forms). Allows for the evaluation of interaction effects from different types of error sources.

Generalizability Theory (cont.) Useful when associated with complex methods: 1. The conditions of measurement affect test scores. 2. Test scores are used for several different purposes. For Example - measurement involving subjectivity (e.g., interviews, rating scales) involve bias. Therefore, human judgement could be considered “conditions of measurement” If feasible, it is a more thorough procedure for identifying the error component that may enter scores.

Standard Error of Measurement (SEM) SEM is a statistic that obtains the confidence interval for many obtained scores. It represents the hypothetical distribution we would have if someone took a test an infinite # of times. A measure that allows one to predict the range of fluctuation that is likely to occur in a single individual's score because of irrelevant, chance factors. This measurement is used in analyzing the reliability of the test in obtaining the "true" score. Indicates how much variability in test scores can be expected as a result of measurement error. SEM is a function of two factors: reliability of test & variability of test scores. Formula for SEM is : SM = SD(Sq root of 1 minus reliability)

Standard Error of Measurement (cont.) The most common use of the SEM is the production of the confidence intervals. The SEM is an estimate of how much error there is in a test. The SEM can be looked at in the same way as Standard Deviations. Sixty eight percent of the time the true score would be between plus one SEM and minus one SEM. We could be 68% sure that the students true score would be between +/- one SEM. Between +/- two SEM the true score would be found 96% of the time (e.g., SEM x +/- two SEM) Or, if the student took the test 100 times, 64 times the true score would fall between +/- one SEM.