Procedures for Estimating Reliability

Slides:



Advertisements
Similar presentations
Measurement Concepts Operational Definition: is the definition of a variable in terms of the actual procedures used by the researcher to measure and/or.
Advertisements

Lecture 7: reliability & validity Aims & objectives –This lecture will explore a variety of techniques for ensuring that research is conducted with reliable.
Reliability IOP 301-T Mr. Rajesh Gunesh Reliability  Reliability means repeatability or consistency  A measure is considered reliable if it would give.
Topics: Quality of Measurements
Procedures for Estimating Reliability
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
The Department of Psychology
Chapter 4 – Reliability Observed Scores and True Scores Error
 A description of the ways a research will observe and measure a variable, so called because it specifies the operations that will be taken into account.
Part II Sigma Freud & Descriptive Statistics
Reliability for Teachers Kansas State Department of Education ASSESSMENT LITERACY PROJECT1 Reliability = Consistency.
Methods for Estimating Reliability
-生醫統計期末報告- Reliability 學生 : 劉佩昀 學號 : 授課老師 : 蔡章仁.
Reliability and Validity of Research Instruments
Part II Knowing How to Assess Chapter 5 Minimizing Error p115 Review of Appl 644 – Measurement Theory – Reliability – Validity Assessment is broader term.
Can you do it again? Reliability and Other Desired Characteristics Linn and Gronlund Chap.. 5.
Reliability n Consistent n Dependable n Replicable n Stable.
Research Methods in MIS
Classical Test Theory By ____________________. What is CCT?
Classroom Assessment Reliability. Classroom Assessment Reliability Reliability = Assessment Consistency. –Consistency within teachers across students.
Psychometrics Timothy A. Steenbergh and Christopher J. Devers Indiana Wesleyan University.
Technical Issues Two concerns Validity Reliability
Measurement and Data Quality
CHAPTER 5 Test Scores as Composites
Foundations of Educational Measurement
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 14 Measurement and Data Quality.
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
1 Chapter 4 – Reliability 1. Observed Scores and True Scores 2. Error 3. How We Deal with Sources of Error: A. Domain sampling – test items B. Time sampling.
Reliability & Agreement DeShon Internal Consistency Reliability Parallel forms reliability Parallel forms reliability Split-Half reliability Split-Half.
Chapter 2: Behavioral Variability and Research Variability and Research 1. Behavioral science involves the study of variability in behavior how and why.
Research Methodology and Methods of Social Inquiry Nov 8, 2011 Assessing Measurement Reliability & Validity.
Copyright © 2008 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Assessing Measurement Quality in Quantitative Studies.
SOCW 671: #5 Measurement Levels, Reliability, Validity, & Classic Measurement Theory.
Experimental Research Methods in Language Learning Chapter 12 Reliability and Reliability Analysis.
1 LANGUAE TEST RELIABILITY. 2 What Is Reliability? Refer to a quality of test scores, and has to do with the consistency of measures across different.
Reliability n Consistent n Dependable n Replicable n Stable.
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Chapter 6 - Standardized Measurement and Assessment
Reliability a measure is reliable if it gives the same information every time it is used. reliability is assessed by a number – typically a correlation.
Reliability When a Measurement Procedure yields consistent scores when the phenomenon being measured is not changing. Degree to which scores are free of.
Chapter 13 Understanding research results: statistical inference.
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 11 Measurement and Data Quality.
Chapter 6 Norm-Referenced Measurement. Topics for Discussion Reliability Consistency Repeatability Validity Truthfulness Objectivity Inter-rater reliability.
Reliability. Basics of test score theory Each person has a true score that would be obtained if there were no errors in measurement. However, measuring.
ESTABLISHING RELIABILITY AND VALIDITY OF RESEARCH TOOLS Prof. HCL Rawat Principal UCON,BFUHS Faridkot.
Some Terminology experiment vs. correlational study IV vs. DV descriptive vs. inferential statistics sample vs. population statistic vs. parameter H 0.
Professor Jim Tognolini
Lecture 5 Validity and Reliability
Reliability and Validity
Measurement: Part 1.
Reliability.
CHAPTER 5 MEASUREMENT CONCEPTS © 2007 The McGraw-Hill Companies, Inc.
Classical Test Theory Margaret Wu.
Reliability & Validity
Introduction to Measurement
پرسشنامه کارگاه.
Calculating Reliability of Quantitative Measures
Reliability, validity, and scaling
PSY 614 Instructor: Emily Bullock, Ph.D.
Instrumentation: Reliability Measuring Caring in Nursing
Measurement: Part 1.
Evaluation of measuring tools: reliability
Reliability.
By ____________________
CHAPTER 5 Test Scores as Composites
How can one measure intelligence?
15.1 The Role of Statistics in the Research Process
Measurement: Part 1.
Chapter 8 VALIDITY AND RELIABILITY
CHAPTER 5 Test Scores as Composites
Presentation transcript:

Procedures for Estimating Reliability CHAPTER 7 Procedures for Estimating Reliability

*TYPES OF RELIABILITY TYPE OF RELIABILITY WHT IT IS HOW DO YOU DO IT WHAT THE RELIABILITY COEFFICIENT LOOKS LIKE Test-Retest 2 Admin A measure of stability Administer the same test/measure at two different times to the same group of participants r test1.test2 Ex. IQ test Parallel/alternate Interitem/Equivalent Forms A measure of equivalence Administer two different forms of the same test to the same group of participants r testA.testB Ex. Stats Test Test-Retest with Alternate Forms A measure of stability and equivalence On Monday, you administer form A to 1st half of the group and form B to the second half. On Friday, you administer form B to 1st half of the group and form A to the 2nd half Inter-Rater 1 Admin A measure of agreement Have two raters rate behaviors and then determine the amount of agreement between them Percentage of agreement Internal Consistency A measure of how consistently each item measures the same underlying construct Correlate performance on each item with overall performance across participants Cronbach’s Alpha Method Kuder-Richardson Method Split Half Method Hoyts Method

Procedures for Estimating/Calculating Reliability Procedures Requiring 2 Test Administration Procedures Requiring 1 Test Administration

Procedures for Estimating Reliability *Procedures Requiring two (2) Test Administration 1. Test-Retest Reliability Method measures the Stability. 2. Parallel (Alternate) Equivalent Forms Interitem Reliability Method measures the Equivalence. 3. Test-Retest with Alternate Reliability Forms measures the Stability and Equivalent

Procedures Requiring 2 Test Administration 1. Test-Retest Reliability Method Administering the same test to the same group of participants then, the two sets of scores are correlated with each other. The correlation coefficient ( r ) between the two sets of scores is called the coefficient of stability. The problem with this method is time Sampling, it means that factors related to time are the sources of measurement error e.g., change in exam condition such as noises, the weather, illness, fatigue, worry, mood change etc.

How to Measure the Test Retest Reliability Class IQ Scores Students X –first timeY- second time John 125 120 Jo 110 112 Mary 130 128 Kathy 122 120 David 115 120 rfirst time.second time stability

Procedures Requiring 2 Test Administration 2. Parallel (Alternate) Forms Reliability Method Different Forms of the same test are given to the same group of participants then, the two sets of scores are correlated. The correlation coefficient (r) between the two sets of scores is called the coefficient of equivalence.

Procedures Requiring 2 Test Administration 2. Parallel (Alternate) Forms Reliability Method Two test administrations with the same group are required. Test scores may be affected by factors such as motivation, fatigue, or intervening events like practice, or learning. The means and variances of the observed scores are equal for the two forms.

How to measure the Parallel Forms Reliability Class Test Scores Students X-Form A Y-Form B John 95 90 Jo 80 85 Mary 78 82 Kathy 82 88 David 75 72 rformA•formB equivalence

Procedures Requiring 2 Test Administration 3. Test-Retest with Alternate Reliability Forms It is a combination of the test-retest and alternate form reliability method. On Monday, you administer form A to 1st half of the group and form B to the second half. On Friday, you administer form B to 1st half of the group and form A to the second half. The correlation coefficient ( r) between the two sets of scores is called the coefficient of stability and equivalence.

Procedures for Estimating Reliability *Procedures Requiring one (1) Test Administration A. Internal Consistency Reliability B. Inter-Rater Reliability

Procedures Requiring 1 Test Administration *A. Internal Consistency Reliability (ICR) Examines the unidimensional nature of a set of items in a test. It tells us how unified the items are in a test or in an assessment. Ex. If we administer a 100-item personality test we want the items to relate with one another and to reflect the same construct (personality). We want them to have item homogeneity. *ICR deals with how unified the items are in a test or an assessment. This is called “item homogeneity.”

*A. Internal Consistency Reliability (ICR) *4 Different ways to measure ICR 1. Guttman Split Half Reliability Method same as (Spearman Brown Prophesy Formula) 2. Cronbach’s Alpha Method 3. Kuder Richardson Method 4. Hoyt’s Method They are different statistical procedures to calculate the reliability of a test.

Procedures Requiring 1 Test Administration A. Internal Consistency Reliability (ICR) 1. Guttman Split-Half Reliability Method (most popular) usually use for dichotomously scored exams. First, administer a test, then divide the test items into 2 subtests (There are four popular methods), then, find the correlation between the 2 subtests and place it in the formula.

1. Split Half Reliability Method  

1. Split Half Reliability Method *The 4 popular methods are: 1.Assign all odd-numbered items to form 1 and all even-numbered items to form 2 2. Rank order the items in terms of their difficulty levels (p-values) based on the responses of the examiners; then assign items with odd-numbered ranks to form 1 and those with even-numbered ranks to form 2

1. Split Half Reliability Method The four popular methods are: 3. Randomly assign items to the two half-test forms 4. Assign items to half-test forms so that the forms are “matched” in content e.g. if there are 6 items on reliability, each half will get 3 items.

1. Split Half Reliability Method A high Slit Half reliability coefficient (e.g., >0.90) indicates a homogeneous test.  

1. Split Half Reliability Method *Ex. Use the split half reliability method to calculate the reliability estimate of a test with reliability coefficient (correlation) of 0.25 for the 2 halves of this test ?

1. Split Half Reliability Method  

Next: How to calculate the Split Half Reliability Method using SPSS

1. Split Half Reliability Method A=X and B=Y

Calculate the Split Half Reliability Method for the X and Y X Y 1 3 2 6 4 4 5 7

Procedures Requiring 1 Test Administration A. Internal Consistency Reliability (ICR) 2. Cronbach’s Alpha Method (used for wide range of scoring such as Non-Dichotomously and Dichotomously scored exams. Cronbach’s(α) is a preferred statistic. Lee Cronbach-

Procedures Requiring 1 Test Administration  

Cronbach α for composite tests K is number of tests/subtest

A. Internal Consistency Reliability (ICR) 2 A. Internal Consistency Reliability (ICR) 2. *Cronbach’s Alpha Method or ( Coefficient (α) is a preferred statistic) Ex. Suppose that the examinees are tested on 4 essay items and the maximum score for each is 10 points. The variance for the items are as follow; σ²1=9, σ²2=4.8, σ²3=10.2, and σ²4=16. If the total score variance σ²x=100, used the Cronbach’s Alpha Method to calculate the internal consistency of this test? A high α coefficient (e.g., >0.90) indicates a homogeneous test.

2. *Cronbach’s Alpha Method  

Cronbach’s Alpha Method

Procedures Requiring 1 Test Administration 3. Kuder Richardson Method A. Internal Consistency Reliability (ICR) *The Kuder-Richardson Formula 20 (KR-20) first published in 1937. It is a measure of internal consistency reliability for measures with dichotomous choices. It is analogous \ə-ˈna-lə-gəs\ to Cronbach's α, except Cronbach's α is also used for non-dichotomous tests. pq=σ²i. A high KR-20 coefficient (e.g., >0.90) indicates a homogeneous test.

Procedures Requiring 1 Test Administration  

Procedures Requiring 1 Test Administration  

3. Kuder Richardson Method (KR 20and KR 21) See table 7 3. *Kuder Richardson Method (KR 20and KR 21) See table 7.1 next slide or data on p.136 next

Variance=square of standard deviation=4.08

Procedures Requiring 1 Test Administration A. Internal Consistency Reliability (ICR) *3. Kuder Richardson Method (KR 21) It is used only with dichotomously scored items. It does not require the computing of each item variance (you do it once for all items or test variance σ²X=Total test score variance) see table 7.1 for standard deviation and variance for all items. It assumes all items are equal in difficulties.

Procedures Requiring 1 Test Administration  

Procedures Requiring 1 Test Administration A. Internal Consistency Reliability (ICR) 4. *Hoyt’s (1941) Method Hoyt used repeated measures ANOVA to obtained variance or MS to calculate the Hoyt’s Coefficient. MS=σ²=S²=Variance

Procedures Requiring 1 Test Administration MS residual=MS error  

4. *Hoyt’s (1941) Method MS person is the total variance for all persons MS residual/MS error has it’s own calculations, SPSS NEXT

Procedures Requiring 1 Test Administration B. Inter-Rater Reliability It is measure of consistency from rater to rater. It is a measure of agreement between the raters.

Procedures Requiring 1 Test Administration B. Inter-Rater Reliability Items Rater 1 Rater 2 1 4 3 2 3 5 3 5 5 4 4 2 5 1 2 First do the r for rater1.rater2 then, X 100.

Procedures Requiring 1 Test Administration B. Inter-Rater Reliability More than 2 raters: Raters 1, 2, and 3 Calculate r for 1 & 2=.6 Calculate r for 1 & 3=.7 Calculate r for 2 & 3=.8 µ=.7 x100=70%

Next: How to calculate the Inter rated reliability using SPSS Three raters, 10 questions, on scale of 1-5 https://www.youtube.com/watch?v=1Avl7DzKmnc How to calculate the Inter rated reliability using EXCEL https://www.youtube.com/watch?v=fq_LNTPgVF8

ĸ = Cohen's kappa Cohen's kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories. The first mention of a kappa-like statistic is attributed to Galton (1892)

CHAPTER 8

CHAPTER 8 *Introduction to Generalizability Theory Cronbach (1963) Generalizability is another way to calculate the reliability of a test by using ANOVA. Generalizability refers to the degree to which a particular set of measurements of an examinee generalizes to a more extensive set of measurements of that examinee. (just like conducting inferential research) CHAPTER 8

Introduction to Generalizability Generalizability Coefficient FYI, In Classical True Score Theory, the Reliability was defined as the ratio of the True score to Observed score (X) Reliability= T/X or T/T+E Reliability Coefficient pX1X2= σ²T/ σ²X

In Generalizability theory an examinee’s Universe Score is defined as the average of the measurements in the universe of generalization (The Universe Score is the same as True score in classical test theory), it is the average or mean of the measurements in the Universe of Generalization.

Introduction to Generalizability Key Terms Universe: Universe are a set of measurement conditions which are more extensive than the condition under which the sample measurements were obtained. Ex: If you took the Test Construction exam here at AU then, the Universe or (generalization) is when you take the test construction exams at several other universities, University Score AU 85 FIU 90 FAU 84 NSU 80 UM 88 μ=85.40 is called the Universe Score  same as True score

Introduction to Generalizability Key Terms Universe Score: It is the same as True score in Classical Test Theory. It is the average (mean) of the measurements in the universe of generalization. Ex: If you take the test construction at other universities, the mean of your test scores is your Universe Score (see previous slide).

Introduction to Generalizability *Generalizabilty Coefficient The Generalizability Coefficient or p is defined as the ratio of Universe Score Variance (σ²U) to expected Observed Score Variance (eσ²X). * Generalizability Coefficient is analogous to reliability coefficient in classical test theory Generalizability Coefficient=p= σ²U/eσ²X Ex. if Expected Observed Score Variance=eσ²X =10 and Universe Score Variance σ²U =5 Then, the Generalizability Coefficient is: 5/10=0.5

Introduction to Generalizability Key Term Facets: Facets are a part or aspect of something, also A Set of Measurement Conditions to determine a performance. Ex. Next slide

Introduction to Generalizability *Facets: Example If two supervisors want to rate the performance of factory workers under three workloads (heavy, medium, and light), how many sets of measurements (facets) we’ll have? See next slide

Introduction to Generalizability *Facets: Example If two supervisors (IV1) want to rate the performance of factory workers under three workloads (IV2) [heavy, medium, and light], how many sets of measurements (facets) we’ll have? Performance (DV) See next slide

Introduction to Generalizability Facets The two sets of measurement- conditions or the two facets are; 1- the supervisors (one and two), 2- The workloads (heavy, medium, and light). Performance (DV) (Use Two Way ANOVA). Ex. 2 next slide

10 9 5 4 8 6 Factorial designs 2x3 Workload heavy Workload med . Workload heavy Workload med Workload light Supervisor 1 10 9 5 Supervisor 2 4 8 6

Factorial designs

Introduction to Generalizability Facets *A researcher measuring students compositional writing on four occasions. On each occasion, each student writes compositions on two different topics. All compositions are graded by three different raters. This design involves how many facets?? See next slide

Introduction to Generalizability Facets *A researcher measuring students compositional writing on four occasions (IV1). On each occasion, each student writes compositions on two different topics (IV2). All compositions are graded by three different raters (IV3. measuring students compositional writing (DV)

Introduction to Generalizability Facets *Facets: Example If four professors (IV1) want to rate the performance of students on four exams (IV2) [Psychology, Math, Stats, and English], how many sets of measurements (facets) we’ll have?

Introduction to Generalizability Facets: *Facets: Example If four professors (IV1) want to rate the performance of students on four exams (IV2) [Psychology, Math, Stats, and English], how many sets of measurements (facets) we’ll have? Performance (DV)

Introduction to Generalizability Key Term Universe of Generalization: Universe of Generalization are all of the measurement conditions for the second set of measurement or “universe.” Such as; fatigue, room temperature, specification, etc,. Ex. All of the conditions under which you took your test- construction exams at other universities.

Introduction to Generalizability Generalizability Distinguishes between Generalizability Studies (G- Studies) and Decision Studies (D-Studies). *G-Studies: G-Studies are concern with extent to which a sample of measurement generalizes to universe of measurements. It is the study of generalizability procedures.

Generalizability Studies (G- Studies) and Decision studies (D-Studies) D-Studies refer to providing data for making decisions about examinees. It is about the adequacy of measurement. Ex. Next slide

Generalizability Studies (G- Studies) and Decision studies (D-Studies) Ex. Suppose we use an achievement test to test 2000 children from public and 2000 children from private schools. If we want to know whether this test is equally reliable for both types of schools then we are dealing with G-Study (quality of measurement). Ex. We can generalize a test such as GRE exam to students at AU (private) and FIU (public) students who took the exam.

Generalizability Studies (G- Studies) and Decision studies (D-Studies) However, if we want to compare the means of the students who took the GRE at these different types of institutions (data) and draw a conclusion about differences in the adequacy of the two educational systems then, we dealing with D-Study. Ex. Compare the means of AU and FIU students Who took the GRE/EPPP exam.

Introduction to Generalizability *Generalizability Designs: There are 4 different Generalizability designs with different Generalizability theory ( -)  examinees (+)  rater or examiners

Generalizability Designs: 1._ _ _ _ _ _ _ _ _ _ + 1. One rater rates each one of the examinees (classroom Ex) 2._ _ _ _ _ _ _ _ _ _ + + + 2. A group of raters rate each one of the examinees (Qualifying Ex or panel interview) 3._ _ _ _ _ _ _ _ _ _ + + + + + + + + + + 3. One rater rates only one examinee 4._ _ _ _ _ +++++ +++++ +++++ +++++ +++++ 4. Each examinee is rated by different group of raters (most expensive).

(Research article on Generalizability) Scoring Performance Assessment Based on Judgments Using Generalizability Theory by Christopher Wing-Tat Chiu https://books.google.com/books?id=QLfRihSocp4C&pg=PA24&lpg=PA24&dq=example+of+One+rater+rates+only+one+examinee&source=bl&ots=m-eURwwXqN&sig=MnEXGWMsJPEu7lJuqHW99WDNWLc&hl=en&sa=X&ved=0ahUKEwjKtt3ZxpTQAhXkhVQKHchoBgYQ6AEIHzAA#v=onepage&q=example%20of%20One%20rater%20rates%20only%20one%20examinee&f=false

ASSIGNMENT Please take the quiz 4B, and read pp chapter 9 and 10