Measurement Reliability Objective & Subjective tests Standardization & Inter-rater reliability Properties of a “good item” Item Analysis Internal Reliability.

Slides:



Advertisements
Similar presentations
Standardized Scales.
Advertisements

Transformations & Data Cleaning
Conceptualization and Measurement
The Research Consumer Evaluates Measurement Reliability and Validity
1 COMM 301: Empirical Research in Communication Kwan M Lee Lect4_1.
Some (Simplified) Steps for Creating a Personality Questionnaire Generate an item pool Administer the items to a sample of people Assess the uni-dimensionality.
Reliability and Validity checks S-005. Checking on reliability of the data we collect  Compare over time (test-retest)  Item analysis  Internal consistency.
Types of Reliability.
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
© 2006 The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Validity and Reliability Chapter Eight.
Chapter 4 – Reliability Observed Scores and True Scores Error
1Reliability Introduction to Communication Research School of Communication Studies James Madison University Dr. Michael Smilowitz.
 A description of the ways a research will observe and measure a variable, so called because it specifies the operations that will be taken into account.
Part II Sigma Freud & Descriptive Statistics
Reliability for Teachers Kansas State Department of Education ASSESSMENT LITERACY PROJECT1 Reliability = Consistency.
Part II Sigma Freud & Descriptive Statistics
Face, Content & Construct Validity
CH. 9 MEASUREMENT: SCALING, RELIABILITY, VALIDITY
Copyright © Allyn & Bacon (2007) Data and the Nature of Measurement Graziano and Raulin Research Methods: Chapter 4 This multimedia product and its contents.
Measurement. Scales of Measurement Stanley S. Stevens’ Five Criteria for Four Scales Nominal Scales –1. numbers are assigned to objects according to rules.
Reliability and Validity of Research Instruments
RESEARCH METHODS Lecture 18
Reliability & Validity
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 8 Using Survey Research.
Measurement Reliability Objective & Subjective tests Standardization & Inter-rater reliability Properties of a “good item” Item Analysis Internal Reliability.
Psych 231: Research Methods in Psychology
Validity, Reliability, & Sampling
Research Methods in MIS
Practical Psychometrics Preliminary Decisions Components of an item # Items & Response Approach to the Validation Process.
Measurement and Data Quality
Reliability, Validity, & Scaling
Measurement in Exercise and Sport Psychology Research EPHE 348.
Instrumentation.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
MEASUREMENT CHARACTERISTICS Error & Confidence Reliability, Validity, & Usability.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 14 Measurement and Data Quality.
LECTURE 06B BEGINS HERE THIS IS WHERE MATERIAL FOR EXAM 3 BEGINS.
Analyzing Reliability and Validity in Outcomes Assessment (Part 1) Robert W. Lingard and Deborah K. van Alphen California State University, Northridge.
User Study Evaluation Human-Computer Interaction.
1 Cronbach’s Alpha It is very common in psychological research to collect multiple measures of the same construct. For example, in a questionnaire designed.
ScWk 240 Week 6 Measurement Error Introduction to Survey Development “England and America are two countries divided by a common language.” George Bernard.
Reliability & Validity
1 Chapter 4 – Reliability 1. Observed Scores and True Scores 2. Error 3. How We Deal with Sources of Error: A. Domain sampling – test items B. Time sampling.
Tests and Measurements Intersession 2006.
Reliability & Agreement DeShon Internal Consistency Reliability Parallel forms reliability Parallel forms reliability Split-Half reliability Split-Half.
Experiment Basics: Variables Psych 231: Research Methods in Psychology.
Psychometrics & Validation Psychometrics & Measurement Validity Properties of a “good measure” –Standardization –Reliability –Validity A Taxonomy of Item.
Research Methodology and Methods of Social Inquiry Nov 8, 2011 Assessing Measurement Reliability & Validity.
Copyright © 2008 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Assessing Measurement Quality in Quantitative Studies.
Measurement Theory in Marketing Research. Measurement What is measurement?  Assignment of numerals to objects to represent quantities of attributes Don’t.
Experimental Research Methods in Language Learning Chapter 12 Reliability and Reliability Analysis.
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Chapter 7 Measuring of data Reliability of measuring instruments The reliability* of instrument is the consistency with which it measures the target attribute.
Chapter 14: Affective Assessment
Measurement Experiment - effect of IV on DV. Independent Variable (2 or more levels) MANIPULATED a) situational - features in the environment b) task.
Chapter 6 - Standardized Measurement and Assessment
Reliability a measure is reliable if it gives the same information every time it is used. reliability is assessed by a number – typically a correlation.
Classroom Assessment Chapters 4 and 5 ELED 4050 Summer 2007.
Reliability When a Measurement Procedure yields consistent scores when the phenomenon being measured is not changing. Degree to which scores are free of.
Slides to accompany Weathington, Cunningham & Pittenger (2010), Chapter 10: Correlational Research 1.
Language Assessment Lecture 7 Validity & Reliability Instructor: Dr. Tung-hsien He
Copyright © 2014 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 11 Measurement and Data Quality.
Choosing and using your statistic. Steps of hypothesis testing 1. Establish the null hypothesis, H 0. 2.Establish the alternate hypothesis: H 1. 3.Decide.
Measurement and Scaling Concepts
Concept of Test Validity
Understanding Results
Analyzing Reliability and Validity in Outcomes Assessment Part 1
Collecting and Interpreting Quantitative Data – Introduction (Part 1)
15.1 The Role of Statistics in the Research Process
Analyzing Reliability and Validity in Outcomes Assessment
Presentation transcript:

Measurement Reliability Objective & Subjective tests Standardization & Inter-rater reliability Properties of a “good item” Item Analysis Internal Reliability –Spearman-Brown Prophesy Formla -- α & # items External Reliability –Test-retest Reliability –Alternate Forms Reliability

Types of Tests Most self-report tests used in Psychology and Education are “ objective tests”: the term “objective” refers to the question format and scoring process the most commonly used objective format is... Likert type items -- a statement is provided and the respondent is asked to mark one of the following Strongly Disagree Neither Agree Agree Strongly Agree nor Disagree Agree Different numbers of selections and “verbal anchors” are used (and don’t forget about reverse-keying) this item format is considered “objective” because scoring it requires no interpretation of the respondents answer

On the other hand, “ subjective tests ” require some interpretation or rating to score the response. Probably the most common type is called content coding Respondents are asked to write an answer to a question (usually a sentence to a short paragraph) The rater then “counts” the number of times each specified category of word or idea occurs and the counts of these becomes the “score” for the question. e.g., counting the attributes “offered” and “sought” in the archival exercise in 350 Lab Rater then “assigns” the all or part of the response to a particular “category” e.g., determining whether the author was seeking to contact a male or female

We would need to assess the inter-rater reliability of the scores from these “subjective” items. Have two or more raters score the same set of tests (usually 25-50% of the tests) Assess the consistency of the scores given by different raters using Cohen’s Kappa -- common criterion is.70 or above Ways to improve inter-rater reliability… improved standardization of the measurement instrument instruction in the elements of the standardization practice with the instrument -- with feedback experience with the instrument use of the instrument to the intended population/application

Properties of a “Good Item” Each item must reflect the construct/attribute of interest “content validity is assured not assessed” Each item should be positively related to the construct/attribute of interest (“positively monotonic”) Lower values Higher Values Construct of Interest Item Response Lower Higher perfect item great item common items bad items Scatter plot of each persons score in the item and construct

But, there’s a problem … We don’t have scores on “the construct/attribute” So, what do we do ??? Use our best approximation of each person’s construct/attribute score -- which is … Their composite score on the set of items written to reflect that construct/attribute Yep -- we use the set of untested items to make decisions about how “good” each of the items is… But, how can this work ??? We’ll use an iterative process Not a detailed analysis -- just looking for “really bad” items

Process for Item Analysis 1st Pass compute a total score from the set of items written to reflect the specific construct/attribute recode the total score into five ordered categories divide the sample into five groups (low to high total scores) for each item plot the means of the five groups on the item look for items that are “flat” “quadratic” or “backward” drop “bad items” -- don’t get carried away -- keep all you can 2nd Pass compute a new total from the items you kept re-recode the new total score into 5 categories replot all the items (including the ones dropped on 1st pass) Additional Passes repeat until “stable”

Internal Reliability The question of internal reliability is whether or not the set of items “hangs together” or “reflects a central construct”. If each item reflects the same “central construct” then the aggregate (sum or average) of those items ought to provide a useful score on that construct Ways of Assessing Internal Reliability Split-half reliability the items were randomly divided into two half-tests and the scores of the two half-tests were correlated high correlations (.7 and higher) were taken as evidence that the items reflect a “central construct” split-half reliability is “easily” done by hand (before computers) but has been replaced by...

Chronbach’s  -- a measures of the consistency with which individual items inter-relate to each other i R - i i = # items  = * R = average correlation i - 1 R among the items From this formula you can see two ways to increase the internal consistency of a set of items increase the similarity of the items will increase theiraverage correlation - R increase the number of items  -values range from (larger is better) “good”  values are and above

Assessing  using SPSS Itemcorrected alpha if item-total rdeleted i i i i i i i Coefficient Alpha =.58 Tells the  for this set of items Correlation between each item and a total comprised of all the other items (except that one) negative item-total correlations indicate either... very “poor” item reverse keying problems What the alpha would be if that item were dropped drop items with alpha if deleted larger than alpha don’t drop too many at a time !! Usually do several “passes” rather that drop several items at once.

Assessing  using SPSS Itemcorrected alpha if item-total rdeleted i i i i i i i Coefficient Alpha =.58 Pass #1 All items with “-” item-total correlations are “bad” check to see that they have been keyed correctly if they have been correctly keyed -- drop them notice this is very similar to doing an item analysis and looking for items within a positive monotonic trend

Assessing  using SPSS Itemcorrected alpha if item-total rdeleted i i i i i i Coefficient Alpha =.71 Pass #2, etc Check that there are now no items with “-” item-total corrs Look for items with alpha-if- deleted values that are substantially higher than the scale’s alpha value don’t drop too many at a time probably i7 probably not drop i1 & i5 recheck on next “pass” it is better to drop 1-2 items on each of several “passes”

Whenever we’ve considered research designs and statistical conclusions, we’ve always been concerned with “sample size” We know that larger samples (more participants) leads to... more reliable estimates of mean and std, r, F & X 2 more reliable statistical conclusions quantified as fewer Type I and II errors The same principle applies to scale construction - “more is better” but now it applies to the number of items comprising the scale more (good) items leads to a better scale… more adequately represent the content/construct domain provide a more consistent total score (respondent can change more items before total is changed much) In fact, there is a formulaic relationship between number of items and  (how we quantify scale reliability) the Spearman-Brown Prophesy Formula

Here are the two most common forms of the formula… Note:  X = reliability of test/scale  K = desired reliability k = by what factor you must lengthen test to obtain  K  K * (1 -  X ) k =  X * (1 -  K ) Starting with reliability of the scale (  X ), and desired reliability (  K ), estimate by what factor you must lengthen the test to obtain the desired reliability (k) k *  X  K = ((k-1) *  X ) Starting with reliability of scale (  X ), estimate the resulting reliability (  K ) if the test length were increased by a certain factor (k)

Examples -- You have a 20-item scale with  X =.50 how many items would need to be added to increase the scale reliability to.70? k is a multiplicative factor -- NOT the number of items to add to reach  K, we will need 20 * k = 20 * 2.33 = 46.6 = 47 items so we must add 27 new items to the existing 20 items Please Note: This use of the formula assumes that the items to be added are “as good” as the items already in the scale (I.e., have the same average inter-item correlation -- R) This is unlikely!! You wrote items, discarded the “poorer” ones during the item analysis, and now need to write still more that are as good as the best you’ve got ???  K * (1 -  X ).70 * (1 -.50) k = = = 2.33  X * (1 -  K ).50 * (1 -.70)

Examples -- You have a 20-item scale with  X =.50 to what would the reliability increase if we added 30 items? k = (# original + # new ) / # original = ( ) / 20 = 2.5 Please Note: This use of the formula assumes that the items to be added are “as good” as the items already in the scale (i.e., have the same average inter-item correlation -- R) This is unlikely!! You wrote items, discarded the “poorer” ones during the item analysis, and now need to write still more that are as good as the best you’ve got ??? So, this is probably an over- estimate of the resulting  if we were to add 30 items. k *  X 2.5 *.50  K = = = ((k-1) *  X ) 1 + ((2.5-1) *.50)

External Reliability Test-Retest Reliability Consistency of scores across re-testing Test-Retest interval is usually 2 weeks to 6 months Assessed using a combination of correlation and wg t-test The key to assessing test-retest reliability is to recognize that we depend upon tests to give us the “right score” for each person. The score can’t be “right” if it isn’t consistent -- same score For years, assessment of test-retest reliability was limited to correlational analysis (r >.70 is “good”) … but consider the next slide

Alternate Forms Reliability Sometimes it is useful to have “two versions” of a test -- called alternate forms If the test is used for any type of “before vs. after” evaluation Can minimize “sensitization” and “reactivity” Alternate Forms Reliability is assessed similarly to test-retest validity the two forms are administered - usually at the same time we want respondents to get the “same score” from both versions “good” alternate forms will have a high correlation and a small (nonsignificant) mean difference

External Reliability You can gain substantial information by giving a test-retest of the alternate forms Fa_t1 Fb-t1 Fa-t2 Fb-t2 Alternate forms evaluations Test-retest evaluations Mixed Evaluations Usually find that... AltF > T-Retest > Mixed Why?

Evaluating External Reliability The key to assessing test-retest reliability is to recognize that we must assess what we want the measure to tell us… sometimes we primarily want the measure to “line up” the respondents, so we can compare this order with how they line up on some other attribute this is what we are doing with most correlational research if so, then a “reliable measure” is one that lines up respondents the same each time assess this by simple correlating test-retest or alt-forms scores other times we are counting on the actual “score” to be the same across time or forms if so, even r = 1.00 is not sufficient (means could still differ) “similar scores” is demonstrated by a combination of … good correlation (similar rank orders) no mean difference (similar center to the rankings)

Here’s a scatterplot of the test (x-axis) re-test (y-axis) data… retest scores 50 r = t = 3.2, p< test scores What’s “good” about this result ? Good test-retest correlation What’s “bad” about this result ? Substantial mean difference -- folks tended to have retest scores lower than their test scores

Here’s a another…. retest scores 50 r = t = 1.2, p> test scores What’s “good” about this result ? Good mean agreement ! What’s “bad” about this result ?Poor test-retest correlation !

Here’s a another…. retest scores 50 r = t = 1.2, p> test scores What’s “good” about this result ? Good mean agreement and good correlation! What’s “bad” about this result ?Not much !