Test Development.

Slides:



Advertisements
Similar presentations
Allyn & Bacon 2003 Social Work Research Methods: Qualitative and Quantitative Approaches Topic 7: Basics of Measurement Examine Measurement.
Advertisements

Assessing Student Performance
Item Analysis.
Developing a Questionnaire
Topic 4B Test Construction.
Conceptualization and Measurement
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
Psychometrics William P. Wattles, Ph.D. Francis Marion University.
Part II Sigma Freud & Descriptive Statistics
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT
General Information --- What is the purpose of the test? For what population is the designed? Is this population relevant to the people who will take your.
Models for Measuring. What do the models have in common? They are all cases of a general model. How are people responding? What are your intentions in.
CH. 9 MEASUREMENT: SCALING, RELIABILITY, VALIDITY
RESEARCH METHODS Lecture 18
MEASUREMENT. Measurement “If you can’t measure it, you can’t manage it.” Bob Donath, Consultant.
1 Measurement PROCESS AND PRODUCT. 2 MEASUREMENT The assignment of numerals to phenomena according to rules.
Item Response Theory. Shortcomings of Classical True Score Model Sample dependence Limitation to the specific test situation. Dependence on the parallel.
Importance of Testing In Educational situations To determine the progress of students To ascertain achievement of educational objectives To make sound.
Chapter 7 Correlational Research Gay, Mills, and Airasian
Standardized Test Scores Common Representations for Parents and Students.
Test Validity S-005. Validity of measurement Reliability refers to consistency –Are we getting something stable over time? –Internally consistent? Validity.
1 Item Analysis - Outline 1. Types of test items A. Selected response items B. Constructed response items 2. Parts of test items 3. Guidelines for writing.
MEASUREMENT OF VARIABLES: OPERATIONAL DEFINITION AND SCALES
Measurement in Exercise and Sport Psychology Research EPHE 348.
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
LECTURE 06B BEGINS HERE THIS IS WHERE MATERIAL FOR EXAM 3 BEGINS.
Technical Adequacy Session One Part Three.
Variable  An item of data  Examples: –gender –test scores –weight  Value varies from one observation to another.
CHAPTER 6, INDEXES, SCALES, AND TYPOLOGIES
Chapter 7 Item Analysis In constructing a new test (or shortening or lengthening an existing one), the final set of items is usually identified through.
Reliability & Validity
Counseling Research: Quantitative, Qualitative, and Mixed Methods, 1e © 2010 Pearson Education, Inc. All rights reserved. Basic Statistical Concepts Sang.
Lab 5: Item Analyses. Quick Notes Load the files for Lab 5 from course website –
Learning Objective Chapter 9 The Concept of Measurement and Attitude Scales Copyright © 2000 South-Western College Publishing Co. CHAPTER nine The Concept.
Psychometrics & Validation Psychometrics & Measurement Validity Properties of a “good measure” –Standardization –Reliability –Validity A Taxonomy of Item.
Selecting a Sample. Sampling Select participants for study Select participants for study Must represent a larger group Must represent a larger group Picked.
1 Item Analysis - Outline 1. Types of test items A. Selected response items B. Constructed response items 2. Parts of test items 3. Guidelines for writing.
Assessment and Testing
Validity and Item Analysis Chapter 4.  Concerns what instrument measures and how well it does so  Not something instrument “has” or “does not have”
The Practice of Social Research Chapter 6 – Indexes, Scales, and Typologies.
Measurement Theory in Marketing Research. Measurement What is measurement?  Assignment of numerals to objects to represent quantities of attributes Don’t.
Correlation They go together like salt and pepper… like oil and vinegar… like bread and butter… etc.
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Reliability performance on language tests is also affected by factors other than communicative language ability. (1) test method facets They are systematic.
Item Response Theory in Health Measurement
Tests and Measurements
Technical Adequacy of Tests Dr. Julie Esparza Brown SPED 512: Diagnostic Assessment.
SECOND EDITION Chapter 5 Standardized Measurement and Assessment
Chapter 6 - Standardized Measurement and Assessment
Response Processes Psych DeShon. Response Elicitation Completion Completion Requires production Requires production Allows for creative responses.
Reliability a measure is reliable if it gives the same information every time it is used. reliability is assessed by a number – typically a correlation.
Stages of Test Development By Lily Novita
Chapter 6 Indexes, Scales, And Typologies. Chapter Outline Indexes versus Scales Index Construction Scale Construction.
Copyright © Springer Publishing Company, LLC. All Rights Reserved. DEVELOPING AND USING TESTS – Chapter 11 –
Chapter 2 Theoretical statement:
6 Scales, Tests, & Indexes.
CHAPTER 6, INDEXES, SCALES, AND TYPOLOGIES
ARDHIAN SUSENO CHOIRUL RISA PRADANA P.
Concept of Test Validity
Associated with quantitative studies
Test Validity.
Validity and Reliability
Reliability & Validity
پرسشنامه کارگاه.
Test Development Test conceptualization Test construction Test tryout
PSY 614 Instructor: Emily Bullock, Ph.D.
Chapter 4 Characteristics of a Good Test
Basic Statistics for Non-Mathematicians: What do statistics tell us
Role of Statistics in Developing Standardized Examinations in the US
Chapter 6 Indexes, Scales, and Typologies
Presentation transcript:

Test Development

stages Test conceptualization Test construction Test tryout defining the test Test construction Selecting a measurement scale Developing items Test tryout Item analysis Revising the test  

1. Test conceptualization Defining the scope, purpose, and limits of the test.

Initial questions in test construction Should the item content be similar or varied? Should the range of difficulty be narrow or broad? ceiling effect vs. floor effect How many items should be created?

Which domains should be tapped? the test developer may specify content domains and cognitive skills that must be included on the test. What kind of test item should be used?

2. Test construction

Selecting a scaling method

levels of measurement N O I R

Scaling methods Most are rating scales that are summative May be unidimensional or multi-dimensional

Method of paired comparisons Aka forced choice Test taker is forced to pick one of two items paired together

Comparative scaling Test takers sort cards or rank items from “least” to “most”

Categorical scaling Test takers sort cards into one of 2 or more categories. Stimuli are thought to differ quantitatively not qualitatively

Likert type scales Response choices are ordered on a continuum from one extreme to the other (e.g., strongly agree to strongly disagree). Likert assumes an interval scale although this may not be realistically accurate.  

Guttman scales Response choices for each item are various statements that lie on a continuum. Endorsing the most extreme statement reflects endorsement of milder statements as well.

Method of equal-appearing intervals Presumed to be interval For knowledge scale: obtain T/F statements Experts rate each item For attitude scale Judges rate each item on a likert scale assuming equal intervals For both Total test score for the test taker is based on “weighted” items (determined by averaging the experts ratings)

Method of absolute scaling Way to determine the difficulty level of items. Give items to several age groups, with one age group acting as the anchor. Item difficulty is assessed by noting the performance of each age group on each item as compared to the anchor group.

Method of empirical keying Based entirely on empirical findings. Test developer comes up with several items and then gives these to a group of people who are known to possess the construct and a group who is known not to possess the construct. Items are selected based on how well they distinguish one group from the other.

Writing the items

Item format Selected response Constructed response

Multiple choice Pros---- Cons----  

Matching Pros---- Cons----

True/False Pros---- Cons---- Forced-choice methodology.

Fill in Pros---- Cons----

Short answer objective item Pros--- Cons--- 

Essay Pros---- Cons----

Scoring items Cumulative model Class/category Ipsative Correction for guessing

3. Test tryout Should be on group that represents the ultimate group of test takers (who the test is intended for) Good items Reliable Valid Discriminate well

Before item analysis, look at the variability of scores within the test Floor effect? Ceiling effect?

4. Item analysis helps determine which items should be kept, revised, deleted.

Item-difficulty index proportion of examinees who get the item correct. can get a mean item difficulty.

Ideal item difficulty when using multiple guess items, try to account for the probability of chance. Optimal item difficulty = 1+g/2 exception to choosing item difficulty around mid-range involves tests of extreme groups.

Item endorsement proportion of examinees who endorsed the item.

Item reliability index Indication of internal consistency Product of the item SD and the correlation between the item and total scale Items with low reliability can be eliminated

Item validity index Correlate item with criterion – (helps identify predictively useful test items) Multiply the item score and the criterion total score with the SD of the item. The usefulness of an item also depends on its dispersion or ability to discriminate

Item discrimination index how well the item discriminates between high scorers and low scorers on the test. For each item, compare the performance of those in the upper vs lower performance ranges. Formula: d= (U-L)/N U = # of pple in the upper range who got it right L= # of pple in the lower range who got it right N= total # of pple in the upper OR lower range.

Interpreting the IDI can vary from –1 to +1. A (–) number = A 0 indicates = The closer the IDI is to +1 Can also use the IDI approach to examine the pattern of incorrect responses.

Item characteristic curves “Graphic representation of item difficulty and discrimination” horizontal line = ability vertical line = probability of a correct response

plots the probability of a correct response relative to the position on the entire test. If the curve is an incline slope or like an S, the item is doing a good job of separating low and high scorers.

Item fairness Items should measure the same thing across groups Items should have similar ICC across groups Items should have similar predictive validity across groups

Speed tests Easy items, similar items – everyone gets correct. Measuring response time Traditional analyses of items do not apply

Qualitative item analysis Test takers descriptions of the test Think aloud administrations Expert panels

5. Revising the test based on the info we obtained from the item analysis. New items and additional testing of these items may be required.

Cross validation Once you have your revised test, need to seek new, independent confirmation of the test’s validity. The researcher uses a new sample to determine if the test predicts the criterion as well as it did in the original sample.

Validity shrinkage Typically, with cross validation, you will find that the test is less accurate in predicting the criterion with this new sample.

Co-validation Validating two or more tests at the same time Co-norming Saves $ Beneficial for tests that are used together

6. Publishing the test final step that involves development of a test manual.

Production of testing materials Testing materials that are user friendly will be more accepted. The lay out of the materials should allow for smooth administration.

Technical manual Summarizes the technical data and references. Item analyses, scale reliabilities, validation evidence , etc can be found here.

User’s manual provides instruction for administration, scoring, and interpretation. The Standards for Educational and Psychological Testing recommend that manuals meet several goals (p 135). two of the most important: 1. describe the rationale and recommended uses of the test 2. provide data on reliability and validity.

Testing is big business