Unit 2: Test Worthiness and Making Meaning out of Raw Scores

Slides:

Advertisements

Similar presentations

Questionnaire Development

Advertisements

Richard M. Jacobs, OSA, Ph.D.

Measurement Concepts Operational Definition: is the definition of a variable in terms of the actual procedures used by the researcher to measure and/or.

Lesson Fourteen Interpreting Scores. Contents Five Questions about Test Scores 1. The general pattern of the set of scores  How do scores run or what.

Reliability and Validity

A quick introduction to the analysis of questionnaire data John Richardson.

Copyright 2001 by Allyn and Bacon Standardized Testing Chapter 14.

Measurement Concepts & Interpretation. Scores on tests can be interpreted: By comparing a client to a peer in the norm group to determine how different.

Statistics Used In Special Education

Instrumentation.

Foundations of Educational Measurement

McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.

LECTURE 06B BEGINS HERE THIS IS WHERE MATERIAL FOR EXAM 3 BEGINS.

Test item analysis: When are statistics a good thing? Andrew Martin Purdue Pesticide Programs.

Chapter 1: Research Methods

Chapter 11 Descriptive Statistics Gay, Mills, and Airasian

Descriptive Statistics

Instrumentation (cont.) February 28 Note: Measurement Plan Due Next Week.

Descriptive Statistics

Chapter 4: Test administration. z scores Standard score expressed in terms of standard deviation units which indicates distance raw score is from mean.

Reliability & Validity

Research Methodology Lecture No :24. Recap Lecture In the last lecture we discussed about: Frequencies Bar charts and pie charts Histogram Stem and leaf.

Counseling Research: Quantitative, Qualitative, and Mixed Methods, 1e © 2010 Pearson Education, Inc. All rights reserved. Basic Statistical Concepts Sang.

Tests and Measurements Intersession 2006.

EDU 8603 Day 6. What do the following numbers mean?

Chapter 2 Statistical Concepts Robert J. Drummond and Karyn Dayle Jones Assessment Procedures for Counselors and Helping Professionals, 6 th edition Copyright.

Measures of central tendency are statistics that express the most typical or average scores in a distribution These measures are: The Mode The Median.

UTOPPS—Fall 2004 Teaching Statistics in Psychology.

Appraisal and Its Application to Counseling COUN 550 Saint Joseph College For Class # 3 Copyright © 2005 by R. Halstead. All rights reserved.

Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.

Psy 230 Jeopardy Measurement Research Strategies Frequency Distributions Descriptive Stats Grab Bag $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500.

Basic Assessment Principles Chapter 2.  Nominal  Ordinal  Interval  Ratio Measurement Scales.

SOCW 671: #5 Measurement Levels, Reliability, Validity, & Classic Measurement Theory.

Chapter 7 Measuring of data Reliability of measuring instruments The reliability* of instrument is the consistency with which it measures the target attribute.

Standardized Testing. Basic Terminology Evaluation: a judgment Measurement: a number Assessment: procedure to gather information.

Chapter 6 - Standardized Measurement and Assessment

Chapter 3 Selection of Assessment Tools. Council of Exceptional Children’s Professional Standards All special educators should possess a common core of.

Educational Research: Data analysis and interpretation – 1 Descriptive statistics EDU 8603 Educational Research Richard M. Jacobs, OSA, Ph.D.

Statistics Josée L. Jarry, Ph.D., C.Psych. Introduction to Psychology Department of Psychology University of Toronto June 9, 2003.

Educational Research Descriptive Statistics Chapter th edition Chapter th edition Gay and Airasian.

5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)

Interpreting Test Results using the Normal Distribution Dr. Amanda Hilsmier.

Measurement and Scaling Concepts

XI. Testing and Individual Differences

Different Types of Data

Business and Economics 6th Edition

CHAPTER 3: Practical Measurement Concepts

Concept of Test Validity

Assessment Theory and Models Part II

Teaching Statistics in Psychology

Reliability & Validity

Chapter 6: Selecting Measurement Instruments

Numerical Descriptive Measures

Research Statistics Objective: Students will acquire knowledge related to research Statistics in order to identify how they are used to develop research.

Module 8 Statistical Reasoning in Everyday Life

PSY 614 Instructor: Emily Bullock, Ph.D.

Introduction to Statistics

Evaluation of measuring tools: reliability

RESEARCH METHODS Lecture 18

Using statistics to evaluate your test Gerard Seinhorst

Summary descriptive statistics: means and standard deviations:

Descriptive Statistics

Chapter 10: Intelligence & Testing

Unit 11: Testing and Individual Differences

Chapter 8 VALIDITY AND RELIABILITY

Business and Economics 7th Edition

Chapter 3: How Standardized Test….

Presentation transcript:

Unit 2: Test Worthiness and Making Meaning out of Raw Scores Common Assessment Instruments for Today’s World

Test Worthiness: What Does it Take Four requirements of test worthiness: Validity: measures what it is supposed to Reliability: Score is an accurate measure of his/her true score Cross-Cultural Fairness: Test is true reflection of the individual & not a function of cultural bias inherent in test Practicality: Test is appropriate for situation

Correlation Coefficient Correlation Coefficient: Relationship between two sets of test scores. Range from -1.0 to +1.0 Positive Correlation: Tendency for scores to be related in the same direction Negative Correlation: Tendency for scores to trend toward opposite direction (inverse)

Strong Correlation (Relationship) Indication of Strong Relationship: -1.0 and +1.0 indicates strong relationship Weak or No Relationship: 0 Scatterplot: Graph showing two or more sets of test scores Positive correlation: Diagonal line rises from left to right Negative correlation: Diagonal line rises from right to left

Scatterplot: Positive Correlation Negative Correlation

Scatterplot: Weak or No Correlation

Coefficient of Determination: Shared Variance Coefficient of Determination: Common factors that account for a relationship. Correlation Coefficient² Example: On tests of depression & anxiety, a .85 correlation was found in these two tests. Square .85: .85 x .85 = .7225 .7225 x 100 = 72.25 or 72% This shows that anxiety & depression share a large number of factors - but not all factors.

Test Worthiness: Validity Validity: The degree to which a test measures what it’s supposed to measure Forms of Validity: Content Validity Criterion-related Validity Concurrent Validity Predictive Validity Construct Validity Experimental Design Validity Convergent Validity Discriminant Validity

Validity: Content Validity Content Validity: The content of the test is appropriate for what the test intends to measure Face Validity: The superficial appearance of the test. A valid test may or may not have face validity. *Face validity is not a true measure of validity

Validity: Criterion-related Validity Criterion-related Validity: Relationship between test scores and another standard Concurrent Validity: Relationship between test scores & another currently obtainable benchmark Predictive Validity: Relationship between test scores & a future standard Standard Error of Estimate: Range where a predicted score might lie False Positive: A test incorrectly predicts a test- taker will have an attribute or be successful False Negative: A test incorrectly predicts a test- taker will not have an attribute or be successful

Validity: Construct Validity Construct Validity: Evidence that an idea or concept is actually being measured by the test (Is the test for intelligence truly measuring intelligence?) Evidence used to measure construct validity: a) Experimental design: Using experimentation to show that a test measures a concept b) Factor analysis: Statistically examining relationship between subscales and larger construct (between individual subject areas and the test as a whole)

Validity: Construct Validity Convergent Validity: Relationship between a test and other similar tests (highly correlated - say .75 range) Discriminant Validity: Showing a lack of relationship between a test and tests of unrelated concepts (test between depression and anxiety)

Reliability Reliability: The degree to which test scores are free from errors of measurement “Perfect world” scenario: Test is well-made, the environment is optimal, & the test taker is at his/her best Reliability Coefficient: Are test scores consistent and dependable?

Reliability: Measuring Reliability Test-retest Reliability: Relationship between test scores from one test given at two different administrations to the same people The closer the two sets of scores, the more reliable the test Test-retest reliability is more effective in areas that are less likely to change over time

Reliability: Measuring Reliability Alternate Forms Reliability: Relationship between scores from two similar versions of the same test Examiner designs alternate, parallel, or equivalent forms of the original test and administers this alternate form as the second test One of the problems is to insure that both tests are truly equal

Reliability: Internal Consistency Internal Consistency: Reliability measured statistically by going “within” the test (how scores on individual items relate to each other or the test as a whole) Types of Internal Consistency: 1) Split-half (odd-even) 2) Cronbach’s Coefficient Alpha 3) Kuder-Richardson

Reliability: Internal Consistency Split-half Reliability: Correlating one half of a test against the other half Advantages of Split-half: 1) Having to give only one test 2) Not having to create a separate alternate form Disadvantages of Split-half: 1) False reliability if two halves are not parallel or equivalent 2) Make test half as long (shortening test may decrease reliability

Reliability: Internal Consistency Spearman-Brown Equation: Mathematical compensation for shortening the number of correlations by using the split-half reliability test Spearman-Brown Equation: Spearman = Brown reliability = 2ʳʰʰ 1 + ʳʰʰ - Where ʳʰʰ is the split-half reliability estimate *If a test manual states that split-half was used, check to see if the Spearman-Brown formula was used. If not, the test may be more reliable than is noted.

Reliability: Internal Consistency Cronbach’s Coefficient Alpha and Kuder-Richardson: Methods that attempt to estimate the reliability of all the possible split-half combinations by correlating the scores for each item on the test with the total score on the test and finding the average correlation for all of the items Kuder-Richardson can only be used with tests that have right and wrong answers (achievement) Coefficient Alpha can be used with tests with various types of responses (rating scales)

Reliability: Item Response Theory Item Response Theory: Examines each item individually for its ability to measure the trait being examined Item Characteristic Curve: Assumes that as people’s abilities increase, their probability of answering an item correctly increases

Reliability: Item Characteristic Curve If “S” flattens out: Less ability to discriminate or provide a range of probabilities of getting a correct or incorrect response If “S” is tall: Item is creating strong differentiation across ability 1.0 .75 Probability of Correct Answer .50 .25 0.0 55 70 85 100 115 130 145 IQ Ability

Cross-cultural Fairness Cross-cultural Fairness: Degree to which cultural background, class, disability, and gender do not affect test results Tests must be carefully selected to prevent bias Test scores must be interpreted in light of the cultural, ethnic, disability, or linguistic factors that may impact scores

Practicality Practicality: Feasibility considerations in test selection and administration Major Practical Concerns: 1) Time: Amount of time to administer 2) Cost: Budgeting issues 3) Format: Print, type of questions 4) Readability: Understandability 5) Ease of Administration, Scoring, & Interpretation

Selecting & Administering a Good Test 1) Determine goals of your client 2) Choose instrument to reach client goals 3) Access information about possible instruments a) Source books on testing 1) Buros Mental Measurements Yearbook 2) Tests in Print 4) Examine Validity, Reliability, Cross-cultural Fairness, & Practicality of the Possible Instruments 5) Choose an Instrument Wisely

Unit 2: Statistical Concepts Making Meaning Out of Raw Scores

Raw Scores are Meaningless Raw Scores: Untreated score before manipulation or processing Norm Group Comparisons Are Helpful: 1) Tells us relative position within the norm group 2) Allows us to compare the results among test- takers 3) Allows us to compare test results on two or more different tests taken by same person

Procedures for Normative Comparisons Frequency Distribution: List of scores & number of times a score occurred Orders a set of scores from highest to lowest & lists corresponding frequency of each score Allows identification of most frequent scores and helps identify where an individual’s score falls relative to the rest of the group

Histograms & Frequency Polygons Histogram: Bar graph of class intervals & frequency of a set of scores Class Intervals: Grouping scores by a pre-determined range Frequency Polygon: Line graph of class intervals & frequency of a set of scores

Cumulative Distributions (Ogive Curve) Cumulative Distribution: Line graph to examine percentile rank of a set of scores Applications: Good for conveying information about percentile rank

Normal Curves & Skewed Curves Normal Curve: Bell-shaped curve that human traits tend to fall along Predictable pattern that occurs whenever we measure human traits and abilities Skewed Curves: Test scores that do not fall along a normal curve Negatively Skewed Curve: Majority of scores at the upper end Positively Skewed Curve: Majority of scores at the lower end

Measures of Central Tendency Central Tendency: Give you a sense of how close a score is to the middle of the distribution Three Measures of Central Tendency: 1) Mean: Arithmetic average of all scores: add all scores and divide by # of scores 2) Median: Middle score: 50% fall above; 50% fall below 3) Mode: Most frequently occurring score *In a skewed distribution, median is a better measure of central tendency.

Measures of Variability Measures of Variability: How much scores vary in a distribution Three Measures of Variability: 1) Range: Difference between highest & lowest score plus 1 2) Interquartile Range: Middle 50% of scores around the median 3) Standard Deviation: How scores vary from the mean

Measures of Variability: Range Range: Tells you the distance from the highest to lowest score Calculated by subtracting the lowest score from the highest score and adding 1

Measures of Variability: Interquartile Range Interquartile Range: Provides the range of the middle 50% of scores around the median Useful with skewed curves because it offers a more representative picture of where a large percentage of scores fall Calculate: Subtract the score that is 1/4 of the way from the bottom from the score that is 3/4 of the way from the bottom & divide by 2. Next, add & subtract this number to the median

Measures of Variability: Standard Deviation Standard Deviation: Describes how scores vary from the mean In all normal curves, the percentage of scores between standard deviation units is the same 99.5% of people fall within the first three standard deviations *Adequate scores are in the “eye of the beholder”

Common Assessments: Situation Specific Developmental Disabilities: Impairment in Cognitive, Communication, Social/Emotional, & Adaptive (daily living skills) Functioning Assessments Used: 1) Bayley Scales of Infant Development 2) Wechsler Preschool & Primary Scales of Intelligence, 3rd Edition 3) Wechsler Intelligence Scale for Children, 4th Ed. 4) Autism Diagnostic Observation Scale 5) Vineland Adaptive Behavior Scale, 2nd Ed.

Common Assessments: Situation Specific Learning Disabilities: Disorders that affect a broad range of academic & functional skills, i.e., speaking, listening, reading, writing, spelling, & completing math calculations. Deficit in one or more ways the brain processes information Assessments 1) Wechsler Preschool & Primary Scale of Intelligence 2) Wechsler Intelligence Scale for Children, 4th Ed. 3) Wechsler Adult Intelligence Scale, 3rd Ed. 4) Wechsler Individual Achievement Test, 2nd Ed.

Learning Disabilities Assessments, Continued 5) Wechsler Memory Scale, 3rd Ed. 6) Woodcock-Johnson Test of Achievement, 3rd Ed. 7) Comprehensive Test of Phonological Processing 8) Attention Deficit Disorder Evaluation Scale (Home, Self-report, & School version) 9) Beck Depression Inventory, 2nd Ed. 10) Beck Anxiety Inventory

Common Assessments: Situation Specific Attention Deficit/Hyperactivity Disorder 1) Wechsler Intelligence Scale for Children, 4th Ed. 2) Processing Speed Index 3) Wechsler Adult Intelligence Scale, 3rd Ed. 4) Woodcock-Jackson Test of Achievement, 3rd Ed. 5) Understanding Directions Subset 6) Attention Deficit Disorder Evaluation Scale (Home, Self-report, & School version) 7) Behavior Assessment System for Children, 2nd Ed. (Parent report, Teacher report, Self-report)

Common Assessments: Situation Specific Gifted and Talented Evaluation: Individuals who are so gifted or advanced, they need special provisions to meet their educational needs Assessments 1) Wechsler Preschool & Primary Scale of Intelligence (3rd Ed.) 2) Wechsler Intelligence Scale for Children, 4th Ed. 3) Wechsler Adult Intelligence Scale, 3rd Ed.