LECTURE 06B BEGINS HERE THIS IS WHERE MATERIAL FOR EXAM 3 BEGINS
RIGOR OF ASSESSMENT IN NORM-REFERENCED TESTING (HUTCHINSON, 1996)
RIGOR OF ASSESSMENT (PART OF ASSESSING PSYCHOMETRIC ADEQUACY) Validity Extent which a procedure actually measures what it claims to measure Reliability Consistency of response/performance elicitation Remember: Can be applied to both norm-referenced and criterion referenced testing
RIGOR OF ASSESSMENT IN NORM- REFERENCED TESTING: SUBTOPIC = VALIDITY
ASSESSING VALIDITY IN NORM-REFERENCED TESTING Definition of and evidence for validity Extent which a procedure actually measures what it is supposed to measure Defined relative to a specific purpose E.g. valid for screening, but not valid for Tx planning Issue of the quality and extent of available evidence Logical analysis Empirical data
TYPES OF VALIDITY (H&P, 2012) Construct validity “Degree to which a test measures the theoretical construct it is intended to measure”
Content validity Degree to which the content of a test is consistent with the purpose of a test --appropriateness of items --completeness of the item sample --the way in which the items assess the content Cf. face validity, which has surface appearance of content validity TYPES OF VALIDITY (H&P, 2012)
Criterion-related validity Degree to which the test performance predict performance on other (external) criteria --subtype = predictive Ability to predict score on future test in related area --subtype = concurrent compared to present performance on other tests in related area TYPES OF VALIDITY (H&P, 2012)
SOURCES OF EVIDENCE OF VALIDITY, (HUTCHINSON, 1996) Evidence used to support the argument that a test is valid for its stated purpose First source category = Logical evidence Test’s purpose well stated Construct (theory/framework) well defined Good rationale for content of the test, which includes documentation that both easy and hard test items have been included, to discriminate disorder Key concept: Are the test authors’ logically-based arguments convincing?
SOURCES OF EVIDENCE OF VALIDITY, (HUTCHINSON, 1996) Evidence used to support the argument that a test is valid for its stated purpose Second source category = Empirical evidence Correlation (r), a measure of relationship between ____________________ and _____________________ Good prediction of group membership with measures of __________________ and _____________________ Pattern of relationship among sub-test results should match the pattern predicted by the construct Via correlation Via factor analysis Key concept: Are the test authors’ empirically-based arguments convincing?
What are the labels on the axes when one uses correlation as evidence for validity? Empirical evidence for validity, using correlation… Measure of relationship between _____________ and ____________
Empirical evidence for validity, using correlation… Measure of relationship between _____________ and ____________ Is the test authors’ empirical argument convincing? What evidence is given to describe the relationship between the test of interest and others considered to be similar? Note that valid tests should also have low correlations with test measuring different parameters
Sensitivity--the test’s accuracy in correctly identifying the clients WITH the disorder Specificity-- the test’s accuracy in correctly identifying the clients WITHOUT the disorder Empirical evidence for validity, using measures of sensitivity and specificity…
Empirical evidence for validity, using measures of sensitivity and specificity… Let’s “visualize” these concepts
Empirical evidence for validity, using measures of sensitivity and specificity… In the test manual, we’re looking for reports of high specificity and high sensitivity. Is the test authors’ empirical argument convincing? What evidence is given to support the accuracy of this test in classifying subjects into already- established performance categories? Do you see how this type of evidence for validity is directly related to the purpose of norm- referenced tests?
Empirical evidence for validity, using patterns of correlations among subtests, to see if the patterns fit what the construct would predict (construct in this example = what makes up writing ability?) Is the test authors’ empirical argument convincing? What statistical data support the relationship among separate components of the test or their relationship with the overall contruct?
Empirical evidence for validity, using factor analysis of sub-test scores, e.g. to see if patterns of factor loadings follow what the construct of writing ability would predict I: “Writer’s development of the work” II: “Writer’s fluency with mechanics” III: “Sentence structure” IV: “Writer’s orientation to the reader” Is the test authors’ empirical argument convincing?
RIGOR OF ASSESSMENT IN NORM- REFERENCED TESTING: SUBTOPIC = RELIABILITY
Reliability Consistency of response/performance elicitation (includes consistency of scoring and measurement) Remember….
TYPES OF RELIABILITY, AND EVIDENCE FOR THEM Agreement OR Inter-rater reliability Correlation of scores of two raters (good = )* Item by item or total score Stability OR Test-retest reliability Correlation of scores from two separate test administrations with same person, across testees (good = )* (continued….) Can you see why the authors should optimally provide reliability scores for: 1) each age group separately? 2) both normal and disordered groups?
TYPES OF RELIABILITY, AND EVIDENCE FOR THEM (CONT.) Internal consistency OR split-half reliability Split test in two halves and obtain correlation between the two sets: Measured as r E.g. Split top from bottom E.g. split even items from odd items Test items assigned to two halves through random assignment, and obtain r. Then do this again, and again, and again….. “Average” all the r’s = Cronbach’s coefficient alpha
What are the labels on the axes when one uses correlation as evidence for --inter-rater reliability? --test/retest reliability? --split half reliability? Empirical evidence for reliability, using patterns of correlations…
Think: Even when a test is very carefully designed and reliable (consistent) in its ability to measure a construct (e.g. narrative comprehension), a client’s responses to test items may not always reflect a true picture of his underlying ability (e.g. his true ability to understand narrative passages). Error in measurement cannot be avoided, especially when measuring human performance. Even with the most reliable test, what are some of the other factors that affect a client’s performance on a test, on a given day? Transition slide from topic of reliability to topic of Standard Error of Measurement (SEM) Observed score = the actual raw score that a test-taker earns True score = hypothetical “ideal” score that the person would have earned if there were no error in measurement
STANDARD ERROR OF MEASUREMENT SEM If a person took a test 100 times, their scores: 1) would tend to fall near some central score (represented by a measure of central tendency, such as the average), e.g. 42 2) would deviate from the central score (due to error of measurement) in predictable way, with most of them not too far from the center The “average deviation” (or “average distance”) from the central score is known as the standard deviation, e.g. 2. This standard deviation (“average deviation”) due to error of measurement is called the standard error of measurement (SEM), e.g. 2 away from 42 (either above or blow) Number of times the person earned the score few many Score __ Can you fill in the values that would be two SEM away from the average?
STANDARD ERROR OF MEASUREMENT SEM Now, test-makers don’t really calculate SEM by giving people a test 100 times! They calculate SEM using: 1)estimates of the test’s reliability (at least one of the three types) 2)the distribution of scores earned by the normative sample 3)the way in which reliability varies at different score levels SO, clinicians don’t calculate SEM. SEM is provided in the test manual to help guide us in our interpretation of a client’s score. Number of times the person earned the score few many Score __ Can you fill in the values that would be two SEM away from the average?
STANDARD ERROR OF MEASUREMENT SEM 68% of the scores would be predicted to fall within one SEM of the average e.g. we could predict that 68/100 would fall between 40 and 44 95% of the scores would be predicted to fall within two SEMs of the average e.g. we could predict that 95/100 would fall between ____ and ____ Number of times the person earned the score few many Score __
SEM AND ITS RELATIONSHIP TO CONFIDENCE INTERVALS (See Hutchinson and H&P readings) Observed score The actual raw score that the test taker earns True score The score that the person would have earned if there were no measurement error
SEM AND ITS RELATIONSHIP TO CONFIDENCE INTERVALS (See Hutchinson and H&P readings) + 1 SEM to -1 SEM = 68% confidence interval. We can have 68% confidence that the client’s true score would fall somewhere in this range + 2 SEM to -2 SEM = 95% confidence interval. We can have 95% confidence that the client’s true score would fall somewhere in this range
INTERPRETATION OF CONFIDENCE INTERVAL RELATIVE TO CUT-OFF SCORE How do we interpret performance when confidence interval : a)is completely above the cut-off score? b)is completely below the cut-off score? c)straddles the cut-off score?
LECTURE 06B ENDS HERE