Unit 5: Improving and Assessing the Quality of Behavioral Measurement PS 522: Behavioral Measures and Interpretation of Data Lisa R. Jackson, Ph.D.
Indicators of Trustworthy Measurement Validity Directly measures a socially significant behavior Measures a dimension of the behavior relevant to the question Helps you determine that you are testing what you think you are testing
Indicators of Trustworthy Measurement Accuracy Observed values match the true values of an event Reliability Measurement yields the same values across repeated measurement of the same event This is a measure of consistency, stability Does it reliably produce the same results over time? Is your assessment reliable within itself? If you use different versions, are they all equally reliable?
Types of Validity: Face Validity This is all about how the tests looks, not what it really measures. Does the test measure what the test items suggest it will? High face validity can increase test-taker’s confidence, interest and motivation in the test. Low face validity could disguise the real purpose of the test to decrease self-report bias Good for measuring something people don’t want to admit to It is the least scientifically accurate measure of validity
Types of Validity: Content Validity How well the test samples behavior representative of the whole universe of behavior it is designed to sample. Ex: A test of depression needs to measure thoughts, emotions, motivation and behavior, not just mood. An assessment for autism cannot just assess language skills Affected by who writes the test
Types of Validity: Criterion Related This is an external measure – how well does the test measure up against another standard? How well can a score on this test be used to infer an individual’s likely standing on the criterion? Concurrent: How well is a score on this test correlated to an individual’s likely standing on a criterion of interest in the present? Ex: if you test low on one depression test, do you also test low on the Beck Depression Inventory? Predictive: How well is a score on this test correlated to an individual’s likely standing on a criterion of interest in the future? Ex: Does SAT predict college achievement?
Types of Validity: Construct Validity How appropriate are conclusions drawn from test scores regarding an individual’s standing on a construct? Does this test really assess the construct? This takes time to determine. Ex: Over time, the Beck Depression Inventory has acquired construct validity. Most experts agree that it tests for depression.
Threats to Measurement Validity Indirect measurement Measuring a behavior other than the behavior of interest Example: Using children’s responses to a questionnaire as a measure of how often and how well they get along with their classmates Measuring a dimension that is irrelevant or ill suited to the reason for measuring behavior Example: Using a ruler in a pot of water to measure temperature Trying to measure reading endurance in oral reading by counting the number of correct and incorrect words read, but not counting how long the student read
Types of Reliability: Internal Consistency This one is really all about the questions, not the test. The goal is to see if the questions are consistent with one another. Do test items reliably produce the same results? Do all of the questions measure the same idea?
Types of Reliability: Split Half Divide a test in half equally Give one half to one group, the other half to an equal group The goal is to judge the consistency of the test Are the 2 halves strongly correlated to each other? If they are, the test is reliable. One way is to use odd/even reliability: questions 1,3,5,7 on one half, 2,4,6,8 on other Use when impractical to have two tests or two administrations; saves time and expense
Types of Reliability: Test/Retest Using the same instrument to measure the same thing at two different points in time. The goal is to determine the consistency of the test over time Does it produce similar results each time? If so, the test is reliable Good for a stable construct such as personality Won’t work to assess a reading test if subjects improve in reading between administrations
Types of Reliability: Parallel and Alternate Forms A make-up test is an example You might not want to give the same one Minimizes memory effects if administering the same test to the same person The person won’t have memorized answers from testing to testing The mean on each form must be equivalent to the original Can be time consuming, expensive
Assessing the Reliability of Measurement Measurement is reliable when it yields the same values across repeated measures of the same event Not the same as accuracy Reliable application of measurement system is important Requires permanent products for re-measurement Low reliability signals suspect data
Using Interobserver Agreement to Assess Behavioral Measurement The degree to which two or more independent observers report the same values for the same events Determine competence of new observers Detect observer drift Judge clarity of definitions and system Increase believability of data
Requisites for IOA Observers must: Use the same observation code and measurement system Observe and measure the same participants and events Observe and record independently of one another
Methods for Calculating IOA Percentage of agreement is most common way to calculate Event Recording methods compare: Total count recorded by each observer Mean count-per-interval Exact count-per-interval Trial-by-trial
Methods for Calculating IOA Timing recording methods: Total duration IOA Mean duration-per-occurrence IOA Latency-per-response Mean IRT-per-response Interval recording and Time sampling: Interval-by-interval IOA (Point by point) Scored-interval IOA Unscored-interval IOA
Considerations in IOA Obtain and report IOA at the same levels at which researchers will report and discuss in study results For each behavior For each participant In each phase of intervention or baseline
Considerations in IOA Believability of data increases as agreement approaches 100% History of using 80% agreement as acceptable benchmark Depends upon the complexity of the measurement system
Considerations in IOA Reporting IOA Narrative form Table Graphs In all formats, report how, when, and how often IOA was assessed
Assessing the Accuracy and Reliability of Behavioral Measurement First, design a good measurement system Second, train observers carefully Third, evaluate extent to which data are accurate and reliable Measure the measurement system Accuracy means the observed values match the true values of an event You can’t base research conclusions or treatment decisions on faulty data
Assessing the Accuracy of Measurement Four purposes of accuracy assessment: Determine if data are good enough to make decisions Discovery and correction of measurement errors Reveal consistent patterns of measurement error Assure consumers that data are accurate
Accuracy Assessment Procedures Measurement is accurate when observed values match true values Accuracy determined by calculating correspondence of each data point with its true value Process for determining true value must differ from measurement procedures Accuracy assessment should be reported in research
Threats to Measurement Accuracy and Reliability Inadequate observer training Explicit and systematic Careful selection Train to competency standard On-going training to minimize observer drift
Threats to Measurement Accuracy and Reliability Unintended influences on observers Observer expectations of what the data should look like Observer reactivity when she/he is aware that others are evaluating the data Measurement bias Feedback to observers about how their data relates to the goals of intervention
Final Points The data that you gather is used to help improve the lives of real people Others may use your data as the basis of intervention Measurements need to be accurate, consistent and relevant
Questions?? Thanks for participating! I am sure you have been asking questions here in seminar! Great job! But, if you have more, email me: Ljackson2@kaplan.edu These slides are posted in the Doc Sharing area for your review.