Reliability.

Reliability

Reliability - meanings
Everyday uses: A reliable machine starts and runs continously after we push the ON button. A reliable employee arrives on time and is rarely absent. A reliable source provides accurate information. A reliable car dealer is in business for many years and gives good customer service.

Reliability – psychological testing
In testing theory reliability means: Replicability – the score can be replicated. Consistency – the same construct is assessed throughout the test. Reliability is a feature of all measurement tools.

Intuitive perspective
measuring rule scales

True Score Theory Frameworks of the test reliability:
Classical test theory (CTT) Item response theory Generalizability theory Most popular: CTT IRT – gains popularity

Classical test theory

Observed score (X) Observed score (X) – is a person’s actual score on a test. E.g. 15 correct answers out of 20 on a exam test. It may be affected by many factors either positive or negative. Examples?

True score (T) True score – is the score a person would get if all sources of unreliability were removed or cancelled. Like avarage score from many different (infinite number) administrations of the test. Different conditions may introduce some unreliability. In practice: unobservable.

Error score (E) Error score is the difference between the true score and the observed score. E may add something to X or subctract from X. Transformation: T = O – E E = T - O

Error score Error has unsystematic influence on true score
Therefore, it is random, which means that when we test someone infinite number times: All possible errors will be normally distributed with mean=0. E won’t be in a systematic relationship with T Relationship between two errors = 0

Variance of true score When we consider groups of scores:

Definition of reliabilty
Using the symbols, we define reliability of a test as:

Definition of reliabilty
Reliability is the proportion of observed score variance that is true variance.

Total observed variance Error variance True variance
Observed variance represents about half of the true variance Error variance is relatively small

Link to empirical studies
Alternatively we can think about reliability in terms of stability of a score in time an across different conditions. We call a test reliable if the score obtained by an individual is repeatable. Regardless of situation he/she will obtain always the same score.

Definition with empirical link
Reliability – how stable is a score of a test when we repeat the measurement. In practice there are many methods determinig reliability. All adopt the above definition.

Methods assessing reliabilty
Test-retest reliability Alternate form reliability Internal consistency: Split-half reliability Kuder-Richardson formulas Cronbach’s alpha Inter-rater reliability

Methods assessing reliabilty
Analyses are different with different reliability coefficients. Each method provides different information about the test. It is recommended to use at least two methods.

Test-retest reliability
Administering the same test to the same individuals on two seperate occasions. Two occasions – week to few months apart. The reliability coefficient is the correlation. The higher the correlation, the more reliable test.

Sources of unreliability
Changes in personal conditions. Fatigue, emotional states Learning Motivation Stability of a test across various situations. Testing conditions Climated changes Stability of a trait in time.

Stability in time How long should the time between two measurements be? The content should be forgotten The construct shouldn’t change Depends on a test and underlying construct. The means should be controlled.

Example Person A1 A2 B1 B2 1 2 3 4 5 7 6 Mean ? Correlation rA1A2=?
rB1B2=?

Example Person A1 A2 B1 B2 1 2 3 4 5 7 6 Mean Correlation rA1A2=1
rB1B2=1

Stability - problems Children change fast
Intelligence, knowledge – subjects can learn Trait vs state It is difficult to test the same people It doesn’t say anything about the content First measurement affect second

Test-retest - practice
Types of tests its used: Traits assessment Personality questionnairs Cognitive performance (excluding knowdlegde about facts) Preferences, attitudes

Alternate form reliability
Avoids the problem of learning and remembering content Requires two forms of the test Two forms should be smilar in terms of: Number of items Time limits Content specification Instruction

Alternate form reliability
Two forms should be smilar in terms of statistics: Equal means Equal SD Equal intercorrelations Should correlate the same with external any variable

Administration The same group of examinees fill two forms.
Usually there is some time between two administrations. The time depends on the test and construct (like with test-retest) The correlation between forms is the reliability coefficient.

Sources of unreliability
Stability in time Sensitivity for contextual factors (e.g. testing conditions). Items content – similarity between two forms.

Where to use All tools that measure traits.
E.g. personality questionnairs Intelligence tests Knowledge about facts

Alternate forms - problems
In practice – rarely used. It is difficult to construct two tests. Difficult to find parallel items with very similar content but differently expressed.

Internal consistency One of the most frequently used. Few methods:
Split-half reliability Kuder-Richardson formulas Coefficient alpha

Split-half Reliability
How to avoid two measurements? Measure only once. How is the assumption about replicability of a score violeted?

Like the two alternate forms were adiministered in immediate succession. Split-half means that we administer one test. After we collect tests, we split items into halves. Then, we treat the halves as alternate forms.

How to split test into halves? It depends on the content of the test. We have few possibilities: Randomly chosen items Odd-even items With respect to items content and their statistics

Random items If we can assume that all test items are equal in terms of the content and their statistics. E.g. many personality questionnaires: we assume that items are equally important. Then, it dosen’t matter how we split the items.

Random choice - example
1. Worry about things. 2. Fear for the worst. 3. Am afraid of many things. 4. Get stressed out easily. 5. Get caught up in my problems. 6. Feel threatened easily.

Split-half: odd-even E.g. some intelligence tests may have increasing difficulty of items. Examinees may be fatigued toward the end of the test. Timing effect affects the second half. Simple split into halves or random selection is not possible.

Split-half: odd-even Two halves are ballanced.
E.g., contain items with similar difficulty

Split-half: item content
Sometimes groups of items are distibuted throughout the test. We need to ballance the halves with items from different groups.

Item content - example 1.Feel comfortable around people. 2. Love excitement. 3. Seek adventure. 4. Love action. 5. Make friends easily. 6. Willing to try anything once. 7. Am skilled in handling social situations. 8. Am the life of the party.

Formula Simple correlation between the two halves doesn’t give reliability of the full-length test. It gives the reliability only of one half. The correction must be applied: Spearman-Brown formula.

Spearman-Brown formula
- reliability of the entire test - correlation between the two halves

Example Correlation between halves = 0.5 = 2 x 0.5/ = 1/1.5 =0.66

Example Correlation between halves = 0.9 = 2 x 0.9/ = 1.8/1.9 =0.94

Conclusions Spearman-Brown formula’s outcome is always higher than initial correlation. The higher the correlation between halves, the higher the Spearman-Brown result.

Split-half summary The source of error: incosistency between halves.
Different content of two halves. Good method for tests with increasing difficulty, like many intelligence tests.

Internal consistency – other methods
Problem with split-half: the result depends on how we split the test. Few methods were proposed that avoid this problem. They give result which equals the mean correlation of all possible halves of a test.

Cronbach’s alpha Very widely used.
Assumes that all items are equal in terms of: The underlying theory – all items are good representations of the construct Statistics – e.g. there are no big differences in difficulty. Perfect method for most personality tests.

Formula - number of items - variance of the total scores - sum of the variances of each item

Example

= 4 = 25 = = 10

Result

Meaning… The result says to what extent all items represent the same construct. Other words: describes heterogenity of the test. Ask question: is there one trait/factor underlying the test.

What determines alpha…
= ?

The higher the difference between subjects the higher alpha = 1

= ?

The higher the difference between items the lower alpha = 0

Number of items Mean items intercorrelation

What determines alpha - summary
The difference between people in sample Consistency in anwering the questions Number of items Items intercorrelations Provides information about the consistency of a test.

Inter-scorer reliability
Assesses how two or more scorers differ in their evaluation of a test. Some tests are difficult to assess. The score depends on the instruction added to test. E.g. projective tests.

Draw a house/tree

Inter-scorer reliability
How to evaluate whether projective tests are reliable? We can examine how objective the rating procedure is. A sample of tests is independenly scored by two or more examiner. Then: correlations between scorers.

Summing up Mehod Sources of error Test-Retest Changes over time,
Different testing conditions Alternate-forms Item sampling, Changes over time Split-half Nature of split Coefficient alpha Item sampling Test heterogenity Interscorer Scorer differences

Reliability.

Similar presentations

Presentation on theme: "Reliability."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reliability.

Similar presentations

Presentation on theme: "Reliability."— Presentation transcript:

Similar presentations

About project

Feedback