Technical Adequacy Session One Part Three
Reliability We all have friends, some are reliable and some are not With your partner, discuss what a reliable friend is, List three qualities you would use?
Reliability In Laymen’s term, reliability is being able to depend that the results are accurate for that test. If you did it again, would you get the same score? There are many factors that affect reliability
Error in measurement Two types of error in measurement Systematic-Bias Random
Bias Generally bias refers to raising a persons score because they were advantaged in some way However, the groups that was not advantaged, was affected negatively by the bias Boys score better on multiple choice questions than girls, the boys were advantaged, the girls were disadvantaged
Random error Random error – is very different It is hard to predict who it is affecting, hard to predict by how much, Hard to predict by what magnitude Reliable test try and eliminate most types of error
Reliability Coefficient We can measure how reliable tests are by the reliability coefficient A test free from error- has perfect 1.0 A test filled with error –has a 0 Since every test has error then a reliability around .85 or above
Types of reliability Item reliability Stability Inter-rater reliability or interobserver agreement
Item reliability Item reliability affects the prediction of understanding of the knowledge in several ways Imagine a study trying to predict how the population of a country or state will vote in the next election The prediction is only as good as the sample it selects, if it select from one area, it will not be representative of the population This same concepts applies to developing a test Test developers cannot possible select all the items they need to test, the more accurate the representative is of the total knowledge, the more reliable the test
Item reliability Your goal is for the student performance on the sample items would be the same as if he/she took all of the items ( if that were a possibility) The goal of the test is to be able to generalize the students ability to what they know of the entire realm of knowledge in that area When we over estimate their ability, our test is unreliable
Item reliability There are two main approaches to determining item reliability Alternate form reliability- Internal consistency
Item reliability Alternate form reliability- two forms of a test are developed, each from the same knowledge base but each with different questions You then test a large sample with the test Half take one form , half the other They should have similar scores Scores from the test are correlated and form the correlation coefficient
Item reliability Internal consistency There are many ways to test internal consistency On popular way is to develop a test that can be split with a similar level of difficulty Administer the test and see how the students did Say the test was split by first half an second half, grade half of the class on the first half and the other half on the second half and compare scores. Can also do if for specific items
Stability In many cases, we expect out tests to produce information that when tested later, will yield the same results A child tested for colorblindness- should reveal being colorblind later in life since the problem is not curable, if not the test was unreliable because it is unstable
Stability A test should produce similar results I you give a set of students a test and then wait a while, then readminister the test, it should produce similar results The more similar the results, the more stable and the more reliable
Stability Stability is not affected by, interventions. If you test a child and it shows he is weak in a certain area, then you provide and intervention and the child does better on the next test, that is not considered a weakness in stability
Inter-rater reliability Inter observer/inter-rater reliability The concept is simple and easy to understand- It is analogous to a piece of music, a book or a movie, Two people see, read or watch the same thing and have a different opinion Watch the next clip, what do you think?
Inter-rater reliability Now Watch the next clip, an count how many people test the mattress Do people have similar answers
Inter-rater reliability Inter-rater reliability needs to be developed in several places and can be measured in several ways Different raters/observers need to be trained on what to watch, need to have a clear criteria for what is a positive incident of what you are observer If you are looking for out of seat behavior, is it standing, squirming, leaning over, or being two feet from the desk
Inter-rater reliability Inter-rater reliability can be measured in several ways, by comparing two people scores from the same Or by doing an item by item analysis and comparing the difference observation
Standard Error of Measurement
Standard Error of Measurement Imagine you gave a test to a kindergarten student on his letter sound recognition You developed 100 test of ten items After giving the child about ten of these test, the scores would be about the same. Some of the test he would know the sounds, some he would not, but the average would be accurate SEM tries to predict what that error between the test would be if you only gave him one test, remember it could be a test he scored well on, or it could be a test he scored poorly on It is a similar concept to Standard Deviation, but related specifically to error
Estimate of True Scores This is more of a conceptual concept, that a statistical unit Imagine you take a fifty question test and you do not know ten answers questions You guess on them and being a very lucky person, you get 8 right- These eight answers are really not your true score If you are unlucky, you get a lower score
Confidence Intervals Given the fact that true scores are difficult to obtain, the concept of confidence intervals was created. When it is combined with SEM it relays very accurate scores The level of confidence tells us how certain the score is within the range
Confidence Intervals If a child has a score of 90 ± 5 ( SEM) the we are saying the child score is somewhere between 85 and 95. If we say that a child has a score of 90 ± 5 ( SEM) with a 95% confidence level, we are saying that there is only a 5% chance that the child score is somewhere above or below 85 and 95. The lower the confidence, the smaller the range the child score is somewhere between 88 and 92. at a 80% confidence level
Validity This refers to the degree to which the evidence and theory support the interpretation of the test scores by the proposed uses of tests Often test are interpreted for uses they were not designed. Therefore, Validity is a fundamental consideration
Validity The fundamental question that you need to ask, is, Does the testing process lead to the correct inferences about a specific person.
Validity First assume you give an IQ test in English to a non English speaking person You give a test that measures cultural items a that a person was not exposed to You use a test designed for national standards that does not align to a local standards ( social studies)
Validity Content validity- Is the content of the measure representative of the domain of content it is suppose to assess? Experts look at the content and compare it to what they feel it should contain.
Validity Appropriateness of included items- Should the questions be here Do they represent what it is trying to measure ( different than content validity) are the questions from a too high of a grade level, like middle school stuff on an elementary test Is the presentation of the items appropriate, are the questions worded properly?
Validity Content not included- is there important content missing that should be there? How are the items measured Are the multiple choice, Open ended where you must show work
Validity Criterion Reference Validity- references a tests ability to describe a test takers ability in two ways Present- Concurrent Criterion Referenced Validity Future- Predictive Criterion Referenced Validity
Validity Concurrent Criterion Referenced Validity- Is the test/assessment a good predictor of what the students currently know based on the criterion of the knowledge base? If a child takes an achievement test. Is it a valid measure of how well he did in fourth grade?
Validity Predictive Criterion Referenced Validity Does the test have the ability to predict what it say it will predict A reading readiness test- if a students scores high, does he learn to read easily? If a child scores poorly, does he struggle to learn to read?
Validity Construct Validity refers to the extent to which a procedure or test measures a theoretical trait or characteristic construct validity refers to whether a scale measures or correlates with the theorized psychological construct ( such as intelligence) that it purports to measure.