Download presentation
Presentation is loading. Please wait.
Published byDarren Lionel Malone Modified over 9 years ago
1
Susan Lottridge NCSA, June 2014 Automated Scoring in the Next Generation World © Copyright 2014 Pacific Metrics Corporation
2
Next Generation World (Winter, Burkhardt, Freidhoff, Stimson, & Leslie, 2013) Computer-based testing Adaptive testing Potentially large item pools Technology-enhanced items Automated item generation Personalized learning/formative uses
3
Key Topics in Today’s Presentation Cataloguing Constructed Response items Combining scoring sources Adaptive Testing TEIs/Automated Item Generation
4
Cataloguing CR Items
5
Item Type & Scoring Knowing item type can help to determine appropriate scoring approach and whether item is ‘score-able’ (Lottridge, Winter & Mugan, 2013) There are almost an infinite ways to create a Constructed Response Item!
6
Constructed Response Items Covers a very broad range of item types – When different from an essay? – When different from a performance event or task? Need better definition around these items – Types – Structural considerations – Content – Score points/rubric
7
Types Technology-enhanced items Text-based – One-word to phrasal typed response – Single sentence – Multiple sentence Constrained entry – Numbers – Equations/expressions
8
Structural Considerations (Ferrara et al., 2003; Scalise & Gifford, 2006) CR items are often multi-part – What defines a part? – Parts can differ from entry boxes – How are parts scored in the rubric? – How many points per part? CR items consist of multiple types – TEI + text ‘explain your answer’ – Solution + Equation (+ text ’explain your answer’) – Solution + text ‘explain your answer’
9
Math CR Items CR Item StructureCount Solution + explanation28 [Solution + explanation] + [solution + explanation]10 [N] Solution + explanation9 [N] Solution5 Equation + solution4 Equation + solution + explanation4 TEI with or without labeling8 TEI + solution/expression/equation4 Other3
10
Science CR Items CR Item StructureCount Identify 1 + explain/describe7 Identify 2 + explain/describe7 Identify 4 + explain/describe1 Identify [N]11 Explain/describe [N]5 [M] [identify [N] + explain/describe [N]]4 TEI + explain/describe [N] + identify [N]3 Solution + explain/describe [N] + identify [N]3 Solution [N]1
11
Reading CR Items CR Item StructureCount Identify 1 + 2 details22 Identify [N] + [N] details9 2 details7 [N] details5 [N] List 1 + explain/describe5 Summary with specifications5 Inference/generalization + [N] details5
12
Content Considerations Math – What is required in ‘explain your answer’ or ‘show your work’ responses? Science (Baxter & Glaser, 1998) – Content lean to Content rich – Process constrained to Process rich Reading – Detail versus explanation – Prediction/generalization – Summarization
13
Combining Scoring Sources
14
There are many ways to leverage different scoring sources (Lottridge, Winter & Mugan, 2013) – 100% human + N% computer second read – Complementary human and computer scoring – 100% computer with N% human second read – Blended human and computer scoring But can we use different computer scoring models to produce better results? – Adjudication Model – Ensemble Model
15
Adjudication Model Two engines independently trained on 5 math CR items – Each engine underperforms relative to humans – One engine is ‘rule-based;’ the other is heavily NLP/Machine Learning based Restrict computer scoring to those responses in which engines agree – The remaining results go to humans for scoring Results show promise
17
Exact Agreement Rates (Complete Validation Sample) ItemN Human 1 – Human 2 Engine 1 – Human 2 Engine 2 – Human 2 134191%76%80% 226188%79%85% 334085%77% 429896%88%89% 518394%74%70%
18
Adjudication Proportions ItemN Engine Assigns Score Humans Assign Score 167671%29% 253980%20% 370674%26% 459890%10% 537262%38%
19
Engine Assigns Score Condition (Exact Agreement Performance) ItemN Human 1 – Human 2 Engine 1/2– Human 2 124672%74% 220990% 325386% 427496%92% 510697%91%
20
Humans Assign Score Condition (Exact Agreement Performance) ItemN Human 1 – Human 2 Engine 1 – Human 2 Engine 2 – Human 2 19586%39%52% 25279%31%59% 38785%48%51% 42492%46%54% 57790%52%42%
21
Adjudication Summary When we restrict scoring to when the engines agree, they perform similarly to humans. When they do not agree, the engines perform poorly relative to humans. This suggests the adjudication criteria are adequate for retaining the responses that should be scored by automated scoring.
22
Ensemble Model Combining scores from two different engines to produce a score – Weighted average – Optimization via regression – Other methods (decision trees, etc.)
23
Results Exact Agreement Rate with Human Raters ItemEngine 1Engine 2EnsembleImprovement 159%61%66%5% 275%67%76%1% 357%58%63%5% 13 Reading CR items Ensembling by averaging scores from two engines 10 items exhibited no improvement 3 items exhibited some improvement
24
Adaptive Testing and CRs Item pools – Potentially large number of CRs (thousands) – Low number of examinees per CR (if any) Impacts on hand scoring and engine scoring – Training readers and engines – Requires large AS staff to train or a shift from ‘expert-based’ to ‘automated’ training models
25
TEIs and Automated Item Generation Many TEIs/AIG templates are scored 0-1 or are multiple choice (Winter et al., 2013) – But often require multiple steps by examinee Can we involve item authors in the configuring scoring rules to enable partial-credit scoring? – Expands usefulness of item to examinees – Removes expert scoring labor from training process
26
References Baxter, G., & Glaser, R. (1998). Investigating the cognitive complexity of science assessments. Educational Measurement: Issues & Practice, 17(3), 37-45. Ferrara, S., Duncan, T., Perie, M., Freed, R., McGovern, J., & Chilukuri, R. (2003, April). Item construct validity: Early results from a study of the relationship between intended and actual cognitive demands in a middle school science assessment. Paper presented at the annual meeting of the American Educational Research Association, Chicago, IL. Lottridge, S., Winter, P., & Mugan, L. (2013). The AS Decision Matrix: Using Program Stakes and Item Type to Make Informed Decisions about Automated Scoring Implementations. Pacific Metrics Corporation. Retrieved from http://http://www.pacificmetrics.com/white- papers/ASDecisionMatrix_WhitePaper_Final.pdf. Scalise, K., & Gifford, B. R. (2006). Computer-Based Assessment in E-Learning: A Framework for Constructing "Intermediate Constraint" Questions and Tasks for Technology Platforms. Journal of Teaching, Learning and Assessment, 4(6). Winter, P. C., Burkhardt, A. K., Freidhoff, J. R., Stimson, R. J., & Leslie, S. C. (2013). Astonishing impact: An introduction to five computer-based assessment issues. Michigan Virtual Learning Research Institute. Retrieved from http://media.mivu.org/institute/pdf/astonishing_impact.pdf
27
Questions? Pacific Metrics Corporation
28
1 Lower Ragsdale Drive, Building 1, Suite 150, Monterey, CA 93940 www.pacificmetrics.com Thank You
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.