Worcester Polytechnic Institute Towards Assessing Students’ Fine Grained Knowledge: Using an Intelligent Tutor for Assessing Mingyu Feng August 18 th,

Slides:

Advertisements

Similar presentations

Addressing the Testing Challenge with a Web-Based E-Assessment System that Tutors as it Assesses Mingyu Feng, Worcester Polytechnic Institute (WPI) Neil.

Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Advanced Topics in Standard Setting. Methodology Implementation Validity of standard setting.

Knowledge Inference: Advanced BKT Week 4 Video 5.

Chapter Fifteen Understanding and Using Standardized Tests.

Brian Junker Carnegie Mellon 2006 MSDE / MARCES Conference 1 Using On-line Tutoring Records to Predict End-of-Year Exam Scores Experience with the Assistments.

Modeling Student Knowledge Using Bayesian Networks to Predict Student Performance By Zach Pardos, Neil Heffernan, Brigham Anderson and Cristina Heffernan.

Correlation & Regression Chapter 15. Correlation statistical technique that is used to measure and describe a relationship between two variables (X and.

Part II Knowing How to Assess Chapter 5 Minimizing Error p115 Review of Appl 644 – Measurement Theory – Reliability – Validity Assessment is broader term.

MATH ASSESSMENT TEST OCMA May, HISTORY OF MAT Test originally developed in late 60’s.

Effective Skill Assessment Using Expectation Maximization in a Multi Network Temporal Bayesian Network By Zach Pardos, Advisors: Neil Heffernan, Carolina.

Computer Science Department Jeff Johns Autonomous Learning Laboratory A Dynamic Mixture Model to Detect Student Motivation and Proficiency Beverly Woolf.

The ASSISTment Project Trying to Reduce Bottom-out hinting: Will telling students how many hints they have left help? By Yu Guo, Joseph E. Beck& Neil T.

Addressing the Testing Challenge with a Web-Based E - Assessment System that Tutors as it Assesses Nidhi Goel Course: CS 590 Instructor: Prof. Abbott.

Using Growth Models for Accountability Pete Goldschmidt, Ph.D. Assistant Professor California State University Northridge Senior Researcher National Center.

Conclusion Our prediction model did a good job at predict 8 th grade math proficiency. It can be used to estimate 10 th grade score fairly well, too. But.

Sept. 29 th, 2005 Investigating Learning over Time Mingyu Feng Neil Heffernan Longitudinal Analysis on Assistment Data.

Multivariate Analyses & Programmatic Research Re-introduction to Multivariate research Re-introduction to Programmatic research Factorial designs  “It.

Using Mixed-Effects Modeling to Compare Different Grain-Sized Skill Models Mingyu Feng, Worcester Polytechnic Institute Neil T. Heffernan, Worcester Polytechnic.

Regression Chapter 10 Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania.

Minnesota Manual of Accommodations for Students with Disabilities Training Guide

Today Concepts underlying inferential statistics

Correlational Designs

Educational Data Mining Overview John Stamper PSLC Summer School /25/2011 1PSLC Summer School 2011.

A Value-Based Approach for Quantifying Scientific Problem Solving Effectiveness Within and Across Educational Systems Ron Stevens, Ph.D. IMMEX Project.

RESEARCH METHODS IN EDUCATIONAL PSYCHOLOGY

Developing School-Based Systems of Support: Ohio’s Integrated Systems Model Y.S.U. March 30, 2006.

Determining the Significance of Item Order In Randomized Problem Sets Zachary A. Pardos, Neil T. Heffernan Worcester Polytechnic Institute Department of.

COPYRIGHT WESTED, 2010 Calipers II: Using Simulations to Assess Complex Science Learning Diagnostic Assessments Panel DRK-12 PI Meeting - Dec 1–3, 2010.

LOG O Development of a diagnostic system using a testing-based approach for strengthening student prior knowledge Computers & Education (September 2011)

Assessing Students’ Performance Longitudinally: Item Difficulty Parameter vs. Skill Learning Tracking Mingyu Feng, Worcester Polytechnic Institute Neil.

Case Study – San Pedro Week 1, Video 6. Case Study of Classification  San Pedro, M.O.Z., Baker, R.S.J.d., Bowers, A.J., Heffernan, N.T. (2013) Predicting.

BPS - 3rd Ed. Chapter 211 Inference for Regression.

Classroom Assessments Checklists, Rating Scales, and Rubrics

EDU 385 Education Assessment in the Classroom

L 1 Chapter 12 Correlational Designs EDUC 640 Dr. William M. Bauer.

Grading and Reporting Chapter 15

Let’s Talk Assessment Rhonda Haus University of Regina 2013.

Diagnostics Mathematics Assessments: Main Ideas  Now typically assess the knowledge and skill on the subsets of the 10 standards specified by the National.

Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.

Educational Data Mining: Discovery with Models Ryan S.J.d. Baker PSLC/HCII Carnegie Mellon University Ken Koedinger CMU Director of PSLC Professor of Human-Computer.

6. Evaluation of measuring tools: validity Psychometrics. 2012/13. Group A (English)

Standard Setting Results for the Oklahoma Alternate Assessment Program Dr. Michael Clark Research Scientist Psychometric & Research Services Pearson State.

“Value added” measures of teacher quality: use and policy validity Sean P. Corcoran New York University NYU Abu Dhabi Conference January 22, 2009.

Chapter 16 Data Analysis: Testing for Associations.

Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.

Applying the Redundancy Principle ( Chapter 7) And using e-learning data for CTA Ken Koedinger 1.

Brian Lukoff Stanford University October 13, 2006.

Core Methods in Educational Data Mining HUDK4050 Fall 2015.

Early Identification of Introductory Major's Biology Students for Inclusion in an Academic Support Program BETHANY V. BOWLING and E. DAVID THOMPSON Department.

Chapter 6 - Standardized Measurement and Assessment

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Assistant Instructor Nian K. Ghafoor Feb Definition of Proposal Proposal is a plan for master’s thesis or doctoral dissertation which provides the.

BPS - 5th Ed. Chapter 231 Inference for Regression.

Investigate Plan Design Create Evaluate (Test it to objective evaluation at each stage of the design cycle) state – describe - explain the problem some.

Data-Driven Education

Classroom Assessments Checklists, Rating Scales, and Rubrics

How to interact with the system?

Classroom Assessments Checklists, Rating Scales, and Rubrics

Scaled Leadership Vision 20/20 Continues: Assessing Key Elements in the 5E Mathematics Classroom Silvia Aday, District instructional supervisor David galarce,

Using Bayesian Networks to Predict Test Scores

Towards building a better cognitive model

Mingyu Feng Neil Heffernan Joseph Beck

Towards Assessing Students’ Fine Grained Knowledge: Using an Intelligent Tutor for Assessing Mingyu Feng August 18th, 2009 Where are these people from/

Detecting the Learning Value of Items In a Randomized Problem Set

Addressing the Assessing Challenge with the ASSISTment System

Neil T. Heffernan, Joseph E. Beck & Kenneth R. Koedinger

How to interact with the system?

Understanding and Using Standardized Tests

Chapter 8 VALIDITY AND RELIABILITY

Presentation transcript:

Worcester Polytechnic Institute Towards Assessing Students’ Fine Grained Knowledge: Using an Intelligent Tutor for Assessing Mingyu Feng August 18 th, 2009 Ph.D. Dissertation Committee: Prof. Neil T. Heffernan (WPI) Prof. Carolina Ruiz (WPI) Prof. Joseph E. Beck (WPI) Prof. Kenneth R. Koedinger (CMU)

2 Motivation – the need  Concerns about poor student performance on new state tests  High-stakes standards-based tests are required by the No Child Left Behind (NCLB) Act  Student performance are not satisfactory  Massachusetts (2003, 20% failed 10 th grade math on the first try)  Worcester  Secondary teachers are asked to be data-driven  MCAS test reports  Formative assessment and practice tests  Provided by Northwest Evaluation Association; Measured Progress; Pearson Assessments, etc.

333 Motivation – the problems  I: Formative assessment takes time from instruction  NCLB or NCLU (No Child Left Untested)?  Every hour spent assessing students is an hour lost from instruction  Limited classroom time compels teachers to make a choice

44 Motivation – the problems  II: Performance reports are not satisfactory  Teachers want more frequent and more detailed reports Confrey, J., Valenzuela, A., & Ortiz, A. (2002). Recommendation to the Texas State Board of Education on the Setting of TAKS Standards: A Call to Responsible Action. At

5 Main Contributions  Improved assessment system by taking into account how much assistance students need (WWW’06; ITS’06; EDM’08; UMUAI Journal’09 (nominated for James Chen award))  Established a way to track and predict performance longitudinally over multiple years (WWW’06; EDM’08)  Rigorously evaluated the effectiveness of the skill models of various granularities (AAAI’06 EDM Workshop; TICL’07; IEEE Journal’09)  Used data mining approach to evaluate effectiveness of individual contents (AIED’09)  Used data mining to refine existing skill models (EDM’09; in preparation)  Developed an online reporting system deployed and used by real teachers (AIED’05; Book chapter’07; TICL Journal’06; JILR Juornal’07)

6 Roadmap Motivation Contributions  Background - ASSISTment  Using tutoring system as an assessor  Dynamic assessment  Longitudinal modeling  Cognitive diagnostic modeling  Conclusion & general implications

77 ASSISTments System  A web-based tutoring system that assists students in learning mathematics and gives teachers assessment of their students’ progress  Teachers like ASSISTments  Students like ASSISTments

8  We break multi-step items (original question) into scaffolding questions  Attempt: student take an action to answer a question  Response: the correctness of student answer (1/0)  Hint Messages: given on demand that give hints about what step to do next  Buggy Message: a context sensitive feedback message  Skill: a piece of knowledge required to answer a question An ASSISTment

99 Facts about ASSISTments  students have used the system regularly  More than 10 million data records collected  Other features  Learning experiments; authoring tools, account and class management toolkit …  The dissertation uses data of about 1000 students who used ASSISTments during AIED’05: Razzaq, L., Feng, M., Nuzzo-Jones, G., Heffernan, N.T., Koedinger, K. R., Junker, B., Ritter, S., Knight, A., Aniszczyk, C., Choksey, S., Livak, T., Mercado, E., Turner, T.E., Upalekar. R, Walonoski, J.A., Macasek. M.A., Rasmussen, K.P. (2005). The Assistment Project: Blending Assessment and Assisting. In C.K. Looi, G. McCalla, B. Bredeweg, & J. Breuker (Eds.) Proceedings of the 12th International Conference on Artificial Intelligence in Education, pp Amsterdam: ISO Press. Book Chapter: Razzaq, L., Feng, M., Heffernan, N., Koedinger, K., Nuzzo-Jones, G., Junker, B., Macasek, M., Rasmussen, K., Turner, T., & Walonoski, J. (2007). Blending Assessment and Instructional Assistance. In Nedjah, Mourelle, Borges and Almeida (Eds). Intelligent Educational Machines within the Intelligent Systems Engineering Book Series. pp Springer Berlin / Heidelberg.

10 Roadmap Motivation Contributions Background - ASSISTments  Using tutoring system as an assessor  Dynamic assessment  Longitudinal modeling  Cognitive diagnostic modeling  Conclusion & general implications

11 A Grade Book Report JILR Journal: Feng, M. & Heffernan, N. (2007). Towards Live Informing and Automatic Analyzing of Student Learning: Reporting in the Assistment System. Journal of Interactive Learning Research. 18 (2), pp Chesapeake, VA: AACE. TICL Journal: Feng, M., Heffernan, N.T. (2006). Informing Teachers Live about Student Learning: Reporting in the Assistment System. Technology, Instruction, Cognition, and Learning Journal. Vol. 3. Old City Publishing, Philadelphia, PA Where does this score come from?

Automated Assessment  Big idea: use data collected while a student uses ASSISTment to assess him  Lots of types of data available  (last screen just used % correct on original questions)  Lots of other possible measures  Why should we be more complicated? Worcester Polytechnic Institute 12

13 A Grade Book Report  Static – does not distinguish “Tom” and “Jack”  Average – ignores development over time  Uninformative – not informative for classroom instruction Dynamic assessment Longitudinal modeling Cognitive diagnostic assessment

14 Dynamic Assessment – the idea Brown, A. L., Bryant, N.R., & Campione, J. C. (1983). Preschool children’s learning and transfer of matrices problems: Potential for improvement. Paper presented at the Society for Research in Child Development meetings, Detroit.  Dynamic testing began before computerized testing (Brown, Bryant, & Campione, 1983).

15 Dynamic vs. Static Assessment  Developing dynamic testing metrics  # attempts  # minutes to come up with an answer; # minutes to complete an ASSISTment  # hint requests; # hint-before-attempt requests; #bottom-out hints  % correct on scaffolds  # problems solved  “Static” measure  correct/wrong on original questions

16 Dynamic A ssessment – data  Data  Sept, 2004 – May, 2005  391 students  Online data  267 minutes (sd. = 79); 9 days; 147 items (sd. = 60)  8 th grade MCAS scores (May, 2005)  Data  Sept, 2005 – May, 2006  616 students  Online data  196 minutes (sd. = 76); 6 days; 88 items (sd. = 42)  8 th grade MCAS scores (May, 2006)

17  Three linear stepwise regression models 17 Dynamic Assessment - modeling 1-parameter IRT proficiency estimate All online metrics 1-parameter IRT proficiency estimate + all online metrics The standard test model The assistance model The mixed model 1-parameter IRT: One parameter item response theory model MCAS Score

18  Bayesian Information Criterion (BIC)  Widely used model selection criterion  Resolves overfitting problem by introducing a penalty term for the number of parameters  Formula  Prefer model with lower BIC  Mean Absolute Deviation (MAD)  Cross-validated prediction error  Function  Prefer model with lower MAD 18 Dynamic Assessment - evaluation Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology, 25,

19 Dynamic Assessment - results 1-parameter IRT proficiency estimate All online metrics 1-parameter IRT proficiency estimate + all online metrics The standard test model The assistance model The mixed model Model MADBIC Correlation with th grade MCAS Model MADBIC Correlation with th grade MCAS The standard test model The assistance model p=0.001 Model MADBIC Correlation with th grade MCAS The standard test model The assistance model The mixed model p=0.001

20 Dynamic Assessment – what variables are important?

21 Dynamic Assessment - robustness  See if model can generalize  Test model on other year’s data

Compare Models from Two Years Worcester Polytechnic Institute 22  Which metrics are stable across years? data data (Constant) IRT_Proficiency_Estimate Scaffold_Percent_Correct Avg_Question_Time Avg_Attempt Avg_Hint_Request Question_Count Avg_Item_Time Total_Attempt

23 Dynamic Assessment - conclusion  ASSISTments data enables us to assess more accurately  The relative success of the assistance model over the standard test model highlights the power of the dynamic measures Feng, M., Heffernan, N.T, Koedinger, K.R. (2006a). Addressing the Testing Challenge with a Web-Based E- Assessment System that Tutors as it Assesses. In Proceedings of the 15th International World Wide Web Conference. pp New York, NY: ACM Press Best Student Paper Nominee. Feng, M., Heffernan, N.T., & Koedinger, K.R. (2009). Addressing the assessment challenge in an online System that tutors as it assesses. User Modeling and User-Adapted Interaction: The Journal of Personalization Research (UMUAI journal). 19(3), 2009.

24 Roadmap Motivation Contributions Background - ASSISTments  Using tutoring system as an assessor Dynamic assessment Longitudinal modeling  Cognitive diagnostic modeling  Conclusion & general implications

25 Can we have our cake and eat it, too?  Most large standardized tests are unidimensional or low-dimensional.  Yet, teachers need fine grained diagnostic reports (Militello, Sireci, & Schweid, 2008; Wylie, & Ciofalo, 2008; Stiggins, 2005)  Can we have our cake and eat it, too? Militello, M., Sireci, S., & Schweid, J. (2008). Intent, purpose, and fit: An examination of formative assessment systems in school districts. Paper presented at the American Educational Research Association, New York City, NY. Wylie, E. C., & Ciofalo, J. (2008). Supporting teachers' use of individual diagnostic items. Teachers College Record. Retrieved from on October 13, Stiggins, R. (2005). From formative assessment to assessment FOR learning: A path to success in standards-based schools. Phi Delta Kappan, 87(4),

26 Cognitive Diagnostic Assessment  McCalla & Greer (1994) pointed out that the ability to represent and reason about knowledge at various levels of detail is important for robust tutoring.  Griel, Wang & Zhou (2008) proposed one direction for future research is to increase understanding of how to select an appropriate grain size or level of analysis  Can we use MCAS test results to help select the right grain-sized model from a series of models of different granularities? McCalla, G. I. and Greer, J. E. (1994). Granularity- based reasoning and belief revision in student models. In Greer, J. E. and McCalla, G. I., (eds), Student Modeling: The Key to Individualized Knowledge-Based Instruction, pages Springer-Verlag, Berlin. Gierl, M.J., Wang, C., & Zhou, J. (2008). Using the attribute hierarchy method to make diagnostic inferences about examinees’ cognitive skills in Algebra on the SAT. Journal of Technology, Learning, and Assessment, 6(6).

27 Building Skill Models Math WPI - 1 WPI - 5 Patterns, Relations, and Algebra Geometry Measurement Number Sense and Operations Data Analysis, Statistics and Probability … Using- measurement -formulas- and- techniques Setting-up- and-solving- equation Understanding -pattern Understanding- data- presentation- techniques Understanding- and-applying- congruence-and- similarity Converting- from-one- measure-to- another understanding- number- representations WPI - 39 … … … … WPI - 78 Ordering-fractions Equation- solving Equation-concept Inducing-function Plot-graph XY-graph Congruence Similar-triangles Perimeter Area Circle-graph Unit-conversion Equivalent- Fractions- Decimals-Percents … … … …… … …

28 Building Skill Models Math WPI - 5 WPI - 1 Patterns, Relations, and Algebra Geometry Measurement Number Sense and Operations Data Analysis, Statistics and Probability … Using- measurement -formulas- and- techniques Setting-up- and-solving- equation Understanding -pattern Understanding- data- presentation- techniques Understanding- and-applying- congruence-and- similarity Converting- from-one- measure-to- another understanding- number- representations WPI - 39 … … … … WPI - 78 Ordering-fractions Equation- solving Equation-concept Inducing-function Plot-graph XY-graph Congruence Similar-triangles Perimeter Area Circle-graph Unit-conversion Equivalent- Fractions- Decimals-Percents … … … …… … …

29 Cognitive Diagnostic Assessment – data  Data  Sept, 2004 – May, 2005  447 students  Online data: 7.3 days; 87 items (sd. = 35)  Item level response of 8 th grade MCAS test (May, 2005)  Data  Sept, 2005 – May, 2006  474 students  Online data: 5 days; 51 items (sd. = 24)  Item level 8 th grade MCAS scores (May, 2006)  All online and MCAS items have been tagged with all four skill models

30 Cognitive Diagnostic Assessment - modeling  Fit mixed-effects logistic regression model  Predict MCAS score  Extrapolate the fitted model in time to the month of the MCAS test  Obtain probability of getting each MCAS question correct, based upon skill tagging of the MCAS item  Sum up probabilities to get total score X ijkt is the 0/1 response of student i on question j tapping skill k in month t -- Month t is elapsed month in the study; 0 for September, 1 for October, and so on -- β 0k and β 1k : respective fixed effects for baseline and rate of change in probability of correctly answering a question tapping skill k. -- β 00 and β 10 : the group average incoming knowledge level and rate of change -- β 0 and β 1 : the baseline level of achievement and rate of change of the student Longitudinal model (e.g. Singer & Willett, 2003)

Absolute Difference WPI-1WPI-5WPI-39WPI … How do I Evaluate Models? Data Real MCAS score ASSISTment Predicted Score Skill ModelsWPI-1WPI-5WPI-39WPI-78 Mary Tom … Sue Dick Harry MAD %Error13.00%12.85%12.41%12.09% Paired two-sample t-test

32 P =0.21P <0.001 P =0.006 Comparing Models of Different Granularities % % P = parameter IRT model Data WPI-1WPI-5WPI-39WPI-78 MAD %Error 13.00%12.85%12.41%12.09% > > > > > > Data WPI-1WPI-5WPI-39WPI-78 MAD %Error 19.37%19.14%15.10%14.70% P <0.001 P =0.03

The Effect of Scaffolding - hypothesis  Only using original questions makes it hard to decide which skill to “blame”  Scaffolding questions aid in diagnosis by directly assessing a single skill  Hypotheses  Using responses to scaffolding questions will improve prediction accuracy  Scaffolding questions are more useful for fine grained models 33

The Effect of Scaffolding - results Data Only original questions used WPI % WPI % WPI % WPI % 34 Original + Scaffolding questions used 13.00% 12.85% 12.41% 12.09% Data Only original questions used WPI % WPI % WPI % WPI % Original + Scaffolding questions used 19.37% 19.14% 15.10% 14.70%

35 Cognitive Diagnostic Assessment - usage Results presented in a nested structure of different granularities to serve a variety of stake-holders

36 Cognitive Diagnostic Assessment - conclusion  Fine-grained models do the best job estimating student skill level overall  Not necessarily the best for all consumers (e.g. principals)  Need ability to diagnosis (e.g. scaffolding questions)  Scaffolding questions  Helps improve overall prediction accuracy  More useful for fine-grained models Feng, M., Heffernan, N.T, Mani, M. & Heffernan C. (2006). Using Mixed-Effects Modeling to Compare Different Grain-Sized Skill Models. In Beck, J., Aimeur, E., & Barnes, T. (Eds). Educational Data Mining: Papers from the AAAI Workshop. Menlo Park, CA: AAAI Press. pp Feng, M, Heffernan, N., Heffernan, C. & Mani, M. (2009). Using mixed-effects modeling to analyze different grain-sized skill models. IEEE Transactions on Learning Technologies Special Issue on Real-World Applications of Intelligent Tutoring Systems. (Featured article of the issue) Pardos, Z., Feng, M. & Heffernan, N. T. & Heffernan-Lindquist, C. (2007).Analyzing fine-grained skill models using bayesian and mixed effect methods. In Luckin & Koedinger (Eds.) Proceedings of the 13th Conference on Artificial Intelligence in Education. Amsterdam, Netherlands: IOS Press.pp

37 Future Work - Skill Model Refinement  We found that WPI-78 is good enough to better predict a state test than some less fine-grained models  However, WPI-78 may have some mis-taggings  Expert-built models are subject to the risk of “expert blind spot”  Our best-guess in a 7-hour coding session  A best guess model should be iteratively tested and refined

38 Skill Model Refinement - approaches  Human experts manually update hand-crafted models  (1,000+ items ) * (100+ skills)  Not practical to do it often  Data mining can help  Skills or items with high residuals  Skills consistently over-predicted or under-predicted  “Un-learned” skills (i.e. negative slopes from mixed- effects models) Feng, M., Heffernan, N., Beck, J, & Koedinger, K. (2008). Can we predict which groups of questions students will learn from? In Beck & Baker (Eds.). Proceedings of the 1st International Conference on Education Data Mining. Montreal, 2008.

39  Searching for better models automatically  Learning Factor Analysis (LFA) (Koedinger, & Junker, 1999)  A semi-automated method  Three parts  Difficulty factors associated with problems  A combinatorial search space by applying operators (add, split, merge) on the base model  A statistical model that evaluate how a model fit the data  Can we increase the efficiency of LFA? Skill Model Refinement - approaches Human identify difficulty factors through task analysis Auto-methods search for better models based upon factors

40 Suggesting Difficulty Factors  Some items in a random sequence cause significantly less learning than others  Hypothesis  Problems that “don’t help” students learn might be teaching a different skill(s)  Create factor tables  Preliminary results show some validity Feng, M., Heffernan, N., & Beck, J. (2009). Using learning decomposition to analyze instructional effectiveness in the ASSISTment system. In Dimitrova, Mizoguchi, du Boulay, & Graesser (Eds), Proceedings of the 14th International Conference on Artificial Intelligence in Education (AIED-2009). Amsterdam, Netherlands: IOS Press. Brighton, UK. SkillFactor Circle-areaHigh Circle-areaHigh Circle-areaHigh Circle-areaLow

41 Roadmap Motivation Contributions Background - ASSISTments Using tutoring system as an assessor Dynamic assessment Longitudinal modeling Cognitive diagnostic modeling  Conclusion & general implications

42 Conclusion of the Dissertation  The dissertation establishes novel assessment methods to better assess students in tutoring systems  Assess students better by analyzing their learning behaviors when using the tutor  Assess students longitudinally by tracking learning over time  Assess students diagnostically by modeling fine- grained skills

43 Comments from the Education Secretary  Secretary of Education, Arne Duncan weighed in (in Feb 2009) on the NCLB Act, and called for continuous assessment Duncan says he is concerned about overtesting but he thinks states could solve the problem by developing better tests. He also wants to help them develop better data management systems that help teachers track individual student progress. "If you have great assessments and real-time data for teachers and parents that say these are [the student's] strengths and weaknesses, that's a real healthy thing," he says. Ramírez, E., & Clark, K. (Feb., 2009). What Arne Duncan Thinks of No Child Left Behind: The new education secretary talks about the controversial law and financial aid forms. (Electronic version) Retrieved on March 8th, 2009 from

44 General implication  Continuous assessment systems are possible to build (we built one)  Save classroom instruction time by assessing students during tutoring  Track individual progress and help stakeholders get student performance information  Provide teachers with fine-grained, cognitively diagnostic feedbacks to be “data-driven”

45 A metaphor for this shift Committee on the Foundations of Assessment Board on Testing and Assessment Center for Education National Research Council James W. Pellegrino Naomi Chudowsky Robert Glaser (page 284).  Businesses don’t close down periodically to take inventory of stock any more  Bar code; auto-checkout  Non-stopped business  Richer information

46 Acknowledgement  My advisor  Neil Heffernan  Committee members  Ken Koedinger  Carolina Ruiz  Joe Beck  The ASSISTment team  My family  Many more…

Worcester Polytechnic Institute Thanks! Questions?

48 Backup slides

49 Motivation – the problems  III: The “moving” target problem  Testing and instruction have been separate fields of research with their own goals  Psychometric theory assumes a fixed target for measurement  ITS wants student ability to “move”

50 More Contributions  Working systems   The reporting system that gives cognitive diagnostic reports to teachers in a timely fashion  Establish an easy approach to detect the effectiveness of individual tutoring content AIED’05: Razzaq, L., Feng, M., Nuzzo-Jones, G., Heffernan, N.T., Koedinger, K. R., Junker, B., Ritter, S., Knight, A., Aniszczyk, C., Choksey, S., Livak, T., Mercado, E., Turner, T.E., Upalekar. R, Walonoski, J.A., Macasek. M.A., Rasmussen, K.P. (2005). The Assistment Project: Blending Assessment and Assisting. In C.K. Looi, G. McCalla, B. Bredeweg, & J. Breuker (Eds.) Proceedings of the 12th International Conference on Artificial Intelligence in Education, pp Amsterdam: ISO Press. Book Chapter: Razzaq, L., Feng, M., Heffernan, N., Koedinger, K., Nuzzo-Jones, G., Junker, B., Macasek, M., Rasmussen, K., Turner, T., & Walonoski, J. (2007). Blending Assessment and Instructional Assistance. In Nedjah, Mourelle, Borges and Almeida (Eds). Intelligent Educational Machines within the Intelligent Systems Engineering Book Series. pp Springer Berlin / Heidelberg. JILR Journal: Feng, M. & Heffernan, N. (2007). Towards Live Informing and Automatic Analyzing of Student Learning: Reporting in the Assistment System. Journal of Interactive Learning Research. 18 (2), pp Chesapeake, VA: AACE. TICL Journal: Feng, M., Heffernan, N.T. (2006). Informing Teachers Live about Student Learning: Reporting in the Assistment System. Technology, Instruction, Cognition, and Learning Journal. Vol. 3. Old City Publishing, Philadelphia, PA AIED’09: Feng, M., Heffernan, N.T., Beck, J. (2009). Using learning decomposition to analyze instructional effectiveness in the ASSISTment system. In Dimitrova, Mizoguchi, du Boulay, and Grasser (Eds), Proceedings of the 14th International Conference on Artificial Intelligence in Education (AIED-2009). pp Amsterdam, Netherlands: IOS Press.

51 Evidence 62%50%37%

52 Evidence 1.Congruence 2.Perimeter 3.Equation-Solving

53 Terminology  MCAS  Item/question/problem  Response  Original question  Scaffolding question  Hint message  Bottom-out hint  Buggy message  Attempt  Skill/knowledge component  Skill model/cognitive model/Q-matrix  Single mapping model  Multi-mapping model

54

55 Worcester Polytechnic Institute 55 The reporting system  I developed the first reporting system for ASSISTments in 2004 that  is online, live, and gives detailed feedback at a grain size for guiding instruction

56 The grade book “It’s spooky; he’s watching everything we do”. – a student

57 Identifying difficult steps

58 Informing hard skills

59 Linear Regression Model  An approach to modeling relationship between one or more variables (y) and one or more variables (X)  Y depends linearly on X  How linear regression works?  Minimizing sum-of-squares  Example of linear regression with one independent variable  Stepwise regression  Forward; backward; Combination Worcester Polytechnic Institute 59

60 1-Parameter IRT Model  Item response theory (IRT) model relates the probability of an examinee's response to a test item to an underlying ability in a logistic function  1-PL IRT model where β n is the ability of person n and δ i is the difficulty of item i.  I used BI-LOG MG to run the model and get estimate of student ability and item difficulty Worcester Polytechnic Institute 60

61 Dynamic assessment - The models

62 Dynamic assessment - The models

63 Dynamic assessment – The models

64 Dynamic assessment - Validation

65 Longitudinal Modeling - data Average %correct on original questions over time (FAKE data) What does our real data look like?real data

66

67

68  What do we get from (linear) mixed effects models?  Average population trajectory for the specified group  Trajectory indicated by two parameters  intercept: slope:  The average estimated score for a group at time j is  One trajectory for every single student  Each student got two parameters to vary from the group average  Intercept: slope:  The estimated score for student i at time j is Longitudinal Modeling - methodology Singer, J. D. & Willett, J. B. (2003). Applied Longitudinal Data Analysis: Modeling Change and Occurrence. Oxford University Press, New York.

69 Longitudinal Modeling - results BIC: Bayesian Information Criterion (the lower, the better) Feng, M., Heffernan, N.T, Koedinger, K.R. (2006a) Addressing the Testing Challenge with a Web-Based E- Assessment System that Tutors as it Assesses. In Proceedings of the 15th International World Wide Web Conference. pp New York, NY: ACM Press Best Student Paper Nominee. Feng, M., Heffernan, N.T, Koedinger, K.R. (2006b). Predicting State Test Scores Better with Intelligent Tutoring Systems: Developing Metrics to Measure Assistance Required. In Ikeda, Ashley & Chan (Eds.). Proceedings of the 8th International Conference on Intelligent Tutoring Systems. Springer-Verlag: Berlin. pp

70 Mixed effects models  Individuals in the population are assumed to have their own subject-specific mean response trajectories over time  The mean response is modeled as a combination of population characteristics (fixed effects) and subject-specific effects that are unique to a particular individual (random effects)  It is possible to predict how individual response trajectories change over time  Flexibility in accommodating imbalance in longitudinal data  Methodological features: 1) 3 or more waves of data 2) an outcome variable (dependent variable) whose values change systematically over time 3) A sensible metric for time that is the fundamental predictor in the longitudinal study

71 Sample longitudinal data

72 Comparison of Approaches  Ayers & Junker (2006)  Estimate student proficiency using  1-PL IRT model  LLTM (linear logistic test model)  Main question difficulty decomposed into K skills  1-PL IRT fits dramatically better  Only main questions used  Additive, non-temporal  WinBUGS Worcester Polytechnic Institute 72

73 Comparison of Approaches  Pardos et al. (2006)  Conjunctive Bayes nets  Non-temporal  Scaffolding used  Bayes Net Toolbox (Murphy, 2001)  DINA model (Anozie, 2006) Worcester Polytechnic Institute 73

74 Comparison of Approaches  Feng, Heffernan, Mani & Heffernan (2006)  Logistic mixed-effects model (Generalized Linear Mixed- effects Model, GLMM)  Temporal  X i j is the 0/1 response of student i on question j tapping KC k in month t,  R lme4 library Worcester Polytechnic Institute 74 Month t is elapsed month in the study; β 0k and β 1k are respective fixed effects for baseline and rate of change in probability of correctly answering a question tapping KC k.

75 Comparison of Approaches  Comparing to LLTM in Ayers & Junker (2006)  Student proficiency depends on time  Question difficulty depends on KC and time  Assign only the most difficult skill instead of full Q-matrix mapping of multiple skills as in LLTM  Scaffolding used to gain identifiability  Ayers & Junker (2006) use regression to predict MCAS after obtaining estimate of student ability (θ) (MAD= 10.93%)  No such regression process in my work  logit(p=1) = θ – 0; estimated score = full score * p  Higher MAD, but provide diagnostic information Worcester Polytechnic Institute 75

76 Comparison of Approaches  Comparing to Bayes nets and conjunctive models  Bayes: probability reasoning; conjunctive  GLMM: linear learning; max-difficulty reduction  Computationally much easier and faster  Results are still comparable  GLMM is better than Bayes nets when WPI-1, WPI-5 used  GLMM is comparable with Bayes nets when WPI-39 or WPI- 78 used  WPI-39: GLMM 12.41%, Bayes: 12.05%  WPI-78: GLMM 12.09%, Bayes: 13.75% Worcester Polytechnic Institute 76

77 Cognitive Diagnostic Assessment – BIC results  BIC  #data points are different  Items tagged with more than one skill will be duplicated in the data  Finer grained models have more multi-mappings, and thus, more data points (higher BIC)  WPI-5 better than WPI-1; WPI-78 better than WPI-39  Calculate MAD as the evaluation gauge Worcester Polytechnic Institute 77 Model WPI-1WPI-5WPI-39WPI Data Data

78 Analyzing Instructional Effectiveness Feng, M., Heffernan, N., & Beck, J. (2009). Using learning decomposition to analyze instructional effectiveness in the ASSISTment system. In Dimitrova, Mizoguchi, du Boulay, & Graesser (Eds), Proceedings of the 14th International Conference on Artificial Intelligence in Education (AIED-2009). Amsterdam, Netherlands: IOS Press. Brighton, UK. Prior encounters Correct ? t1t1 011Tom t4t4 t3t3 t2t2 ItemStudent  Detect relative instructional effectiveness among items in the same GLOP using learning decomposition.

79 Searching Results  Among 38 GLOPs, LFA found significant better models for 12  Shall I be happy?  “Sanity” check: random assigned factor tables #items in GLOP (#GLOPs) Learning- suggested factors Random factor table 2 (11)55 3 (5) 4 (7) (15)4 (5, 6, 8, 9)1 (5)  Further works need to be done  Quantitatively measure whether and how data analysis results can be helpful for subject-matter experts  Explore the automatic factor assigning approach on more data for other systems  Contrast with human experts as controlled condition

80  Guess which item is the most difficult one? Log likelihood Bayesian Information Criterion 1,079.21, Num of skills12 Num of parameters24 Coefficients1.099, , 0.100; , Item ID Square- root Factor- High