Worcester Polytechnic Institute Towards Assessing Students’ Fine Grained Knowledge: Using an Intelligent Tutor for Assessing Mingyu Feng August 18 th,

Worcester Polytechnic Institute Towards Assessing Students’ Fine Grained Knowledge: Using an Intelligent Tutor for Assessing Mingyu Feng August 18 th, 2009 Ph.D. Dissertation Committee: Prof. Neil T. Heffernan (WPI) Prof. Carolina Ruiz (WPI) Prof. Joseph E. Beck (WPI) Prof. Kenneth R. Koedinger (CMU)

2 Motivation – the need  Concerns about poor student performance on new state tests  High-stakes standards-based tests are required by the No Child Left Behind (NCLB) Act  Student performance are not satisfactory  Massachusetts (2003, 20% failed 10 th grade math on the first try)  Worcester  Secondary teachers are asked to be data-driven  MCAS test reports  Formative assessment and practice tests  Provided by Northwest Evaluation Association; Measured Progress; Pearson Assessments, etc.

333 Motivation – the problems  I: Formative assessment takes time from instruction  NCLB or NCLU (No Child Left Untested)?  Every hour spent assessing students is an hour lost from instruction  Limited classroom time compels teachers to make a choice

44 Motivation – the problems  II: Performance reports are not satisfactory  Teachers want more frequent and more detailed reports Confrey, J., Valenzuela, A., & Ortiz, A. (2002). Recommendation to the Texas State Board of Education on the Setting of TAKS Standards: A Call to Responsible Action. At http://www.syrce.org/State_Board.htmhttp://www.syrce.org/State_Board.htm

5 Main Contributions  Improved assessment system by taking into account how much assistance students need (WWW’06; ITS’06; EDM’08; UMUAI Journal’09 (nominated for James Chen award))  Established a way to track and predict performance longitudinally over multiple years (WWW’06; EDM’08)  Rigorously evaluated the effectiveness of the skill models of various granularities (AAAI’06 EDM Workshop; TICL’07; IEEE Journal’09)  Used data mining approach to evaluate effectiveness of individual contents (AIED’09)  Used data mining to refine existing skill models (EDM’09; in preparation)  Developed an online reporting system deployed and used by real teachers (AIED’05; Book chapter’07; TICL Journal’06; JILR Juornal’07)

6 Roadmap Motivation Contributions  Background - ASSISTment  Using tutoring system as an assessor  Dynamic assessment  Longitudinal modeling  Cognitive diagnostic modeling  Conclusion & general implications

77 ASSISTments System  A web-based tutoring system that assists students in learning mathematics and gives teachers assessment of their students’ progress  Teachers like ASSISTments  Students like ASSISTments

8  We break multi-step items (original question) into scaffolding questions  Attempt: student take an action to answer a question  Response: the correctness of student answer (1/0)  Hint Messages: given on demand that give hints about what step to do next  Buggy Message: a context sensitive feedback message  Skill: a piece of knowledge required to answer a question An ASSISTment

99 Facts about ASSISTments  5000+ students have used the system regularly  More than 10 million data records collected  Other features  Learning experiments; authoring tools, account and class management toolkit …  The dissertation uses data of about 1000 students who used ASSISTments during 2004-2006 AIED’05: Razzaq, L., Feng, M., Nuzzo-Jones, G., Heffernan, N.T., Koedinger, K. R., Junker, B., Ritter, S., Knight, A., Aniszczyk, C., Choksey, S., Livak, T., Mercado, E., Turner, T.E., Upalekar. R, Walonoski, J.A., Macasek. M.A., Rasmussen, K.P. (2005). The Assistment Project: Blending Assessment and Assisting. In C.K. Looi, G. McCalla, B. Bredeweg, & J. Breuker (Eds.) Proceedings of the 12th International Conference on Artificial Intelligence in Education, pp. 555-562. Amsterdam: ISO Press. Book Chapter: Razzaq, L., Feng, M., Heffernan, N., Koedinger, K., Nuzzo-Jones, G., Junker, B., Macasek, M., Rasmussen, K., Turner, T., & Walonoski, J. (2007). Blending Assessment and Instructional Assistance. In Nedjah, Mourelle, Borges and Almeida (Eds). Intelligent Educational Machines within the Intelligent Systems Engineering Book Series. pp.23-49. Springer Berlin / Heidelberg.

10 Roadmap Motivation Contributions Background - ASSISTments  Using tutoring system as an assessor  Dynamic assessment  Longitudinal modeling  Cognitive diagnostic modeling  Conclusion & general implications

11 A Grade Book Report JILR Journal: Feng, M. & Heffernan, N. (2007). Towards Live Informing and Automatic Analyzing of Student Learning: Reporting in the Assistment System. Journal of Interactive Learning Research. 18 (2), pp. 207-230. Chesapeake, VA: AACE. TICL Journal: Feng, M., Heffernan, N.T. (2006). Informing Teachers Live about Student Learning: Reporting in the Assistment System. Technology, Instruction, Cognition, and Learning Journal. Vol. 3. Old City Publishing, Philadelphia, PA. 2006. Where does this score come from?

Automated Assessment  Big idea: use data collected while a student uses ASSISTment to assess him  Lots of types of data available  (last screen just used % correct on original questions)  Lots of other possible measures  Why should we be more complicated? Worcester Polytechnic Institute 12

13 A Grade Book Report  Static – does not distinguish “Tom” and “Jack”  Average – ignores development over time  Uninformative – not informative for classroom instruction Dynamic assessment Longitudinal modeling Cognitive diagnostic assessment

14 Dynamic Assessment – the idea Brown, A. L., Bryant, N.R., & Campione, J. C. (1983). Preschool children’s learning and transfer of matrices problems: Potential for improvement. Paper presented at the Society for Research in Child Development meetings, Detroit.  Dynamic testing began before computerized testing (Brown, Bryant, & Campione, 1983).

15 Dynamic vs. Static Assessment  Developing dynamic testing metrics  # attempts  # minutes to come up with an answer; # minutes to complete an ASSISTment  # hint requests; # hint-before-attempt requests; #bottom-out hints  % correct on scaffolds  # problems solved  “Static” measure  correct/wrong on original questions

16 Dynamic A ssessment – data  2004-2005 Data  Sept, 2004 – May, 2005  391 students  Online data  267 minutes (sd. = 79); 9 days; 147 items (sd. = 60)  8 th grade MCAS scores (May, 2005)  2005-2006 Data  Sept, 2005 – May, 2006  616 students  Online data  196 minutes (sd. = 76); 6 days; 88 items (sd. = 42)  8 th grade MCAS scores (May, 2006)

17  Three linear stepwise regression models 17 Dynamic Assessment - modeling 1-parameter IRT proficiency estimate All online metrics 1-parameter IRT proficiency estimate + all online metrics The standard test model The assistance model The mixed model 1-parameter IRT: One parameter item response theory model MCAS Score

18  Bayesian Information Criterion (BIC)  Widely used model selection criterion  Resolves overfitting problem by introducing a penalty term for the number of parameters  Formula  Prefer model with lower BIC  Mean Absolute Deviation (MAD)  Cross-validated prediction error  Function  Prefer model with lower MAD 18 Dynamic Assessment - evaluation Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology, 25, 111-163.

19 Dynamic Assessment - results 1-parameter IRT proficiency estimate All online metrics 1-parameter IRT proficiency estimate + all online metrics The standard test model The assistance model The mixed model Model MADBIC Correlation with 2005 8th grade MCAS Model MADBIC Correlation with 2005 8th grade MCAS The standard test model6.40 -295 0.733 The assistance model 5.46 -402 0.821 p=0.001 Model MADBIC Correlation with 2005 8th grade MCAS The standard test model6.40 -295 0.733 The assistance model 5.46 -402 0.821 The mixed model 5.04 -450 0.841 p=0.001

20 Dynamic Assessment – what variables are important?

21 Dynamic Assessment - robustness  See if model can generalize  Test model on other year’s data

Compare Models from Two Years Worcester Polytechnic Institute 22  Which metrics are stable across years? 2004-2005 data2005-2006 data (Constant) 32.4143.284 IRT_Proficiency_Estimate 26.832.944 Scaffold_Percent_Correct 20.42721.327 Avg_Question_Time -0.17-0.102 Avg_Attempt -10.5 Avg_Hint_Request -3.217 Question_Count 0.072 Avg_Item_Time 0.045 Total_Attempt -0.044

23 Dynamic Assessment - conclusion  ASSISTments data enables us to assess more accurately  The relative success of the assistance model over the standard test model highlights the power of the dynamic measures Feng, M., Heffernan, N.T, Koedinger, K.R. (2006a). Addressing the Testing Challenge with a Web-Based E- Assessment System that Tutors as it Assesses. In Proceedings of the 15th International World Wide Web Conference. pp. 307-316. New York, NY: ACM Press. 2006. Best Student Paper Nominee. Feng, M., Heffernan, N.T., & Koedinger, K.R. (2009). Addressing the assessment challenge in an online System that tutors as it assesses. User Modeling and User-Adapted Interaction: The Journal of Personalization Research (UMUAI journal). 19(3), 2009.

24 Roadmap Motivation Contributions Background - ASSISTments  Using tutoring system as an assessor Dynamic assessment Longitudinal modeling  Cognitive diagnostic modeling  Conclusion & general implications

25 Can we have our cake and eat it, too?  Most large standardized tests are unidimensional or low-dimensional.  Yet, teachers need fine grained diagnostic reports (Militello, Sireci, & Schweid, 2008; Wylie, & Ciofalo, 2008; Stiggins, 2005)  Can we have our cake and eat it, too? Militello, M., Sireci, S., & Schweid, J. (2008). Intent, purpose, and fit: An examination of formative assessment systems in school districts. Paper presented at the American Educational Research Association, New York City, NY. Wylie, E. C., & Ciofalo, J. (2008). Supporting teachers' use of individual diagnostic items. Teachers College Record. Retrieved from http://www.tcrecord.org/PrintContent.asp?ContentID=15363 on October 13, 2008.http://www.tcrecord.org/PrintContent.asp?ContentID=15363 Stiggins, R. (2005). From formative assessment to assessment FOR learning: A path to success in standards-based schools. Phi Delta Kappan, 87(4), 324-328.

26 Cognitive Diagnostic Assessment  McCalla & Greer (1994) pointed out that the ability to represent and reason about knowledge at various levels of detail is important for robust tutoring.  Griel, Wang & Zhou (2008) proposed one direction for future research is to increase understanding of how to select an appropriate grain size or level of analysis  Can we use MCAS test results to help select the right grain-sized model from a series of models of different granularities? McCalla, G. I. and Greer, J. E. (1994). Granularity- based reasoning and belief revision in student models. In Greer, J. E. and McCalla, G. I., (eds), Student Modeling: The Key to Individualized Knowledge-Based Instruction, pages 39-62. Springer-Verlag, Berlin. Gierl, M.J., Wang, C., & Zhou, J. (2008). Using the attribute hierarchy method to make diagnostic inferences about examinees’ cognitive skills in Algebra on the SAT. Journal of Technology, Learning, and Assessment, 6(6).

27 Building Skill Models Math WPI - 1 WPI - 5 Patterns, Relations, and Algebra Geometry Measurement Number Sense and Operations Data Analysis, Statistics and Probability … Using- measurement -formulas- and- techniques Setting-up- and-solving- equation Understanding -pattern Understanding- data- presentation- techniques Understanding- and-applying- congruence-and- similarity Converting- from-one- measure-to- another understanding- number- representations WPI - 39 … … … … WPI - 78 Ordering-fractions Equation- solving Equation-concept Inducing-function Plot-graph XY-graph Congruence Similar-triangles Perimeter Area Circle-graph Unit-conversion Equivalent- Fractions- Decimals-Percents … … … …… … …

28 Building Skill Models Math WPI - 5 WPI - 1 Patterns, Relations, and Algebra Geometry Measurement Number Sense and Operations Data Analysis, Statistics and Probability … Using- measurement -formulas- and- techniques Setting-up- and-solving- equation Understanding -pattern Understanding- data- presentation- techniques Understanding- and-applying- congruence-and- similarity Converting- from-one- measure-to- another understanding- number- representations WPI - 39 … … … … WPI - 78 Ordering-fractions Equation- solving Equation-concept Inducing-function Plot-graph XY-graph Congruence Similar-triangles Perimeter Area Circle-graph Unit-conversion Equivalent- Fractions- Decimals-Percents … … … …… … …

29 Cognitive Diagnostic Assessment – data  2004-2005 Data  Sept, 2004 – May, 2005  447 students  Online data: 7.3 days; 87 items (sd. = 35)  Item level response of 8 th grade MCAS test (May, 2005)  2005-2006 Data  Sept, 2005 – May, 2006  474 students  Online data: 5 days; 51 items (sd. = 24)  Item level 8 th grade MCAS scores (May, 2006)  All online and MCAS items have been tagged with all four skill models

30 Cognitive Diagnostic Assessment - modeling  Fit mixed-effects logistic regression model  Predict MCAS score  Extrapolate the fitted model in time to the month of the MCAS test  Obtain probability of getting each MCAS question correct, based upon skill tagging of the MCAS item  Sum up probabilities to get total score 30 -- X ijkt is the 0/1 response of student i on question j tapping skill k in month t -- Month t is elapsed month in the study; 0 for September, 1 for October, and so on -- β 0k and β 1k : respective fixed effects for baseline and rate of change in probability of correctly answering a question tapping skill k. -- β 00 and β 10 : the group average incoming knowledge level and rate of change -- β 0 and β 1 : the baseline level of achievement and rate of change of the student Longitudinal model (e.g. Singer & Willett, 2003)

Absolute Difference WPI-1WPI-5WPI-39WPI-78 1.692.152.824.53 2.342.853.334.87 … 0.540.771.152.74 0.591.301.883.70 1.330.580.021.86 31 How do I Evaluate Models? 04-05 Data Real MCAS score ASSISTment Predicted Score Skill ModelsWPI-1WPI-5WPI-39WPI-78 Mary 25.0023.3122.8522.1820.47 Tom 32.0029.6629.1528.6727.13 … Sue 29.0028.4628.2327.8526.26 Dick 28.0027.4126.7026.1224.30 Harry 22.0023.3322.5822.0220.14 MAD4.424.374.224.11 %Error13.00%12.85%12.41%12.09% Paired two-sample t-test

32 P =0.21P <0.001 P =0.006 Comparing Models of Different Granularities 4.67 13.70% 4.36 12.83% P =0.10 1-parameter IRT model 04-05 Data WPI-1WPI-5WPI-39WPI-78 MAD 4.424.374.224.11 %Error 13.00%12.85%12.41%12.09% > > > > > > 05-06 Data WPI-1WPI-5WPI-39WPI-78 MAD 6.586.514.834.99 %Error 19.37%19.14%15.10%14.70% P <0.001 P =0.03

The Effect of Scaffolding - hypothesis  Only using original questions makes it hard to decide which skill to “blame”  Scaffolding questions aid in diagnosis by directly assessing a single skill  Hypotheses  Using responses to scaffolding questions will improve prediction accuracy  Scaffolding questions are more useful for fine grained models 33

The Effect of Scaffolding - results 04-05 Data Only original questions used WPI-1 14.91% WPI-5 14.06% WPI-39 15.29% WPI-78 17.75% 34 Original + Scaffolding questions used 13.00% 12.85% 12.41% 12.09% 05-06 Data Only original questions used WPI-1 20.05% WPI-5 19.88% WPI-39 18.68% WPI-78 16.91% Original + Scaffolding questions used 19.37% 19.14% 15.10% 14.70%

35 Cognitive Diagnostic Assessment - usage Results presented in a nested structure of different granularities to serve a variety of stake-holders

36 Cognitive Diagnostic Assessment - conclusion  Fine-grained models do the best job estimating student skill level overall  Not necessarily the best for all consumers (e.g. principals)  Need ability to diagnosis (e.g. scaffolding questions)  Scaffolding questions  Helps improve overall prediction accuracy  More useful for fine-grained models Feng, M., Heffernan, N.T, Mani, M. & Heffernan C. (2006). Using Mixed-Effects Modeling to Compare Different Grain-Sized Skill Models. In Beck, J., Aimeur, E., & Barnes, T. (Eds). Educational Data Mining: Papers from the AAAI Workshop. Menlo Park, CA: AAAI Press. pp. 57-66. Feng, M, Heffernan, N., Heffernan, C. & Mani, M. (2009). Using mixed-effects modeling to analyze different grain-sized skill models. IEEE Transactions on Learning Technologies Special Issue on Real-World Applications of Intelligent Tutoring Systems. (Featured article of the issue) Pardos, Z., Feng, M. & Heffernan, N. T. & Heffernan-Lindquist, C. (2007).Analyzing fine-grained skill models using bayesian and mixed effect methods. In Luckin & Koedinger (Eds.) Proceedings of the 13th Conference on Artificial Intelligence in Education. Amsterdam, Netherlands: IOS Press.pp.626-628.

37 Future Work - Skill Model Refinement  We found that WPI-78 is good enough to better predict a state test than some less fine-grained models  However, WPI-78 may have some mis-taggings  Expert-built models are subject to the risk of “expert blind spot”  Our best-guess in a 7-hour coding session  A best guess model should be iteratively tested and refined

38 Skill Model Refinement - approaches  Human experts manually update hand-crafted models  (1,000+ items ) * (100+ skills)  Not practical to do it often  Data mining can help  Skills or items with high residuals  Skills consistently over-predicted or under-predicted  “Un-learned” skills (i.e. negative slopes from mixed- effects models) Feng, M., Heffernan, N., Beck, J, & Koedinger, K. (2008). Can we predict which groups of questions students will learn from? In Beck & Baker (Eds.). Proceedings of the 1st International Conference on Education Data Mining. Montreal, 2008.

39  Searching for better models automatically  Learning Factor Analysis (LFA) (Koedinger, & Junker, 1999)  A semi-automated method  Three parts  Difficulty factors associated with problems  A combinatorial search space by applying operators (add, split, merge) on the base model  A statistical model that evaluate how a model fit the data  Can we increase the efficiency of LFA? Skill Model Refinement - approaches Human identify difficulty factors through task analysis Auto-methods search for better models based upon factors

40 Suggesting Difficulty Factors  Some items in a random sequence cause significantly less learning than others  Hypothesis  Problems that “don’t help” students learn might be teaching a different skill(s)  Create factor tables  Preliminary results show some validity Feng, M., Heffernan, N., & Beck, J. (2009). Using learning decomposition to analyze instructional effectiveness in the ASSISTment system. In Dimitrova, Mizoguchi, du Boulay, & Graesser (Eds), Proceedings of the 14th International Conference on Artificial Intelligence in Education (AIED-2009). Amsterdam, Netherlands: IOS Press. Brighton, UK. SkillFactor Circle-areaHigh Circle-areaHigh Circle-areaHigh Circle-areaLow

41 Roadmap Motivation Contributions Background - ASSISTments Using tutoring system as an assessor Dynamic assessment Longitudinal modeling Cognitive diagnostic modeling  Conclusion & general implications

42 Conclusion of the Dissertation  The dissertation establishes novel assessment methods to better assess students in tutoring systems  Assess students better by analyzing their learning behaviors when using the tutor  Assess students longitudinally by tracking learning over time  Assess students diagnostically by modeling fine- grained skills

43 Comments from the Education Secretary  Secretary of Education, Arne Duncan weighed in (in Feb 2009) on the NCLB Act, and called for continuous assessment Duncan says he is concerned about overtesting but he thinks states could solve the problem by developing better tests. He also wants to help them develop better data management systems that help teachers track individual student progress. "If you have great assessments and real-time data for teachers and parents that say these are [the student's] strengths and weaknesses, that's a real healthy thing," he says. Ramírez, E., & Clark, K. (Feb., 2009). What Arne Duncan Thinks of No Child Left Behind: The new education secretary talks about the controversial law and financial aid forms. (Electronic version) Retrieved on March 8th, 2009 from http://www.usnews.com/articles/education/2009/02/05/what-arne-duncan-thinks-of-no-child-left-behind.html. http://www.usnews.com/articles/education/2009/02/05/what-arne-duncan-thinks-of-no-child-left-behind.html

44 General implication  Continuous assessment systems are possible to build (we built one)  Save classroom instruction time by assessing students during tutoring  Track individual progress and help stakeholders get student performance information  Provide teachers with fine-grained, cognitively diagnostic feedbacks to be “data-driven”

45 A metaphor for this shift Committee on the Foundations of Assessment Board on Testing and Assessment Center for Education National Research Council James W. Pellegrino Naomi Chudowsky Robert Glaser (page 284).  Businesses don’t close down periodically to take inventory of stock any more  Bar code; auto-checkout  Non-stopped business  Richer information

46 Acknowledgement  My advisor  Neil Heffernan  Committee members  Ken Koedinger  Carolina Ruiz  Joe Beck  The ASSISTment team  My family  Many more…

Worcester Polytechnic Institute Thanks! Questions?

48 Backup slides

49 Motivation – the problems  III: The “moving” target problem  Testing and instruction have been separate fields of research with their own goals  Psychometric theory assumes a fixed target for measurement  ITS wants student ability to “move”

50 More Contributions  Working systems  www.ASSISTment.org www.ASSISTment.org  The reporting system that gives cognitive diagnostic reports to teachers in a timely fashion  Establish an easy approach to detect the effectiveness of individual tutoring content AIED’05: Razzaq, L., Feng, M., Nuzzo-Jones, G., Heffernan, N.T., Koedinger, K. R., Junker, B., Ritter, S., Knight, A., Aniszczyk, C., Choksey, S., Livak, T., Mercado, E., Turner, T.E., Upalekar. R, Walonoski, J.A., Macasek. M.A., Rasmussen, K.P. (2005). The Assistment Project: Blending Assessment and Assisting. In C.K. Looi, G. McCalla, B. Bredeweg, & J. Breuker (Eds.) Proceedings of the 12th International Conference on Artificial Intelligence in Education, pp. 555-562. Amsterdam: ISO Press. Book Chapter: Razzaq, L., Feng, M., Heffernan, N., Koedinger, K., Nuzzo-Jones, G., Junker, B., Macasek, M., Rasmussen, K., Turner, T., & Walonoski, J. (2007). Blending Assessment and Instructional Assistance. In Nedjah, Mourelle, Borges and Almeida (Eds). Intelligent Educational Machines within the Intelligent Systems Engineering Book Series. pp.23-49. Springer Berlin / Heidelberg. JILR Journal: Feng, M. & Heffernan, N. (2007). Towards Live Informing and Automatic Analyzing of Student Learning: Reporting in the Assistment System. Journal of Interactive Learning Research. 18 (2), pp. 207-230. Chesapeake, VA: AACE. TICL Journal: Feng, M., Heffernan, N.T. (2006). Informing Teachers Live about Student Learning: Reporting in the Assistment System. Technology, Instruction, Cognition, and Learning Journal. Vol. 3. Old City Publishing, Philadelphia, PA. 2006. AIED’09: Feng, M., Heffernan, N.T., Beck, J. (2009). Using learning decomposition to analyze instructional effectiveness in the ASSISTment system. In Dimitrova, Mizoguchi, du Boulay, and Grasser (Eds), Proceedings of the 14th International Conference on Artificial Intelligence in Education (AIED-2009). pp. 523-530. Amsterdam, Netherlands: IOS Press.

51 Evidence 62%50%37%

52 Evidence 1.Congruence 2.Perimeter 3.Equation-Solving

53 Terminology  MCAS  Item/question/problem  Response  Original question  Scaffolding question  Hint message  Bottom-out hint  Buggy message  Attempt  Skill/knowledge component  Skill model/cognitive model/Q-matrix  Single mapping model  Multi-mapping model

55 Worcester Polytechnic Institute 55 The reporting system  I developed the first reporting system for ASSISTments in 2004 that  is online, live, and gives detailed feedback at a grain size for guiding instruction

56 The grade book “It’s spooky; he’s watching everything we do”. – a student

57 Identifying difficult steps

58 Informing hard skills

59 Linear Regression Model  An approach to modeling relationship between one or more variables (y) and one or more variables (X)  Y depends linearly on X  How linear regression works?  Minimizing sum-of-squares  Example of linear regression with one independent variable  Stepwise regression  Forward; backward; Combination Worcester Polytechnic Institute 59

60 1-Parameter IRT Model  Item response theory (IRT) model relates the probability of an examinee's response to a test item to an underlying ability in a logistic function  1-PL IRT model where β n is the ability of person n and δ i is the difficulty of item i.  I used BI-LOG MG to run the model and get estimate of student ability and item difficulty Worcester Polytechnic Institute 60

61 Dynamic assessment - The models

62 Dynamic assessment - The models

63 Dynamic assessment – The models

64 Dynamic assessment - Validation

65 Longitudinal Modeling - data Average %correct on original questions over time (FAKE data) What does our real data look like?real data

68  What do we get from (linear) mixed effects models?  Average population trajectory for the specified group  Trajectory indicated by two parameters  intercept: slope:  The average estimated score for a group at time j is  One trajectory for every single student  Each student got two parameters to vary from the group average  Intercept: slope:  The estimated score for student i at time j is Longitudinal Modeling - methodology Singer, J. D. & Willett, J. B. (2003). Applied Longitudinal Data Analysis: Modeling Change and Occurrence. Oxford University Press, New York.

69 Longitudinal Modeling - results BIC: Bayesian Information Criterion (the lower, the better) Feng, M., Heffernan, N.T, Koedinger, K.R. (2006a) Addressing the Testing Challenge with a Web-Based E- Assessment System that Tutors as it Assesses. In Proceedings of the 15th International World Wide Web Conference. pp. 307-316. New York, NY: ACM Press. 2006. Best Student Paper Nominee. Feng, M., Heffernan, N.T, Koedinger, K.R. (2006b). Predicting State Test Scores Better with Intelligent Tutoring Systems: Developing Metrics to Measure Assistance Required. In Ikeda, Ashley & Chan (Eds.). Proceedings of the 8th International Conference on Intelligent Tutoring Systems. Springer-Verlag: Berlin. pp. 31-40. 2006.

70 Mixed effects models  Individuals in the population are assumed to have their own subject-specific mean response trajectories over time  The mean response is modeled as a combination of population characteristics (fixed effects) and subject-specific effects that are unique to a particular individual (random effects)  It is possible to predict how individual response trajectories change over time  Flexibility in accommodating imbalance in longitudinal data  Methodological features: 1) 3 or more waves of data 2) an outcome variable (dependent variable) whose values change systematically over time 3) A sensible metric for time that is the fundamental predictor in the longitudinal study

71 Sample longitudinal data

72 Comparison of Approaches  Ayers & Junker (2006)  Estimate student proficiency using  1-PL IRT model  LLTM (linear logistic test model)  Main question difficulty decomposed into K skills  1-PL IRT fits dramatically better  Only main questions used  Additive, non-temporal  WinBUGS Worcester Polytechnic Institute 72

73 Comparison of Approaches  Pardos et al. (2006)  Conjunctive Bayes nets  Non-temporal  Scaffolding used  Bayes Net Toolbox (Murphy, 2001)  DINA model (Anozie, 2006) Worcester Polytechnic Institute 73

74 Comparison of Approaches  Feng, Heffernan, Mani & Heffernan (2006)  Logistic mixed-effects model (Generalized Linear Mixed- effects Model, GLMM)  Temporal  X i j is the 0/1 response of student i on question j tapping KC k in month t,  R lme4 library Worcester Polytechnic Institute 74 Month t is elapsed month in the study; β 0k and β 1k are respective fixed effects for baseline and rate of change in probability of correctly answering a question tapping KC k.

75 Comparison of Approaches  Comparing to LLTM in Ayers & Junker (2006)  Student proficiency depends on time  Question difficulty depends on KC and time  Assign only the most difficult skill instead of full Q-matrix mapping of multiple skills as in LLTM  Scaffolding used to gain identifiability  Ayers & Junker (2006) use regression to predict MCAS after obtaining estimate of student ability (θ) (MAD= 10.93%)  No such regression process in my work  logit(p=1) = θ – 0; estimated score = full score * p  Higher MAD, but provide diagnostic information Worcester Polytechnic Institute 75

76 Comparison of Approaches  Comparing to Bayes nets and conjunctive models  Bayes: probability reasoning; conjunctive  GLMM: linear learning; max-difficulty reduction  Computationally much easier and faster  Results are still comparable  GLMM is better than Bayes nets when WPI-1, WPI-5 used  GLMM is comparable with Bayes nets when WPI-39 or WPI- 78 used  WPI-39: GLMM 12.41%, Bayes: 12.05%  WPI-78: GLMM 12.09%, Bayes: 13.75% Worcester Polytechnic Institute 76

77 Cognitive Diagnostic Assessment – BIC results  BIC  #data points are different  Items tagged with more than one skill will be duplicated in the data  Finer grained models have more multi-mappings, and thus, more data points (higher BIC)  WPI-5 better than WPI-1; WPI-78 better than WPI-39  Calculate MAD as the evaluation gauge Worcester Polytechnic Institute 77 Model WPI-1WPI-5WPI-39WPI-78 04-05 Data 173445.2170359.9170581.7165711.4 05-06 Data 39210.5739174.2954696.454299.54 3085-2224870 36-15522399

78 Analyzing Instructional Effectiveness Feng, M., Heffernan, N., & Beck, J. (2009). Using learning decomposition to analyze instructional effectiveness in the ASSISTment system. In Dimitrova, Mizoguchi, du Boulay, & Graesser (Eds), Proceedings of the 14th International Conference on Artificial Intelligence in Education (AIED-2009). Amsterdam, Netherlands: IOS Press. Brighton, UK. Prior encounters 1 0 0 1 Correct ? 1 1 1 0 t1t1 011Tom 010 000 000 t4t4 t3t3 t2t2 ItemStudent  Detect relative instructional effectiveness among items in the same GLOP using learning decomposition.

79 Searching Results  Among 38 GLOPs, LFA found significant better models for 12  Shall I be happy?  “Sanity” check: random assigned factor tables #items in GLOP (#GLOPs) Learning- suggested factors Random factor table 2 (11)55 3 (5) 4 (7)31 5-11 (15)4 (5, 6, 8, 9)1 (5)  Further works need to be done  Quantitatively measure whether and how data analysis results can be helpful for subject-matter experts  Explore the automatic factor assigning approach on more data for other systems  Contrast with human experts as controlled condition

80  Guess which item is the most difficult one? Log likelihood -532.6-524 Bayesian Information Criterion 1,079.21,065.99 Num of skills12 Num of parameters24 Coefficients1.099, 0.137 1.841, 0.100; -0.927, 0.055 Item ID Square- root Factor- High 89410 4111 467311 11711

Worcester Polytechnic Institute Towards Assessing Students’ Fine Grained Knowledge: Using an Intelligent Tutor for Assessing Mingyu Feng August 18 th,

Similar presentations

Presentation on theme: "Worcester Polytechnic Institute Towards Assessing Students’ Fine Grained Knowledge: Using an Intelligent Tutor for Assessing Mingyu Feng August 18 th,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Worcester Polytechnic Institute Towards Assessing Students’ Fine Grained Knowledge: Using an Intelligent Tutor for Assessing Mingyu Feng August 18 th,

Similar presentations

Presentation on theme: "Worcester Polytechnic Institute Towards Assessing Students’ Fine Grained Knowledge: Using an Intelligent Tutor for Assessing Mingyu Feng August 18 th,"— Presentation transcript:

Similar presentations

About project

Feedback