Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012.

Slides:



Advertisements
Similar presentations
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 March 26, 2012.
Advertisements

Relationship Mining Sequential Pattern Mining Week 5 Video 4.
Relationship Mining Association Rule Mining Week 5 Video 3.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 March 16, 2012.
Instructor: Mircea Nicolescu Lecture 13 CS 485 / 685 Computer Vision.
Multiplication Rule. A tree structure is a useful tool for keeping systematic track of all possibilities in situations in which events happen in order.
Planning under Uncertainty
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Copyright 2004 David J. Lilja1 What Do All of These Means Mean? Indices of central tendency Sample mean Median Mode Other means Arithmetic Harmonic Geometric.
Introduction to Analysis of Algorithms
Optimization via Search CPSC 315 – Programming Studio Spring 2009 Project 2, Lecture 4 Adapted from slides of Yoonsuck Choe.
Chapter 2: Algorithm Discovery and Design
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Sequence similarity.
Complexity (Running Time)
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
Chapter 2: Algorithm Discovery and Design
Chapter 2: Algorithm Discovery and Design
© 2006 Pearson Addison-Wesley. All rights reserved10 A-1 Chapter 10 Algorithm Efficiency and Sorting.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Fundamentals of Python: From First Programs Through Data Structures
College Algebra Prerequisite Topics Review
Factorial Experiments
Data Structures and Algorithms Graphs Minimum Spanning Tree PLSD210.
Genetic Algorithm.
Chapter Whole Numbers and Their Operations 3 3 Copyright © 2013, 2010, and 2007, Pearson Education, Inc.
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
Chapter 2: Algorithm Discovery and Design Invitation to Computer Science, C++ Version, Third Edition.
Invitation to Computer Science, Java Version, Second Edition.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Solving Using Matrices By Helen Chin Kitty Luo Anita La.
Multiplying Whole Numbers © Math As A Second Language All Rights Reserved next #5 Taking the Fear out of Math 9 × 9 81 Single Digit Multiplication.
1 Lesson 8: Basic Monte Carlo integration We begin the 2 nd phase of our course: Study of general mathematics of MC We begin the 2 nd phase of our course:
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Section 2.1.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
Regular Expressions Chapter 6 1. Regular Languages Regular Language Regular Expression Finite State Machine L Accepts 2.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
1 Counting by Mapping Lecture 15: Nov 4 A B …… f.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 March 16, 2012.
Verification & Validation. Batch processing In a batch processing system, documents such as sales orders are collected into batches of typically 50 documents.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 17, 2012.
CS1Q Computer Systems Lecture 2 Simon Gay. Lecture 2CS1Q Computer Systems - Simon Gay2 Binary Numbers We’ll look at some details of the representation.
Core Methods in Educational Data Mining HUDK4050 Fall 2015.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 6, 2013.
Core Methods in Educational Data Mining HUDK4050 Fall 2015.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
HANGMAN OPTIMIZATION Kyle Anderson, Sean Barton and Brandyn Deffinbaugh.
Fortran: Control Structures Session Three ICoCSIS.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
CS Class 04 Topics  Selection statement – IF  Expressions  More practice writing simple C++ programs Announcements  Read pages for next.
Banaras Hindu University. A Course on Software Reuse by Design Patterns and Frameworks.
Slide 7.1 Saunders, Lewis and Thornhill, Research Methods for Business Students, 5 th Edition, © Mark Saunders, Philip Lewis and Adrian Thornhill 2009.
OPERATORS IN C CHAPTER 3. Expressions can be built up from literals, variables and operators. The operators define how the variables and literals in the.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Lesson 8: Basic Monte Carlo integration
Core Methods in Educational Data Mining
COUNTING AND PROBABILITY
Discrete Structures for Computer Science
Objective of This Course
CSc4730/6730 Scientific Visualization
Association Rule Mining
COUNTING AND PROBABILITY
Introduction to TouchDevelop
3 Chapter Whole Numbers and Their Operations
Biologically Inspired Computing: Operators for Evolutionary Algorithms
Presentation transcript:

Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Today’s Class Sequential Pattern Mining

Two categories of SPM (That we’ll discuss today) Approaches Coming from Association Rule Mining MOTIF Extraction

Association Rule Mining Try to automatically find if-then rules within the data set

Sequential Pattern Mining Try to automatically find temporal patterns within the data set

ARM Example If person X buys diapers, Person X buys beer Purchases occur at the same time

SPM Example If person X buys novel Foundation now, Person X buys novel Second Foundation in a later transaction Conclusion: recommend Second Foundation to people who have previously purchased Foundation

SPM Example Many customers rent Star Wars, then the Empire Strikes Back, then Return of the Jedi Doesn’t matter if they rent other stuff in- between

SPM Example Many customers buy flowers, and then buy diapers AND diaper cream several months later

SPM Example Many learners become confused, then game the system, then become frustrated, then complete gaming the system, then become re- engaged

Different Constraints than ARM If-then elements do not need to occur in the same data point Instead – If-then elements should have same user (or other organizing variable) – If elements can be within a certain time window of each other – Then element time should be within a certain window after if times

Sequential Pattern Mining Find all subsequences in data with high support Support calculated as number of sequences that contain subsequence, divided by total number of sequences

Sequential Pattern Mining What are some subsequences with high support? (What is their support?) Chuck: a, abc, ac, de, cef Darlene: af, ab, acd, dabc, ef Egoberto: aef, ab, aceh, d, ae Francine: a, bc, acf, d, abeg

Questions? Comments?

Algorithms for SPM

GSP (Generalized Sequential Pattern) Classic Algorithm (Srikant & Agrawal, 1996)

Data pre-processing Data transformed from individual actions to sequences by user E.g. Bob: {GAMING and BORED, OFF-TASK and BORED, ON-TASK and BORED, GAMING and BORED, GAMING and FRUSTRATED, ON-TASK and BORED}

Data pre-processing In some cases, time also included E.g. Bob: {GAMING and BORED 5:05:20, OFF-TASK and BORED 5:05:40, ON-TASK and BORED 5:06:00, GAMING and BORED 5:06:20, GAMING and FRUSTRATED 5:06:40, ON-TASK and BORED 5:07:00}

Algorithm Take the whole set of sequences of length 1 – May include “ANDed” combinations at same time Find which sequences of length 1 have support over pre-chosen threshold Compose potential sequences out of pairs of sequences of length 1 with acceptable support Find which sequences of length 2 have support over pre-chosen threshold Compose potential sequences out of triplets of sequences of length 1 and 2 with acceptable support Continue until no new sequences found

Let’s execute GPS algorithm With min support = 50%

Let’s execute GPS algorithm With min support = 50% Chuck: a, abc, ac, de, cef Darlene: af, ab, acd, dabc, ef Egoberto: aef, ab, aceh, d, ae Francine: a, bc, acf, d, abeg

Other algorithms Free-Span Prefix-Span Select sub-sets of data to search within Faster, but same basic idea as in GPS

Uses in educational domains

Perera et al. (2009) What were the three ways that Perera et al. (2009) used sequential pattern mining? What did they learn, and how did they use the information?

Perera et al. (2009) 1.Overall uses of collaborative tools by groups 2.Sequences of collaborative tool use by different group members 3.Sequences of access of specific resources by different group members In all cases, they found common patterns and then looked at how support differed for successful and unsuccessful groups

Perera et al. (2009): Important Findings 1.Overall uses of collaborative tools by groups – Successful groups used ticketing system more than the wiki; weaker groups used wiki more – Patterns were particularly strong for group leaders

Perera et al. (2009): Important Findings 2.Sequences of collaborative tool use by different group members – Successful groups characterized by leader opening ticket and other student working on ticket – Successful groups characterized by students other than leader opening ticket, and other students working on ticket

Perera et al. (2009): Important Findings 3.Sequences of access of specific resources by different group members – The best groups had interactions around the same resource by multiple students – The poor groups did no work on tickets before closing them

Questions? Comments?

MOTIF Extraction

Motif Short, recurring pattern in a sequence of categories occurring over time

Motif in Music Short, recurring pattern of notes in a musical composition

Motif in Music What’s the motif? KIY KIY How many times does the motif occur?

Motif in Music What’s the motif? KIY KIY How many times does the motif occur? – Depends on how you define it, right? – And that’s part of the challenge…

Motif in Language Short, recurring pattern of characters in a sequence of characters occurring over time

Motif in Genetics Short, recurring pattern of genes in a sequence of genes occurring over time Typically written as letters

Goal of Motif Extraction Discern a common pattern of characters in a large corpus of characters The characters may vary slightly from case to case

Can you find the motif?

UBSWWDFKLWPRHUC JBDPXBDVEJVMBKK VBDWNLROFVUBFFW OWIFTIENDOXJXIOB AUAAOOXZAABZSBT MUAWSNTVZXSFHMI LFQRKUTFRIENDOV LOMTPOQHJVYYMFJ LWGJMVPKYOZNMSA RUPMFOHPVSPPVPT BAZXVFTPQFQJVBM HLPMOKUOXGRIENDO INUSUNSGDAAICAV XRZZWCDXOVZZJKQ VOVCROMCJTOLXYU HUVRYFREENDOBBGC AQJBVXJCAJLEMAU ONJORIFCGAUGIRN PJGCHBDQIWJJTMQ IQYQHKKBNBVDFPV JJLHWPZAYZIGGEH IGJZRMAAWJBESSS JXZFRIEMDOVZRBJY IRPWYIRJISLFVFF

How would you describe the motif? UBSWWDFKLWPRHUC JBDPXBDVEJVMBKK VBDWNLROFVUBFFW OWIFTIENDOXJXIOB AUAAOOXZAABZSBT MUAWSNTVZXSFHMI LFQRKUTFRIENDOV LOMTPOQHJVYYMFJ LWGJMVPKYOZNMSA RUPMFOHPVSPPVPT BAZXVFTPQFQJVBM HLPMOKUOXGRIENDO INUSUNSGDAAICAV XRZZWCDXOVZZJKQ VOVCROMCJTOLXYU HUVRYFREENDOBBGC AQJBVXJCAJLEMAU ONJORIFCGAUGIRN PJGCHBDQIWJJTMQ IQYQHKKBNBVDFPV JJLHWPZAYZIGGEH IGJZRMAAWJBESSS JXZFRIEMDOVZRBJY IRPWYIRJISLFVFF

Finding motifs Several algorithms

Finding motifs Variant on PROJECTION algorithm (Tompa & Buhler, 2001) used in (Shanabrook et al., 2010) – Only example of motif extraction in educational data mining so far

Big idea For each character string C that could be a motif example (e.g. all character strings of desired length) – Create a set of projections, random variations of C that vary in one or more ways

Big idea For each pair of strings C1 and C2, see how many overlaps there are between their projection matrices Take the pair with the most matches and combine into a motif – Creating multi-example motif if 3+ get added together Repeat until goal number of motifs is found, or until new motif is below criterion goodness

Motif in Education Short, recurring pattern of behaviors in a sequence of behaviors occurring over time Written as letters in Shanabrook et al. (2010)

Detail for education How do you segment student behavior? Could use student’s interaction on an entire problem, and compute letters across whole problem – Might make more sense in tutors with shorter problems Could use student’s interaction on an entire problem, and define letters differently for context within whole problem – Approach used by Shanabrook et al. (2010) Could use “sliding window” of N actions

Behaviors in Shanabrook et al. “hints (a, b, c) – Hints is a measure of the number of hints viewed for this problem. Although each problem has a maximum number of hints, the hint count does not have an upper bound because students can repeat hints and the count will increase at each repeated view. The three categories for hints are: (a) no hints, meaning that thestudent did not use the hint facility for that problem, (b) meaning the student used the hint facility, but was not given the solution, and (c) last hint solved, meaning that the student was given the solution to the problem by the last hint. As described above, this metric combines two values logged by the tutor: the count of hints seen, and an indicator that the final hint giving the answer was seen. The data could have been simply binned low, medium, high hints; however, this would have missed the significance of zero hints and using hints to reveal the problem solution.”

Behaviors in Shanabrook et al. “secFirst (d, e, f) – The seconds to first attempt is an important measure as it is during this time that the student is reading the problem and formulating their response. In previous research [6], five seconds was determined to be a threshold for this metric representing gaming: students who make a first attempt in less than five seconds are considered not working on-task. We divide secFirst into three bins: (d) less than 5 sec, (e) 5 to 30 sec, (f) greater than 30 sec. (d) represents students who are gaming the system, (e) represents a moderate time to the first attempt, (f) represents a long time to the first attempt. The cut at 30 seconds was chosen because it equalizes the distribution of bins (e and f), representing a division between a moderate and a long time to the first attempt.”

Behaviors in Shanabrook et al. “secOther (g, h, i, j, k) – This variable represents actions related to answering the problem after the first attempt was made. While the first attempt includes the problem reading and solution time, subsequent solution attempts could be much quicker and the student could still be making good effort. secOther is categorized in five bins: (g) skip, (h) solved on first, (i) 0 to 1.2 sec, (j) 1.2 to 2.9 sec, (k) greater than 2.9 sec. First, there are two categorical bins, skip and solve on first attempt. These are each determined from an indicator in the log data for that problem. Skipping a problem implies only that students never clicked on a correct answer; they could have worked on the problem and then given up, or immediately skipped to the next problem with only a quick look. Solved on first attempt indicates correctly solving the problem. If neither of the first two bins are indicated in the logs, then the secOther metric measures the mean time for all attempts after the first. The divisions of 1.2 sec and 2.9 sec for the latter three bins were obtained using the mean and one standard deviation above the mean for all tutor usage; (i) less than 1.2 seconds would indicate guessing, (j) would indicate normal attempts, and (k) would indicate a long time between attempts.”

Behaviors in Shanabrook et al. “numIncorrect – (o, p, q) - Each problem has four or five possible answer choices, that we divide into three groups: (o) zero incorrect attempts, indicates either solved on first attempt, skipped problem, or last hint solves problem (defined by the other metrics); (p) indicates choosing the correct answer in the second or third attempt, and (q) obtaining the answer by default in a four answer problem or possibly guessing when there is five answer problem.”

What other constructs could be used? What other kinds of constructs could be used for the atoms of motif analyses in educational analyses? – At this grain-size (e.g. specific actions)

What other constructs could be used? What other kinds of constructs could be used for the atoms of motif analyses in educational analyses? – At other grain-sizes?

Common Motifs {adgo, adip, adiq} {aeho, afho} {ceho} {adgo, aeho} {aeiq aeho aeho aekp aeho aeiq aeho aeip aeho aeip}

Interpretations (Shanabrook et al., 2010) {adgo, adip, adiq} – gaming the system {aeho, afho} – “This student is using the tutor appropriately, but not being challenged.” {ceho} – problem is too difficult {adgo, aeho} – student is skipping problems {aeiq aeho aeho aekp aeho aeiq aeho aeip aeho aeip} – working on-task

Do you agree with interpretations? {adgo, adip, adiq} – gaming the system {aeho, afho} – “This student is using the tutor appropriately, but not being challenged.” {ceho} – problem is too difficult {adgo, aeho} – student is skipping problems {aeiq aeho aeho aekp aeho aeiq aeho aeip aeho aeip} – working on-task

How can researchers form good interpretations?

Questions? Comments?

What else? What else could sequential pattern mining and motif extraction be used for in education? – Beyond Perera et al. and Shanabrook et al.

Asgn. 8 Solutions

Asgn. 9 Questions? Comments?

Next Class Monday, April 29 Learnograms Guest lecturer: Arnon Hershkovitz – World-famous inventor of learnograms Readings: None Assignments Due: None

The End