Download presentation
Presentation is loading. Please wait.
Published byAlisha Golden Modified over 8 years ago
1
Andy Nguyen Christopher Piech Jonathan Huang Leonidas Guibas. Stanford University
2
Introduction to The Problem
3
MOOC Massive Open Online Courses Easy access educational opportunities at zero cost Rapid attendance growth leads to astronomical data increase. Tens of thousands of students submitting multiple homework assignments leads to hundreds of thousands of data points. This data needs to be efficiently organized.
4
Search Engines for Student Submitted Content Building a searchable database is inevitable and valuable. A search engine will efficiently retrieve submissions of a particular type. Benefits for instructors: Query for similar submissions. Query for “interesting” submissions. Benefits for student: See how students with different viewpoints think. Browse other work for inspiration.
5
Delivering Scalable Human Feedback Delivering feedback to students is critical for learning. Instructor time is a limited resource. Automatic unit-tested only return a binary correct/ incorrect feedback. Proposed solution: Peer grading.
6
The Solution
7
The Codewebs Solution Recognizing “shared parts" of student solutions. Instructor feedback can be force multiplied. Providing detailed feedback for thousands of students with the same effort spend in an ordinary college course.
8
Structured Representations of Syntax Parse submissions into AST(Abstract syntax tree representation) similar to a compiler. Nodes in our ASTs are specified by a type and an optional name. Multiple submissions that are distinct as code strings can correspond to the same AST. Each AST is stored explicitly in our database.
9
AST Example
10
Indexing Code Submissions The Codewebs engine accepts basic queries in the form of what we call code phrases. Code phrases are subgraphs of ASTs. There are three forms of subgraphs: Subtrees. Subforests. Contexts.
11
Subtrees
12
Contexts
13
Subforests In addition to subtrees, we consider subforests which capture everything in a contiguous sequence of statements. Specifically, we define a subforest to be a consecutive sequence of statement subtrees (i.e., subtrees rooted at STATEMENT nodes).
14
An Inverted Index Inverted index is an index data structure storing a mapping from content to its locations in a database file. In our case the content will be code phrases and the locations will be the ASTs containing them. This is an efficient way to store and query our data.
15
Building an inverted index The inverted index is incrementally constructed by adding one AST at a time as follows: We first preprocess all ASTs by anonymizing identifiers that are not recognized as reserved language identifiers or as those belonging to the provided starter code for the assignment. Then for each AST A in the data, we extract all relevant code phrases by iterating over all subtrees, all consecutive sequences of statements, and their contexts. For each code phrase, we append A to its corresponding list in the inverted index.
16
Streamlining Index Construction
17
Serialized Representation of an AST
18
Efficient Precomputation of Hashes By exploiting the particular structure of the hash function H, we can efficiently precompute hash codes for all code phrases within an AST. We can precompute all code phrase hashes in a single pass with a simple dynamic programming approach. we can compute the hash of any sublist in constant time, with:
19
Example
20
Example cont.
28
Solution Space Reduction We want to reduce the amount of ASTs by finding semantic equivalence. It is difficult to match source codes because there are many syntactically distinct ways of implementing the same functionality (for example for/while loops). We will use unit test outcomes to determine whether two ASTs are semantically equivalent. We propose a simple semi-automated method that takes a human specified code phrase and searches the database for all equivalent ways of writing that code phrase, henceforth an “equivalence class”.
29
Testing for semantic equivalence
30
Estimating the semantic equivalence probability.
31
Semantic Equivalence Classes To discover candidate equivalence classes, we use a two stage process: An instructor annotates a small set of seed subtrees that he believes are semantically meaningful. We algorithmically extract as many semantically equivalent subtrees as possible for each of denoted seeds. Advantage: Results in named equivalence classes which can be used to provide non-cryptic feedback.
32
First Phase Example Hypothesis subtree
33
Second Phase We expand the seed subtrees one at a time to build up an equivalence class of subtrees by iterating the following steps: 1. Finding feet that fit the shoe. Find candidate subtrees that are potentially equivalent to B by detaching the subtree B from its surroundings. 2. Finding shoes that fit the feet. Find other contexts that can plausibly be attached to a subtree which is functionally equivalent to the seed. 3. Repeat Until Convergence.
34
Second Phase Step 1
35
Step 1 example
36
Second Phase Step 2
37
Second Phase Step 3 We repeat steps (1) and (2) and expand both the set of equivalent subtrees as well as the set of the contexts that we believe may be attached to one of the equivalent subtrees until the size of our set stops growing.
38
Reduction and Reindexing.
39
Reduction and Reindexing cont.
40
Now What?
41
Providing Feedback Naïve (unit-test based) feedback. Codewebs based feedback: Identifying idea based errors. Extracting equivalent ways of writing the erroneous expression. Matching these expressions to student submissions. Providing human generated message explaining the mistake.
42
Feedback Case Study Many students misunderstood the mathematics involved and wrote: The Codewebs system was used to extract further equivalencies. Codewebs found 1208 submissions. Unit test found 1091 submissions. Using both strategies lead to a 47% increase over just unit tests.
43
Empirical Findings
44
Data Collection Data used is collected from Stanford's online machine learning (ML) Class. Code for the ML class was predominantly written in Octave, a high level interpreted language similar to MATLAB The Codewebs indexer can be run on a personal computer with the full index fitting in under 6Gb of main memory
45
Reduction
46
Submission Coverage
47
Running Time
48
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.