Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater V.2, Attali
Content Vector Analysis (CVA) Essay to be graded Higher quality essay Lower quality essay Higher grade Lower grade
Vector space model An essay is vector of weighted terms Similarity in vector space Latent Semantic Analysis Dimensionality reduction
Human scored essays Input essay Most similar
Vector representation doesn’t consider the ordering of words in en essay John is quicker than Mary and Mary is quicker than John have the same vectors This is called the bag of words model. In a sense, this is a step back: The positional index was able to distinguish these two documents.
The term frequency tf t,e of term t in essay e is defined as the number of times that t occurs in e. We want to use tf when computing input_essay- score_specific vocabulary model match scores. But how? Raw term frequency is not what we want: ▪ An essay with 10 occurrences of the term may be more relevant than an essay with one occurrence of the term. ▪ But not 10 times more relevant. Similarity does not increase proportionally with term frequency.
Rare terms are more informative than frequent terms Recall stop words Consider a term in the essay that is rare in the collection (e.g., virtualization) of ith score point An essay containing this term is very likely to be assigned with the score with human score essay collection that contain virtualization ▪ We want a higher weight for rare terms like virtualization.
Consider term that is frequent in a collection (e.g., high, increase, line) An essay containing such a term is more likely to be assigned with a score point than that doesn’t, but it’s not a sure indicator of match. For frequent terms, we want positive weights for words like high, increase, and line, but lower weights than for rare terms. We will use document frequency (df) to capture this in the score. df ( N) is the number of documents (essays) that contain the term
So we have a |V|-dimensional vector space Terms are axes of the space Essays are points or vectors in this space Very high-dimensional: hundreds of millions of dimensions This is a very sparse vector - most entries are zero.
First cut: distance between two points ( = distance between the end points of the two vectors) Euclidean distance? Euclidean distance is a bad idea... ... because Euclidean distance is large for vectors of different lengths.
Why distance is a bad idea Measure angle between two vectors
A vector can be (length-) normalized by dividing each of its components by its length – for this we use the L 2 norm: Dividing a vector by its L 2 norm makes it a unit (length) vector
Two problems that arose using the vector space model: synonymy: many ways to refer to the same object, e.g. car and automobile ▪ Penalize an essay polysemy: most words have more than one distinct meaning, e.g.model, python, chip ▪ Falsely inflate score
Example: Vector Space Model (from Lillian Lee) auto engine bonnet tyres lorry boot car emissions hood make model trunk make hidden Markov model emissions normalize Synonymy Will have small cosine but are related Polysemy Will have large cosine but not truly related
Similar? Prompt
Semantics Relating words with other words Explicit semantic mapping Using external knowledge-base ▪ Wordnet ▪ Ontology Implicit sematic mapping Extract hidden (latent) semantics ▪ Use implicit co-occurrence ▪ projection of essay in abstract space
Dimensionality reduction through lower order approximation Extracting hidden semantics of a document Semantic space dimension is lower than term space ▪ Remove redundant term dimensions
Prompt specific trainingScore point specific training Prompt 1 ……... Prompt N Term t ………… Term 1 Score 1 ……... ……. Score S Term t ………… Term 1
Lower eigenvalues have less effect in product Lower rank approximation can be obtained by ignoring small eigenvalues
Matrix diagonalization theorem
Symmetric diagonalization theorem
Symmetric diagonal decomposition term Number of documents in which both term i and term j co-occur ??