NICK PENDAR AND ELENA COTOS IOWA STATE UNIVERSITY THE 3RD WORKSHOP ON INNOVATIVE USE OF NLP FOR BUILDING EDUCATIONAL APPLICATIONS JUNE 19, 2008 Automatic.

NICK PENDAR AND ELENA COTOS IOWA STATE UNIVERSITY THE 3RD WORKSHOP ON INNOVATIVE USE OF NLP FOR BUILDING EDUCATIONAL APPLICATIONS JUNE 19, 2008 Automatic Identification of Discourse Moves in Scientific Article Introductions

Outline Background and motivation Discourse move identification  Data and annotation scheme  Feature selection  Sentence representation  Classifier  Evaluation  Inter-annotator agreement Further work

Automated evaluation: Background Automated essay scoring (AES) in performance-based and high-stakes standardized tests (e.g., ACT, GMAT, TOEFL, etc.) ‏ Automated error detection in L2 output (Burstein and Chodorow, 1999; Chodorow et al., 2007; Han et al., 2006; Leacock and Chodorow, 2003) ‏ Assessment of various constructs, e.g., topical content, grammar, style, mechanics, syntactic complexity, and deviance or plagiarism (Burstein, 2003; Elliott, 2003; Landauer et al., 2003; Mitchell et al., 2002; Page, 2003; Rudner and Liang, 2002) Text organization limited to recognizing the five- paragraph essay format, thesis, and topic sentences AntMover (Anthony and Lashkia, 2003) ‏

Wide range of possibilities for high quality evaluation and feedback ( Criterion ; Burstein, Chodorow, & Leacock, 2004) ‏ Potential in formative assessment, but – the effects of intelligent formative feedback are not fully investigated Warschauer and Ware (2006) call for the development of a classroom research agenda that would help evaluate and guide the application of AES in the writing pedagogy “the potential of automated essay evaluation for improving student writing is an empirical question, and virtually no peer-reviewed research has yet been published” (Hyland and Hyland, 2006, p. 109) ‏ Automated evaluation: CALI Motivation

Automated evaluation: EAP Motivation EAP pedagogical approaches (Cortes, 2006; Levis & Levis- Muller, 2003; Vann & Myers, 2001) fail to provide NNSs with sufficient academic writing practice and remediational guidance Problem of disciplinarity An NLP-based academic discourse evaluation software application could account for this drawback Such an application has not yet been developed

Automated evaluation: Research Motivation Long-term research goals:  design and implementation of IADE (Intelligent Academic Discourse Evaluator) ‏  analysis of IADE effectiveness for formative assessment purposes

Evaluates students’ research article introductions in terms of moves/steps (Swales 1990, 2004) ‏ Draws from  SLA models: interactionist views (Caroll, 1999; Gass, 1997; Long, 1996; Long & Robinson, 1998; Mackey, Gass, & McDonough, 2000; Swain, 1993) and Systemic Functional Linguistics (Martin, 1992; Halliday, 1985) ‏  Skill Acquisition Theory of learning (DeKeyser, 2007 ) ‏ Is informed by empirical research on the provision of feedback Is informed by Evidence Centered Design principles (Mislevy et al., 2006) ‏

Discourse Move Identification Approached as a classification problem (similar to Burstein et al., 2003) ‏  given a sentence and a finite set of moves and steps, what move/step does the sentence signify? ISUAW corpus: 1,623 articles; 1,322,089 words; average length of articles 814.09 words Stratified sampling of 401 introduction sections representative of 20 academic disciplines Sub-corpus: 267,029 words; average length 665.91 words; 11,149 sentences Manual annotation

Discourse Move Identification Annotation scheme (Swales, 1990; Swales, 2004) ‏

Discourse Move Identification Multiple layers of annotation for cases when the same sentence signified more than one move or more than one step

Feature Selection Features that reliably indicate a move/step Text-categorization approach (see Sebastiani, 2002) ‏ Each sentence treated as a data item to be classified and represented as an n-dimensional vector in the Euclidean space The task of the learning algorithm is to find a function F : S → M that would map the sentences in the corpus S to classes in M = {m1,m2,m3} Identification of moves, not yet steps

Feature Selection Extraction of word unigrams, bigrams, and trigrams from the annotated corpus Preprocessing:  All tokens stemmed using the NLTK port of the Porter Stemmer algorithm (Porter, 1980) ‏  All numbers in the texts replaced by the string _number_  The tokens inside each n-gram alphabetized in case of bigrams and trigrams  All n-grams with a frequency of less than five excluded

Feature Selection Odds ratio Conditional probabilities are calculated as maximum likelihood estimates N-grams with maximum odds ratios selected as features

Sentence Representation Each sentence represented as a vector Presence or absence of terms in sentences recorded as Boolean values (0 for the absence of the corresponding term or a 1 for its presence) ‏

Classifier Support Vector Machines (SVM) (Basu et al., 2003; Burges, 1998; Cortes and Vapnik, 1995; Joachims, 1998; Vapnik, 1995) ‏ five-fold cross validation Machine learning environment RAPIDMINER (Mierswa et al., 2006) ‏  RBF kernel found through a set of different parameter settings on the feature set with 3,000 unigrams Parameters not necessarily the best; exhaustive searches will be performed on the other feature sets

Evaluation Five-fold cross validation on 14 different feature sets were performed

Evaluation Accuracy - the proportion of classifications that agreed with the manually assigned labels

Evaluation Precision - what proportion of the items assigned to a given category actually belonged to it Recall - what proportion of the items actually belonging to a category were labeled correctly

Evaluation Trigram models result in the best precision Unigram models result in the best recall

Evaluation Move 2 is most difficult to identify as revealed by error analysis – Move 2 gets misclassified as Move 1  Use the relative position of the sentence in the text to disambiguate the move involved  see what percentage of Move 2 sentences identified as Move 1 by the system also have been labeled Move 1 by the annotator Extracted features are not discipline-dependent

This just in… Built a model with top 3000 unigrams and top 3000 trigrams  Precision: 91.14%  Recall: 82.98%  Kappa: 87.57

Inter-annotator agreement Second annotations on a sample of files across all 20 disciplines = 487 sentences k - inter-annotator agreement  P(A) - observed probability of agreement  P(E) - expected probability of agreement Average k = 0.945 over the three moves

Further work on IADE Ongoing experiments to improve accuracy  experimenting with different kernel parameters to find optimal models More annotation Inter-annotator agreement (3 annotators) ‏ Identification of steps Development of intelligent feedback Web interface design

Further research with IADE Evaluation of IADE effectiveness  Learning potential  Learner fit  Meaning focus  Authenticity  Impact  Practicality (Chapelle, 2001) Process/product research direction - interaction between use and outcome (Warschauer &Ware, 2006) ‏ Target for evaluation - “what is taught through technology” (Chapelle, 2007, p.30) ‏

THANK YOU! Questions? Suggestions?

NICK PENDAR AND ELENA COTOS IOWA STATE UNIVERSITY THE 3RD WORKSHOP ON INNOVATIVE USE OF NLP FOR BUILDING EDUCATIONAL APPLICATIONS JUNE 19, 2008 Automatic.

Similar presentations

Presentation on theme: "NICK PENDAR AND ELENA COTOS IOWA STATE UNIVERSITY THE 3RD WORKSHOP ON INNOVATIVE USE OF NLP FOR BUILDING EDUCATIONAL APPLICATIONS JUNE 19, 2008 Automatic."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NICK PENDAR AND ELENA COTOS IOWA STATE UNIVERSITY THE 3RD WORKSHOP ON INNOVATIVE USE OF NLP FOR BUILDING EDUCATIONAL APPLICATIONS JUNE 19, 2008 Automatic.

Similar presentations

Presentation on theme: "NICK PENDAR AND ELENA COTOS IOWA STATE UNIVERSITY THE 3RD WORKSHOP ON INNOVATIVE USE OF NLP FOR BUILDING EDUCATIONAL APPLICATIONS JUNE 19, 2008 Automatic."— Presentation transcript:

Similar presentations

About project

Feedback