Centre for Text Technology (CTexT) Research Unit: Languages and Literature in the South African Context North-West University, Potchefstroom Campus (PUK)

Centre for Text Technology (CTexT) Research Unit: Languages and Literature in the South African Context North-West University, Potchefstroom Campus (PUK) South Africa {Gerhard.VanHuyssteen; Martin.Puttkammer; Sulene.Pilon; Handre.Groenewald}@nwu.ac.za 30 September 2007; Borovets Gerhard B van Huyssteen, Martin J Puttkammer, Suléne Pilon and Hendrik J Groenewald Using Machine Learning to Annotate Data for NLP Tasks Semi-Automatically

30 September 2007; Borovets Van Huyssteen, Puttkammer, Pilon & Groenewald Overview Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Human Language Technologies HLTs depends on availability of linguistic data Specialized lexicons Annotated and raw corpora Formalized grammar rules Creation of such resources Expensive and protractive Especially for less-resourced languages Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Human Language Technologies Less-resourced Languages Methodology

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Less-resourced Languages "languages for which few digital resources exist; and thus, languages whose computerization poses unique challenges. [They] are languages with limited financial, political, and legal resources… " (Garrett, 2006) Implicit in this definition: –Lacks human resources (little attention in research or discussions) –Lacks computational linguists working on these languages Research question: –How could one facilitate development of linguistic data by enabling non-experts to collaborate in the computerization of less- resourced languages? Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Human Language Technologies Less-resourced Languages Methodology

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Methodology I Empowering linguists and mother-tongue speakers to deliver annotated data –High quality –Shortest possible time Escalate the annotation of linguistic data by mother-tongue speakers –User-friendly environments –Bootstrapping –Machine learning instead of rule-based techniques Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Human Language Technologies Less-resourced Languages Methodology

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Methodology II The general idea: –Development of gold standards –Development of annotated data –Bootstrapping With the click of a button: –Annotate data –Train machine-learning algorithm Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Human Language Technologies Less-resourced Languages Methodology

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Central Point of Departure I Annotators are invaluable resources Based on experiences with less-resourced languages –Annotators have mostly word processing skills –Used to a GUI-based environment –Usually limited skills in a computational or programming environment Worst cases annotators have difficulties with –File management –Unzipping –Proper encoding of text files Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Assumptions Interviews

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Central Point of Departure II Aim of this project: Enabling annotators to focus on what they are good at: Enriching data with expert linguistic knowledge Training the machine learner occurs automatically Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Assumptions Interviews

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald End-user Requirements I Unstructured interviews with four annotators 1. What do you find unpleasant about your work as an annotator? 2. What will make your life as an annotator easier? Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Assumptions Interviews

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald End-user Requirements II 1.What do you find unpleasant about your work as an annotator? –Repetitiveness Lack of concentration/motivation –Feeling “useless” Do not see results Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Assumptions Interviews

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald End-user Requirements III 2. What will make your life as an annotator easier? –Friendly environment (i.e. GUI-based, and not lists of words) –Bite-sizes of data rather than endless lists –Rather correct data than annotate from scratch Program should already suggest a possible annotation –Click or drag –Reference works need to be available –Automatic data management Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Assumptions Interviews

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Solution: TurboAnnotate User-friendly annotating environment – Bootstrapping with machine learning – Creating gold standards/annotated lists Inspired by DictionaryMaker (Davel and Peche, 2006) and Alchemist (University of Chicago, 2004) Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald DictionaryMaker Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Alchemist Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Simplified Workflow of TurboAnnotate Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Step 1: Create Gold Standard Create gold standard –Independent test set for evaluating performance –1000 random instances used –Annotator only has to select one data file Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Step 2: Verify Annotations New data sourced from base list –Automatically annotated by classifier –Presented to annotator in the "Annotate" tab Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald TurboAnnotate : Annotation Environment Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Step 3: Verify Annotated Set Bootstrapping – inspired by DictionaryMaker 200 words per chunk – trained in background Annotator verifies –Click “accept” or correct the instance Verified data serve as training data Iterative process till desired results Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald The Machine Learning System I Tilburg Memory-Based Learner (TiMBL). –Wide success and applicability in the field of natural language processing –Available for research purposes –Relative ease to use On the down-side –Performs best with large quantities of data For the tasks of hyphenation and compound analysis, TiMBL performs well with small quantities of data Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald The Machine Learning System II Default parameter settings used Task specific feature selection Performance is evaluated against gold standard –For hyphenation and compound analysis, accuracy is determined on word-level and not per instance Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Features I All input words converted feature vectors –Splitting window –Context 3 positions (left and right) Class –Hyphenation: indicating a break –Compound Analysis: 3 possible classes + indicating word boundary _ indicating valence morpheme = no break Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Features II Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions Example: eksamenlokaal -‘examination room’

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Parameter Optimisation I Large variations in accuracy occur when parameter settings of MBL algorithms are changed Finding the best combination of parameters –Exhaustive searches undesirable –Slow and computationally expensive Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Parameter Optimisation II Alternative: Paramsearch (Van den Bosch, 2005) –delivers combinations of algorithmic parameters that are estimated to perform well PSearch –Our own modification of Paramsearch –Only implemented after all data has been annotated –Ensures the best possible classifier Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Criteria Two criteria –Accuracy –Human effort (time) Evaluated on the tasks of hyphenation and compound analysis for Afrikaans and Setswana Four human annotators –Two well-experienced in annotating –Two considered novices in the field Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Criteria Accuracy Effort

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Accuracy Two kinds of accuracy –Classifier accuracy –Human accuracy Expressed as percentage of correctly annotated words over total number of words Gold standard excluded as training data Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Criteria Accuracy Effort

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Classifier Accuracy (Hyphenation) # Words in Training Data Accuracy: AfrikaansAccuracy: Setswana 20038.60%94.50% 60054.00%98.30% 100058.30%98.80% 200068.50%98.90% Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Criteria Accuracy Effort

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Human Accuracy Human accuracy –Two separate unseen datasets of 200 words for each language –First dataset annotated in an ordinary text editor –The second dataset annotated with TurboAnnotate. Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Criteria Accuracy Effort

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Human Accuracy Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Criteria Accuracy Effort Annotation Tool Accuracy (Hyph) Time (s) (Hyph) Accuracy (CA) Time (s) (CA) Text Editor (200 Words) 93.25%132591.50%802 TurboAnnotate (200 words) 98.34%125894.00%748

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Human Effort I Two questions –Is it faster to annotate with TurboAnnotate? –What would the predicted saving on human effort be on a large dataset? Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Criteria Accuracy Effort

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Human Effort II Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Criteria Accuracy Effort # Words in Training Set Time (s) (Hyph) Time (s) (CA) 01258748 600663614 2000573582

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Human Effort III 1 minute faster to annotate 200 words with TurboAnnotate Larger dataset (40,000 words) –Difference of only circa 3.5 uninterrupted human hours This picture changes when the effect of bootstrapping is considered –Extrapolating to 42,967 words Saving of 51 hours (68%) for hyphenation Saving of 9 hours (41%) for compound analysis Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Criteria Accuracy Effort

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Conclusion TurboAnnotate helps to increase the accuracy of human annotators Saves human effort Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Conclusion Future Work Obtaining TurboAnnotate Acknowledgements

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Future Work Other lexical annotation tasks –Creating lexicons for spelling checkers –Creating data for morphological analysis Stemming Lemmatization Improve GUI Network solution Active Learning Experiment with C5.0 Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Conclusion Future Work Obtaining TurboAnnotate Acknowledgements

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald TurboAnnotate Requirements: –Linux –Perl 5.8 –Gtk+ 2.10 –TiMBL 5.1 Open-source Available at http://www.nwu.ac.za/ctext Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Conclusion Future Work Obtaining TurboAnnotate Acknowledgements

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Acknowledgements This work was supported by a grant from the South African National Research Foundation (GUN: FA2004042900059). We also acknowledge the inputs and contributions of –Ansu Berg –Pieter Nortjé –Rigardt Pretorius –Martin Schlemmer –Wikus Slabbert Conclusion Future Work Obtaining TurboAnnotate Acknowledgements Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion

Centre for Text Technology (CTexT) Research Unit: Languages and Literature in the South African Context North-West University, Potchefstroom Campus (PUK)

Similar presentations

Presentation on theme: "Centre for Text Technology (CTexT) Research Unit: Languages and Literature in the South African Context North-West University, Potchefstroom Campus (PUK)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Centre for Text Technology (CTexT) Research Unit: Languages and Literature in the South African Context North-West University, Potchefstroom Campus (PUK)

Similar presentations

Presentation on theme: "Centre for Text Technology (CTexT) Research Unit: Languages and Literature in the South African Context North-West University, Potchefstroom Campus (PUK)"— Presentation transcript:

Similar presentations

About project

Feedback