Centre for Text Technology (CTexT) Research Unit: Languages and Literature in the South African Context North-West University, Potchefstroom Campus (PUK)

Slides:



Advertisements
Similar presentations
Testing Relational Database
Advertisements

Modelling with expert systems. Expert systems Modelling with expert systems Coaching modelling with expert systems Advantages and limitations of modelling.
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
University of Sheffield NLP Exercise I Objective: Implement a ML component based on SVM to identify the following concepts in company profiles: company.
University of Sheffield NLP Module 4: Machine Learning.
Bernd Bruegge & Allen Dutoit Object-Oriented Software Engineering: Conquering Complex and Changing Systems 1 Software Engineering September 12, 2001 Capturing.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Usability Inspection n Usability inspection is a generic name for a set of methods based on having evaluators inspect or examine usability-related issues.
CALL: Computer-Assisted Language Learning. 2/14 Computer-Assisted (Language) Learning “Little” programs Purpose-built learning programs (courseware) Using.
13.1 © 2007 by Prentice Hall 13 Chapter Building Systems.
Bootstrapping pronunciation models: a South African case study Presented at the CSIR Research and Innovation Conference Marelie Davel & Etienne Barnard.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
McGraw-Hill/Irwin The Interactive Computing Series © 2002 The McGraw-Hill Companies, Inc. All rights reserved. Microsoft Excel 2002 Exploring Formulas.
Wang, Z., et al. Presented by: Kayla Henneman October 27, 2014 WHO IS HERE: LOCATION AWARE FACE RECOGNITION.
1.Database plan 2.Information systems plan 3.Technology plan 4.Business strategy plan 5.Enterprise analysis Which of the following serves as a road map.
Chapter 1 Database Systems. Good decisions require good information derived from raw facts Data is managed most efficiently when stored in a database.
(ACM KDD 09’) Prem Melville, Wojciech Gryc, Richard D. Lawrence
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
Erasmus University Rotterdam Introduction With the vast amount of information available on the Web, there is an increasing need to structure Web data in.
The System of Measurement and Evaluation Regarding the Activity of Simulated Enterprises Multilateral projects.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Knowledge representation
©2009 Excel Experts. All rights reservedJune Johannesburg, South Africa Introduction An.
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Suléne Pilon & Danie Prinsloo Overview: Teaching and Training in South Africa 25 November 2008;
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Introduction of Geoprocessing Topic 7a 4/10/2007.
ACBiMA: Advanced Chinese Bi-Character Word Morphological Analyzer 1 Ting-Hao (Kenneth) Huang Yun-Nung (Vivian) Chen Lingpeng Kong
Introduction to Simultaneous Interpreting Discussion Questions Simultaneous Interpreting from English ITP 165.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
CONTENTS Processing structures and commands Control structures – Sequence Sequence – Selection Selection – Iteration Iteration Naming conventions – File.
Mining Binary Constraints in Feature Models: A Classification-based Approach Yi Li.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
ECE450 - Software Engineering II1 ECE450 – Software Engineering II Today: Introduction to Software Architecture.
Sentiment Analysis with Incremental Human-in-the-Loop Learning and Lexical Resource Customization Shubhanshu Mishra 1, Jana Diesner 1, Jason Byrne 2, Elizabeth.
Presenter: Jinhua Du ( 杜金华 ) Xi’an University of Technology 西安理工大学 NLP&CC, Chongqing, Nov , 2013 Discriminative Latent Variable Based Classifier.
Hendrik J Groenewald Centre for Text Technology (CTexT™) Research Unit: Languages and Literature in the South African Context North-West University, Potchefstroom.
Theme 2: Data & Models One of the central processes of science is the interplay between models and data Data informs model generation and selection Models.
Link Translation provides training and practical experience on industry- standard Computer Assisted Translation (CAT) tools for our team of linguists.
Generality and Openness in Enabling Methodologies for Morphology and Text Processing Anssi Yli-Jyrä Department of General Linguistics, University of Helsinki.
Installing Java on a Home machine For Windows Users: Download/Install: Go to downloads html.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Introduction of Geoprocessing Lecture 9 3/24/2008.
Microsoft Visual Basic 2012: Reloaded Fifth Edition Chapter One An Introduction to Visual Basic 2012.
Unsupervised Classification
Design Evaluation Overview Introduction Model for Interface Design Evaluation Types of Evaluation –Conceptual Design –Usability –Learning Outcome.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Text Linguistics. Definition of linguistics Linguistics can be defined as the scientific or systematic study of language. It is a science in the sense.
WP4 Models and Contents Quality Assessment
GCE Software Systems Development
Decision Support Systems
An Artificial Intelligence Approach to Precision Oncology
FEASIBILITY STUDY Feasibility study is a means to check whether the proposed system is correct or not. The results of this study arte used to make decision.
Classification with Perceptrons Reading:
Chapter 10 Verification and Validation of Simulation Models
Statistical NLP: Lecture 9
CSc4730/6730 Scientific Visualization
Chapter 1 Introduction(1.1)
Algorithms and Problem Solving
Classification Breakdown
Statistical NLP : Lecture 9 Word Sense Disambiguation
INTRODUCTION Educational technology as the theory and practice of educational approaches to learning. Educational technology as technological tools and.
Presentation transcript:

Centre for Text Technology (CTexT) Research Unit: Languages and Literature in the South African Context North-West University, Potchefstroom Campus (PUK) South Africa {Gerhard.VanHuyssteen; Martin.Puttkammer; Sulene.Pilon; 30 September 2007; Borovets Gerhard B van Huyssteen, Martin J Puttkammer, Suléne Pilon and Hendrik J Groenewald Using Machine Learning to Annotate Data for NLP Tasks Semi-Automatically

30 September 2007; Borovets Van Huyssteen, Puttkammer, Pilon & Groenewald Overview Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Human Language Technologies HLTs depends on availability of linguistic data Specialized lexicons Annotated and raw corpora Formalized grammar rules Creation of such resources Expensive and protractive Especially for less-resourced languages Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Human Language Technologies Less-resourced Languages Methodology

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Less-resourced Languages "languages for which few digital resources exist; and thus, languages whose computerization poses unique challenges. [They] are languages with limited financial, political, and legal resources… " (Garrett, 2006) Implicit in this definition: –Lacks human resources (little attention in research or discussions) –Lacks computational linguists working on these languages Research question: –How could one facilitate development of linguistic data by enabling non-experts to collaborate in the computerization of less- resourced languages? Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Human Language Technologies Less-resourced Languages Methodology

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Methodology I Empowering linguists and mother-tongue speakers to deliver annotated data –High quality –Shortest possible time Escalate the annotation of linguistic data by mother-tongue speakers –User-friendly environments –Bootstrapping –Machine learning instead of rule-based techniques Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Human Language Technologies Less-resourced Languages Methodology

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Methodology II The general idea: –Development of gold standards –Development of annotated data –Bootstrapping With the click of a button: –Annotate data –Train machine-learning algorithm Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Human Language Technologies Less-resourced Languages Methodology

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Central Point of Departure I Annotators are invaluable resources Based on experiences with less-resourced languages –Annotators have mostly word processing skills –Used to a GUI-based environment –Usually limited skills in a computational or programming environment Worst cases annotators have difficulties with –File management –Unzipping –Proper encoding of text files Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Assumptions Interviews

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Central Point of Departure II Aim of this project: Enabling annotators to focus on what they are good at: Enriching data with expert linguistic knowledge Training the machine learner occurs automatically Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Assumptions Interviews

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald End-user Requirements I Unstructured interviews with four annotators 1. What do you find unpleasant about your work as an annotator? 2. What will make your life as an annotator easier? Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Assumptions Interviews

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald End-user Requirements II 1.What do you find unpleasant about your work as an annotator? –Repetitiveness Lack of concentration/motivation –Feeling “useless” Do not see results Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Assumptions Interviews

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald End-user Requirements III 2. What will make your life as an annotator easier? –Friendly environment (i.e. GUI-based, and not lists of words) –Bite-sizes of data rather than endless lists –Rather correct data than annotate from scratch Program should already suggest a possible annotation –Click or drag –Reference works need to be available –Automatic data management Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Assumptions Interviews

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Solution: TurboAnnotate User-friendly annotating environment – Bootstrapping with machine learning – Creating gold standards/annotated lists Inspired by DictionaryMaker (Davel and Peche, 2006) and Alchemist (University of Chicago, 2004) Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald DictionaryMaker Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Alchemist Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Simplified Workflow of TurboAnnotate Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Step 1: Create Gold Standard Create gold standard –Independent test set for evaluating performance –1000 random instances used –Annotator only has to select one data file Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Simplified Workflow of TurboAnnotate Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Step 2: Verify Annotations New data sourced from base list –Automatically annotated by classifier –Presented to annotator in the "Annotate" tab Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald TurboAnnotate : Annotation Environment Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Simplified Workflow of TurboAnnotate Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Step 3: Verify Annotated Set Bootstrapping – inspired by DictionaryMaker 200 words per chunk – trained in background Annotator verifies –Click “accept” or correct the instance Verified data serve as training data Iterative process till desired results Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald The Machine Learning System I Tilburg Memory-Based Learner (TiMBL). –Wide success and applicability in the field of natural language processing –Available for research purposes –Relative ease to use On the down-side –Performs best with large quantities of data For the tasks of hyphenation and compound analysis, TiMBL performs well with small quantities of data Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald The Machine Learning System II Default parameter settings used Task specific feature selection Performance is evaluated against gold standard –For hyphenation and compound analysis, accuracy is determined on word-level and not per instance Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Features I All input words converted feature vectors –Splitting window –Context 3 positions (left and right) Class –Hyphenation: indicating a break –Compound Analysis: 3 possible classes + indicating word boundary _ indicating valence morpheme = no break Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Features II Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions Example: eksamenlokaal -‘examination room’

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Parameter Optimisation I Large variations in accuracy occur when parameter settings of MBL algorithms are changed Finding the best combination of parameters –Exhaustive searches undesirable –Slow and computationally expensive Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Parameter Optimisation II Alternative: Paramsearch (Van den Bosch, 2005) –delivers combinations of algorithmic parameters that are estimated to perform well PSearch –Our own modification of Paramsearch –Only implemented after all data has been annotated –Ensures the best possible classifier Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Functional Specifications & Solutions Technical Specifications & Solutions User Instructions

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Criteria Two criteria –Accuracy –Human effort (time) Evaluated on the tasks of hyphenation and compound analysis for Afrikaans and Setswana Four human annotators –Two well-experienced in annotating –Two considered novices in the field Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Criteria Accuracy Effort

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Accuracy Two kinds of accuracy –Classifier accuracy –Human accuracy Expressed as percentage of correctly annotated words over total number of words Gold standard excluded as training data Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Criteria Accuracy Effort

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Classifier Accuracy (Hyphenation) # Words in Training Data Accuracy: AfrikaansAccuracy: Setswana %94.50% %98.30% %98.80% %98.90% Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Criteria Accuracy Effort

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Human Accuracy Human accuracy –Two separate unseen datasets of 200 words for each language –First dataset annotated in an ordinary text editor –The second dataset annotated with TurboAnnotate. Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Criteria Accuracy Effort

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Human Accuracy Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Criteria Accuracy Effort Annotation Tool Accuracy (Hyph) Time (s) (Hyph) Accuracy (CA) Time (s) (CA) Text Editor (200 Words) 93.25% %802 TurboAnnotate (200 words) 98.34% %748

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Human Effort I Two questions –Is it faster to annotate with TurboAnnotate? –What would the predicted saving on human effort be on a large dataset? Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Criteria Accuracy Effort

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Human Effort II Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Criteria Accuracy Effort # Words in Training Set Time (s) (Hyph) Time (s) (CA)

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Human Effort III 1 minute faster to annotate 200 words with TurboAnnotate Larger dataset (40,000 words) –Difference of only circa 3.5 uninterrupted human hours This picture changes when the effect of bootstrapping is considered –Extrapolating to 42,967 words Saving of 51 hours (68%) for hyphenation Saving of 9 hours (41%) for compound analysis Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Criteria Accuracy Effort

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Conclusion TurboAnnotate helps to increase the accuracy of human annotators Saves human effort Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Conclusion Future Work Obtaining TurboAnnotate Acknowledgements

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Future Work Other lexical annotation tasks –Creating lexicons for spelling checkers –Creating data for morphological analysis Stemming Lemmatization Improve GUI Network solution Active Learning Experiment with C5.0 Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Conclusion Future Work Obtaining TurboAnnotate Acknowledgements

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald TurboAnnotate Requirements: –Linux –Perl 5.8 –Gtk –TiMBL 5.1 Open-source Available at Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion Conclusion Future Work Obtaining TurboAnnotate Acknowledgements

30 September 2007; BorovetsVan Huyssteen, Puttkammer, Pilon & Groenewald Acknowledgements This work was supported by a grant from the South African National Research Foundation (GUN: FA ). We also acknowledge the inputs and contributions of –Ansu Berg –Pieter Nortjé –Rigardt Pretorius –Martin Schlemmer –Wikus Slabbert Conclusion Future Work Obtaining TurboAnnotate Acknowledgements Introduction End-User Requirements Solution: Design & Implementation Evaluation Conclusion