University of Sheffield NLP Exercise I Objective: Implement a ML component based on SVM to identify the following concepts in company profiles: company.

Slides:



Advertisements
Similar presentations
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Advertisements

1/(20) Introduction to ANNIE Diana Maynard University of Sheffield March 2004
An Introduction to GATE
University of Sheffield, NLP Case study: GATE in the NeOn project Diana Maynard University of Sheffield.
Overview This presentation will be answering these main questions about AutoDoc: What does it do? What is it? How does it do it? Starting from the finish.
1/19 ITApplications XML Module Session 4: Document Type Definition (DTD) Part 2.
University of Sheffield NLP Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell.
University of Sheffield NLP Module 4: Machine Learning.
University of Sheffield NLP Module 11: Advanced Machine Learning.
SL-10 Laboratory Hot Tack / Seal Tester TMI Group of Companies TMI Group of Companies.
ANNIC ANNotations In Context GATE Training Course 27 – 28 April 2006 Niraj Aswani.
11 Getting Started with ASP.NET Beginning ASP.NET 4.0 in C# 2010 Chapters 5 and 6.
1 Configuring Internet- related services (April 22, 2015) © Abdou Illia, Spring 2015.
University of Sheffield, NLP Annotation and Evaluation Diana Maynard, Niraj Aswani University of Sheffield.
Different Streaming Technologies. Three major streaming technologies include:
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.
UIMA Introduction SHARPn Summit June 11, 2012
Guide to MCSE , Enhanced 1 Activity 10-1: Restarting Windows Server 2003 Objective: to restart Windows Server 2003 Start  Shut Down  Restart Configure.
Guide to MCSE , Enhanced 1 Activity 4-1: Creating and Adding Members to Global Groups Objective: Use Active Directory Users and Computers to create.
Named Entity Recognition without Training Data on a Language you don’t speak Diana Maynard Valentin Tablan Hamish Cunningham NLP group, University of Sheffield,
Tutorial 11 Installing, Updating, and Configuring Software
University of Sheffield NLP Opinion Mining in GATE Horacio Saggion & Adam Funk.
The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are.
Bogor-Java Environment for Eclipse MSE Presentation II Yong Peng.
ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)
© 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice SISP Training Documentation Template.
Information Extraction From Medical Records by Alexander Barsky.
Chapter 17. Copyright 2003, Paradigm Publishing Inc. CHAPTER 17 BACKNEXTEND 17-2 LINKS TO OBJECTIVES Mail Merge Wizard Letters Envelopes Labels Directory.
Tutorial 121 Creating a New Web Forms Page You will find that creating Web Forms is similar to creating traditional Windows applications in Visual Basic.
InDesign CS3 Lesson 4 ( Only pages ) Importing and Editing Text.
UIMA SHARP 4 - NLP May 25, Outline UIMA Terminology (not just TLAs) Parts of a UIMA pipeline Running a pipeline Viewing annotations Creating a new.
Exploring Microsoft Office XP - Microsoft Word 2002 Chapter 71 Exploring Microsoft Word Chapter 7 The Expert User: Workgroups, Forms, Master Documents,
Lecture Set 1 Part C: Understanding Visual Studio and.NET – Applications, Solutions, Projects (no longer used – embedded in Lecture Set 2A)
Creating a Project with C++ Builder
1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva
Renesas Technology America Inc. 1 SKP8CMINI Tutorial 2 Creating A New Project Using HEW.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Faculty Webpage Design Minimum Requirements. Go to: then High Schoolhttp://gcsc.groupfusion.net/
Salt Suite User Guide (Copyright Salt ).
University of Sheffield NLP Teamware: A Collaborative, Web-based Annotation Environment Kalina Bontcheva, Milan Agatonovic University of Sheffield.
University of Sheffield NLP Module 1: Introduction to GATE Developer © The University of Sheffield, This work is licenced under the Creative.
Lecture Set 2 Part A: Creating an Application with Visual Studio – Solutions, Projects, Files.
University of Sheffield, NLP Annotation and Evaluation Diana Maynard, Niraj Aswani University of Sheffield.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
 Identify Active Directory functions and Benefits.  Identify the major components that make up an Active Directory structure.  Identify how DNS relates.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with.
Combining GATE and UIMA Ian Roberts. University of Sheffield NLP 2 Overview Introduction to UIMA Comparison with GATE Mapping annotations between GATE.
University of Sheffield, NLP Module 6: ANNIC Kalina Bontcheva © The University of Sheffield, This work is licensed under the Creative Commons.
Renesas Technology America Inc. 1 SKP8CMINI Tutorial 2 Creating A New Project Using HEW.
PestPac Software. Leads The Leads Module allows you to track all of your pending sales for your company from the first contact to the close. By the end.
Hands-On Microsoft Windows Server 2008 Chapter 5 Configuring Windows Server 2008 Printing.
©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials,
Bogor-Java Environment for Eclipse MSE Presentation III Yong Peng.
Chapter 28. Copyright 2003, Paradigm Publishing Inc. CHAPTER 28 BACKNEXTEND 28-2 LINKS TO OBJECTIVES Table Calculations Table Properties Fields in a Table.
University of Sheffield NLP Module 1: Introduction to JAPE © The University of Sheffield, This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike.
University of Sheffield NLP Sentiment Analysis (Opinion Mining) with Machine Learning in GATE.
University of Sheffield NLP Module 11: Machine Learning Supplementary material © The University of Sheffield, This work is licenced under the.
University of Sheffield NLP Module 11: Machine Learning © The University of Sheffield, This work is licenced under the Creative Commons Attribution-NonCommercial-ShareAlike.
Machine Learning © The University of Sheffield,
Machine Learning © The University of Sheffield,
Module 11: Machine Learning
Relation Extraction.
Chunking—Practical Exercise
Classification—Practical Exercise
Module 1: Introduction to GATE Developer
Running a Java Program using Blue Jay.
Hierarchical, Perceptron-like Learning for OBIE
WinSLAMM Batch Editor Module 23
Java Code Review with CheckStyle
Presentation transcript:

University of Sheffield NLP Exercise I Objective: Implement a ML component based on SVM to identify the following concepts in company profiles: company name; address; fax; phone; web site; industry type; creation date; industry sector; main products; market locations; number of employees; stock exchange listings

University of Sheffield NLP Exercise I Materials : we are working with material in directory hands-on- resources/ml/entity-learning training documents: a set of 5 company profiles annotated with the target concepts (corpus/annotated) - each document contains an annotation Mention with a feature class representing the target concept (human annotated), the documents also contain annotation produced by ANNIE plus an annotation called Entity that wraps up named entities of type Person, Organization, Location, Date, Address. All annotations are in the default annotation set test documents (without target concepts and without annotations): a set of company profiles from the same source as the training data (corpus/testing) SVM configuration file learn-company.xml (experiments/company-profile- learning) Open the configuration file in a text editor to see how the target concept and the linguistic annotations are encoded, remember that the target concept is encoded using the sub-element in the element (in this case we are trying to learn a Mention and its ‘class’).

University of Sheffield NLP Exercise I – PART I 1.Run an experiment with the training documents to check the performance of the learning component on annotated data – we will use the GATE GUI for this exercise Load the Batch Learning plug-in using the plug-in manager (it has the name ‘learning’ in the list of plug-ins) Create a corpus (ANNOTATED) Populate it with the training documents (corpus/annotated) use encoding UFT-8 (you may want to look at one of the documents to see the annotations, the target annotation is Mention) Create a Batch Learning PR using the provided configuration file (experiments/company-profile-learning/learn-company.xml) - should appear in the list of processing resources Create a corpus pipeline and add the Batch Learning PR to the corpus pipeline Set the parameter “learningMode” of the Batch Learning PR to “evaluation” Run the corpus pipeline over the ANNOTATED corpus (by setting the corpus parameter) When finished, evaluation information will be dumped on the GATE console Examine the GATE console to see the evaluation results

University of Sheffield NLP Exercise I – PART I In this exercise we have tested how to evaluate the learning component over annotated documents. Note that we have provided very few documents for training. According to the configuration file and the number of documents in the corpus, the ML pipeline will execute 2 runs, each run will use 3 documents for training and 2 documents for testing, in each test document the Mention annotation automatically produced will be compared to the true Mention annotation (gold standard) to compute precision, recall, and f-measure values. The evaluation results will be an average over the two runs.

University of Sheffield NLP Exercise I - PART II 1.Run an experiment to TRAIN the machine learning component Create a corpus and populate it with the training data (or use ANNOTATED from previous steps) Create a Batch Learning PR using the provided configuration file (or use the same PR as before) Create a corpus pipeline containing the Batch Learning PR (or use the one before) In the corpus pipeline, set the “learningMode” of the Batch Learning PR component to “training” Set the corpus in the corpus pipeline to the ANNOTATED corpus Run the corpus pipeline Now you have trained the ML component to recognise Mentions

University of Sheffield NLP Exercise I – PART III 1.Run an experiment to apply the trained model to unseen documents We will use the trained model produced in the previous exercise Create a corpus (TEST) and populate it with the test documents (use UTF-8 encoding) NOTE: the documents are not annotated, so you need to produce the annotations! The steps below produce the annotations. Load the ANNIE system (with defaults) Create an ANNIE NE Transducer (call it ENTITY-GRAMMAR) using the grammar file under (grammars/create_entity.jape) Add the ENTITY-GRAMMAR as the last component of ANNIE Run ANNIE (+ the new grammar) over the TEST corpus Verify that the documents contain the ANNIE annotations + the Entity annotation

University of Sheffield NLP Exercise I – PART III Take the corpus pipeline created in the previous exercise and change the parameter learning mode of the Batch Learning PR to “application” The input annotation set should be empty (default) because the ANNIE annotations are there, and the output annotation set can be any set (including the default) Apply (run) the corpus pipeline to the TEST corpus (by setting the corpus) Examine the result of the annotation process (see if Mention annotations have been produced) Mention annotations should contain a feature class (one of the concepts listed in the first slide) and a feature ‘prob’ which is a probability produced by the ML component Now you have applied a trained model to a set of unseen documents With the parts I, II, and III you have use the evaluation, training, and application modes of the Batch Learning PR

University of Sheffield NLP Exercise I – PART IV 1.Run your own experiment: copy the configuration file to another directory and edit this configuration file. You may comment out some of the features used, or the windows used, or the type of ML. Chapter 11 of the GATE guide contains enough information on options you can adjust.

University of Sheffield NLP Exercise II Objective: Implement a ML component based on SVM to “learn” ANNIE, e.g. To learn to identify the following concepts or named entities: Location, Address, Date, Person, Organization Materials (under directory hand-on-resources/ml/entity-learning) We will need the GATE GUI and the learning plug-in loaded using the plug-in manager (see previous exercise) We will use the testing documents provided in Exercise I Before starting, it better to close all documents and resources of the previous exercise Configuration file is learn-nes.xml in experiments/learning-nes, it is very similar to the previously used but check the target annotation to be learned (Entity and its type)

University of Sheffield NLP Exercise II – PART I 1.Annotate the documents Create a corpus (CORPUS) and populate it with the test documents (use UTF-8 encoding) NOTE: the documents are not annotated, so you need to produce the annotations! The steps below produce the annotations. Load the ANNIE system (with defaults) Create an ANNIE NE Transducer (call it ENTITY-GRAMMAR) using the grammar file under (grammars/create_entity.jape) Add the ENTITY-GRAMMAR as the last component of ANNIE Run ANNIE (+ the new grammar) over the CORPUS Verify that the documents contain the ANNIE annotations + the Entity annotation

University of Sheffield NLP Exercise II – PART I 1.Evaluate an SVM to identify ANNIE’s named entities Create a Batch Learning PR using the provided configuration file (experiments/learning-nes/learn-nes.xml) Create a corpus pipeline and add the Batch Learning PR to the corpus pipeline Set the parameter “learningMode” of the Batch Learning PR to “evaluation” Run the corpus pipeline over the CORPUS corpus (by setting the corpus parameter) When finished, evaluation information will be dumped on the GATE console Examine the GATE console to see the evaluation results NOTE: For the sake of this exercise we have used annotations produced by ANNIE as gold standard and learn an named entity recognition system based on those annotations. Note however that training should be based on human annotations.

University of Sheffield NLP Exercise II – PART II 1.Train a SVM to learn named entities and apply it to unseen documents We will use the documents you annotated (automatically!) in PART I (corpus CORPUS) Using the corpus editor remove from CORPUS the first 5 documents in the list (profile_A, profile_AA, profile_AB, profile_AC, profile_AD) Create a corpus called TESTING Add to TESTING (using the corpus editor) documents profile_A, profile_AA, proffile_AB, profile_AC, profile_AD – should be the last 5 of the list! Now we have one corpus for training (CORPUS) and one corpus for testing (TESTING)

University of Sheffield NLP Exercise II – PART II We will use the learning corpus pipeline we have evaluated in PART I of this exercise In the learning corpus pipeline, set the parameter “training” of the Batch Learning PR to “training” Run the learning corpus pipeline over the CORPUS corpus (by setting the corpus parameter) Now we have a trained model to recognise Entity and its type In the learning corpus pipeline, set the parameter “learningMode” of the Batch Learning PR to “application” Also set the output annotation set outputASName to “Output” (to hold the annotations produced by the system) Run the learning corpus pipeline over the TESTING corpus (by setting the corpus parameter) After execution, check the annotations produced on any of the testing documents (Output annotation set)

University of Sheffield NLP Exercise II – PART III On any of the automatically annotated documents from TESTING you may want to use the annotationDiff tool verify in each document how the learner performed, comparing the Entity in the default annotation set with the Entity in the Output annotation set. Run your own experiment varying any of the parameters of the configuration file, modifying or adding new features, etc.