1/(20) Introduction to ANNIE Diana Maynard University of Sheffield March 2004

Slides:



Advertisements
Similar presentations
Analysis of Computer Algorithms
Advertisements

Chapter 11 Introduction to Programming in C
By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
XP New Perspectives on Microsoft Office Word 2003 Tutorial 6 1 Microsoft Office Word 2003 Tutorial 6 – Creating Form Letters and Mailing Labels.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
0 - 0.
ALGEBRAIC EXPRESSIONS
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
ADDING INTEGERS 1. POS. + POS. = POS. 2. NEG. + NEG. = NEG. 3. POS. + NEG. OR NEG. + POS. SUBTRACT TAKE SIGN OF BIGGER ABSOLUTE VALUE.
SUBTRACTING INTEGERS 1. CHANGE THE SUBTRACTION SIGN TO ADDITION
MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
Addition Facts
Report Card P Only 4 files are exported in SAMS, but there are at least 7 tables could be exported in WebSAMS. Report Card P contains 4 functions: Extract,
Parallel List Ranking Advanced Algorithms & Data Structures Lecture Theme 17 Prof. Dr. Th. Ottmann Summer Semester 2006.
ABC Technology Project
An Introduction to GATE
26/10/2008 SWESE'08 1 Enhanced Semantic Access to Software Artefacts Danica Damljanović and Kalina Bontcheva.
Funded by: European Commission – 6th Framework Project Reference: IST WP6 review presentation GATE ontology QuestIO - Question-based Interface.
University of Sheffield, NLP Case study: GATE in the NeOn project Diana Maynard University of Sheffield.
The World Wide Web. 2 The Web is an infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that.
4 Oracle Data Integrator First Project – Simple Transformations: One source, one target 3-1.
University of Sheffield NLP Module 2: Introduction to IE and ANNIE.
Information Extraction with GATE
University of Sheffield NLP Exercise I Objective: Implement a ML component based on SVM to identify the following concepts in company profiles: company.
University of Sheffield NLP Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell.
University of Sheffield NLP Module 4: Machine Learning.
31242/32549 Advanced Internet Programming Advanced Java Programming
1(18) GATE: A Unicode-based Infrastructure Supporting Multilingual Information Extraction Kalina Bontcheva, Diana Maynard, Valentin Tablan, Hamish Cunningham.
Lecture 6: Software Design (Part I)
The Semantic Web and Language Technology BT Exact, Martlesham Hamish Cunningham Department of Computer Science, University of Sheffield Friday October.
University of Sheffield NLP Module 11: Advanced Machine Learning.
ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.
Getting Familiar with Web Pages 1 2 The Internet Worldwide collection of interconnected computer networks that enables businesses, organizations, governments,
Addition 1’s to 20.
25 seconds left…...
Test B, 100 Subtraction Facts
Week 1.
We will resume in: 25 Minutes.
Introduction to Costing with PPM Amanda Oliver 2008 PPM User Conference.
From Model-based to Model-driven Design of User Interfaces.
ANNIC ANNotations In Context GATE Training Course 27 – 28 April 2006 Niraj Aswani.
Advanced JAPE Mark A. Greenwood. University of Sheffield NLP Recap Installed and run GATE Understand the idea of  LR – Language Resources  PR – Processing.
Ontology-based Information Extraction for Business Intelligence
Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK.
©2012 Paula Matuszek CSC 9010: Text Mining Applications Fall, 2012 Introduction to GATE Dr. Paula Matuszek Taken partially from.
Named Entity Recognition without Training Data on a Language you don’t speak Diana Maynard Valentin Tablan Hamish Cunningham NLP group, University of Sheffield,
©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials,
University of Sheffield NLP Module 3: Introduction to JAPE.
Survey of Semantic Annotation Platforms
ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)
Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA.
Experiences with UIMA from a User’s Perspective Dietmar Rösner, Manuela Kunze, Hany Mahgoub University of Magdeburg C Knowledge Based Systems and Document.
Information Extraction From Medical Records by Alexander Barsky.
Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with.
Combining GATE and UIMA Ian Roberts. University of Sheffield NLP 2 Overview Introduction to UIMA Comparison with GATE Mapping annotations between GATE.
JAPE and Java Kalina Bontcheva, Department of Computer Science, University.
University of Sheffield NLP Module 3: Introduction to JAPE © The University of Sheffield, This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike.
Natural Language Interfaces to Ontologies Danica Damljanović
©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials,
University of Sheffield NLP Module 1: Introduction to JAPE © The University of Sheffield, This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike.
University of Sheffield, NLP Introduction to Text Mining Module 4: Development Lifecycle (Part 1)
Module 3: Introduction to JAPE
Presentation transcript:

1/(20) Introduction to ANNIE Diana Maynard University of Sheffield March

2/(20) What is ANNIE? ANNIE is a vanilla information extraction system comprising a set of core PRs: –Tokeniser –Sentence Splitter –POS tagger –Gazetteers –Semantic tagger (JAPE transducer) –Orthomatcher (orthographic coreference)

3/(20) ANNIE Pipeline

4/(20) Other Processing Resources There are also lots of additional processing resources which are not part of ANNIE itself but which come with the default installation of GATE –Gazetteer collector –PRs for Machine Learning –Various exporters –Annotation set transfer etc….

5/(20) Creating a new application from ANNIE Typically a new application will use most of the core components from ANNIE The tokeniser, sentence splitter and orthomatcher are basically language, domain and application-independent The POS tagger is language dependent but domain and application- independent The gazetteer lists and JAPE grammars may act as a starting point but will almost certainly need to be modified You may also require additional PRs (either existing or new ones)

6/(20) Modifying gazetteers Gazetteers are plain text files containing lists of names Each gazetteer set has an index file listing all the lists, plus features of each list (majorType, minorType and language) Lists can be modified either internally using Gaze, or externally in your favourite editor Gazetteers can also be mapped to ontologies To use Gaze and the ontology editor, you need to download the relevant creole files

7/(20) JAPE grammars A semantic tagger consists of a set of rule-based JAPE grammars run sequentially JAPE is a pattern-matching language The LHS of each rule contains patterns to be matched The RHS contains details of annotations (and optionally features) to be created More complex rules can also be created

8/(20) Input specifications The head of each grammar phase needs to contain certain information –Phase name –Inputs –Matching style e.g. Phase: location Input: Token Lookup Number Control: appelt

9/(20) Matching algorithms and Rule Priority 3 styles of matching: –Brill (fire every rule that applies) –First (shortest rule fires) –Appelt (use of priorities) Appelt priority is applied in the following order –Starting point of a pattern –Longest pattern –Explicit priority (default = -1)

10/(20) NE Rule in JAPE Rule: Company1 Priority: 25 ( ( {Token.orthography == upperInitial} )+ //from tokeniser {Lookup.kind == companyDesignator} //from gazetteer lists ):match --> :match.NamedEntity = { kind=company, rule=Company1 }

11/(20) LHS of the rule LHS is expressed in terms of existing annotations, and optionally features and their values Any annotation to be used must be included in the input header Any annotation not included in the input header will be ignored (e.g. whitespace) Each annotation is enclosed in curly braces Each pattern to be matched is enclosed in round brackets and has a label attached

12/(20) Macros Macros look like the LHS of a rule but have no label Macro: NUMBER (({Digit})+) They are used in rules by enclosing the macro name in round brackets ( (NUMBER)+):match Conventional to name macros in uppercase letters Macros hold across an entire set of grammar phases

13/(20) Contextual information Contextual information can be specified in the same way, but has no label Contextual information will be consumed by the rule ({Annotation1}) ({Annotation2}):match ({Annotation3})

14/(20) RHS of the rule LHS and RHS are separated by Label matches that on the LHS Annotation to be created follows the label (Annotation1):match :match.NE = {feature1 = value1, feature2 = value2}

15/(20) Using phases Grammars usually consist of several phases, run sequentially Only one rule within a single phase can fire Temporary annotations may be created in early phases and used as input for later phases Annotations from earlier phases may need to be combined or modified A definition phase (conventionally called main.jape) lists the phases to be used, in order Only the definition phase needs to be loaded

16/(20) More complex JAPE rules Any Java code can be used on the RHS of a rule This is useful for e.g. feature percolation, ontology population, accessing information not readily available, comparing feature values, deleting existing annotations etc. There are examples of these in the user guide and in the ANNIE NE grammars Most JAPE rules end up being complex!

17/(20) Using JAPE for other tasks JAPE grammars are not just useful for NE annotation They can be a quick and easy way of performing any kind of task where patterns can be easily recognised and a finite-state approach is possible, e.g. transforming one style of markup into another, deriving features for the learning algorithms

18/(20) Example rule for deriving features Rule: Entity ( {Gpe}| {Organization}| {Person}| {Location}| {Facility} ):entity --> { gate.AnnotationSet entityAS = (gate.AnnotationSet)bindings.get("entity"); gate.Annotation entityAnn = (gate.Annotation)entityAS.iterator().next(); gate.FeatureMap features = Factory.newFeatureMap(); features.put("type", entityAnn.getType()); outputAS.add(entityAnn.getStartNode(), entityAnn.getEndNode(), "Entity, features); }

19/(20) Finding Examples ANNIE for default NE rules: gate/src/gate/resources/creole/NEtransducer/NE/ MUSE for more complex NE rules: muse/src/muse/resources/grammar/main h-TechSight for ontology population: htechsight/application/grammar Various other applications generally follow the format: projectname/application/grammar/

20/(20) Conclusion This talk: More information: