Download presentation
Presentation is loading. Please wait.
Published byDeshaun Beacham Modified over 10 years ago
1
ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin
2
Alala2 GATE and Information Extraction ● Basic introduction to IE and GATE ● Overview of ANNIE ● JAPE: rule writing ● JAPE debugger
3
GATE and IE ● IE is one of the core tasks GATE is designed for ● IE is the basis for many other, more complex applications, e.g. semantic annotation ● Cornerstone of IE is Named Entity Recognition
4
Alala4 A Typical IE System 1.Pre-processing –format detection –tokenisation –word segmentation –sense disambiguation –sentence splitting –POS tagging 2.Named entity detection –entity detection –coreference
5
Alala5 Two Approaches to IE Knowledge Engineering ● rule based ● developed by experienced language engineers ● make use of human intuition ● obtain marginally better performance ● development could be very time consuming ● some changes may be hard to accommodate Learning Systems ● use statistics or other machine learning ● developers do not need LE expertise ● requires large amounts of annotated training data ● some changes may require re-annotation of the entire training corpus
6
Alala6 Named Entity Recognition ● NE involves identification of proper names in texts, and classification into a set of predefined categories of interest. ● Three universally accepted categories: person, location and organisation ● Other common tasks: recognition of date/time expressions, measures (percent, money, weight etc), email addresses etc. ● Other domain-specific entities: names of drugs, medical conditions, names of ships, bibliographic references etc.
7
Alala7 ANNIE Unicode Tokeniser FS Gazetteer Lookup Sentence Splitter Hepple POS Tagger Input: URL or text Document format (XML, HTML, SGML, email, …) GATE Document Character Class Sequence Rules Lists JAPE Sentence Patterns Brill Rules Lexicon Semantic Tagger Ortho Matcher JAPE IE Grammar Cascade GATE Document XML dump of IE Annotations Output: ANNIE IE modules NOTE: square boxes are processes, rounded ones are data. Pronominal Coreferencer JAPE Grammar
8
Alala8 Unicode Tokeniser Bases tokenisation on Unicode character classes Language-independent tokenisation Declarative token specification language, e.g.: "UPPERCASE_LETTER" LOWERCASE_LETTER"* > Token; orthography=upperInitial; kind=word Look at the ANNIE English tokeniser and at tokenisers for other languages (in plugins directory) for more information and examples
9
Alala9 Gazetteer ● Set of lists compiled into Finite State Machines ● 60k entries in 80 types, inc.: organization; artifact; location; amount_unit; manufacturer; transport_means; company_designator; currency_unit; date; government_designator;... ● Each list has attributes MajorType and MinorType and Language): city.lst: location: city: english currency_prefix.lst: currency_unit: pre_amount currency_unit.lst: currency_unit: post_amount ● Attributes are used as input to JAPE grammars ● List entries may be entities or parts of entities, or they may contain contextual information (e.g. job titles often indicate people)
10
Alala10 The Named Entity Grammar ● JAPE phases run sequentially and constitute a cascade of FSTs over annotations ● hand-coded rules applied to annotations to identify NEs ● annotations from format analysis, tokeniser. POS tagger and gazetteer modules ● use of contextual information ● rule priority based on pattern length, rule status and rule ordering ● Common entities: persons, locations, organisations, dates, addresses.
11
Orthomatcher ● Orthographic coreference between annotations in the same document, e.g. Mr Brown, James Brown ● Matching rules are invoked between annotations of the same type, or between an existing annotation and an “Unknown” annotation ● The latter is the only case where an annotation type can be changed ● Lookup tables of aliases and exceptions (i.e. overriding of matching rules) ● Also pronominal coreference (see User Guide)
12
Alala12 JAPE: a Jolly And Pleasant Experience ● Grammars (cascades of phases) – Phases (lists of rules) ● Rules – LHS (patterns) – RHS (actions) ● Priority – Implicit ● longest match ● first mention – Explicit ● priority
13
LHS of JAPE rules ● The LHS of the rule contains patterns to be matched, in the form of annotations (and optionally their attributes). ● Annotation types to be recognised must be declared at the beginning of the phase ● Annotations may be combined using traditional operators [ | * + ?] ● There is no negative operator ● More than one pattern can be matched in a single rule ● Left and right context (not to be annotated) can be matched
14
Examples of LHS patterns ({Lookup.majorType == location}) :loc --------------------- ({Token.string == "in"} | {Token.string == "by"}) ({Year}) :date -------------------- ( ({Lookup.majorType == jobtitle} ):jobtitle {Surname} ):person
15
RHS of JAPE rules ({Lookup.majorType == location}) :loc :loc.Location = {kind = “city", rule = “Location1"} ---------------------- ( ({Lookup.majorType == jobtitle} ):jobtitle {Surname} ):person :jobtitle.JobTitle = {rule = "PersonJobTitle"}, :person.Person = {kind = “Surname", rule = "PersonJobTitle"}
16
Complex RHS ● JAPE RHS is quite limited in what you can do ● But you can use any Java you like on the RHS of the rule ● Useful for e.g. removing temporary annotations and percolating and manipulating features from previous annotations ● Also means you can use JAPE for many other things apart from just creating annotations, e.g. counting things, manipulating the text, adding annotations to the document, etc. ● And you don’t have to be a JAVA expert to do it. ● Although it helps to have friends who are….
17
Example of using Java in a rule Rule: FirstName ({Lookup.majorType == person_first}):person --> { gate.AnnotationSet person = (gate.AnnotationSet)bindings.get("person"); gate.Annotation personAnn = (gate.Annotation)person.iterator().next(); gate.FeatureMap features = Factory.newFeatureMap(); features.put("gender", personAnn.getFeatures().get("minorType")); features.put("rule", "FirstName"); outputAS.add(person.firstNode(), person.lastNode(), "FirstPerson", features); }
18
Available Java objects ● bindings: binding variables ● doc: GATE Document ● annotations: all GATE Document annotations ● inputAS, outputAS: phase input and output annotations ● ontology See documentation for more details…..
19
Alala19 JAPE Application modes ● Brill (fires all matches) ● First (shortest match fires) ● Once (Phase exits after first match) ● All (as for Brill, but matching continues from offset following the current one, not from the end of the last match) ● Appelt (priority ordering: longest match fires, then explicit rule priority, then first defined rule fires) Note that prioritisation only operates within a single phase, not globally
20
20 {A}+ Application Modes A AA Appelt Once Brill First All
21
Example: “China Sea” Rule: Location1 Priority: 25 ( ({Lookup.majorType == loc_key, Lookup.minorType == pre})? {Lookup.minorType == country} {Lookup.majorType == loc_key, Lookup.minorType == post})? ) :locName --> :locName.Location = {kind = "location", rule = "Location1"} Rule: Location2 Priority: 20 ({Lookup.minorType == location}) :location --> :location.Name = {kind = "location", rule=GazLocation}
22
JAPE Hints and Tricks ● JAPE is quite limited in some respects as to what can be done – There is no negative operator – It can be slow if it is badly written, e.g. ({Token})* – Context is consumed, which can make rule-writing awkward – Priority can be difficult to set correctly ● But fear not, there is generally a sneaky way around it…..
23
How to avoid a pattern from matching Rule: disablePattern Priority: 1000 ( ) {} ● Instead of having a negative operator, we can simply put a high priority rule which does nothing when fired. ● This will be preferred to a lower priority rule which performs the action intended, i.e. only in the case when the former pattern doesn’t apply.
24
How to play with input annotations Input: Person Organisation VerbWork Split … Rule: RelationWorkIn ({Person} {VerbWork} {Organisation}) {… /* create annotation of type “Relation” */ …} ● Use existing annotations to find relations ● We ignore Tokens to enable more flexibility, i.e. there could be additional words between the annotations specified ● Split ensures we don’t cross sentence boundaries
25
How to deal with overlapping annotations ● Because matched annotations are consumed, when two annotations overlap (e.g. in gazetteer lists), the second one will never be matched. ● E.g. for the string “hALCAM” with Lookups hAL, ALCAM, and CAM, ALCAM will never be matched ● Solution is to delete the annotations once matched, and then rerun the same grammar phase over the text ● The process may need to be repeated several times (determine by trial and error)
26
More examples ● In the GATE User Guide under the section “Useful tricks with JAPE” ● Look in the ANNIE grammars and in the foreign language grammars – there are many examples of little tricks ● Check the GATE mailing list archives
27
Custom Processing Resource for your grammars 1. Java developer extends GATE's default JAPE Transducer creating Java class package com.yourcompany; import gate.creole.Transducer; public class CustomTransducer extends Transducer {} 2. JAPE developer adds definition in the plugin’s creole.xml My custom JAPE Transducer com.yourcompany.CustomTransducer java.lang.String java.net.URL java.lang.String 3. GATE user opens custom resource in GATE GUI Right-Click on “Processing Resources” In the pop-up menu select “New >” --> “My custom JAPE Transducer”
28
JAPE debugger ● Speeds up the development of JAPE grammars ● Integrated in GATE GUI ● Friendly for non-experts Allows you to: ● Inspect the pattern matching ● Find overridden rules ● Detect complex inter-rule influence ● And many other things
29
Inspection of pattern matching
30
Overridden rules
31
Inter-rule influence ( finding problem)
32
Inter-rule influence (what is that?)
33
Inter-rule influence (problem synopsis) Text processed: … of the J. L. Kellog Graduate School of Management and the Indiana University School of Business … Conflicting rule: Rule: NotPersonFull Priority: 80 // Det + Surname // This rule was commented course //J.L. Kellog processed without J. //17.06.03 ( {Token.category == DT} | {Token.category == PRP} | {Token.category == RB} ) ( (PREFIX)* (UPPER) (PERSONENDING)? ):foo Shadowed rule: Rule: PersonFullExt Priority: 100 // F.W. Jones Fred Jones // Andrew "Flip" Filipowski // Andrew J. "Flip" Filipowski //({Token.category == DT})? ( ((FIRSTNAME | FIRSTNAMEAMBIG))+ (INITIALS)? ((FIRSTNAME | FIRSTNAMEAMBIG) )* (PREFIX)* ((UPPER)):surname (PERSONENDING)? ):person -->
34
Coming soon…..JAPE4 What JAPE4 IS: ● a new version of internal language in GATE release 4 ● language is based on original JAPE ● incorporate best practices from JAPE, Jape+ and Japec ● 3-5 times faster than JAPE What JAPE4 IS NOT: ● an improved version of original Jape, Jape+ or Japec but rather a new language ● a language backward compatible with JAPE In most cases it seems to be possible to easily modify original Jape, Jape+ or Japec grammars to be compatible with JAPE4 specification.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.