ANNIC ANNotations In Context GATE Training Course 27 – 28 April 2006 Niraj Aswani
2 Motivation - I Need for efficient corpus indexing and querying arises frequently both in machine learning-based and human- engineered NLP systems. Language Engineers use their intuition when writing patterns trying to strike the ideal balance between specificity and coverage. This requires them to make a series of informed guesses which are then validated by testing the resulting rule set over a corpus. (Isn’t it painful?)
3 Motivation - II Need a system that allows querying the information contained in a corpus in more flexible ways than simple full-text search (e.g. identifying share movements like “BT shares ended up 36p” Required: A system that can index and query both linguistic metadata and document content - in a flexible way and also allows validating the derived rule set with minimum possible efforts.
4 ANNIC - ANNotations In Context What can be indexed? Documents in any format supported by GATE (i.e. XML, HTML, RTF, , text, etc.) Indexing of Linguistic metadata Extensive indexing of document content and linguistic information (annotations and features) associated with document content, independent of document format Powered with? Apache Lucene technology Description Full featured annotation indexing and search engine, developed as part of GATE
5 What is special? Indexing and extraction of information from overlapping annotations and features ANNIC - ANNotations In Context Result? Matching texts in the corpus, displayed within the context of Linguistic annotations (and not just text, as is customary for KWIC systems) Interface? Advanced GUI provides a graphical view of annotation mark-ups over the text along with ability to build new queries interactively Where to use? Can be used as first step in rule development in NLP systems as it enables the discovery and testing of patterns in corpora
6 GATE Documents Format of document is analysed and converted into a single unified model of annotations. Documents and corpora is encoded in the form of annotations The annotations associated with each document are a structure central to GATE. Each annotation consists of - start offset - end offset - a set of features associated with it - each feature has a name and a relative value Various processing resources to annotate documents
7 The Pattern Syntax ANNIC allows indexing documents with annotations and features and users to issue queries that contain LHS part of the JAPE pattern/action rule e.g. {Person} {Token.string==“from”} {Organization} JAPE – Java Annotation Pattern Engine in GATE - It executes the JAPE grammar phases- each phase consists of regular expression pattern/action rules over annotations - LHS represents an annotation pattern e.g. {Title}{Token.orth=“upperinitial”} - RHS describes the action to be taken when pattern found e.g. Annotate the above pattern as a Person
8 Klene Operators ANNIC supports two Klene operators “+” and “*” ({A})+n one and upto n occurrences of annotation {A} ({A})*n zero and upto n occurrences of annotation {A} Also supports | (OR) operator {A}({B} | {C}) {A}{B} | {A}{C} {A} ({B} | {C})+2 ({A} ({B} |{C})) | ({A} ({B} |{C}) ({B} | {C})) ({A}{B}) | ({A}{C}) | ({A}{B}{B}) | ({A}{B}{C}) | ({A}{C}{B}) | ({A}{C}{C})
9 ANNIC PRs ANNIC Index PR –Allows indexing document content and metadata from a given corpus –Parameters Corpus (serialized corpus) Base token annotation type (e.g. Token) Annotation features to be excluded (e.g. SpaceToken) Index location
10 ANNIC PRs ANNIC Search PR –Allows searching over indexed documents –Parameters Corpus (serialized corpus) OR one or more index locations Limit (number of maximum patterns) Context window (number of base tokens to show as context on each (left and right) side Query (JAPE L.H.S. pattern)
11 ANNIC Viewer
12 ANNIC DEMO QUESTIONS
13 Thank You! This talk: