ANNIC ANNotations In Context GATE Training Course 27 – 28 April 2006 Niraj Aswani.

Slides:



Advertisements
Similar presentations
IAC (ACCESS INTERFACE CORPUS) DEVELOPED BY BARCELONA MEDIA & UNIVERSITAT POMPEU FABRA TONI BADIA (BARCELONA MEDIA - UNIVERSITAT POMPEU FABRA) JUDITH DOMINGO.
Advertisements

1/(20) Introduction to ANNIE Diana Maynard University of Sheffield March 2004
An Introduction to GATE
26/10/2008 SWESE'08 1 Enhanced Semantic Access to Software Artefacts Danica Damljanović and Kalina Bontcheva.
University of Sheffield NLP Exercise I Objective: Implement a ML component based on SVM to identify the following concepts in company profiles: company.
University of Sheffield NLP Module 4: Machine Learning.
Snejina Lazarova Senior QA Engineer, Team Lead CRMTeam Dimo Mitev Senior QA Engineer, Team Lead SystemIntegrationTeam Telerik QA Academy SOAP-based Web.
IEPAD: Information Extraction based on Pattern Discovery Chia-Hui Chang National Central University, Taiwan
Alex Meng Chunshi Jin Elliott Conant Jonathan Fung.
Information Retrieval in Practice
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
The Data Mining Visual Environment Motivation Major problems with existing DM systems They are based on non-extensible frameworks. They provide a non-uniform.
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
WWW and Internet The Internet Creation of the Web Languages for document description Active web pages.
September 15, 2003Houssam Haitof1 XSL Transformation Houssam Haitof.
Overview of Search Engines
XIS™ XML Intranet System. XIS, the XML Intranet System provides the foundation for your database production and management. XIS maximizes the flexible.
DEiXTo.
Manohar – Why XML is Required Problem: We want to save the data and retrieve it further or to transfer over the network. This.
(C) 2013 Logrus International Practical Visualization of ITS 2.0 Categories for Real World Localization Process Part of the Multilingual Web-LT Program.
HTML 1 Introduction to HTML. 2 Objectives Describe the Internet and its associated key terms Describe the World Wide Web and its associated key terms.
©2012 Paula Matuszek CSC 9010: Text Mining Applications Fall, 2012 Introduction to GATE Dr. Paula Matuszek Taken partially from.
16-1 The World Wide Web The Web An infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that information.
XSLT for Data Manipulation By: April Fleming. What We Will Cover The What, Why, When, and How of XSLT What tools you will need to get started A sample.
Survey of Semantic Annotation Platforms
ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)
Experiences with UIMA from a User’s Perspective Dietmar Rösner, Manuela Kunze, Hany Mahgoub University of Magdeburg C Knowledge Based Systems and Document.
Max Planck Institute for Psycholinguistics Tool development report H. Brugman MPI Nijmegen.
Introduction to XML. XML - Connectivity is Key Need for customized page layout – e.g. filter to display only recent data Downloadable product comparisons.
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
Information Extraction From Medical Records by Alexander Barsky.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Spoken dialog for e-learning supported by domain ontologies Dario Bianchi, Monica Mordonini and Agostino Poggi Dipartimento di Ingegneria dell’Informazione.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Satish Ramanan April 16, AGENDA Context Why - Integrate Search with BI? How - do we get there? - Tool Strategy What - is in it for me ? - Outcomes.
XML A web enabled data description language 4/22/2001 By Mark Lawson & Edward Ryan L’Herault.
Presentation Topic: XML and ASP Presented by Yanzhi Zhang.
Complex Data Transformations in Digital Libraries with Spatio-Temporal Information B. Martins, N. Freire, J. Borbinha Instituto Superior Técnico, Technical.
XML and Digital Libraries M. Zubair Department of Computer Science Old Dominion University.
Semantic Technologies & GATE NSWI Jan Dědek.
UVa's Digital Library CSG - September 2005 Slides courtesy of: Leslie Johnston Director, Digital Access Services, UVA Library Tim Sigmon University of.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
Regular Expressions The ultimate tool for textual analysis.
The PLAZI Markup System Donat Agosti Terry Catapano Robert “Bob“ Morris Guido Sautter Universität Karlsruhe (TH) Research University – founded 1825.
IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,
Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
University of Sheffield, NLP Module 6: ANNIC Kalina Bontcheva © The University of Sheffield, This work is licensed under the Creative Commons.
LINGUATECA FLUP/CLUP The Corpógrafo – a Web-based environment for corpora research extract Term Candidates.
University of Sheffield NLP Module 3: Introduction to JAPE © The University of Sheffield, This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike.
JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.
Comanche A GUI management tool for Apache Daniel López Ridruejo
August 2003 At A Glance The IRC is a platform independent, extensible, and adaptive framework that provides robust, interactive, and distributed control.
XML The Extensible Markup Language (XML ), which is comparable to SGML and modeled on it, describes how to describe a collection of data. A standard way.
Reviews Crawler (Detection, Extraction & Analysis) FOSS Practicum By: Syed Ahmed & Rakhi Gupta April 28, 2010.
HTML Concepts and Techniques Fifth Edition Chapter 1 Introduction to HTML.
From XML to DAML – giving meaning to the World Wide Web Katia Sycara The Robotics Institute
©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials,
XML Extensible Markup Language
CERN Document Server 19 tth January 2006 CERN Document Server Jean-Yves Le Meur 19 th January 2006.
A S P. Outline  The introduction of ASP  Why we choose ASP  How ASP works  Basic syntax rule of ASP  ASP’S object model  Limitations of ASP  Summary.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
ANNIC: Annotations in Context Niraj Aswani, Valentin Tablan Thomas Heitz University of Sheffield.
Information Retrieval in Practice
The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies
Computer Software: Programming
Presentation transcript:

ANNIC ANNotations In Context GATE Training Course 27 – 28 April 2006 Niraj Aswani

2 Motivation - I Need for efficient corpus indexing and querying arises frequently both in machine learning-based and human- engineered NLP systems. Language Engineers use their intuition when writing patterns trying to strike the ideal balance between specificity and coverage. This requires them to make a series of informed guesses which are then validated by testing the resulting rule set over a corpus. (Isn’t it painful?)

3 Motivation - II Need a system that allows querying the information contained in a corpus in more flexible ways than simple full-text search (e.g. identifying share movements like “BT shares ended up 36p” Required: A system that can index and query both linguistic metadata and document content - in a flexible way and also allows validating the derived rule set with minimum possible efforts.

4 ANNIC - ANNotations In Context What can be indexed? Documents in any format supported by GATE (i.e. XML, HTML, RTF, , text, etc.) Indexing of Linguistic metadata Extensive indexing of document content and linguistic information (annotations and features) associated with document content, independent of document format Powered with? Apache Lucene technology Description Full featured annotation indexing and search engine, developed as part of GATE

5 What is special? Indexing and extraction of information from overlapping annotations and features ANNIC - ANNotations In Context Result? Matching texts in the corpus, displayed within the context of Linguistic annotations (and not just text, as is customary for KWIC systems) Interface? Advanced GUI provides a graphical view of annotation mark-ups over the text along with ability to build new queries interactively Where to use? Can be used as first step in rule development in NLP systems as it enables the discovery and testing of patterns in corpora

6 GATE Documents Format of document is analysed and converted into a single unified model of annotations. Documents and corpora is encoded in the form of annotations The annotations associated with each document are a structure central to GATE. Each annotation consists of - start offset - end offset - a set of features associated with it - each feature has a name and a relative value Various processing resources to annotate documents

7 The Pattern Syntax ANNIC allows indexing documents with annotations and features and users to issue queries that contain LHS part of the JAPE pattern/action rule e.g. {Person} {Token.string==“from”} {Organization} JAPE – Java Annotation Pattern Engine in GATE - It executes the JAPE grammar phases- each phase consists of regular expression pattern/action rules over annotations - LHS represents an annotation pattern e.g. {Title}{Token.orth=“upperinitial”} - RHS describes the action to be taken when pattern found e.g. Annotate the above pattern as a Person

8 Klene Operators ANNIC supports two Klene operators “+” and “*” ({A})+n one and upto n occurrences of annotation {A} ({A})*n zero and upto n occurrences of annotation {A} Also supports | (OR) operator {A}({B} | {C})  {A}{B} | {A}{C} {A} ({B} | {C})+2  ({A} ({B} |{C})) | ({A} ({B} |{C}) ({B} | {C}))  ({A}{B}) | ({A}{C}) | ({A}{B}{B}) | ({A}{B}{C}) | ({A}{C}{B}) | ({A}{C}{C})

9 ANNIC PRs ANNIC Index PR –Allows indexing document content and metadata from a given corpus –Parameters Corpus (serialized corpus) Base token annotation type (e.g. Token) Annotation features to be excluded (e.g. SpaceToken) Index location

10 ANNIC PRs ANNIC Search PR –Allows searching over indexed documents –Parameters Corpus (serialized corpus) OR one or more index locations Limit (number of maximum patterns) Context window (number of base tokens to show as context on each (left and right) side Query (JAPE L.H.S. pattern)

11 ANNIC Viewer

12 ANNIC DEMO QUESTIONS

13 Thank You! This talk: