GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.
Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
GMD German National Research Center for Information Technology Darmstadt University of Technology Perspectives and Priorities for Digital Libraries Research.
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
1/(20) Introduction to ANNIE Diana Maynard University of Sheffield March 2004
An Introduction to GATE
26/10/2008 SWESE'08 1 Enhanced Semantic Access to Software Artefacts Danica Damljanović and Kalina Bontcheva.
GATE, Human Language and Machine Learning Hamish Cunningham, Valentin.
1(18) GATE: A Unicode-based Infrastructure Supporting Multilingual Information Extraction Kalina Bontcheva, Diana Maynard, Valentin Tablan, Hamish Cunningham.
The Semantic Web and Language Technology BT Exact, Martlesham Hamish Cunningham Department of Computer Science, University of Sheffield Friday October.
A Prototype Implementation of a Framework for Organising Virtual Exhibitions over the Web Ali Elbekai, Nick Rossiter School of Computing, Engineering and.
ANNIC ANNotations In Context GATE Training Course 27 – 28 April 2006 Niraj Aswani.
Where the Web Went Wrong Hamish Cunningham Dept. Computer Science, University.
1(21) HLT, Data Sparsity and Semantic Tagging Louise Guthrie (University of Sheffield) Roberto Basili (University of Tor Vergata, Rome) Hamish Cunningham.
Mining the web to improve semantic-based multimedia search and digital libraries
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
AceMedia Personal content management in a mobile environment Jonathan Teh Motorola Labs.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
After OWL: defacto standards for semantic technologies (or: what do you get for €40m EU research money?)
Geneve, February 12, 2004 CERN OAI 3 Workshop - Tutorial 2 F. Lützenkirchen Implementing institutional Content Repositories with MyCoRe and MILESS 3rd.
Ontology-Aware Information Extraction Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield OntoWeb.
MDC Open Information Model West Virginia University CS486 Presentation Feb 18, 2000 Lijian Liu (OIM:
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
ArcGIS Workflow Manager An Introduction
What’s the difference between Tony Blair and Mother Theresa? (Human Language Technology for Preservation return on investment)
GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.
GATE technical workshop: introduction Hamish Cunningham Sheffield, March.
WP5.4 - Introduction  Knowledge Extraction from Complementary Sources  This activity is concerned with augmenting the semantic multimedia metadata basis.
Software Architecture for Language Engineering (SALE) – where next? Hamish.
GATE, a General Architecture for Text Engineering Hamish Cunningham Department.
revised CmpE 583 Fall 2006Discussion: OWL- 1 CmpE 583- Web Semantics: Theory and Practice DISCUSSION: OWL Atilla ELÇİ Computer Engineering.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Survey of Semantic Annotation Platforms
ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)
JRC-Ispra, , Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged.
GATE: an AKT success story [GATE: open source language technology component architecture and many tools, with a number of AKT roles]
1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva
University of Sheffield NLP Teamware: A Collaborative, Web-based Annotation Environment Kalina Bontcheva, Milan Agatonovic University of Sheffield.
 Copyright 2008 Digital Enterprise Research Institute. All rights reserved. Semantic on the Social Semantic Desktop.
1 Geospatial and Business Intelligence Jean-Sébastien Turcotte Executive VP San Francisco - April 2007 Streamlining web mapping applications.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
Evaluating Semantic Metadata without the Presence of a Gold Standard Yuangui Lei, Andriy Nikolov, Victoria Uren, Enrico Motta Knowledge Media Institute,
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
Building a Topic Map Repository Xia Lin Drexel University Philadelphia, PA Jian Qin Syracuse University Syracuse, NY * Presented at Knowledge Technologies.
Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with.
Translingual Information Management Stephan Busemann Language Technology Lab German Research Center for Artificial Intelligence.
Digital Library The networked collections of digital text, documents, images, sounds, scientific data, and software that are the core of today’s Internet.
26/05/2005 Research Infrastructures - 'eInfrastructure: Grid initiatives‘ FP INFRASTRUCTURES-71 DIMMI Project a DI gital M ulti M edia I nfrastructure.
GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva, Valentin Tablan, Diana Maynard, Yorick Wilks.
DSpace - Digital Library Software
Plug-in Architectures Presented by Truc Nguyen. What’s a plug-in? “a type of program that tightly integrates with a larger application to add a special.
Toward an Open Source Textual Entailment Platform (Excitement Project) Bernardo Magnini (on behalf of the Excitement consortium) 1 STS workshop, NYC March.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
A Unicode-based Environment for the Creation and use of LRs Valentin Tablan, Cristian Ursu, Kalina Bontcheva, Hamish Cunningham, Diana Maynard, Oana Hamza,
Introduction A field survey of Dutch language resources has been carried out within the framework of a project launched by the Dutch Language Union (Nederlandse.
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
Behrooz ChitsazLorrie Apple Johnson Microsoft ResearchU.S. Department of Energy.
Steven Perry Dave Vieglais. W a s a b i Web Applications for the Semantic Architecture of Biodiversity Informatics Overview WASABI is a framework for.
CERN Document Server 19 tth January 2006 CERN Document Server Jean-Yves Le Meur 19 th January 2006.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
Infrastructure and Workflow for the Formal Evaluation of Semantic Search Technologies Stuart N. Wrigley 1, Raúl García-Castro 2 and Cassia Trojahn 3 1.
Implementing institutional Content Repositories with MyCoRe and MILESS
GATE and the Semantic Web
Presentation transcript:

GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield University of Leeds, February 20 th 2003 Structure of the talk: Introduction: Software Architecture and GATE Examples: KT and HLT; indexing football Ragbag of features and colourful pictures Demo

2/29 Motivation for Software Infrastructure for Language Engineering Need for scalable, reusable, and portable HLT solutions Support for large data, in multiple media, languages, formats, and locations Lowering the cost of creation of new language processing components Promoting quantitative evaluation metrics via tools and a level playing field

3/29 Motivation (II): software lifecycle in collaborative research Project Proposal: We love each other. We can work so well together. We can hold workshops on Santorini together. We will solve all the problems of AI that our predecessors were too stupid to. Analysis and Design: Stop work entirely, for a period of reflection and recuperation following the stress of attending the kick-off meeting in Luxembourg. Implementation: Each developer partner tries to convince the others that program X that they just happen to have lying around on a dusty disk-drive meets the project objectives exactly and should form the centrepiece of the demonstrator. Integration and Testing: The lead partner gets desperate and decides to hard- code the results for a small set of examples into the demonstrator, and have a fail- safe crash facility for unknown input ("well, you know, it's still a prototype..."). Evaluation: Everyone says how nice it is, how it solves all sorts of terribly hard problems, and how if we had another grant we could go on to transform information processing the World over (or at least the European business travel industry).

4/29 GATE, a General Architecture for Text Engineering An architecture A macro-level organisational picture for LE software systems. A framework For programmers, GATE is an object-oriented class library that implements the architecture. A development environment For language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction. Some free components......and wrappers for other people's components Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc. Free software (LGPL). Download at

5/29 Architectural principles Non-prescriptive, theory neutral (strength and weakness) Re-use, interoperation, not reimplementation (e.g. diverse XML support, integration of tools like Protégé, Jena and Weka) (Almost) everything is a component, and component sets are user-extendable Component-based development An OO way of chunking software: Java Beans GATE components: CREOLE = modified Java Beans (Collection of REusable Objects for Language Engineering) The minimal component = 10 lines of Java, 10 lines of XML, 1 URL.

6/29 GATE Language Resources GATE LRs are documents, ontologies, corpora, lexicons, …… Documents / corpora: GATE documents loaded from local files or the web... Diverse document formats: text, html, XML, , RTF, SGML. Processing Resourcres Algorithmic components knows as PRs – beans with execute methods. All PRs can handle Unicode data by default. Clear distinction between code and data (simple repurposing) freebies with GATE e.g. Named entity recognition; WordNet; Protégé; Ontology; OntoGazetteer; DAML+OIL export; Information Retrieval based on Lucene

7/29 Visual Resources

8/29 Applications GATE has been used for a variety of applications, including: MUMIS: automatic creation of semantic indexes for multimedia programme material MUSE: a multi-genre IE system EMILLE: a 70 million word corpus of Indic languages Metadata for Medline (at Merck) ACE: participation in the Automatic Content Extraction programme HSE: summarisation of health and safety information from company reports OldBaileyIE: NE recognition on 17th century Old Bailey Court reports. AKT: language technology in knowledge management AMITIES: call centre automation Various Medical Informatics and database technology projects IE in Romanian, Bulgarian, Greek, Bengali, Spanish, Swedish, German, Italian, and French (Arabic, Chinese and Russian next year)

9/29 Some users… At time of writing a representative fraction of GATE users includes: Longman Pearson publishing, UK; Merck KgAa, Germany; Canon Europe, UK; Knight Ridder (the second biggest US news publisher); BBN; Sirma AI Ltd., Bulgaria; the American National Corpus project, US; Imperial College, London, the University of Manchester, the University of Karlsruhe, Vassar College, the University of Southern California and a large number of other UK, US and EU Universities; the Perseus Digital Library project, Tufts University, US.

10/29 Example 1: the Knowledge Economy and Human Language Gartner, December 2002: taxonomic and hierachical knowledge mapping and indexing will be prevalent in almost all information-rich applications through 2012 more than 95% of human-to-computer information input will involve textual language A contradiction: formal knowledge in semantics-based systems vs. ambiguous informal natural language The challenge: to reconcile these two opposing tendencies

11/29 Human Language Formal Knowledge (ontologies and instance bases) (A)IE CLIE (M)NLG Controlled Language OIE Semantic Web; Semantic Grid; Semantic Web Services Closing the Language Loop (1)

12/29 Closing the Language Loop (2) Information Extraction (IE): from NL to formal data Adaptive IE: learning by example Ontology-based IE: annotate to user-supplied ontology Controlled-Language IE: simplify the interface (Multilingual) Natural Language Generation: documentation Cross-cutting issues: Content Extraction vs. Information Extraction Scaling and robustness - cf. MUSE project Hybrid learning and knowledge-based systems

13/29 Building IE Components in GATE (1) The ANNIE system – a reusable and easily extendable set of components

14/29 Building IE Components in GATE (2) JAPE: a Java Annotation Patterns Engine Light, robust regular-expression-based processing Cascaded finite state transduction Low-overhead development of new components Rule: Company1 Priority: 25 ( ( {Token.orthography == upperInitial} )+ {Lookup.kind == companyDesignator} ):companyMatch --> :companyMatch.NamedEntity = { kind = company, rule = “Company1” }

15/29 Populating Ontologies with IE

16/29 Protégé and Ontology Management

17/29 Example 2: the MUMIS project Multimedia Indexing and Searching Environment Composite index of a multimedia programme from multiple sources in different languages ASR, video processing, information extraction (Dutch, English, German), merging, user interface University of Twente/CTIT, University of Sheffield, University of Nijmegen, DFKI, MPI, ESTEAM AB, VDA

18/29 The Whole Picture EN DE Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Sources IE NL Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Trans criptions ASRASR Forma l Text Forma l Text Forma l Text Forma l Text Forma l Text Forma l Text Forma l Text Forma l Text Forma l Text Forma l Text Forma l Text Speech Signals Merging Final Annotations Forma l Text Forma l Text Forma l Text Anno- tations Multimedia Data Base Video & Audio Signal User Interface Query Results Ontology & Lexicon

19/29 User Interface

20/29 Play

21/29 Relational Database … GATE Format Handlers HTML docs RTF docs XML docs Named entity Core- ference … ANNIE POS tagger Named entity Event extraction … Custom application 1 … Document content Document metadata Document format data Linguistic data File storage … Oracle/ PostgresQL Developing MUMIS Components with GATE

22/29 Ragbag (1): Performance Evaluation At document level – annotation diff At corpus level – corpus benchmark tool – tracking system’s performance over time

23/29 Ragbag 2: Regression Testing – Corpus Benchmark Tool

24/29 Ragbag 3: Information Retrieval Based on the Lucene IR engine

25/29 GATE Unicode Kit (GUK) Java provides no special support for text input (this may change) Support for defining additional Input Methods (IMs) currently 30 IMs for 17 languages Pluggable in other applications Ragbag 4: Editing Multilingual Data

26/29 Ragbag 5: Processing Multilingual Data All the visualisation and editing tools for ML LRs use enhanced Java facilities:

27/29 Ragbag 6: Dialogue Systems GATE is being used in the Amities project for automating call centres Creation of dialogue processing server components to run in the Galaxy Communicator architecture Easy adaptation of the portable IE components to work on noisy ASR output Robustness and speed of GATE components vital for real- time dialogue systems

28/29 Conclusion GATE: an infrastructure that lowers the overhead of creating & embedding robust NLP components Further information: Online demos, tutorials and documentation Software downloads Talks and papers