GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Improving Learning Object Description Mechanisms to Support an Integrated Framework for Ubiquitous Learning Scenarios María Felisa Verdejo Carlos Celorrio.
GMD German National Research Center for Information Technology Darmstadt University of Technology Perspectives and Priorities for Digital Libraries Research.
Technical and design issues in implementation Dr. Mohamed Ally Director and Professor Centre for Distance Education Athabasca University Canada New Zealand.
1/(20) Introduction to ANNIE Diana Maynard University of Sheffield March 2004
An Introduction to GATE
26/10/2008 SWESE'08 1 Enhanced Semantic Access to Software Artefacts Danica Damljanović and Kalina Bontcheva.
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
GATE, Human Language and Machine Learning Hamish Cunningham, Valentin.
1(18) GATE: A Unicode-based Infrastructure Supporting Multilingual Information Extraction Kalina Bontcheva, Diana Maynard, Valentin Tablan, Hamish Cunningham.
The Semantic Web and Language Technology BT Exact, Martlesham Hamish Cunningham Department of Computer Science, University of Sheffield Friday October.
ANNIC ANNotations In Context GATE Training Course 27 – 28 April 2006 Niraj Aswani.
1(21) HLT, Data Sparsity and Semantic Tagging Louise Guthrie (University of Sheffield) Roberto Basili (University of Tor Vergata, Rome) Hamish Cunningham.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
AceMedia Personal content management in a mobile environment Jonathan Teh Motorola Labs.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
MUSCLE WP9 E-Team Integration of structural and semantic models for multimedia metadata management Aims: (Semi-)automatic MM metadata specification process.
Ontology-Aware Information Extraction Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield OntoWeb.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
GATE technical workshop: introduction Hamish Cunningham Sheffield, March.
Semantic Web outlook and trends May The Past 24 Odd Years 1984 Lenat’s Cyc vision 1989 TBL’s Web vision 1991 DARPA Knowledge Sharing Effort 1996.
Software Architecture for Language Engineering (SALE) – where next? Hamish.
GATE, a General Architecture for Text Engineering Hamish Cunningham Department.
revised CmpE 583 Fall 2006Discussion: OWL- 1 CmpE 583- Web Semantics: Theory and Practice DISCUSSION: OWL Atilla ELÇİ Computer Engineering.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Survey of Semantic Annotation Platforms
ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)
JRC-Ispra, , Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged.
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
Linked-data and the Internet of Things Payam Barnaghi Centre for Communication Systems Research University of Surrey March 2012.
1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva
EU Project proposal. Andrei S. Lopatenko 1 EU Project Proposal CERIF-SW Andrei S. Lopatenko Vienna University of Technology
University of Sheffield NLP Teamware: A Collaborative, Web-based Annotation Environment Kalina Bontcheva, Milan Agatonovic University of Sheffield.
 Copyright 2008 Digital Enterprise Research Institute. All rights reserved. Semantic on the Social Semantic Desktop.
GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Introduction to Software Engineering. Why SE? Software crisis manifested itself in several ways [1]: ◦ Project running over-time. ◦ Project running over-budget.
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
© Copyright 2013 STI INNSBRUCK “How to put an annotation in HTML?” Ioannis Stavrakantonakis.
ICCS 2008, CracowJune 23-25, Towards Large Scale Semantic Annotation Built on MapReduce Architecture Michal Laclavík, Martin Šeleng, Ladislav Hluchý.
interactive logbook Paul Kiddie, Mike Sharples et al. The Development of an Application to Enhance.
Strategies for subject navigation of linked Web sites using RDF topic maps Carol Jean Godby Devon Smith OCLC Online Computer Library Center Knowledge Technologies.
Translingual Information Management Stephan Busemann Language Technology Lab German Research Center for Artificial Intelligence.
JAPE and Java Kalina Bontcheva, Department of Computer Science, University.
GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva, Valentin Tablan, Diana Maynard, Yorick Wilks.
11 November Primary Research Team & Capabilities Dept. of Parallel and Distributed Computing Research and Development Areas: –Large-scale HPCN, Grid.
Plug-in Architectures Presented by Truc Nguyen. What’s a plug-in? “a type of program that tightly integrates with a larger application to add a special.
Toward an Open Source Textual Entailment Platform (Excitement Project) Bernardo Magnini (on behalf of the Excitement consortium) 1 STS workshop, NYC March.
A Unicode-based Environment for the Creation and use of LRs Valentin Tablan, Cristian Ursu, Kalina Bontcheva, Hamish Cunningham, Diana Maynard, Oana Hamza,
Java – in context Main Features From Sun Microsystems ‘White Paper’
25 April Unified Cryptologic Architecture: A Framework for a Service Based Architecture Unified Cryptologic Architecture: A Framework for a Service.
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
Steven Perry Dave Vieglais. W a s a b i Web Applications for the Semantic Architecture of Biodiversity Informatics Overview WASABI is a framework for.
A Ubiquitous Permeable Web: requirements for the next generation semantic internet Hamish Cunningham Department of Computer Science, University of Sheffield.
5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
The Earth System Curator Metadata Infrastructure for Climate Modeling Rocky Dunlap Georgia Tech.
INTRO. To I.T Razan N. AlShihabi
Institute of Informatics & Telecommunications NCSR “Demokritos”
GATE and the Semantic Web
Design and Maintenance of Web Applications in J2EE
New Tools In Education Minjun Wang
Presentation transcript:

GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday October 30 th 2002 Next generation web GATE, language technology infrastructure 1(20)

A Ubiquitous Permeable Web The next generation of the web must be: ubiquitous: semantics for every device, every organisation, every individual; permeable: allow contextual data to penetrate and persist; companionable: able to engage with us via multiple natural modalities. Roles for Language Technology: discovery of semantics (ubiquity); mediating between context and personal semantic memories (permeability); conversing with people and the semantic web (companionableness). 2(20)

Critical Mass for the Semantic Web The SW: machine processable, repurposable data to compliment hypertext But: semantics = % of the Web How to achieve critical mass? Huge scale automatic annotation. Requirements: Huge scale: – freely available to all EU citizens – distributed (over a Grid) – re-purposeable (delivered as Web Services) Portability and robustness via: – simple and therefore shallow HLT methods – +ve and –ve learning – analogs of IPSEs for computer-literate users 3 (20)

Motivation for Software Infrastructure for Language Engineering Need for scalable, reusable, and portable HLT solutions Support for large data, in multiple media, languages, formats, and locations Lowering the cost of creation of new language processing components Promoting quantitative evaluation metrics via tools and a level playing field 4 (20)

Motivation (II): software lifecycle in collaborative research Project Proposal: We love each other. We can work so well together. We can hold workshops on Santorini together. We will solve all the problems of AI that our predecessors were too stupid to. Analysis and Design: Stop work entirely, for a period of reflection and recuperation following the stress of attending the kick-off meeting in Luxembourg. Implementation: Each developer partner tries to convince the others that program X that they just happen to have lying around on a dusty disk-drive meets the project objectives exactly and should form the centrepiece of the demonstrator. Integration and Testing: The lead partner gets desperate and decides to hard- code the results for a small set of examples into the demonstrator, and have a fail- safe crash facility for unknown input ("well, you know, it's still a prototype..."). Evaluation: Everyone says how nice it is, how it solves all sorts of terribly hard problems, and how if we had another grant we could go on to transform information processing the World over (or at least the European business travel industry). 2(20)

GATE, a General Architecture for Text Engineering An architecture A macro-level organisational picture for LE software systems. A framework For programmers, GATE is an object-oriented class library that implements the architecture. A development environment For language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction. Some free components......and wrappers for other people's components Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc. Free software (LGPL). Download at 6 (20)

Architectural principles Non-prescriptive, theory neutral (strength and weakness) Re-use, interoperation, not reimplementation (e.g. diverse XML support, integration of tools like Protégé, Jena and Weka) (Almost) everything is a component, and component sets are user-extendable Component-based development An OO way of chunking software: Java Beans GATE components: CREOLE = modified Java Beans (Collection of REusable Objects for Language Engineering) The minimal component = 10 lines of Java, 10 lines of XML, 1 URL. 7 (20)

GATE Language Resources GATE LRs are documents, ontologies, corpora, lexicons, …… Documents / corpora: GATE documents loaded from local files or the web... Diverse document formats: text, html, XML, , RTF, SGML. Processing Resourcres Algorithmic components knows as PRs – beans with execute methods. All PRs can handle Unicode data by default. Clear distinction between code and data (simple repurposing) freebies with GATE e.g. Named entity recognition; WordNet; Protégé; Ontology; OntoGazetteer; DAML+OIL export; Information Retrieval based on Lucene 8 (20)

Relational Database … GATE Format Handlers HTML docs RTF docs XML docs Named entity Core- ference … ANNIE POS tagger Named entity Event extraction … Custom application 1 … Document content Document metadata Document format data Linguistic data File storage … Oracle/ PostgresQL A Language Analysis Example

10(11)

Building IE Components in GATE (1) The ANNIE system – a reusable and easily extendable set of components 11 (20)

Building IE Components in GATE (2) JAPE: a Java Annotation Patterns Engine Light, robust regular-expression-based processing Cascaded finite state transduction Low-overhead development of new components Rule: Company1 Priority: 25 ( ( {Token.orthography == upperInitial} )+ {Lookup.kind == companyDesignator} ):companyMatch --> :companyMatch.NamedEntity = { kind = company, rule = “Company1” } 12 (20)

Performance Evaluation At document level – annotation diff At corpus level – corpus benchmark tool – tracking system’s performance over time 13 (20)

GATE is being used for development of (semi-)automatic methods for: linking web pages to Ontologies using Information Extraction; learning and evolving Ontologies via IE and lexical semantic network traversal. The Semantic Web and GATE 14 (20)

Populating Ontologies with IE

Protégé and Ontology Management

Information Retrieval Support Based on the Lucene IR engine 17 (20)

Processing Multilingual Data All the visualisation and editing tools for ML LRs use enhanced Java facilities: 18 (20)

Applications GATE has been used for a variety of applications, including: MUMIS: automatic creation of semantic indexes for multimedia programme material MUSE: a multi-genre IE system Metadata for Medline (at Merck) ACE: participation in the Automatic Content Extraction programme HSE: summarisation of health and safety information from company reports OldBaileyIE: NE recognition on 17th century Old Bailey Court reports. AKT: language technology in knowledge management AMITIES: call centre automation Various Medical Informatics and database technology projects IE in Romanian, Bulgarian, Greek, Bengali, Spanish, Swedish, German, Italian, and French (Arabic, Chinese and Russian this autumn) 19 (20)

Conclusion GATE: an infrastructure that lowers the overhead of creating & embedding robust NLP components Further information: Online demos, tutorials and documentation Software downloads Talks and papers 20 (20)