GATE, a General Architecture for Text Engineering Hamish Cunningham Department.

Slides:



Advertisements
Similar presentations
Human Language Technologies for the Semantic Web Department of Computer Science, University of Sheffield Fabio Ciravegna and Yorick Wilks.
Advertisements

1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
GMD German National Research Center for Information Technology Darmstadt University of Technology Perspectives and Priorities for Digital Libraries Research.
1/(20) Introduction to ANNIE Diana Maynard University of Sheffield March 2004
An Introduction to GATE
26/10/2008 SWESE'08 1 Enhanced Semantic Access to Software Artefacts Danica Damljanović and Kalina Bontcheva.
GATE, Human Language and Machine Learning Hamish Cunningham, Valentin.
1(18) GATE: A Unicode-based Infrastructure Supporting Multilingual Information Extraction Kalina Bontcheva, Diana Maynard, Valentin Tablan, Hamish Cunningham.
The Semantic Web and Language Technology BT Exact, Martlesham Hamish Cunningham Department of Computer Science, University of Sheffield Friday October.
ANNIC ANNotations In Context GATE Training Course 27 – 28 April 2006 Niraj Aswani.
Where the Web Went Wrong Hamish Cunningham Dept. Computer Science, University.
1() Multi-Source and MultiLingual Information Extraction Diana Maynard Natural Language Processing Group University of Sheffield, UK BCS-SIGAI Workshop,
1(21) HLT, Data Sparsity and Semantic Tagging Louise Guthrie (University of Sheffield) Roberto Basili (University of Tor Vergata, Rome) Hamish Cunningham.
ARCHIMÈDE Presented by Guy Teasdale Directeur, Services soutien et développement Bibliothèque de l’Université Laval CARL Workshop on Institutional Repositories.
1 Archiving Workflow between a Local Repository and the National Library Archive Experiences from the DiVA Project Eva Müller, Peter Hansson, Uwe Klosa,
Mining the web to improve semantic-based multimedia search and digital libraries
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Presentation Outline  Project Aims  Introduction of Digital Video Library  Introduction of Our Work  Considerations and Approach  Design and Implementation.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
AceMedia Personal content management in a mobile environment Jonathan Teh Motorola Labs.
Ontology-based Information Extraction for Business Intelligence
Geneve, February 12, 2004 CERN OAI 3 Workshop - Tutorial 2 F. Lützenkirchen Implementing institutional Content Repositories with MyCoRe and MILESS 3rd.
L EC. 01: J AVA FUNDAMENTALS Fall Java Programming.
Ontology-Aware Information Extraction Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield OntoWeb.
Digital Library Architecture and Technology
Named Entity Recognition without Training Data on a Language you don’t speak Diana Maynard Valentin Tablan Hamish Cunningham NLP group, University of Sheffield,
What’s the difference between Tony Blair and Mother Theresa? (Human Language Technology for Preservation return on investment)
GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.
GATE technical workshop: introduction Hamish Cunningham Sheffield, March.
Software Architecture for Language Engineering (SALE) – where next? Hamish.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Survey of Semantic Annotation Platforms
ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)
Experiences with UIMA from a User’s Perspective Dietmar Rösner, Manuela Kunze, Hany Mahgoub University of Magdeburg C Knowledge Based Systems and Document.
University of Dublin Trinity College Localisation and Personalisation: Dynamic Retrieval & Adaptation of Multi-lingual Multimedia Content Prof Vincent.
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
GATE: an AKT success story [GATE: open source language technology component architecture and many tools, with a number of AKT roles]
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva
Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham.
19/10/20151 Semantic WEB Scientific Data Integration Vladimir Serebryakov Computing Centre of the Russian Academy of Science Proposal: SkTech.RC/IT/Madnick.
University of Sheffield NLP Teamware: A Collaborative, Web-based Annotation Environment Kalina Bontcheva, Milan Agatonovic University of Sheffield.
GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of.
1 Geospatial and Business Intelligence Jean-Sébastien Turcotte Executive VP San Francisco - April 2007 Streamlining web mapping applications.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
MUMIS Franciska de Jong & Thijs Westerveld University of Twente Multimedia Indexing and Searching.
Strategies for subject navigation of linked Web sites using RDF topic maps Carol Jean Godby Devon Smith OCLC Online Computer Library Center Knowledge Technologies.
Translingual Information Management Stephan Busemann Language Technology Lab German Research Center for Artificial Intelligence.
Digital Library The networked collections of digital text, documents, images, sounds, scientific data, and software that are the core of today’s Internet.
GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva, Valentin Tablan, Diana Maynard, Yorick Wilks.
Toward an Open Source Textual Entailment Platform (Excitement Project) Bernardo Magnini (on behalf of the Excitement consortium) 1 STS workshop, NYC March.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
A Unicode-based Environment for the Creation and use of LRs Valentin Tablan, Cristian Ursu, Kalina Bontcheva, Hamish Cunningham, Diana Maynard, Oana Hamza,
Digital University of Pisa Alessandro Lenci CoLing Lab – Laboratorio di Linguistica Computazionale Università di Pisa Aix-Marseille Université.
Behrooz ChitsazLorrie Apple Johnson Microsoft ResearchU.S. Department of Energy.
Steven Perry Dave Vieglais. W a s a b i Web Applications for the Semantic Architecture of Biodiversity Informatics Overview WASABI is a framework for.
Digital Data Preservation: a schema-driven model Student: Stacy Kowalczyk Co-Authors: Clare McInerney and Phil Mitchell Digital Data Preservation – the.
VIVO architecture March 1, Major Components Vitro is a general-purpose Web-based application leveraging semantic standards VIVO is a customized.
Definition, purposes/functions, elements of IR systems Lesson 1.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
Institute of Informatics & Telecommunications NCSR “Demokritos”
GATE and the Semantic Web
DIGITAL LIBRARY.
Presentation transcript:

GATE, a General Architecture for Text Engineering Hamish Cunningham Department of Computer Science, University of Sheffield ENST, Paris, 20/1/2003 Natural Language Engineering in Sheffield: One of the largest Human Language Technology groups in the EU 50 staff in Language and Speech Processing; 25 in Information Retrieval, including 6 professors A focus on scientific method in AI (participate in all the leading quantitative evaluation programmes in the US) A focus on engineering high-quality open-source software for applications and demonstrators

2(27) GATE, a General Architecture for Text Engineering GATE is…. An architecture A macro-level organisational picture for LE software systems. A framework For programmers, GATE is an object-oriented class library that implements the architecture. A development environment For language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction. Free software (LGPL). Mature robust software (in development since 1995). Download at Comes with… Some free components......and wrappers for other people's components Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc.

3(27) Applications; languages GATE has been used for a variety of applications, including: MUMIS: automatic creation of semantic indexes for multimedia programme material MUSE: a multi-genre IE system EMILLE: a 70 million word corpus of Indic languages Metadata for Medline (at Merck) Creation of metadata for Semantic Web Services; documentation using NLG HSE: summarisation of health and safety information from company reports OldBaileyIE: NE recognition on 17th century Old Bailey Court reports. AKT: language technology in knowledge management AMITIES: call centre automation Digital libraries / e-philology for ancient languages researchers Various Medical Informatics and database technology projects IE in Romanian, Bulgarian, Greek, Bengali, Spanish, Swedish, German, Italian, and French (Arabic, Chinese and Russian next year)

4(27) Some users… At time of writing a representative fraction of GATE users includes: Longman Pearson publishing, UK; BT Exact Technologies, UK; Merck KgAa, Germany; Canon Europe, UK; Knight Ridder (the second biggest US news publisher); BBN Technologies, US; Sirma AI Ltd., Bulgaria; Resco AB, Sweden/Finland/Germany; Glaxo Smith Kline Plc: drug-based navigation of Medline abstracts Master Foods NV: extraction of commodities events from news the American National Corpus project, US; Imperial College, London, the University of Manchester, Queen Mary College, UMIST, the University of Karlsruhe, Vassar College, ISI / the University of Southern California and a large number of other UK, US and EU Universities; the Perseus Digital Library project, Tufts University, US.

5(27) Architectural principles Non-prescriptive, theory neutral (strength and weakness) Re-use, interoperation, not reimplementation (e.g. diverse XML support, integration of tools like Protégé, Jena and Weka) (Almost) everything is a component, and component sets are user-extendable Component-based development An OO way of chunking software: Java Beans GATE components: CREOLE = modified Java Beans (Collection of REusable Objects for Language Engineering) The minimal component = 10 lines of Java, 10 lines of XML, 1 URL.

6(27) GATE Language Resources GATE LRs are documents, ontologies, corpora, lexicons, …… Documents / corpora: GATE documents loaded from local files or the web... Diverse document formats: text, html, XML, , RTF, SGML. Processing Resourcres Algorithmic components knows as PRs – beans with execute methods. All PRs can handle Unicode data by default. Clear distinction between code and data (simple repurposing) freebies with GATE e.g. Named entity recognition; WordNet; Protégé; Ontology; OntoGazetteer; DAML+OIL export; Information Retrieval based on Lucene

7(27) Visual Resources

8(27) Displaying Coreference Information

9(27) Displaying Syntactic Information

10(27) Lexicon Support – WordNet example

11(27) Relational Database … GATE Format Handlers HTML docs RTF docs XML docs Named entity Core- ference … ANNIE POS tagger Named entity Event extraction … Custom application 1 … Document content Document metadata Document format data Linguistic data File storage … Oracle/ PostgresQL A Language Analysis Example

12(27) Building IE Components in GATE (1) The ANNIE system – a reusable and easily extendable set of components

13(27) Building IE Components in GATE (2) JAPE: a Java Annotation Patterns Engine Light, robust regular-expression-based processing Cascaded finite state transduction Low-overhead development of new components Rule: Company1 Priority: 25 ( ( {Token.orthography == upperInitial} )+ {Lookup.kind == companyDesignator} ):companyMatch --> :companyMatch.NamedEntity = { kind = company, rule = “Company1” }

14(27) Performance Evaluation At document level – annotation diff At corpus level – corpus benchmark tool – tracking system’s performance over time

15(27) Regression Testing – Corpus Benchmark Tool

16(27) GATE is being used for development of (semi-)automatic methods for: linking web pages to Ontologies using Information Extraction; learning and evolving Ontologies via IE and lexical semantic network traversal. The Semantic Web and GATE

17(27) Populating Ontologies with IE

18(27) Protégé and Ontology Management

19(27) Information Retrieval Support Based on the Lucene IR engine

20(27) GATE Unicode Kit (GUK) Java provides no special support for text input (this may change) Support for defining additional Input Methods (IMs) currently 30 IMs for 17 languages Pluggable in other applications Editing Multilingual Data

21(27) Processing Multilingual Data All the visualisation and editing tools for ML LRs use enhanced Java facilities:

22(27) Dialogue Systems GATE is being used in the Amities project for automating call centres Creation of dialogue processing server components to run in the Galaxy Communicator architecture Easy adaptation of the portable IE components to work on noisy ASR output Robustness and speed of GATE components vital for real- time dialogue systems

23(27) The MUMIS project Multimedia Indexing and Searching Environment Composite index of a multimedia programme from multiple sources in different languages ASR, video processing, information extraction (Dutch, English, German), merging, user interface University of Twente/CTIT, University of Sheffield, University of Nijmegen, DFKI, MPI, ESTEAM AB, VDA Yorick Wilks, Hamish Cunningham, Horacio Saggion, Kalina Bontcheva, Diana Maynard, Oana Hamza, Cristian Ursu

24(27) The Whole Picture EN DE Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Sources IE NL Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Trans criptions ASRASR Forma l Text Forma l Text Forma l Text Forma l Text Forma l Text Forma l Text Forma l Text Forma l Text Forma l Text Forma l Text Forma l Text Speech Signals Merging Final Annotations Forma l Text Forma l Text Forma l Text Anno- tations Multimedia Data Base Video & Audio Signal User Interface Query Results Ontology & Lexicon

25(27) User Interface

26(27) Play

27(27) Conclusion GATE: an infrastructure that lowers the overhead of creating & embedding robust NLP components Further information: Online demos, tutorials and documentation Software downloads Talks and papers