Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.

Slides:



Advertisements
Similar presentations
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Advertisements

GMD German National Research Center for Information Technology Darmstadt University of Technology Perspectives and Priorities for Digital Libraries Research.
1() Information Extraction – why Google doesnt even come close Diana Maynard Natural Language Processing Group University of Sheffield, UK.
An Introduction to GATE
1(18) GATE: A Unicode-based Infrastructure Supporting Multilingual Information Extraction Kalina Bontcheva, Diana Maynard, Valentin Tablan, Hamish Cunningham.
Data Mining and Text Analytics GATE, by Joel Bywater.
The Semantic Web and Language Technology BT Exact, Martlesham Hamish Cunningham Department of Computer Science, University of Sheffield Friday October.
ANNIC ANNotations In Context GATE Training Course 27 – 28 April 2006 Niraj Aswani.
1() Multi-Source and MultiLingual Information Extraction Diana Maynard Natural Language Processing Group University of Sheffield, UK BCS-SIGAI Workshop,
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
Mining the web to improve semantic-based multimedia search and digital libraries
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Detecting Economic Events Using a Semantics-Based Pipeline 22nd International Conference on Database and Expert Systems Applications (DEXA 2011) September.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Text mining and the Semantic Web Dr Diana Maynard NLP Group Department of Computer Science University of Sheffield.
Ontology-based Information Extraction for Business Intelligence
Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK.
Ontology-Aware Information Extraction Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield OntoWeb.
A Light-weight Approach to Coreference Resolution for Named Entities in Text Marin Dimitrov Ontotext Lab, Sirma AI Kalina Bontcheva, Hamish Cunningham,
Named Entity Recognition without Training Data on a Language you don’t speak Diana Maynard Valentin Tablan Hamish Cunningham NLP group, University of Sheffield,
GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.
The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are.
GATE, a General Architecture for Text Engineering Hamish Cunningham Department.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
Language Technology for the Semantic Web OntoWeb5,Florida,October 17 th,2003 WP12: Language Technology Overview SIG5 Paul Buitelaar.
Survey of Semantic Annotation Platforms
ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)
JRC-Ispra, , Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged.
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
The PrestoSpace Project Valentin Tablan. 2 Sheffield NLP Group, January 24 th 2006 Project Mission The 20th Century was the first with an audiovisual.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva
27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.
Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham.
Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh.
University of Sheffield NLP Teamware: A Collaborative, Web-based Annotation Environment Kalina Bontcheva, Milan Agatonovic University of Sheffield.
Semantic Technologies & GATE NSWI Jan Dědek.
Edinburg March 2001CROSSMARC Kick-off meetingICDC ICDC background and know-how and expectations from CROSSMARC CROSSMARC Project IST Kick-off.
For web 2.0.  Digital media files that is made available for download via web syndication.  It is a way to receive audio/video files over the internet.
Experiments of Opinion Analysis On MPQA and NTCIR-6 Yaoyong Li, Kalina Bontcheva, Hamish Cunningham Department of Computer Science University of Sheffield.
Food and Agriculture Organization of the UN Library and Documentation Systems Division July 2005 Ontologies creation, extraction and maintenance 6 th AOS.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
Copenhagen, 6 June 2006 EC CHM Multilinguality Anton Cupcea Finsiel Romania.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
1 Language Technologies (2) Valentin Tablan University of Sheffield, UK ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Information Retrieval
GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva, Valentin Tablan, Diana Maynard, Yorick Wilks.
A Unicode-based Environment for the Creation and use of LRs Valentin Tablan, Cristian Ursu, Kalina Bontcheva, Hamish Cunningham, Diana Maynard, Oana Hamza,
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
General Architecture of Retrieval Systems 1Adrienn Skrop.
University of Sheffield, NLP Introduction to Text Mining Module 4: Development Lifecycle (Part 1)
TextCrowd – Collaborative semantic enrichment of text-based datasets
GATE and the Semantic Web
GATE Mímir: Answering Questions Google Can't
Thanks to Bill Arms, Marti Hearst
DIGITAL LIBRARY.
ITS 2.0 Enriched Terminology Annotation Showcase
CSE 635 Multimedia Information Retrieval
Content Augmentation for Mixed-Mode News Broadcasts Mike Dowman
Using Uneven Margins SVM and Perceptron for IE
Presentation transcript:

Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio Saggion University of Sheffield

The Challenge Lower the cost of annotating document collections with metadata and semantic information New ways to access digital collections via indexes of events, people, etc. The solution: use Human Language Technology (HLT) which requires little or no adaptation to the types of texts being processed

(Semi-)Automatic Annotation with Semantic Information Old Bailey – 18th century English Collection

Indexing and Search by Semantic Content

Information Extraction Technology Identify named entities (domain independent) Persons Dates Numbers Organizations Identify domain-specific events and terms Players Teams Events: goal, foul, etc

Question: Which of these tools and Human Language Technology (HLT) can I use in other digital library applications? Without modification in any domain With domain-specific customisations

Domain-Independent Named Entity Recognition Specifically designed for many genres and domains Work on a variety of document formats Person names, dates, numbers, organisations, monetary expressions, etc. Annotations can be exported as document markup (e.g. XML) for further processing and/or storage or indexed in Oracle Multilingual support via Unicode Support for distributed documents, e.g., WWW

Low-overhead customisation possible by non- computer scientists Used successfully in a number of projects, including adapted to new languages – Bengali, Bulgarian, etc. Publically available, Java-based modules at gate.ac.uk as part of Sheffield’s General Architecture for Text Enginnering (GATE) Domain-Independent Named Entity Recognition (2)

Name Entity Annotated Example President visit President Bush will visit Canada in the June. Bush is expected to…

Correcting the Computer’s Mistakes Less time-consuming than full manual annotation 85-90% correctness are sometimes enough

Other Human Language Technology Automatic speech recognition can be used in combination with IE to annotate sound/video material – results improved with training Domain-specific terms and events can be annotated by modifying the linguistic resources of the IE modules or training them on human- marked texts

Building and Customising HLT Modules for New Domains/Applications Facilitated by existing tools such as the graphical development environment provided by GATE GATE comes with a useful starting set Tokeniser Gazetteer list lookup Sentence detection module Part-of-speech tagging module A pattern-matching engine with grammars Information Retrieval support, etc. Try for free from

Why Are Digital Libraries Good for HLT? Digital libraries are challenging for HLT as they require robustness and scalability Cultural heritage DLs are particularly challenging as they pose new types of problems Example: Nouns in 18th century English texts were capitalised so the NE recogniser had to deal with less reliable orthographic information

Further information Demos: contact me during a coffee break Web: Try NE recognition online: