1() Multi-Source and MultiLingual Information Extraction Diana Maynard Natural Language Processing Group University of Sheffield, UK BCS-SIGAI Workshop,

Slides:



Advertisements
Similar presentations
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
1() Information Extraction – why Google doesnt even come close Diana Maynard Natural Language Processing Group University of Sheffield, UK.
1/(20) Introduction to ANNIE Diana Maynard University of Sheffield March 2004
An Introduction to GATE
1(18) GATE: A Unicode-based Infrastructure Supporting Multilingual Information Extraction Kalina Bontcheva, Diana Maynard, Valentin Tablan, Hamish Cunningham.
13 th September 2007 UK e-Science All Hands Meeting Text Mining Services to Support e-Research Brian Rea and Sophia Ananiadou National Centre for Text.
Information Retrieval in Practice
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Basi di dati distribuite Prof. M.T. PAZIENZA a.a
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Toward Semantic Web Information Extraction B. Popov, A. Kiryakov, D. Manov, A. Kirilov, D. Ognyanoff, M. Goranov Presenter: Yihong Ding.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Text mining and the Semantic Web Dr Diana Maynard NLP Group Department of Computer Science University of Sheffield.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Overview of Search Engines
Ontology-based Information Extraction for Business Intelligence
An innovative platform to allow translation and indexing of internet sites Localization World
Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Named Entity Recognition without Training Data on a Language you don’t speak Diana Maynard Valentin Tablan Hamish Cunningham NLP group, University of Sheffield,
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
Survey of Semantic Annotation Platforms
Information Extraction From Medical Records by Alexander Barsky.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva JRC Workshop September 27, 2005.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva
Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
20 th of May 2004 Beatrice Alex School of Informatics The University of Edinburgh Mixed-Lingual Entity Recognition.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Database What is a database? A database is a collection of information that is typically organized so that it can easily be storing, managing and retrieving.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Information Retrieval
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Mining Wiki Resoures for Multilingual Named Entity Recognition Xiej un
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
University of Sheffield, NLP Introduction to Text Mining Module 4: Development Lifecycle (Part 1)
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Computational and Statistical Methods for Corpus Analysis: Overview
Presentation transcript:

1() Multi-Source and MultiLingual Information Extraction Diana Maynard Natural Language Processing Group University of Sheffield, UK BCS-SIGAI Workshop, Nottingham Trent University, 12 September 2003

2() Outline Introduction to Information Extraction (IE) The MUSE system for Named Entity Recognition Multilingual MUSE Future directions

3() IE is not IR IE pulls facts and structured information from the content of large text collections (usually corpora) IR pulls documents from large text collections (usually the Web) in response to specific keywords

4() Extraction for Document Access With traditional query engines, getting the facts can be hard and slow Where has the Queen visited in the last year? Which places on the East Coast of the US have had cases of West Nile Virus? Constructing a database through IE and linking it back to the documents can provide a valuable alternative search tool. Even if results are not always accurate, they can be valuable if linked back to the original text

5() Extraction for Document Access For access to news identify major relations and event types (e.g. within foreign affairs or business news) For access to scientific reports identify principal relations of a scientific subfield (e.g. pharmacology, genomics)

6() Application Example (1) Ontotext’s KIM query and results

7() Application Example (2)

8() What is Named Entity Recognition? Identification of proper names in texts, and their classification into a set of predefined categories of interest Persons Organisations (companies, government organisations, committees, etc) Locations (cities, countries, rivers, etc) Date and time expressions Various other types as appropriate

9() Basic Problems in NE Variation of NEs – e.g. John Smith, Mr Smith, John. Ambiguity of NE types: John Smith (company vs. person) –June (person vs. month) –Washington (person vs. location) –1945 (date vs. time) Ambiguity between common words and proper nouns, e.g. “may”

10() More complex problems in NE Issues of style, structure, domain, genre etc. Punctuation, spelling, spacing, formatting Dept. of Computing and Maths Manchester Metropolitan University Manchester United Kingdom > Tell me more about Leonardo > Da Vinci

11() Two kinds of approaches Knowledge Engineering rule based developed by experienced language engineers make use of human intuition require only small amount of training data development can be very time consuming some changes may be hard to accommodate Learning Systems use statistics or other machine learning developers do not need LE expertise require large amounts of annotated training data some changes may require re- annotation of the entire training corpus

12() List lookup approach - baseline System that recognises only entities stored in its lists (gazetteers). Advantages - Simple, fast, language independent, easy to retarget (just create lists) Disadvantages - collection and maintenance of lists, cannot deal with name variants, cannot resolve ambiguity

13() Shallow Parsing Approach (internal structure) Internal evidence – names often have internal structure. These components can be either stored or guessed, e.g. location: Cap. Word + {City, Forest, Center, River} e.g. Sherwood Forest Cap. Word + {Street, Boulevard, Avenue, Crescent, Road} e.g. Portobello Street

14() Problems with the shallow parsing approach Ambiguously capitalised words (first word in sentence) [All American Bank] vs. All [State Police] Semantic ambiguity "John F. Kennedy" = airport (location) "Philip Morris" = organisation Structural ambiguity [Cable and Wireless] vs. [Microsoft] and [Dell] [Center for Computational Linguistics] vs. message from [City Hospital] for [John Smith]

15() Shallow Parsing Approach with Context Use of context-based patterns is helpful in ambiguous cases "David Walton" and "Goldman Sachs" are indistinguishable But with the phrase "David Walton of Goldman Sachs" and the Person entity "David Walton" recognised, we can use the pattern "[Person] of [Organization]" to identify "Goldman Sachs“ correctly.

16() Identification of Contextual Information Use KWIC index and concordancer to find windows of context around entities Search for repeated contextual patterns of either strings, other entities, or both Manually post-edit list of patterns, and incorporate useful patterns into new rules Repeat with new entities

17() Examples of context patterns [PERSON] earns [MONEY] [PERSON] joined [ORGANIZATION] [PERSON] left [ORGANIZATION] [PERSON] joined [ORGANIZATION] as [JOBTITLE] [ORGANIZATION]'s [JOBTITLE] [PERSON] [ORGANIZATION] [JOBTITLE] [PERSON] the [ORGANIZATION] [JOBTITLE] part of the [ORGANIZATION] [ORGANIZATION] headquarters in [LOCATION] price of [ORGANIZATION] sale of [ORGANIZATION] investors in [ORGANIZATION] [ORGANIZATION] is worth [MONEY] [JOBTITLE] [PERSON] [PERSON], [JOBTITLE]

18() Caveats Patterns are only indicators based on likelihood Can set priorities based on frequency thresholds Need training data for each domain More semantic information would be useful (e.g. to cluster groups of verbs)

19() MUSE – MUlti-Source Entity Recognition An IE system developed within GATE Performs NE and coreference on different text types and genres Uses knowledge engineering approach with hand-crafted rules Performance rivals that of machine learning methods Easily adaptable

20() MUSE Modules Document format and genre analysis Tokenisation Sentence splitting POS tagging Gazetteer lookup Semantic grammar Orthographic coreference Nominal and pronominal coreference

21() Switching Controller Rather than have a fixed chain of processing resources, choices can be made automatically about which modules to use Texts are analysed for certain identifying features which are used to trigger different modules For example, texts with no case information may need different POS tagger or gazetteer lists Not all modules are language-dependent, so some can be reused directly

22() Multilingual MUSE MUSE has been adapted to deal with different languages Currently systems for English, French, German, Romanian, Bulgarian, Russian, Cebuano, Hindi, Chinese, Arabic Separation of language-dependent and language-independent modules and sub- modules Annotation projection experiments

23() IE in Surprise Languages Adaptation to an unknown language in a very short timespan Cebuano: –Latin script, capitalisation, words are spaced –Few resources and little work already done –Medium difficulty Hindi: –Non-Latin script, different encodings used, no capitalisation, words are spaced –Many resources available –Medium difficulty

24() What does multilingual NE require? Extensive support for non-Latin scripts and text encodings, including conversion utilities –Automatic recognition of encoding –Occupied up to 2/3 of the TIDES Hindi effort Bilingual dictionaries Annotated corpus for evaluation Internet resources for gazetteer list collection (e.g., phone books, yellow pages, bi-lingual pages)

25() GATE Unicode Kit (GUK) Complements Java’s facilities Support for defining Input Methods (IMs) currently 30 IMs for 17 languages Pluggable in other applications (e.g. JEdit) Editing Multilingual Data

26() Processing Multilingual Data All processing, visualisation and editing tools use GUK

27() Future directions Tools and techniques –Further incorporation of ML methods –Annotation projection experiments –Automatic pattern generation –Tools for morphological analysis and parsing Applications –Electronic text corpus of Sumerian literature –Tools for semantic web –Bioinformatics