Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.

Slides:



Advertisements
Similar presentations
AeroDAML Applying Information Extraction to Generate DAML Annotations Dr. Paul Kogut Lockheed Martin Management & Data Systems.
Advertisements

Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
XML: Extensible Markup Language
XML: Managing Data Exchange Stylesheets. Lesson Contents CSS The basic XSL file XSL transforms Templates Sort Numbering Parameters and Variables Datatypes.
Information extraction from text Spring 2003, Part 3 Helena Ahonen-Myka.
Data Category specifications 19 June 20121CLARIN-NL 2012 ISOcat tutorial.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Towards an NLP `module’ The role of an utterance-level interface.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Aki Hecht Seminar in Databases (236826) January 2009
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Traditional Information Extraction -- Summary CS652 Spring 2004.
J. Turmo, 2006 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
A New Web Semantic Annotator Enabling A Machine Understandable Web BYU Spring Research Conference 2005 Yihong Ding Sponsored by NSF.
Ontology-Based Free-Form Query Processing for the Semantic Web Mark Vickers Brigham Young University MS Thesis Defense Supported by:
Evaluating an MT French / English System Widad Mustafa El Hadi Ismaïl Timimi Université de Lille III Marianne Dabbadie LexiQuest - Paris.
The Semantic Web - Week 21 Building the SW: Information Extraction and Integration Module Website: Practical this.
Content Management Systems Digital Resources for Research in the Humanities 2001.
Knowledge Extraction by using an Ontology- based Annotation Tool Knowledge Media Institute(KMi) The Open University Milton Keynes, MK7 6AA October 2001.
Information Extraction from Documents for Automating Softwre Testing by Patricia Lutsky Presented by Ramiro Lopez.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 18 Slide 1 Software Reuse.
CPS120: Introduction to Computer Science The World Wide Web Nell Dale John Lewis.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach by: Craig A. Knoblock, Kristina Lerman Steven Minton, Ion Muslea Presented.
Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google.
Information Retrieval and Information Extraction Akhil Deshmukh (06D5007) ‏ Narendra Kumar (06D05008) ‏ Kumar Lav (06D05012) ‏ image: google.
Mining the Semantic Web: Requirements for Machine Learning Fabio Ciravegna, Sam Chapman Presented by Steve Hookway 10/20/05.
Processing of large document collections Part 11 (Information extraction: multilingual IE, IE from web, IE from semi-structured data; Question answering.
Information extraction from text Part 4. 2 In this part zIE from (semi-)structured text zother applications ymultilingual IE yquestion answering systems.
1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.
Presenter: Shanshan Lu 03/04/2010
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
Information extraction from text Spring 2003, Part 4 Helena Ahonen-Myka.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Architectural Patterns Support Lecture. Software Architecture l Architecture is OVERLOADED System architecture Application architecture l Architecture.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 1 Comparability of language data and analysis Using an ontology for linguistics Scott Farrar, U.
Copenhagen, 6 June 2006 EC CHM Multilinguality Anton Cupcea Finsiel Romania.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
© Copyright 2013 STI INNSBRUCK “How to put an annotation in HTML?” Ioannis Stavrakantonakis.
ICCS 2008, CracowJune 23-25, Towards Large Scale Semantic Annotation Built on MapReduce Architecture Michal Laclavík, Martin Šeleng, Ladislav Hluchý.
DataBase and Information System … on Web The term information system refers to a system of persons, data records and activities that process the data.
Towards the Semantic Web 6 Generating Ontologies for the Semantic Web: OntoBuilder R.H.P. Engles and T.Ch.Lech 이 은 정
Digital libraries and web- based information systems Mohsen Kamyar.
WP3: FE Architecture Progress Report CROSSMARC Seventh Meeting Edinburgh 6-7 March 2003 University of Rome “Tor Vergata”
Domain Specific Models John D. McGregor M13S1. Tool development Eclipse is an environment intended as a basis for building software engineering tools.
The Unreasonable Effectiveness of Data
Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009.
An Ontological Approach to Financial Analysis and Monitoring.
Semantic Data Extraction for B2B Integration Syntactic-to-Semantic Middleware Bruno Silva 1, Jorge Cardoso 2 1 2
The Akoma Ntoso Naming Convention Fabio Vitali University of Bologna.
Irakli Garibashvili Director, National Scientific Library in Georgia.
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Rendering XML Documents ©NIITeXtensible Markup Language/Lesson 5/Slide 1 of 46 Objectives In this session, you will learn to: * Define rendering * Identify.
Of 24 lecture 11: ontology – mediation, merging & aligning.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Data mining in web applications
ICS 3UI - Introduction to Computer Science
PRINCIPLES OF COMPILER DESIGN
Software Design and Architecture
Extracting Recipes from Chemical Academic Papers
Presentation transcript:

Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka Spring 2006

2 Multilingual IE assume we have documents in two languages (English/French), and the user requires templates to be filled in one of the languages (English) from documents in either language –”Gianluigi Ferrero a assisté à la réunion annuelle de Vercom Corp à Londres.” –”Gianluigi Ferrero attended the annual meeting of Vercom Corp in London.”

3 Both texts should produce the same template fill: := –organisation: ’Vercom Corp’ –location: ’London’ –type: ’annual meeting’ –present: := –name: ’Gianluigi Ferrero’ –organisation: UNCLEAR

4 Multilingual IE: three ways of addressing the problem 1. solution –a full French-English machine translation (MT) system translates all the French texts to English –an English IE system then processes both the translated and the English texts to extract English template structures –in general (n languages): the solution requires a separate full IE system for each target language (here: for English) and a full MT system for each language pair

5 Multilingual IE: three ways of addressing the problem 2. solution –separate IE systems process the French and English texts, producing templates in the original source language –a ’mini’ French-English MT system then translates the lexical items occurring in the French templates –in general: the solution requires a separate full IE system for each language and a mini-MT system for each language pair

6 Multilingual IE: three ways of addressing the problem 3. solution –a general IE system, with separate French and English front ends –French and English texts are analyzed (syntax/semantics) –a language-independent representation of the input text (discourse model) is produced –the discourse model is a part of a domain model (ontology) knowledge of the domain (entities, events,…) is described as concepts and relations concepts are related via mappings to lexical items in multiple language-specific lexicons

7 Multilingual IE: three ways of addressing the problem 3. solution continues… –the required information is extracted from the discourse model and the mappings from concepts to the English lexicon are used to produce templates with English lexical items –in general: the solution requires a separate syntactic/semantic analyser for each language, and the construction of mappings between the domain model and a lexicon for each language

8 IE from web problem setting: data is extracted from a web site and transformed into structured format (database records, XML documents) the resulting structured data can then be used to build new applications without having to deal with heterogenous structures –e.g., price comparisons challenges: –thousands of changing heterogeneous sources –scalability: speed is important -> no complex processing possible

9 IE from web a wrapper is a piece of software that can translate an HTML document into a structured form critical problem: –How to define a set of extraction rules that precisely define how to locate the information on the page? for any item to be extracted, one needs an extraction rule to locate both the beginning and end of the item –extraction rules should work for all of the pages in the source –both HTML markup and text content can be used –linguistic analysis is secondary

10 Some Country Codes Congo 242 Egypt 20 Belize 501 Spain 34 Extract: {,,, } Example: country codes

11 Learning from examples input: a set of web pages, in which the data to be extracted is annotated –the user provides the initial set of annotated examples –the system can suggest additional pages for the user to annotate output: a set of extraction rules that describe how to locate the desired information on a web page

12 IE from semi-structured text Capitol Hill - 1 br twnhme. Fplc D/W W/D. Undrgrnd Pkg incl $ BR, upper flr of turn of ctry HOME. incl gar, grt N. Hill loc $995. (206) (This ad last ran on 08/03/97.)

13 IE from semi-structured text 2 templates extracted: –Rental: Neighborhood: Capitol Hill Bedrooms: 1 Price: 675 –Rental: Neighborhood: Capitol Hill Bedrooms: 3 Price: 995

14 IE from semi-structured text the sample text (rental ad) is not grammatical nor has a rigid structure –we cannot use a natural language parser as we did before –simple rules that might work for structured text do not work here

15 Rule for neighborhood, number of bedrooms and associated price Pattern:: *( Nghbr ) *( Digit ) ’ ’ Bdrm * ’$’ ( Number ) Output:: Rental {Neighborhood $1} {Bedrooms $2} {Price $3} –assuming the semantic classes Nghbr (neighborhood names for the city) and Bdrm (BR, Br, Br, bedrooms,…)

16 Other trends in IE semi- and unsupervised methods ACE (Automatic Content Extraction) evaluations –extraction of general relations from text: person in a location; person has some social relation to another person, etc. cross-document processing –e.g. error correction, when the slot values for all templates are known backtracking in the process –now errors on the earlier levels propagate into later levels –could one backtrack and correct errors made earlier, and start then again?