Download presentation
Presentation is loading. Please wait.
Published byKelley Walton Modified over 8 years ago
1
NCSR “Demokritos” Institute of Informatics & Telecommunications CROSSMARC CROSS-lingual Multi Agent Retail Comparison WP3 Multilingual and Multimedia Fact Extraction Vangelis Karkaletsis CROSSMARC Meeting, Edinburgh, March 27-28, 2001
2
CROSSMARC Kick-off Meeting, Edinburgh, March 27, 20012 WP3: Objectives Combine named-entity recognition, wrapper- induction techniques, and language-based information extraction to generate a multimedia fact extraction engine. Incorporate mechanisms to handle the multimedia context of product descriptions. Incorporate mechanisms to allow rapid adaptation to new product types
3
CROSSMARC Kick-off Meeting, Edinburgh, March 27, 20013 WP3: Duration and Person-months Start date: month 10 End date: month 28 Person-months per participant: NCSR: 17; EDIN: 14; RTV: 20; ICDC: 15 NCSR: 17; EDIN: 14; RTV: 20; ICDC: 15
4
CROSSMARC Kick-off Meeting, Edinburgh, March 27, 20014 WP3: Version 1 of Fact Extraction Development of a wrapper induction module that will be combined with the multi-lingual named-entity recognition and cross-lingual name matching of WP2. The module will be able to correlate product names and features expressed in four languages (via cross-lingual name matching), and capture multimedia aspects of product descriptions (tables, hypertext links, banners, etc.). The wrapper produced will be applicable only to rigidly structured product descriptions.
5
CROSSMARC Kick-off Meeting, Edinburgh, March 27, 20015 WP3: Version 2 of Fact Extraction extends Version 1 using language-based techniques from information extraction is able to handle product descriptions written in freer form, in four languages incorporates more advanced techniques to handle multimedia aspects covers only the 1 st domain
6
CROSSMARC Kick-off Meeting, Edinburgh, March 27, 20016 WP3: Version 3 of Fact Extraction extends Version 2 addressing rapid adaptation to new product domains incorporates mechanisms to re-train fact extraction for new domains with minimal human intervention will be applied to port the fact extraction technology to the 2 nd product domain
7
CROSSMARC Kick-off Meeting, Edinburgh, March 27, 20017 WP3: Milestones Month 18: Version 1 of fact extraction Month 18: Version 1 of fact extraction Month 24: Version 2 of fact extraction Month 24: Version 2 of fact extraction Month 28: Version 3 of fact extraction Month 28: Version 3 of fact extraction
8
CROSSMARC Kick-off Meeting, Edinburgh, March 27, 20018 WP3: Version 1 of Fact Extraction Wrappers: Wrappers: A general procedure for extracting relevant data from a particular set of semi-structured Web pages and return the results in a self-describing structured representation, suitable for further processing Wrappers should execute quickly because they are usually used online to satisfy user’s queries. Wrappers should be able to cope with the changing and unstable nature of the Web, like network failures, ill-formed documents, changes in the layout etc.
9
CROSSMARC Kick-off Meeting, Edinburgh, March 27, 20019 WP3: Version 1 of Fact ExtractionExample
10
CROSSMARC Kick-off Meeting, Edinburgh, March 27, 200110 WP3: Version 1 of Fact Extraction Wrappers – An example: Processor and Motherboard Processor and Motherboard AMD® Athlon 1GHz with DFI AK74 Socket A Motherboard AMD® Athlon 1GHz with DFI AK74 Socket A Motherboard A wrapper can be written that uses the positions of particular strings (tags) to delimit the extracted text (e.g., )
11
CROSSMARC Kick-off Meeting, Edinburgh, March 27, 200111 WP3: Version 1 of Fact ExtractionWrappers: Manual creation of wrappers requires high maintenance cost Wrapper induction is a technique for automatically generating wrappers, using inductive learning [Kushmerick 1997] Wrapper Induction Systems: WIEN (Wrapper Induction Environment) (Kushmerick 1997) STALKER – (Muslea 1998-1999) SoftMealy – (Hsu 1998)
12
CROSSMARC Kick-off Meeting, Edinburgh, March 27, 200112 WP3: Version 1 of Fact Extraction Collection of Web pages for a specific domain Annotated set of Web pages Machine Learning Technique Annotation tool Wrapper Extracted information New Web pages Creation of training vectors A simplified schema for wrapper induction
13
CROSSMARC Kick-off Meeting, Edinburgh, March 27, 200113 WP3: Workplan for Version 1 Month 12: Development of the Annotation tool (this forms a part of the annotation tool that will be developed for the needs of WP1, WP2) Month 12: Specification of the learning techniques that will be used for wrapper induction Month 14: Annotation of the Web pages (creation of the training and testing corpus) for the 1 st domain in the 4 languages Month 16: 1 st version of the wrapper induction module Month 18: Final version of the wrapper induction module
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.