NCSR “Demokritos” Institute of Informatics & Telecommunications CROSSMARC CROSS-lingual Multi Agent Retail Comparison WP3 Multilingual and Multimedia Fact Extraction Vangelis Karkaletsis CROSSMARC Meeting, Edinburgh, March 27-28, 2001
CROSSMARC Kick-off Meeting, Edinburgh, March 27, WP3: Objectives Combine named-entity recognition, wrapper- induction techniques, and language-based information extraction to generate a multimedia fact extraction engine. Incorporate mechanisms to handle the multimedia context of product descriptions. Incorporate mechanisms to allow rapid adaptation to new product types
CROSSMARC Kick-off Meeting, Edinburgh, March 27, WP3: Duration and Person-months Start date: month 10 End date: month 28 Person-months per participant: NCSR: 17; EDIN: 14; RTV: 20; ICDC: 15 NCSR: 17; EDIN: 14; RTV: 20; ICDC: 15
CROSSMARC Kick-off Meeting, Edinburgh, March 27, WP3: Version 1 of Fact Extraction Development of a wrapper induction module that will be combined with the multi-lingual named-entity recognition and cross-lingual name matching of WP2. The module will be able to correlate product names and features expressed in four languages (via cross-lingual name matching), and capture multimedia aspects of product descriptions (tables, hypertext links, banners, etc.). The wrapper produced will be applicable only to rigidly structured product descriptions.
CROSSMARC Kick-off Meeting, Edinburgh, March 27, WP3: Version 2 of Fact Extraction extends Version 1 using language-based techniques from information extraction is able to handle product descriptions written in freer form, in four languages incorporates more advanced techniques to handle multimedia aspects covers only the 1 st domain
CROSSMARC Kick-off Meeting, Edinburgh, March 27, WP3: Version 3 of Fact Extraction extends Version 2 addressing rapid adaptation to new product domains incorporates mechanisms to re-train fact extraction for new domains with minimal human intervention will be applied to port the fact extraction technology to the 2 nd product domain
CROSSMARC Kick-off Meeting, Edinburgh, March 27, WP3: Milestones Month 18: Version 1 of fact extraction Month 18: Version 1 of fact extraction Month 24: Version 2 of fact extraction Month 24: Version 2 of fact extraction Month 28: Version 3 of fact extraction Month 28: Version 3 of fact extraction
CROSSMARC Kick-off Meeting, Edinburgh, March 27, WP3: Version 1 of Fact Extraction Wrappers: Wrappers: A general procedure for extracting relevant data from a particular set of semi-structured Web pages and return the results in a self-describing structured representation, suitable for further processing Wrappers should execute quickly because they are usually used online to satisfy user’s queries. Wrappers should be able to cope with the changing and unstable nature of the Web, like network failures, ill-formed documents, changes in the layout etc.
CROSSMARC Kick-off Meeting, Edinburgh, March 27, WP3: Version 1 of Fact ExtractionExample
CROSSMARC Kick-off Meeting, Edinburgh, March 27, WP3: Version 1 of Fact Extraction Wrappers – An example: Processor and Motherboard Processor and Motherboard AMD® Athlon 1GHz with DFI AK74 Socket A Motherboard AMD® Athlon 1GHz with DFI AK74 Socket A Motherboard A wrapper can be written that uses the positions of particular strings (tags) to delimit the extracted text (e.g., )
CROSSMARC Kick-off Meeting, Edinburgh, March 27, WP3: Version 1 of Fact ExtractionWrappers: Manual creation of wrappers requires high maintenance cost Wrapper induction is a technique for automatically generating wrappers, using inductive learning [Kushmerick 1997] Wrapper Induction Systems: WIEN (Wrapper Induction Environment) (Kushmerick 1997) STALKER – (Muslea ) SoftMealy – (Hsu 1998)
CROSSMARC Kick-off Meeting, Edinburgh, March 27, WP3: Version 1 of Fact Extraction Collection of Web pages for a specific domain Annotated set of Web pages Machine Learning Technique Annotation tool Wrapper Extracted information New Web pages Creation of training vectors A simplified schema for wrapper induction
CROSSMARC Kick-off Meeting, Edinburgh, March 27, WP3: Workplan for Version 1 Month 12: Development of the Annotation tool (this forms a part of the annotation tool that will be developed for the needs of WP1, WP2) Month 12: Specification of the learning techniques that will be used for wrapper induction Month 14: Annotation of the Web pages (creation of the training and testing corpus) for the 1 st domain in the 4 languages Month 16: 1 st version of the wrapper induction module Month 18: Final version of the wrapper induction module