Presentation is loading. Please wait.

Presentation is loading. Please wait.

NCSR “Demokritos” Institute of Informatics & Telecommunications CROSSMARC CROSS-lingual Multi Agent Retail Comparison WP3 Multilingual and Multimedia Fact.

Similar presentations


Presentation on theme: "NCSR “Demokritos” Institute of Informatics & Telecommunications CROSSMARC CROSS-lingual Multi Agent Retail Comparison WP3 Multilingual and Multimedia Fact."— Presentation transcript:

1 NCSR “Demokritos” Institute of Informatics & Telecommunications CROSSMARC CROSS-lingual Multi Agent Retail Comparison WP3 Multilingual and Multimedia Fact Extraction Vangelis Karkaletsis CROSSMARC Meeting, Edinburgh, March 27-28, 2001

2 CROSSMARC Kick-off Meeting, Edinburgh, March 27, 20012 WP3: Objectives  Combine named-entity recognition, wrapper- induction techniques, and language-based information extraction to generate a multimedia fact extraction engine.  Incorporate mechanisms to handle the multimedia context of product descriptions.  Incorporate mechanisms to allow rapid adaptation to new product types

3 CROSSMARC Kick-off Meeting, Edinburgh, March 27, 20013 WP3: Duration and Person-months  Start date: month 10  End date: month 28  Person-months per participant: NCSR: 17; EDIN: 14; RTV: 20; ICDC: 15 NCSR: 17; EDIN: 14; RTV: 20; ICDC: 15

4 CROSSMARC Kick-off Meeting, Edinburgh, March 27, 20014 WP3: Version 1 of Fact Extraction  Development of a wrapper induction module that will be combined with the multi-lingual named-entity recognition and cross-lingual name matching of WP2.  The module will be able to correlate product names and features expressed in four languages (via cross-lingual name matching), and capture multimedia aspects of product descriptions (tables, hypertext links, banners, etc.).  The wrapper produced will be applicable only to rigidly structured product descriptions.

5 CROSSMARC Kick-off Meeting, Edinburgh, March 27, 20015 WP3: Version 2 of Fact Extraction  extends Version 1 using language-based techniques from information extraction  is able to handle product descriptions written in freer form, in four languages  incorporates more advanced techniques to handle multimedia aspects  covers only the 1 st domain

6 CROSSMARC Kick-off Meeting, Edinburgh, March 27, 20016 WP3: Version 3 of Fact Extraction  extends Version 2 addressing rapid adaptation to new product domains  incorporates mechanisms to re-train fact extraction for new domains with minimal human intervention  will be applied to port the fact extraction technology to the 2 nd product domain

7 CROSSMARC Kick-off Meeting, Edinburgh, March 27, 20017 WP3: Milestones Month 18: Version 1 of fact extraction Month 18: Version 1 of fact extraction Month 24: Version 2 of fact extraction Month 24: Version 2 of fact extraction Month 28: Version 3 of fact extraction Month 28: Version 3 of fact extraction

8 CROSSMARC Kick-off Meeting, Edinburgh, March 27, 20018 WP3: Version 1 of Fact Extraction Wrappers: Wrappers:  A general procedure for extracting relevant data from a particular set of semi-structured Web pages and return the results in a self-describing structured representation, suitable for further processing  Wrappers should execute quickly because they are usually used online to satisfy user’s queries.  Wrappers should be able to cope with the changing and unstable nature of the Web, like network failures, ill-formed documents, changes in the layout etc.

9 CROSSMARC Kick-off Meeting, Edinburgh, March 27, 20019 WP3: Version 1 of Fact ExtractionExample

10 CROSSMARC Kick-off Meeting, Edinburgh, March 27, 200110 WP3: Version 1 of Fact Extraction Wrappers – An example: Processor and Motherboard Processor and Motherboard AMD® Athlon 1GHz with DFI AK74 Socket A Motherboard AMD® Athlon 1GHz with DFI AK74 Socket A Motherboard A wrapper can be written that uses the positions of particular strings (tags) to delimit the extracted text (e.g., )

11 CROSSMARC Kick-off Meeting, Edinburgh, March 27, 200111 WP3: Version 1 of Fact ExtractionWrappers:  Manual creation of wrappers requires high maintenance cost  Wrapper induction is a technique for automatically generating wrappers, using inductive learning [Kushmerick 1997] Wrapper Induction Systems:  WIEN (Wrapper Induction Environment) (Kushmerick 1997)  STALKER – (Muslea 1998-1999)  SoftMealy – (Hsu 1998)

12 CROSSMARC Kick-off Meeting, Edinburgh, March 27, 200112 WP3: Version 1 of Fact Extraction Collection of Web pages for a specific domain Annotated set of Web pages Machine Learning Technique Annotation tool Wrapper Extracted information New Web pages Creation of training vectors A simplified schema for wrapper induction

13 CROSSMARC Kick-off Meeting, Edinburgh, March 27, 200113 WP3: Workplan for Version 1  Month 12: Development of the Annotation tool (this forms a part of the annotation tool that will be developed for the needs of WP1, WP2)  Month 12: Specification of the learning techniques that will be used for wrapper induction  Month 14: Annotation of the Web pages (creation of the training and testing corpus) for the 1 st domain in the 4 languages  Month 16: 1 st version of the wrapper induction module  Month 18: Final version of the wrapper induction module


Download ppt "NCSR “Demokritos” Institute of Informatics & Telecommunications CROSSMARC CROSS-lingual Multi Agent Retail Comparison WP3 Multilingual and Multimedia Fact."

Similar presentations


Ads by Google