Download presentation
Presentation is loading. Please wait.
Published byFrank Bruce Modified over 9 years ago
1
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST-2000-25366) http://www.iit.demokritos.gr/skel/crossmarc/
2
© CROSSMARC, Frascati, July 17, 2002 CROSSMARC develops commercial strength technology for information extraction from web pages, that employs state of the art language engineering tools and techniques can be used to process pages written in several languages can be adapted semi-automatically to new product types Objectives
3
© CROSSMARC, Frascati, July 17, 2002 CROSSMARC ConsortiumC National Centre for Scientific Research “Demokritos” EL P VeltiNet A.E. EL P University of Edinburgh UK P Universita di Roma Tor Vergata I P Informatique CDC F PLingwayF Start Date: March 1, 2001, End Date: August 31, 2003
4
© CROSSMARC, Frascati, July 17, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling Web Pages Collection with NE annotations NERC-FE Multilingual NERC and Name Matching Multilingual and Multimedia Fact Extraction XHTML pages XML pages Insertion into the data base Products Database User Interface End user
5
© CROSSMARC, Frascati, July 17, 2002 Focused Crawling Exploitation of standard search engines Exploitation of standard search engines Exploitation of a language identification module Exploitation of a language identification module Exploitation of the page filtering module of the web spidering tool Exploitation of the page filtering module of the web spidering tool
6
© CROSSMARC, Frascati, July 17, 2002 Domain-specific Spidering Domain Ontology
7
© CROSSMARC, Frascati, July 17, 2002 Three level of specialization: Three level of specialization: a product has a product has a set of features with a set of features with several attributes ranging over several attributes ranging over some values some values Three level of specialization: Three level of specialization: a product has a product has a set of features with a set of features with several attributes ranging over several attributes ranging over some values some values Domain Ontology
8
© CROSSMARC, Frascati, July 17, 2002 Surface representations (strings) of ontology concepts are organized in four different lexicons (one for each language). E ach lexicon is organized as a set of nodes grouping together nodes grouping together synonyms or synonyms or regular expressions regular expressions of an ontology concept of an ontology concept Surface representations (strings) of ontology concepts are organized in four different lexicons (one for each language). E ach lexicon is organized as a set of nodes grouping together nodes grouping together synonyms or synonyms or regular expressions regular expressions of an ontology concept of an ontology concept Lexicon Ontology concept reference Ontology concept reference Intel Pentium III Intel Pentium III Pentium III Pentium III P3 P3 PIII PIII (Intel|INTEL)?(S)?(Pentium|PENTIUM|P|Pent|PENT)(S)?[\.]?[\-]?(S)?(3|III) (Intel|INTEL)?(S)?(Pentium|PENTIUM|P|Pent|PENT)(S)?[\.]?[\-]?(S)?(3|III) Ontology concept reference Ontology concept reference Intel Pentium III Intel Pentium III Pentium III Pentium III P3 P3 PIII PIII (Intel|INTEL)?(S)?(Pentium|PENTIUM|P|Pent|PENT)(S)?[\.]?[\-]?(S)?(3|III) (Intel|INTEL)?(S)?(Pentium|PENTIUM|P|Pent|PENT)(S)?[\.]?[\-]?(S)?(3|III)
9
© CROSSMARC, Frascati, July 17, 2002 Integration of monolingual IE systems 4 languages are currently supported 4 languages are currently supported English, Greek, Italian French Each partner brings in the project its own NERC and language processing tools Each partner brings in the project its own NERC and language processing tools An example: the Hellenic IE system An example: the Hellenic IE system
10
© CROSSMARC, Frascati, July 17, 2002 Example: the Hellenic IE system (Input)
11
© CROSSMARC, Frascati, July 17, 2002 Example: the Hellenic IE system (Output)
12
© CROSSMARC, Frascati, July 17, 2002 Combination of Wrapper Induction techniques with traditional IE tools … Wrapper = Set of extraction rules Fact- Annotated pages Wrapper New pages Extracted Data Wrapper Induction System GUI Initial pages
13
© CROSSMARC, Frascati, July 17, 2002 CROSSMARC End User Interface UI is generated dynamically from the domain ontology and lexicons (apart from some static parts) …
14
© CROSSMARC, Frascati, July 17, 2002 Other CROSSMARC Tools Corpus Formation Tool Corpus Formation Tool Corpus Formation Tool Corpus Formation Tool Corpus collection and annotation methodology Corpus collection and annotation methodology Web Annotator Web Annotator Web Annotator Web Annotator Ontology maintenance tools Ontology maintenance tools NERC-based Demarcation tool NERC-based Demarcation tool
15
© CROSSMARC, Frascati, July 17, 2002 Corpus formation (for the needs of page filtering) Positive pages Unidentified pages Corpus formation tool Ontology 1 or more Lexicon(s) Similar-to- positive pages Manual classification Negative pages
16
© CROSSMARC, Frascati, July 17, 2002 Web Annotator (for the needs of training and testing NERC and FE)
17
© CROSSMARC, Frascati, July 17, 2002 add the language-specific lexicon for the domain, according to the lexicon schema and the domain ontology train the web spidering tool using the language-specific lexicon and corpus use the web annotator to create corpus for training and testing the IE system configure the IE system in order to accept as input the XHTML pages provided by the spidering tool configure the IE system in order to output an XML document for the product description(s) found in the page, according to the FE schema Connect to the IE remote invocation system (IERI) Adding an IE systems for a new language...
18
© CROSSMARC, Frascati, July 17, 2002 add the new domain-specific ontology and lexicons for the languages supported according to the ontology and lexicon schema add the domain-specific FE schema train and test the web spidering tool (page filtering, link scoring) using the provided tools create the domain-specific training and testing corpus following the corpus collection methodology and using the web annotator tool train the monolingual IE systems in the new domain (combining linguistic information with wrapper induction) Adding a new domain...
19
© CROSSMARC, Frascati, July 17, 2002 Evaluation Task Web Spidering Web Spidering Web Spidering Web Spidering Information Extraction Information Extraction Information Extraction Information Extraction End-user interface End-user interface
20
© CROSSMARC, Frascati, July 17, 2002 Web Spidering Tool
21
© CROSSMARC, Frascati, July 17, 2002 Remote Invocation of monolingual IE systems
22
© CROSSMARC, Frascati, July 17, 2002 CROSSMARC Evaluation Questionnaire
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.