Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST-2000-25366)

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Smart Qualitative Data: Methods and Community Tools for Data Mark-Up SQUAD Libby Bishop Online Qualitative Data Resources: Best Practice in Metadata Creation.
PNS: Personalized Multi-Source News Delivery Georgios Paliouras(1), Mouzakidis Alexandros(1), Christos Ntoutsis(2), Angelos Alexopoulos(3), Christos Skourlas(2)
Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications.
© NCSR, Paris, December 5-6, 2002 WP1: Plan for the remainder (1) Ontology Ontology  Enrich the lexicons for the 1 st domain based on partners remarks.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.
Data Intensive Techniques to Boost the Real-time Performance of Global Agricultural Data Infrastructures SEMAGROW U SING A POWDER T RIPLE S TORE FOR BOOSTING.
Human Language Technologies. Issue Corporate data stores contain mostly natural language materials. Knowledge Management systems utilize rich semantic.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
IST SEWASIE general meeting Aachen, March 14, 2005 System Evolution Tools Maurizio Vincini and Enrico Franconi.
Input Validation For Free Text Fields ADD Project Members: Hagar Offer & Ran Mor Academic Advisor: Dr Gera Weiss Technical Advisors: Raffi Lipkin & Nadav.
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
By ANDREW ZITZELBERGER A Framework for Extraction Ontology Based Information Management.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning Kassel, 22 July 2005 Georgios Paliouras Software & Knowledge Engineering Lab.
Annotating Search Results from Web Databases. Abstract An increasing number of databases have become web accessible through HTML form-based search interfaces.
© URENIO Research Unit 2004 URENIO Online Benchmarking Application Thessaloniki 7 th of October 2004 Isidoros Passas BEng Computer System Engineering.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
Institute of Informatics and Telecommunications – NCSR “Demokritos” Bootstrapping ontology evolution with multimedia information extraction C.D. Spyropoulos,
WP6 – Information Extraction Introduction to MedIEQ Quality Labelling of Medical Web content using Multilingual Information Extraction
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
Survey of Semantic Annotation Platforms
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Master Thesis Defense Jan Fiedler 04/17/98
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s.
Semantic Technologies & GATE NSWI Jan Dědek.
FNERC OVERVIEW 05/12/2002. Lingway, of December 2002 FNERC : introduction Lingway entered the project while CDC had already worked on FNERC Lingway.
Edinburg March 2001CROSSMARC Kick-off meetingICDC ICDC background and know-how and expectations from CROSSMARC CROSSMARC Project IST Kick-off.
Food and Agriculture Organization of the UN Library and Documentation Systems Division July 2005 Ontologies creation, extraction and maintenance 6 th AOS.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
05/03/03-06/03/03 7 th Meeting Edinburgh Naïve Bayes Fact Extractor (NBFE) v.1.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Digital libraries and web- based information systems Mohsen Kamyar.
Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios.
1 Italian FE Component CROSSMARC Eighth Meeting Crete 24 June 2003.
Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media.
>lingway█ Solutions in language processing Lingway & Crossmarc exploitation plan José Coch.
© NCSR, Frascati, July 18-19, 2002 WP1: Plan for the remainder (1) Ontology Ontology  Use of PROTÉGÉ to generate ontology and lexicons for the 1 st domain.
MedKAT Medical Knowledge Analysis Tool December 2009.
September 25, 2006 NASA Feasibility Study Status Update.
NCSR “Demokritos” Institute of Informatics & Telecommunications CROSSMARC CROSS-lingual Multi Agent Retail Comparison Costas Spyropoulos & Vangelis Karkaletsis.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
>lingway█ >Lingway Fact Extractor (LFE)█ >Introduction >Goals Crossmarc / Lingway >Lingway adaptation of the NHLRT approach >Rule induction >(ongoing work)
Inria Rhône-AlpesEMGnet meeting - December 98 1 A Platform for EMG Studies Danielle Ziébelin, Martine Maume and Philippe Genoud INRIA Rhône-Alpes Projet.
ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST Third meeting Rome November 2001.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
Constructing A Yami Language Lexicon Database from Yami Archiving Projects Meng-Chien Yang(Providence University, Taiwan) D. Victoria Rau(National Chung.
WP1: Application Ontology Management Maria Teresa Pazienza Dept. Of Computer Science University of Rome “Tor Vergata”
General Architecture of Retrieval Systems 1Adrienn Skrop.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
WP1: Plan for the remainder (1) Ontology –Finalise ontology and lexicons for the 2 nd domain (RTV) Changes agreed in Heraklion –Improvement to existing.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
NCSR “Demokritos” Institute of Informatics & Telecommunications CROSSMARC CROSS-lingual Multi Agent Retail Comparison WP3 Multilingual and Multimedia Fact.
WP1: Conversion of HTML Web Pages to XML format CROSSMARC Seventh Meeting Edinburgh 6-7 March 2003 University of Rome “Tor Vergata”
WP2: Hellenic NERC Vangelis Karkaletsis, Dimitra Farmakiotou Paris, December 5-6, 2002 Institute of Informatics & Telecommunications NCSR “Demokritos”
Data mining in web applications
 Corpus Formation [CFT]  Web Pages Annotation [Web Annotator]  Web sites detection [NEACrawler]  Web pages collection [NEAC]  IE Remote.
Institute of Informatics & Telecommunications NCSR “Demokritos”
Institute of Informatics & Telecommunications
VELTI Evaluation Methodology
Social Knowledge Mining
Presentation transcript:

Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )

© CROSSMARC, Frascati, July 17, 2002 CROSSMARC develops commercial strength technology for information extraction from web pages, that  employs state of the art language engineering tools and techniques  can be used to process pages written in several languages  can be adapted semi-automatically to new product types Objectives

© CROSSMARC, Frascati, July 17, 2002 CROSSMARC ConsortiumC National Centre for Scientific Research “Demokritos” EL P VeltiNet A.E. EL P University of Edinburgh UK P Universita di Roma Tor Vergata I P Informatique CDC F PLingwayF Start Date: March 1, 2001, End Date: August 31, 2003

© CROSSMARC, Frascati, July 17, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling Web Pages Collection with NE annotations NERC-FE Multilingual NERC and Name Matching Multilingual and Multimedia Fact Extraction XHTML pages XML pages Insertion into the data base Products Database User Interface End user

© CROSSMARC, Frascati, July 17, 2002 Focused Crawling Exploitation of standard search engines Exploitation of standard search engines Exploitation of a language identification module Exploitation of a language identification module Exploitation of the page filtering module of the web spidering tool Exploitation of the page filtering module of the web spidering tool

© CROSSMARC, Frascati, July 17, 2002 Domain-specific Spidering Domain Ontology

© CROSSMARC, Frascati, July 17, 2002 Three level of specialization: Three level of specialization: a product has a product has a set of features with a set of features with several attributes ranging over several attributes ranging over some values some values Three level of specialization: Three level of specialization: a product has a product has a set of features with a set of features with several attributes ranging over several attributes ranging over some values some values Domain Ontology

© CROSSMARC, Frascati, July 17, 2002 Surface representations (strings) of ontology concepts are organized in four different lexicons (one for each language). E ach lexicon is organized as a set of nodes grouping together nodes grouping together synonyms or synonyms or regular expressions regular expressions of an ontology concept of an ontology concept Surface representations (strings) of ontology concepts are organized in four different lexicons (one for each language). E ach lexicon is organized as a set of nodes grouping together nodes grouping together synonyms or synonyms or regular expressions regular expressions of an ontology concept of an ontology concept Lexicon  Ontology concept reference  Ontology concept reference Intel Pentium III Intel Pentium III Pentium III Pentium III P3 P3 PIII PIII (Intel|INTEL)?(S)?(Pentium|PENTIUM|P|Pent|PENT)(S)?[\.]?[\-]?(S)?(3|III) (Intel|INTEL)?(S)?(Pentium|PENTIUM|P|Pent|PENT)(S)?[\.]?[\-]?(S)?(3|III)  Ontology concept reference  Ontology concept reference Intel Pentium III Intel Pentium III Pentium III Pentium III P3 P3 PIII PIII (Intel|INTEL)?(S)?(Pentium|PENTIUM|P|Pent|PENT)(S)?[\.]?[\-]?(S)?(3|III) (Intel|INTEL)?(S)?(Pentium|PENTIUM|P|Pent|PENT)(S)?[\.]?[\-]?(S)?(3|III)

© CROSSMARC, Frascati, July 17, 2002 Integration of monolingual IE systems 4 languages are currently supported 4 languages are currently supported  English, Greek, Italian French Each partner brings in the project its own NERC and language processing tools Each partner brings in the project its own NERC and language processing tools An example: the Hellenic IE system An example: the Hellenic IE system

© CROSSMARC, Frascati, July 17, 2002 Example: the Hellenic IE system (Input)

© CROSSMARC, Frascati, July 17, 2002 Example: the Hellenic IE system (Output)

© CROSSMARC, Frascati, July 17, 2002 Combination of Wrapper Induction techniques with traditional IE tools … Wrapper = Set of extraction rules Fact- Annotated pages Wrapper New pages Extracted Data Wrapper Induction System GUI Initial pages

© CROSSMARC, Frascati, July 17, 2002 CROSSMARC End User Interface UI is generated dynamically from the domain ontology and lexicons (apart from some static parts) …

© CROSSMARC, Frascati, July 17, 2002 Other CROSSMARC Tools Corpus Formation Tool Corpus Formation Tool Corpus Formation Tool Corpus Formation Tool Corpus collection and annotation methodology Corpus collection and annotation methodology Web Annotator Web Annotator Web Annotator Web Annotator Ontology maintenance tools Ontology maintenance tools NERC-based Demarcation tool NERC-based Demarcation tool

© CROSSMARC, Frascati, July 17, 2002 Corpus formation (for the needs of page filtering) Positive pages Unidentified pages Corpus formation tool Ontology 1 or more Lexicon(s) Similar-to- positive pages Manual classification Negative pages

© CROSSMARC, Frascati, July 17, 2002 Web Annotator (for the needs of training and testing NERC and FE)

© CROSSMARC, Frascati, July 17, 2002  add the language-specific lexicon for the domain, according to the lexicon schema and the domain ontology  train the web spidering tool using the language-specific lexicon and corpus  use the web annotator to create corpus for training and testing the IE system  configure the IE system in order to accept as input the XHTML pages provided by the spidering tool  configure the IE system in order to output an XML document for the product description(s) found in the page, according to the FE schema  Connect to the IE remote invocation system (IERI) Adding an IE systems for a new language...

© CROSSMARC, Frascati, July 17, 2002  add the new domain-specific ontology and lexicons for the languages supported according to the ontology and lexicon schema  add the domain-specific FE schema  train and test the web spidering tool (page filtering, link scoring) using the provided tools  create the domain-specific training and testing corpus following the corpus collection methodology and using the web annotator tool  train the monolingual IE systems in the new domain (combining linguistic information with wrapper induction) Adding a new domain...

© CROSSMARC, Frascati, July 17, 2002 Evaluation Task Web Spidering Web Spidering Web Spidering Web Spidering Information Extraction Information Extraction Information Extraction Information Extraction End-user interface End-user interface

© CROSSMARC, Frascati, July 17, 2002 Web Spidering Tool

© CROSSMARC, Frascati, July 17, 2002 Remote Invocation of monolingual IE systems

© CROSSMARC, Frascati, July 17, 2002 CROSSMARC Evaluation Questionnaire