Institute of Informatics & Telecommunications NCSR “Demokritos”

Slides:



Advertisements
Similar presentations
IST SEWASIE 16 May 2002 Sonia Bergamaschi Università di Modena e Reggio Emilia.
Advertisements

Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Smart Qualitative Data: Methods and Community Tools for Data Mark-Up SQUAD Libby Bishop Online Qualitative Data Resources: Best Practice in Metadata Creation.
PNS: Personalized Multi-Source News Delivery Georgios Paliouras(1), Mouzakidis Alexandros(1), Christos Ntoutsis(2), Angelos Alexopoulos(3), Christos Skourlas(2)
Information Society Technologies Third Call for Proposals Norbert Brinkhoff-Button DG Information Society European Commission Key action III: Multmedia.
© NCSR, Paris, December 5-6, 2002 WP1: Plan for the remainder (1) Ontology Ontology  Enrich the lexicons for the 1 st domain based on partners remarks.
CerOrganic European Conference – Athens, 6/12/2011 Giannis Stoitsis, Alexios Dimitropoulos Agro-Know Technologies.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Information Retrieval in Practice
Presentation Outline  Project Aims  Introduction of Digital Video Library  Introduction of Our Work  Considerations and Approach  Design and Implementation.
Web Services Andrea Miller Ryan Armstrong Alex. Web services are an emerging technology that offer a solution for providing a common collaborative architecture.
A System for A Semi-Automatic Ontology Annotation Kiril Simov, Petya Osenova, Alexander Simov, Anelia Tincheva, Borislav Kirilov BulTreeBank Group LML,
1 Information Retrieval and Web Search Introduction.
An innovative platform to allow translation and indexing of internet sites Localization World
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Institute of Informatics and Telecommunications – NCSR “Demokritos” Bootstrapping ontology evolution with multimedia information extraction C.D. Spyropoulos,
Institute of Informatics & Telecommunications – NCSR “Demokritos” Ellogon and the challenge of threads Georgios Petasis Software and Knowledge Engineering.
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
University of Dublin Trinity College Localisation and Personalisation: Dynamic Retrieval & Adaptation of Multi-lingual Multimedia Content Prof Vincent.
1 WEB SERVICES BASED INFORMATION ACCESS ARCHITECTURE Christian Belbeze, Max Chevalier, Chantal Soulé-Dupuy Institut de Recherche en Informatique de Toulouse.
The physics departments and documents network EUNIS Conference, Bled, June 29 th -July 2 nd 2004 Michael Schlenker: Dynamic.
Master Thesis Defense Jan Fiedler 04/17/98
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Odyssey A Reuse Environment based on Domain Models Prepared By: Mahmud Gabareen Eliad Cohen.
Aquenergy Portal Elisabetta Zuanelli, University of Rome “Tor Vergata”, Italy E-Age 2014 Muscat december.
Edinburg March 2001CROSSMARC Kick-off meetingICDC ICDC background and know-how and expectations from CROSSMARC CROSSMARC Project IST Kick-off.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
Building a Topic Map Repository Xia Lin Drexel University Philadelphia, PA Jian Qin Syracuse University Syracuse, NY * Presented at Knowledge Technologies.
Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios.
Institute of Informatics and Telecommunications – NCSR “Demokritos” 1 NCSR at INDIGO Vangelis Karkaletsis Kick-off Project Meeting Athens, 15 February.
>lingway█ Solutions in language processing Lingway & Crossmarc exploitation plan José Coch.
© NCSR, Frascati, July 18-19, 2002 WP1: Plan for the remainder (1) Ontology Ontology  Use of PROTÉGÉ to generate ontology and lexicons for the 1 st domain.
NCSR “Demokritos” Institute of Informatics & Telecommunications CROSSMARC CROSS-lingual Multi Agent Retail Comparison Costas Spyropoulos & Vangelis Karkaletsis.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
SYNTHESIS An information system for administration documentation and promotion of cultural instances Center for Cultural Informatics Foundation for Research.
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
WP1: Application Ontology Management Maria Teresa Pazienza Dept. Of Computer Science University of Rome “Tor Vergata”
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
WP1: Plan for the remainder (1) Ontology –Finalise ontology and lexicons for the 2 nd domain (RTV) Changes agreed in Heraklion –Improvement to existing.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
NCSR “Demokritos” Institute of Informatics & Telecommunications CROSSMARC CROSS-lingual Multi Agent Retail Comparison WP3 Multilingual and Multimedia Fact.
WP2: Hellenic NERC Vangelis Karkaletsis, Dimitra Farmakiotou Paris, December 5-6, 2002 Institute of Informatics & Telecommunications NCSR “Demokritos”
Data mining in web applications
Information Retrieval in Practice
DHTML.
Accessing the Database Server: ODBC, OLE DB, and ADO
Search Engine Architecture
 Corpus Formation [CFT]  Web Pages Annotation [Web Annotator]  Web sites detection [NEACrawler]  Web pages collection [NEAC]  IE Remote.
GATE and the Semantic Web
Institute of Informatics & Telecommunications
RECENT TRENDS IN METADATA GENERATION
Information Retrieval and Web Search
Institute of Informatics & Telecommunications
Presented by: Hassan Sayyadi
Grid Portal Services IeSE (the Integrated e-Science Environment)
Web Engineering.
YourDataStories: Transparency and Corruption Fighting through Data Interlinking and Visual Exploration Georgios Petasis1, Anna Triantafillou2, Eric Karstens3.
Information Retrieval and Web Search
Information Retrieval and Web Search
The Re3gistry software and the INSPIRE Registry
Social Knowledge Mining
Submitted By: Usha MIT-876-2K11 M.Tech(3rd Sem) Information Technology
DIGITAL LIBRARY.
Evaluating Compuware OptimalJ as an MDA tool
Chapter 7 –Implementation Issues
17th APAN Meetings & Joint Techs Workshop
Information Retrieval and Web Search
Presentation transcript:

Institute of Informatics & Telecommunications NCSR “Demokritos” Cross-lingual Information Extraction from Web pages: The use of a general-purpose Text Engineering Platform Georgios Petasis, Vangelis Karkaletsis, Constantine D. Spyropoulos RANLP, September 10-12, 2003

Contents Extracting Information from WEB pages Overview of CROSSMARC Overview of Ellogon The role of Ellogon in CROSSMARC Conclusions Cross-lingual Information Extraction from Web pages 10 September 2003

IE from WEB pages Wrapper Induction Identified Web Sites Delimiter-based Methods Wrapper Induction Cross-lingual Information Extraction from Web pages 10 September 2003

CROSSMARC objective Implement technology for information extraction from web pages, that: can operate on pages without a standardised format (structured, semi-structured, free text); can be used to process web pages written in several languages; can be adapted semi-automatically to new domains Cross-lingual Information Extraction from Web pages 10 September 2003

CROSSMARC technologies The system developed exploits language technology and machine learning methods; exploits domain-specific ontologies and language-specific lexica; employs localisation and user modelling techniques; implements an open and multi-agent architecture Cross-lingual Information Extraction from Web pages 10 September 2003

CROSSMARC consortium EL UK I F National Centre for Scientific Research “Demokritos” (Coordinator) EL Velti S.A. University of Edinburgh UK Universita di Roma Tor Vergata I Lingway F Informatique CDC Cross-lingual Information Extraction from Web pages 10 September 2003

CROSSMARC Architecture Cross-lingual Information Extraction from Web pages 10 September 2003

Ellogon Architecture … Language Processing Components Graphical Interface Services Internet (HTTP, FTP, SOAP) Operating System Services (ActiveX, COM, DDE) Database Connectivity (ODBC) … Collection – Document Manager Storage Format Abstraction Layer XML Ellogon Databases Operating System ??? Cross-lingual Information Extraction from Web pages 10 September 2003

Ellogon Key Features Visualisation of textual/HTML/XML data Support for lexical resources Annotated corpora creation Annotated data comparison Transformation of linguistic information into vectors Annotated corpora modifications Stand-alone application creation Cross-lingual Information Extraction from Web pages 10 September 2003

Ellogon in CROSSMARC Web Pages Collection Identifies web sites that are of relevance to the particular domain Focused Crawler Identifies web pages of interest within the retrieved web sites Domain Specific Spider Site Navigation Page Filtering Link Scoring Cross-lingual Information Extraction from Web pages 10 September 2003

CROSSMARC Multilingual IE Ellogon in CROSSMARC Information Extraction Interesting Web Pages Information Extraction Remote Invocation (IERI) CROSSMARC Multilingual IE NERC based Demarcator EFE XML Conversion ENERC English IE Data Inserter FFE FNERC HFE HNERC IFE INERC Products Database French IE Greek IE Italian IE NERC Demarcation Fact Extraction Cross-lingual Information Extraction from Web pages 10 September 2003

Retrain whole HNERC system Ellogon in CROSSMARC Retrain whole HNERC system Cross-lingual Information Extraction from Web pages 10 September 2003

Ellogon in CROSSMARC Cross-lingual Information Extraction from Web pages 10 September 2003

Ellogon in CROSSMARC Cross-lingual Information Extraction from Web pages 10 September 2003

Ellogon in CROSSMARC Cross-lingual Information Extraction from Web pages 10 September 2003

Conclusions CROSSMARC approach covers all the way from the identification of Web sites and web pages of interest to the extraction of information from them and its presentation. Ellogon offered functionalities that facilitated the development, deployment and exploitation of core CROSSMARC components. Cross-lingual Information Extraction from Web pages 10 September 2003

Future Plans Integrate, after the project completion, all CROSSMARC components and customisation facilities within Ellogon aiming at the development of an Ellogon-based platform for cross-lingual information management from web pages Cross-lingual Information Extraction from Web pages 10 September 2003

Useful Links CROSSMARC http://www.iit.demokritos.gr/skel/crossmarc Ellogon http://www.iit.demokritos.gr/skel/Ellogon Cross-lingual Information Extraction from Web pages 10 September 2003

Balkan Conference on Informatics BCI 2003 http://www.iit.demokritos.gr/skel/bci03_workshop/ Cross-lingual Information Extraction from Web pages 10 September 2003