Institute of Informatics & Telecommunications NCSR “Demokritos” Cross-lingual Information Extraction from Web pages: The use of a general-purpose Text Engineering Platform Georgios Petasis, Vangelis Karkaletsis, Constantine D. Spyropoulos RANLP, September 10-12, 2003
Contents Extracting Information from WEB pages Overview of CROSSMARC Overview of Ellogon The role of Ellogon in CROSSMARC Conclusions Cross-lingual Information Extraction from Web pages 10 September 2003
IE from WEB pages Wrapper Induction Identified Web Sites Delimiter-based Methods Wrapper Induction Cross-lingual Information Extraction from Web pages 10 September 2003
CROSSMARC objective Implement technology for information extraction from web pages, that: can operate on pages without a standardised format (structured, semi-structured, free text); can be used to process web pages written in several languages; can be adapted semi-automatically to new domains Cross-lingual Information Extraction from Web pages 10 September 2003
CROSSMARC technologies The system developed exploits language technology and machine learning methods; exploits domain-specific ontologies and language-specific lexica; employs localisation and user modelling techniques; implements an open and multi-agent architecture Cross-lingual Information Extraction from Web pages 10 September 2003
CROSSMARC consortium EL UK I F National Centre for Scientific Research “Demokritos” (Coordinator) EL Velti S.A. University of Edinburgh UK Universita di Roma Tor Vergata I Lingway F Informatique CDC Cross-lingual Information Extraction from Web pages 10 September 2003
CROSSMARC Architecture Cross-lingual Information Extraction from Web pages 10 September 2003
Ellogon Architecture … Language Processing Components Graphical Interface Services Internet (HTTP, FTP, SOAP) Operating System Services (ActiveX, COM, DDE) Database Connectivity (ODBC) … Collection – Document Manager Storage Format Abstraction Layer XML Ellogon Databases Operating System ??? Cross-lingual Information Extraction from Web pages 10 September 2003
Ellogon Key Features Visualisation of textual/HTML/XML data Support for lexical resources Annotated corpora creation Annotated data comparison Transformation of linguistic information into vectors Annotated corpora modifications Stand-alone application creation Cross-lingual Information Extraction from Web pages 10 September 2003
Ellogon in CROSSMARC Web Pages Collection Identifies web sites that are of relevance to the particular domain Focused Crawler Identifies web pages of interest within the retrieved web sites Domain Specific Spider Site Navigation Page Filtering Link Scoring Cross-lingual Information Extraction from Web pages 10 September 2003
CROSSMARC Multilingual IE Ellogon in CROSSMARC Information Extraction Interesting Web Pages Information Extraction Remote Invocation (IERI) CROSSMARC Multilingual IE NERC based Demarcator EFE XML Conversion ENERC English IE Data Inserter FFE FNERC HFE HNERC IFE INERC Products Database French IE Greek IE Italian IE NERC Demarcation Fact Extraction Cross-lingual Information Extraction from Web pages 10 September 2003
Retrain whole HNERC system Ellogon in CROSSMARC Retrain whole HNERC system Cross-lingual Information Extraction from Web pages 10 September 2003
Ellogon in CROSSMARC Cross-lingual Information Extraction from Web pages 10 September 2003
Ellogon in CROSSMARC Cross-lingual Information Extraction from Web pages 10 September 2003
Ellogon in CROSSMARC Cross-lingual Information Extraction from Web pages 10 September 2003
Conclusions CROSSMARC approach covers all the way from the identification of Web sites and web pages of interest to the extraction of information from them and its presentation. Ellogon offered functionalities that facilitated the development, deployment and exploitation of core CROSSMARC components. Cross-lingual Information Extraction from Web pages 10 September 2003
Future Plans Integrate, after the project completion, all CROSSMARC components and customisation facilities within Ellogon aiming at the development of an Ellogon-based platform for cross-lingual information management from web pages Cross-lingual Information Extraction from Web pages 10 September 2003
Useful Links CROSSMARC http://www.iit.demokritos.gr/skel/crossmarc Ellogon http://www.iit.demokritos.gr/skel/Ellogon Cross-lingual Information Extraction from Web pages 10 September 2003
Balkan Conference on Informatics BCI 2003 http://www.iit.demokritos.gr/skel/bci03_workshop/ Cross-lingual Information Extraction from Web pages 10 September 2003