Download presentation
Presentation is loading. Please wait.
Published byProsper Wilkinson Modified over 6 years ago
1
Institute of Informatics & Telecommunications NCSR “Demokritos”
Cross-lingual Information Extraction from Web pages: The use of a general-purpose Text Engineering Platform Georgios Petasis, Vangelis Karkaletsis, Constantine D. Spyropoulos RANLP, September 10-12, 2003
2
Contents Extracting Information from WEB pages Overview of CROSSMARC
Overview of Ellogon The role of Ellogon in CROSSMARC Conclusions Cross-lingual Information Extraction from Web pages September 2003
3
IE from WEB pages Wrapper Induction Identified Web Sites
Delimiter-based Methods Wrapper Induction Cross-lingual Information Extraction from Web pages September 2003
4
CROSSMARC objective Implement technology for information extraction from web pages, that: can operate on pages without a standardised format (structured, semi-structured, free text); can be used to process web pages written in several languages; can be adapted semi-automatically to new domains Cross-lingual Information Extraction from Web pages September 2003
5
CROSSMARC technologies
The system developed exploits language technology and machine learning methods; exploits domain-specific ontologies and language-specific lexica; employs localisation and user modelling techniques; implements an open and multi-agent architecture Cross-lingual Information Extraction from Web pages September 2003
6
CROSSMARC consortium EL UK I F
National Centre for Scientific Research “Demokritos” (Coordinator) EL Velti S.A. University of Edinburgh UK Universita di Roma Tor Vergata I Lingway F Informatique CDC Cross-lingual Information Extraction from Web pages September 2003
7
CROSSMARC Architecture
Cross-lingual Information Extraction from Web pages September 2003
8
Ellogon Architecture … Language Processing Components
Graphical Interface Services Internet (HTTP, FTP, SOAP) Operating System Services (ActiveX, COM, DDE) Database Connectivity (ODBC) … Collection – Document Manager Storage Format Abstraction Layer XML Ellogon Databases Operating System ??? Cross-lingual Information Extraction from Web pages September 2003
9
Ellogon Key Features Visualisation of textual/HTML/XML data
Support for lexical resources Annotated corpora creation Annotated data comparison Transformation of linguistic information into vectors Annotated corpora modifications Stand-alone application creation Cross-lingual Information Extraction from Web pages September 2003
10
Ellogon in CROSSMARC Web Pages Collection Identifies web sites that are of relevance to the particular domain Focused Crawler Identifies web pages of interest within the retrieved web sites Domain Specific Spider Site Navigation Page Filtering Link Scoring Cross-lingual Information Extraction from Web pages September 2003
11
CROSSMARC Multilingual IE
Ellogon in CROSSMARC Information Extraction Interesting Web Pages Information Extraction Remote Invocation (IERI) CROSSMARC Multilingual IE NERC based Demarcator EFE XML Conversion ENERC English IE Data Inserter FFE FNERC HFE HNERC IFE INERC Products Database French IE Greek IE Italian IE NERC Demarcation Fact Extraction Cross-lingual Information Extraction from Web pages September 2003
12
Retrain whole HNERC system
Ellogon in CROSSMARC Retrain whole HNERC system Cross-lingual Information Extraction from Web pages September 2003
13
Ellogon in CROSSMARC Cross-lingual Information Extraction from Web pages September 2003
14
Ellogon in CROSSMARC Cross-lingual Information Extraction from Web pages September 2003
15
Ellogon in CROSSMARC Cross-lingual Information Extraction from Web pages September 2003
16
Conclusions CROSSMARC approach covers all the way from the identification of Web sites and web pages of interest to the extraction of information from them and its presentation. Ellogon offered functionalities that facilitated the development, deployment and exploitation of core CROSSMARC components. Cross-lingual Information Extraction from Web pages September 2003
17
Future Plans Integrate, after the project completion, all CROSSMARC components and customisation facilities within Ellogon aiming at the development of an Ellogon-based platform for cross-lingual information management from web pages Cross-lingual Information Extraction from Web pages September 2003
18
Useful Links CROSSMARC http://www.iit.demokritos.gr/skel/crossmarc
Ellogon Cross-lingual Information Extraction from Web pages September 2003
19
Balkan Conference on Informatics
BCI 2003 Cross-lingual Information Extraction from Web pages September 2003
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.