Corpus Formation [CFT] Web Pages Annotation [Web Annotator] Web sites detection [NEACrawler] Web pages collection [NEAC] IE Remote Invocation [IERI] Kostas Stamatakis, Vangelis Karkaletsis, Georgios Paliouras Heraklion, June 24, 2003
Corpus Formation (CFT: Corpus Formation Tool) Web Pages Annotation (Web Annotator) Customization for 2nd domain COMPLETED It was Used for the formation of 2nd domain corpus Customization for 2nd domain completed It was used for Corpus annotation according to the guidelines.
CFT: Corpus Formation Tool input output Web sites locally saved CFT Corpus positive pages positive pages Page Filtering & Link Scoring modules training + + other pages negative pages (but similar)
Web Annotator + input output XHTML+ TXT Web Annotator XHTML page IE Systems training + Surrogate text file (annotations)
Big picture NEACrawler WEB XHTML pages XHTML pages XML pages End user Domain-specific Web sites Focused Crawling Domain-specific Spidering Web Pages Collection Domain Ontology XHTML pages Multilingual and Multimedia Fact Extraction XHTML pages Multilingual NERC and Name Matching with NE annotations XML pages NERC-FE Products Database Insertion into the data base User Interface End user
NEACrawler: Web Site Detection input output Web Dirs Keywords NEACrawler URL lists FIT websites Step 1: Crawler runs Step 2: Split list (based on language in the current version, to be deactivated in the final) Step 3: Light spidering - validates each website, whether it is FIT or not.
Big picture NEAC WEB XHTML pages XHTML pages XML pages End user Domain-specific Web sites Focused Crawling Domain-specific Spidering Web Pages Collection Domain Ontology XHTML pages Multilingual and Multimedia Fact Extraction XHTML pages Multilingual NERC and Name Matching with NE annotations XML pages NERC-FE Products Database Insertion into the data base User Interface End user
NEAC: Web Pages Collection input output URL list XHTML pages NEAC
NEAC: Web Pages Collection URL list input XHTML pages output NEAC PAGE PROCESSING Page Filtering Module Meta TIDY Queue ……. LIST……. …….……. Connection Content Processing One URL OK Save page NOTOK Error URLs Link Scoring Module New interesting links Link Processing Ignore page
Navigation Schema URL NO FRAMES FRAMES Split frames OK --- LINKS FORMS IMAGE MAP JAVA SCRIPT TEXT LINK IMAGE LINK SELECT LIST SEARCH BOX TEXT CONSTANTS OTHER
Big picture IERI WEB XHTML pages XHTML pages XML pages End user Domain-specific Web sites Focused Crawling Domain-specific Spidering XHTML pages Web Pages Collection IERI Domain Ontology IE System Remote Invocation Multilingual and Multimedia Fact Extraction XHTML pages Multilingual NERC and Name Matching with NE annotations XML pages NERC-FE Products Database Insertion into the data base User Interface End user
IERI: IE Remote Invocation input output XHTML IERI XML XML files
Agent-based Architecture
What is new Spider, CFT, Web Annotator Customisation to the 2nd domain Rule-based page filtering Machine learning based page filtering Rule-based link scoring Customised CFT Customised Web annotator Evaluation of ML-based page filtering for all 4 languages Crawler and Spider run in both GUI and command line mode, as well as web-based applications XML logs to activate the corresponding agents
Rule-based page filtering Customisation to a new domain involves Creation of primary group of terms (use of regular expressions) E.g. Skills, Salary, Experience Creation of a secondary group of terms (use of regular expressions) E.g. S/W developer, Accountant, Master’s degree A page gets a positive score if terms from both groups are found within the page
Page Filtering Evaluation Results: 1st Domain H ML Precision (%) 0,95 0,87 0,73 0,98 0,97 Recall (%) 1,00 0,90 0,99 0,92 0,96 0,20 0,91 Fmeasure (%) 0,93 0,81 0,33 0,94
Page Filtering Evaluation Results: 2nd Domain ML Precision (%) 0,94 0,92 0,88 0,80 Recall (%) 0,82 0,74 0,79 0,68 F-measure (%) 0,83
Rule-based link scoring Customisation to a new domain involves Specification of five levels of terms’ groups (different score is allocated for each level) In each link the following are examined: Text of the link Text in the context of the link The score of a link is calculated based on the terms found according to the level they belong into and the place where they are found (inside the link or its context).
Pending issues Focused Web Crawler (NEACrawler: EDIN Crawler + NEAC-light) Evaluation in 2nd domain. Removal of Language identification Module NEAC Incorporate the RTV WebXimmler XHTML conversion module Incorporate the EDIN Language Identification Module (LIM). LIM examines each visited page on runtime and adds a language meta-tag when saving the page. According to this info, the IERI application invokes the proper monolingual IE system. Cover more navigation cases (forms, javascript, dhtml, flash). Evaluation of link-scoring