Download presentation
Presentation is loading. Please wait.
Published byFlora Clarke Modified over 6 years ago
1
Corpus Formation [CFT] Web Pages Annotation [Web Annotator] Web sites detection [NEACrawler] Web pages collection [NEAC] IE Remote Invocation [IERI] Kostas Stamatakis, Vangelis Karkaletsis, Georgios Paliouras Heraklion, June 24, 2003
2
Corpus Formation (CFT: Corpus Formation Tool)
Web Pages Annotation (Web Annotator) Customization for 2nd domain COMPLETED It was Used for the formation of 2nd domain corpus Customization for 2nd domain completed It was used for Corpus annotation according to the guidelines.
3
CFT: Corpus Formation Tool
input output Web sites locally saved CFT Corpus positive pages positive pages Page Filtering & Link Scoring modules training + + other pages negative pages (but similar)
4
Web Annotator + input output XHTML+ TXT Web Annotator XHTML page
IE Systems training + Surrogate text file (annotations)
5
Big picture NEACrawler WEB XHTML pages XHTML pages XML pages End user
Domain-specific Web sites Focused Crawling Domain-specific Spidering Web Pages Collection Domain Ontology XHTML pages Multilingual and Multimedia Fact Extraction XHTML pages Multilingual NERC and Name Matching with NE annotations XML pages NERC-FE Products Database Insertion into the data base User Interface End user
6
NEACrawler: Web Site Detection
input output Web Dirs Keywords NEACrawler URL lists FIT websites Step 1: Crawler runs Step 2: Split list (based on language in the current version, to be deactivated in the final) Step 3: Light spidering - validates each website, whether it is FIT or not.
7
Big picture NEAC WEB XHTML pages XHTML pages XML pages End user
Domain-specific Web sites Focused Crawling Domain-specific Spidering Web Pages Collection Domain Ontology XHTML pages Multilingual and Multimedia Fact Extraction XHTML pages Multilingual NERC and Name Matching with NE annotations XML pages NERC-FE Products Database Insertion into the data base User Interface End user
8
NEAC: Web Pages Collection
input output URL list XHTML pages NEAC
9
NEAC: Web Pages Collection
URL list input XHTML pages output NEAC PAGE PROCESSING Page Filtering Module Meta TIDY Queue ……. LIST……. …….……. Connection Content Processing One URL OK Save page NOTOK Error URLs Link Scoring Module New interesting links Link Processing Ignore page
10
Navigation Schema URL NO FRAMES FRAMES Split frames OK --- LINKS FORMS
IMAGE MAP JAVA SCRIPT TEXT LINK IMAGE LINK SELECT LIST SEARCH BOX TEXT CONSTANTS OTHER
11
Big picture IERI WEB XHTML pages XHTML pages XML pages End user
Domain-specific Web sites Focused Crawling Domain-specific Spidering XHTML pages Web Pages Collection IERI Domain Ontology IE System Remote Invocation Multilingual and Multimedia Fact Extraction XHTML pages Multilingual NERC and Name Matching with NE annotations XML pages NERC-FE Products Database Insertion into the data base User Interface End user
12
IERI: IE Remote Invocation
input output XHTML IERI XML XML files
13
Agent-based Architecture
14
What is new Spider, CFT, Web Annotator Customisation to the 2nd domain
Rule-based page filtering Machine learning based page filtering Rule-based link scoring Customised CFT Customised Web annotator Evaluation of ML-based page filtering for all 4 languages Crawler and Spider run in both GUI and command line mode, as well as web-based applications XML logs to activate the corresponding agents
15
Rule-based page filtering
Customisation to a new domain involves Creation of primary group of terms (use of regular expressions) E.g. Skills, Salary, Experience Creation of a secondary group of terms (use of regular expressions) E.g. S/W developer, Accountant, Master’s degree A page gets a positive score if terms from both groups are found within the page
16
Page Filtering Evaluation Results: 1st Domain
H ML Precision (%) 0,95 0,87 0,73 0,98 0,97 Recall (%) 1,00 0,90 0,99 0,92 0,96 0,20 0,91 Fmeasure (%) 0,93 0,81 0,33 0,94
17
Page Filtering Evaluation Results: 2nd Domain
ML Precision (%) 0,94 0,92 0,88 0,80 Recall (%) 0,82 0,74 0,79 0,68 F-measure (%) 0,83
18
Rule-based link scoring
Customisation to a new domain involves Specification of five levels of terms’ groups (different score is allocated for each level) In each link the following are examined: Text of the link Text in the context of the link The score of a link is calculated based on the terms found according to the level they belong into and the place where they are found (inside the link or its context).
19
Pending issues Focused Web Crawler (NEACrawler: EDIN Crawler + NEAC-light) Evaluation in 2nd domain. Removal of Language identification Module NEAC Incorporate the RTV WebXimmler XHTML conversion module Incorporate the EDIN Language Identification Module (LIM). LIM examines each visited page on runtime and adds a language meta-tag when saving the page. According to this info, the IERI application invokes the proper monolingual IE system. Cover more navigation cases (forms, javascript, dhtml, flash). Evaluation of link-scoring
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.