Download presentation
Presentation is loading. Please wait.
Published byColeen Ray Modified over 9 years ago
1
Crawling the Hidden Web Authors: Sriram Raghavan, Hector Garcia-Molina VLDB 2001 Speaker: Karthik Shekar 1
2
Deep Web / Hidden Web Content hidden behind the search forms / registration portals. Dynamically generated based on a query. Size: ~550 times that of PIW (based on study in 2000) Importance: Quality content 2
3
User form interaction 3
4
Crawler form interaction Components of HiWE ( Hidden Web Exposer ) Internal Form Representaion Task-specific database Matching function Response analysis 4
5
HiWE Architecture LVS table – task specific database Form Analyzer, Form Processor, Response Analyzer – take care of the form processing & submission operations. Parser, Crawl Manager, URL List – parts of the basic PIW crawler. 5
6
Internal Form Representation F=({Elements},S,M) S – Submission Information eg. Submission URL M – Meta Information eg. Web-site hosting form, #inlinks. [ in HiWE it is Ф ] 6
7
Label – Value Set Table Each row – ( L, V ) V – fuzzy-graded set of values for the label L M v – membership function, assigns weights to each v i in V M v (v i ) – crawler’s confidence that this assignment to label(element) is semantically correct. 7
8
Label – Value Set Table Ways to populate the table : ▫Explicit initialization Feeding in the data at start up ▫Built-in entries Date, time etc. ▫Wrapped data sources Retrieve data from other sources by querying Type 1 query: return a set of values for a given set of labels Type 2 query: for a set of values return other values belonging to the same set. 8
9
Computing weights on each V i w Built-in & explicit values = 1 For values which the crawler picks up: ▫Label(e) is extracted and there is no entry in the LVS – new row is added ( label(e), dom(e) ) & M dom(e) (x) = 1,x є dom(e) ; 0,otherwise ▫Label(e) is extracted and there is an entry in LVS ( label(e), V ) – entry is modified to ( label(e), V U dom(e) ) with M V U dom(e) (x) = max(M v (x),M dom(e) (x)) 9
10
Computing weights on each V i ▫Label(e) could not be retrieved – For each row calculate a score given by ∑ xєdom(e) M v (x) |dom(e)| Find the row with the max score- (L max, V max ) Replace the row with (L max, V max U D’) [ where D’ is new set from dom(e) such that M D’ (x) = max-score * M dom(e) (x) ] 10
11
Label Matching Normalization of all labels ( case folding, stemming, stop words removal ) Computing edit distance Word ordering ( eg. Company type & type of company ) Block edit distance is used 11
12
Ranking value assignment Aggregation functions ▫Fuzzy conjunction ρ fuz = min i=1..n M vi (v i ) ▫Average ▫Probabilistic ρ prob = 1 – П i=1..n (1- M vi (v i )) M vi (v i ) – likelihood that the assignment is useful ρ fuz < ρ avg < ρ prob More aggresive 12
13
LITE Layout based Information Extraction Based on the physical layout of the page Reason: semantic information is not always reflected in the HTML markup 13
14
LITE & form analysis Pruning Identify text closest to the form element – candidates Rank the candidates Choose the highest ranked candidates as label Perform post-processing 14
15
+/- Simple simulation of the user interaction with the form Learning-based operational model Task/application specific crawler Efficient Label Extraction method Re-use of existing modules Coverage is a challenge Execution time would depend on the look up… 15
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.