Presentation is loading. Please wait.

Presentation is loading. Please wait.

Crawling the Hidden Web Authors: Sriram Raghavan, Hector Garcia-Molina VLDB 2001 Speaker: Karthik Shekar 1.

Similar presentations


Presentation on theme: "Crawling the Hidden Web Authors: Sriram Raghavan, Hector Garcia-Molina VLDB 2001 Speaker: Karthik Shekar 1."— Presentation transcript:

1 Crawling the Hidden Web Authors: Sriram Raghavan, Hector Garcia-Molina VLDB 2001 Speaker: Karthik Shekar 1

2 Deep Web / Hidden Web Content hidden behind the search forms / registration portals. Dynamically generated based on a query. Size: ~550 times that of PIW (based on study in 2000) Importance: Quality content 2

3 User form interaction 3

4 Crawler form interaction Components of HiWE ( Hidden Web Exposer ) Internal Form Representaion Task-specific database Matching function Response analysis 4

5 HiWE Architecture LVS table – task specific database Form Analyzer, Form Processor, Response Analyzer – take care of the form processing & submission operations. Parser, Crawl Manager, URL List – parts of the basic PIW crawler. 5

6 Internal Form Representation F=({Elements},S,M) S – Submission Information eg. Submission URL M – Meta Information eg. Web-site hosting form, #inlinks. [ in HiWE it is Ф ] 6

7 Label – Value Set Table Each row – ( L, V ) V – fuzzy-graded set of values for the label L M v – membership function, assigns weights to each v i in V M v (v i ) – crawler’s confidence that this assignment to label(element) is semantically correct. 7

8 Label – Value Set Table Ways to populate the table : ▫Explicit initialization  Feeding in the data at start up ▫Built-in entries  Date, time etc. ▫Wrapped data sources  Retrieve data from other sources by querying  Type 1 query: return a set of values for a given set of labels  Type 2 query: for a set of values return other values belonging to the same set. 8

9 Computing weights on each V i w Built-in & explicit values = 1 For values which the crawler picks up: ▫Label(e) is extracted and there is no entry in the LVS – new row is added ( label(e), dom(e) ) & M dom(e) (x) = 1,x є dom(e) ; 0,otherwise ▫Label(e) is extracted and there is an entry in LVS ( label(e), V ) – entry is modified to ( label(e), V U dom(e) ) with M V U dom(e) (x) = max(M v (x),M dom(e) (x)) 9

10 Computing weights on each V i ▫Label(e) could not be retrieved – For each row calculate a score given by ∑ xєdom(e) M v (x) |dom(e)| Find the row with the max score- (L max, V max ) Replace the row with (L max, V max U D’) [ where D’ is new set from dom(e) such that M D’ (x) = max-score * M dom(e) (x) ] 10

11 Label Matching Normalization of all labels ( case folding, stemming, stop words removal ) Computing edit distance Word ordering ( eg. Company type & type of company ) Block edit distance is used 11

12 Ranking value assignment Aggregation functions ▫Fuzzy conjunction ρ fuz = min i=1..n M vi (v i ) ▫Average ▫Probabilistic ρ prob = 1 – П i=1..n (1- M vi (v i )) M vi (v i ) – likelihood that the assignment is useful ρ fuz < ρ avg < ρ prob More aggresive 12

13 LITE Layout based Information Extraction Based on the physical layout of the page Reason: semantic information is not always reflected in the HTML markup 13

14 LITE & form analysis Pruning Identify text closest to the form element – candidates Rank the candidates Choose the highest ranked candidates as label Perform post-processing 14

15 +/- Simple simulation of the user interaction with the form Learning-based operational model Task/application specific crawler Efficient Label Extraction method Re-use of existing modules Coverage is a challenge Execution time would depend on the look up… 15


Download ppt "Crawling the Hidden Web Authors: Sriram Raghavan, Hector Garcia-Molina VLDB 2001 Speaker: Karthik Shekar 1."

Similar presentations


Ads by Google