Presentation is loading. Please wait.

Presentation is loading. Please wait.

Incorporating Site-level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy KDD 2009 Jiang-Ming Yang, Rui Cai, Chunsong Wang, Hua Huang,

Similar presentations


Presentation on theme: "Incorporating Site-level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy KDD 2009 Jiang-Ming Yang, Rui Cai, Chunsong Wang, Hua Huang,"— Presentation transcript:

1 Incorporating Site-level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy KDD 2009 Jiang-Ming Yang, Rui Cai, Chunsong Wang, Hua Huang, Lei Zhang, Wei-Ying May Date: 2009/9/21 Speaker: Yi-Lin, Hsu Advisor: Dr. Koh, Jia-ling

2 Outline Introduction Crawling in Web forum Main task Sitemap Construction List Reconstruction & Classification Timestamp Extraction Prediction model Bandwidth control Experiment Conclusion & FutureWork 2

3 Crawling in Web Forums Fetching forum data from various forum sites is the fundamental step of most related web applications. To satisfy the application requirements, an ideal forum crawler should make a tradeoff between its COMPLETENESS and TIMELINESS 3

4 Crawling in Web Forums Completeness: identify updated discussion threads and download the newly published content. (i.e., online Q&A services) 4

5 Crawling in Web Forums Timeliness: to efficiently discover and download newly published discussion threads. 5

6 Characteristics of Forums Index Page Post Page 6

7 Incremental Crawling General Webpages: Treating page independently,i.e., page-wise Forum Pages: Considering pagination, i.e., list wise 7

8 System Overview The proposed solution consists of two parts, offline mining and online crawling. offline mining: Purposes: Distinguish index and posts pages Concatenate pages to list by following pagination 8

9 System Overview The proposed solution consists of two parts, offline mining and online crawling. Online crawling: Identifying index page and post page Predicting the update frequency Balancing bandwidth by adjusting the numbers of various lists in the crawling queue. 9

10 Main Task Sitemap Construction 1 1 2 2 3 3 4 4 5 5 List Construction &Classification Timestamp Extraction Prediction Models Bandwidth Control 10

11 Sitemap Construction 1 1 List Construction &Classification Timestamp Extraction Prediction Models Bandwidth Control 11

12 Forum Sitemap Sitemap : A sitemap is a directed graph consisting of a set of vertices and corresponding arcs – each vertex represents a kind of pages in that forum – each arc denotes the link relation between two vertices. http://forums.asp.net 12

13 Page Clustering VV VV VVVV VV VV 13

14 14

15 Sitemap Construction 2 2 List Construction &Classification Timestamp Extraction Prediction Models Bandwidth Control 15

16 Indentify Index & Post Nodes A SVM-based Classifier Site independent Features —Node size —Link structure 16

17 List Reconstruction Given a new page 1.Classify into a node 2.Detect pagination links 3.Find out link orders 17

18 Sitemap Construction 3 3 List Construction &Classification Timestamp Extraction Prediction Models Bandwidth Control YYYY/MM/DD 18

19 Timestamp Extraction 19 Distinguish real timestamps from noises The temporal order can help !

20 Sitemap Construction 4 4 List Construction &Classification Timestamp Extraction Prediction Models Bandwidth Control 20

21 Prediction Model 40% threads keep active no longer than 24 hours 70% threads are no longer than 3 days. It becomes static after a few days when there is no discussion activity. 21

22 Feature Extraction 22

23 Regression Model Predict when the next new record arrives CT: current time LT: last (re-)visit time by crawler 23 Linear regression Advantages –Lightweight computational cost –Efficient for online process

24 Sitemap Construction 5 5 List Construction &Classification Timestamp Extraction Prediction Models Bandwidth Control 24

25 Bandwidth Control Index and post pages are quite different 25 IndexPost Quantity< 10 %> 90 % Avg. Update Frequencyhighlow Num. Re-crawl Pagessmalllarge Post pages blocks the bandwidth Cannot discover new threads in time A simple but practical solution

26 Experiment Setup 18 web forums in diverse categories March 1999 ~ June 2008 990,476 pages 5,407,854 posts Simulation Repeatable and Controllable Comparison List-wise strategy (LWS), LWS with bandwidth control (LWS + BC) Curve-fitting policy (CF) Bound-based policy (BB, WWW 2008) Oracle (Most ideal case) 26

27 Measurements Bandwidth Utilization I new : #pages with new information I B : #pages crawled Coverage I crawl : #new posts crawled I all : #new posts published on forums Timeliness ∆t i : #minutes between publish and download 27

28 Performance Comparison Experiment results Bandwidth: 3000 pages / day 28

29 Performance Comparison (Cont.) Comparison with various bandwidth 29

30 Performance Comparison (Cont.) Detailed performance of Index and Post pages Bandwidth: 3000 pages / day 30

31 Conclusions and Future Work A list-wise strategy for incremental crawling of web forums. Taking into account user behavior statistics The new strategy is 260% faster than methods and it also achieves a high coverage ratio. Future work Stronger prediction model than linear regression 31


Download ppt "Incorporating Site-level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy KDD 2009 Jiang-Ming Yang, Rui Cai, Chunsong Wang, Hua Huang,"

Similar presentations


Ads by Google