Download presentation
Presentation is loading. Please wait.
Published byMarcus Ramsey Modified over 8 years ago
1
Incorporating Site-level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy KDD 2009 Jiang-Ming Yang, Rui Cai, Chunsong Wang, Hua Huang, Lei Zhang, Wei-Ying May Date: 2009/9/21 Speaker: Yi-Lin, Hsu Advisor: Dr. Koh, Jia-ling
2
Outline Introduction Crawling in Web forum Main task Sitemap Construction List Reconstruction & Classification Timestamp Extraction Prediction model Bandwidth control Experiment Conclusion & FutureWork 2
3
Crawling in Web Forums Fetching forum data from various forum sites is the fundamental step of most related web applications. To satisfy the application requirements, an ideal forum crawler should make a tradeoff between its COMPLETENESS and TIMELINESS 3
4
Crawling in Web Forums Completeness: identify updated discussion threads and download the newly published content. (i.e., online Q&A services) 4
5
Crawling in Web Forums Timeliness: to efficiently discover and download newly published discussion threads. 5
6
Characteristics of Forums Index Page Post Page 6
7
Incremental Crawling General Webpages: Treating page independently,i.e., page-wise Forum Pages: Considering pagination, i.e., list wise 7
8
System Overview The proposed solution consists of two parts, offline mining and online crawling. offline mining: Purposes: Distinguish index and posts pages Concatenate pages to list by following pagination 8
9
System Overview The proposed solution consists of two parts, offline mining and online crawling. Online crawling: Identifying index page and post page Predicting the update frequency Balancing bandwidth by adjusting the numbers of various lists in the crawling queue. 9
10
Main Task Sitemap Construction 1 1 2 2 3 3 4 4 5 5 List Construction &Classification Timestamp Extraction Prediction Models Bandwidth Control 10
11
Sitemap Construction 1 1 List Construction &Classification Timestamp Extraction Prediction Models Bandwidth Control 11
12
Forum Sitemap Sitemap : A sitemap is a directed graph consisting of a set of vertices and corresponding arcs – each vertex represents a kind of pages in that forum – each arc denotes the link relation between two vertices. http://forums.asp.net 12
13
Page Clustering VV VV VVVV VV VV 13
14
14
15
Sitemap Construction 2 2 List Construction &Classification Timestamp Extraction Prediction Models Bandwidth Control 15
16
Indentify Index & Post Nodes A SVM-based Classifier Site independent Features —Node size —Link structure 16
17
List Reconstruction Given a new page 1.Classify into a node 2.Detect pagination links 3.Find out link orders 17
18
Sitemap Construction 3 3 List Construction &Classification Timestamp Extraction Prediction Models Bandwidth Control YYYY/MM/DD 18
19
Timestamp Extraction 19 Distinguish real timestamps from noises The temporal order can help !
20
Sitemap Construction 4 4 List Construction &Classification Timestamp Extraction Prediction Models Bandwidth Control 20
21
Prediction Model 40% threads keep active no longer than 24 hours 70% threads are no longer than 3 days. It becomes static after a few days when there is no discussion activity. 21
22
Feature Extraction 22
23
Regression Model Predict when the next new record arrives CT: current time LT: last (re-)visit time by crawler 23 Linear regression Advantages –Lightweight computational cost –Efficient for online process
24
Sitemap Construction 5 5 List Construction &Classification Timestamp Extraction Prediction Models Bandwidth Control 24
25
Bandwidth Control Index and post pages are quite different 25 IndexPost Quantity< 10 %> 90 % Avg. Update Frequencyhighlow Num. Re-crawl Pagessmalllarge Post pages blocks the bandwidth Cannot discover new threads in time A simple but practical solution
26
Experiment Setup 18 web forums in diverse categories March 1999 ~ June 2008 990,476 pages 5,407,854 posts Simulation Repeatable and Controllable Comparison List-wise strategy (LWS), LWS with bandwidth control (LWS + BC) Curve-fitting policy (CF) Bound-based policy (BB, WWW 2008) Oracle (Most ideal case) 26
27
Measurements Bandwidth Utilization I new : #pages with new information I B : #pages crawled Coverage I crawl : #new posts crawled I all : #new posts published on forums Timeliness ∆t i : #minutes between publish and download 27
28
Performance Comparison Experiment results Bandwidth: 3000 pages / day 28
29
Performance Comparison (Cont.) Comparison with various bandwidth 29
30
Performance Comparison (Cont.) Detailed performance of Index and Post pages Bandwidth: 3000 pages / day 30
31
Conclusions and Future Work A list-wise strategy for incremental crawling of web forums. Taking into account user behavior statistics The new strategy is 260% faster than methods and it also achieves a high coverage ratio. Future work Stronger prediction model than linear regression 31
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.