Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy Jiang-Ming Yang, Rui Cai, Lei Zhang, and Wei-Ying Ma Microsoft Research, Asia Chun-song Wang University of Wisconsin-Madison Hua Huang Beijing University of Posts and Telecommunications
Web Forums February 15, Web Forums Recreat ion SportsGames Comput ers ArtsSocietyScienceHealth Web Search Q & A Social Network Forums is a huge resource with human knowledge !
Forum Data Crawl and Mining February 15, Crawling Data Parsing WWW 2009 Automation Data Parsing WWW 2009 Automation Data Parsing Content Mining SIGIR 2009 Expert Finding & Junk detection SIGIR 2009 Expert Finding & Junk detection WWW 2008 iRobot: Sitemap Reconstruction WWW 2008 iRobot: Sitemap Reconstruction SIGIR 2008 Exploring Traversal Strategy SIGIR 2008 Exploring Traversal Strategy KDD 2009 Incremental Crawling KDD 2009 Incremental Crawling KDD 2009 User Behavior in Forums KDD 2009 User Behavior in Forums
Characteristics of Forums February 15, Index Page Post Page
Incremental Crawling General Web Pages – Treating page independently, i.e., page-wise Forum Pages – Considering pagination, i.e., list-wise February 15, 20145
Our Solution February 15, Incorporating Site-level Knowledge – How many kinds of pages in a website – How various pages linked with each others Purposes – Distinguish index and post pages – Concatenate pages to list by following paginations Sitemap Construction List Construction & Classification Timestamp Extraction Prediction Models Bandwidth Control
February 15, Sitemap Construction List Construction & Classification Timestamp Extraction Prediction Models Bandwidth Control
Forum Sitemap A sitemap is a directed graph consisting of a set of vertices and links February 15,
Page Layout Clustering Forum pages are based on database & template Layout is robust to describe template – Layout can be characterized by the HTML elements in different DOM paths (e.g. repetitive patterns) February 15, Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang and Lei Zhang. iRobot: An Intelligent Crawler for Web Forums. In Proceedings of WWW 2008 Conference
Link Analysis February 15, Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang and Lei Zhang. iRobot: An Intelligent Crawler for Web Forums. In Proceedings of WWW 2008 Conference Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai and Lei Zhang. Exploring Traversal Strategy for Web Forum Crawling. In Proceedings of SIGIR 2008 Conference
February 15, Sitemap Construction List Construction & Classification Timestamp Extraction Prediction Models Bandwidth Control
Indentify Index & Post Nodes A SVM-based Classifier – Site independent – Features Node size Link structure Keywords Node classification is robust that page – Robust to noise on individual pages February 15,
List Reconstruction Given a new page 1.Classify into a node 2.Detect pagination links 3.Find out link orders February 15,
February 15, Sitemap Construction List Construction & Classification Timestamp Extraction Prediction Models Bandwidth Control YYYY/MM/DD
Timestamp Extraction February 15, Distinguish real timestamps from noises – The temporal order can help !
February 15, Sitemap Construction List Construction & Classification Timestamp Extraction Prediction Models Bandwidth Control
Feature Extraction February 15, Features to describe update frequency – List-dependent & independent (site-level statistics) – Absolute & Relative
Regression Model Predict when the next new record arrives – CT: current time – LT: last (re-)visit time by crawler February 15, Linear regression – Advantages Lightweight computational cost Efficient for online process
February 15, Sitemap Construction List Construction & Classification Timestamp Extraction Prediction Models Bandwidth Control
Bandwidth Control Index and post pages are quite different February 15, IndexPost Quantity< 10 %> 90 % Avg. Update Frequencyhighlow Num. Re-crawl Pagessmalllarge Post pages blocks the bandwidth – Cannot discover new threads in time – A simple but practical solution
Experiment Setup 18 web forums in diverse categories – March 1999 ~ June 2008 – 990,476 pages and 5,407,854 posts Simulation – Repeatable and Controllable Comparison – List-wise strategy (LWS), – LWS with bandwidth control (LWS + BC) – Curve-fitting policy (CF) – Bound-based policy (BB, WWW 2008) – Oracle (Most ideal case) February 15,
Measurements Bandwidth Utilization – I new : #pages with new information – I B : #pages crawled Coverage – I crawl : #new posts crawled – I all : #new posts published on forums Timeliness – t i : #minutes between publish and download February 15,
Performance Comparison Warm-up Stage – Bandwidth: 3000 pages / day February 15,
Performance Comparison (Cont.) Comparison with various bandwidth February 15,
Performance Comparison (Cont.) Detailed performance of Index and Post pages – Bandwidth: 3000 pages / day February 15,
Conclusions and Future Work Targeted on web forums, a specific but interesting field. Developing an effective solution for incremental forum crawling – Integrating site-level knowledge – Some practical engineering implementation Future work – Improve timestamps extraction algorithm – Stronger prediction model than linear regression February 15,