Incorporating Site-level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy KDD 2009 Jiang-Ming Yang, Rui Cai, Chunsong Wang, Hua Huang,

Slides:



Advertisements
Similar presentations
iRobot: An Intelligent Crawler for Web Forums
Advertisements

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences.
Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy Jiang-Ming Yang, Rui Cai, Lei Zhang, and Wei-Ying Ma Microsoft.
Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.
Psychological Advertising: Exploring User Psychology for Click Prediction in Sponsored Search Date: 2014/03/25 Author: Taifeng Wang, Jiang Bian, Shusen.
A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Date : 2014/04/15 Source : KDD’13 Authors : Chi Wang, Marina Danilevsky, Nihit.
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
1 A scheme for racquet sports video analysis with the combination of audio-visual information Visual Communication and Image Processing 2005 Liyuan Xing,
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
UNDERSTANDING VISIBLE AND LATENT INTERACTIONS IN ONLINE SOCIAL NETWORK Presented by: Nisha Ranga Under guidance of : Prof. Augustin Chaintreau.
Freshness Policy Binoy Dharia, K. Rohan Gandhi, Madhura Kolwadkar Department of Computer Science University of Southern California Los Angeles, CA.
Xyleme A Dynamic Warehouse for XML Data of the Web.
1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
(ACM KDD 09’) Prem Melville, Wojciech Gryc, Richard D. Lawrence
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
DETECTING SPAMMERS AND CONTENT PROMOTERS IN ONLINE VIDEO SOCIAL NETWORKS Fabrício Benevenuto ∗, Tiago Rodrigues, Virgílio Almeida, Jussara Almeida, and.
Predicting Content Change On The Web BY : HITESH SONPURE GUIDED BY : PROF. M. WANJARI.
1 ENTROPY-BASED CONCEPT SHIFT DETECTION PETER VORBURGER, ABRAHAM BERNSTEIN IEEE ICDM 2006 Speaker: Li HueiJyun Advisor: Koh JiaLing Date:2007/11/6 1.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.
Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.
Understanding and Predicting Personal Navigation Date : 2012/4/16 Source : WSDM 11 Speaker : Chiu, I- Chih Advisor : Dr. Koh Jia-ling 1.
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
Adding Semantics to Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer.
Qing-Cai Chen; Xiao-Hong Yang; Xiao-Long Wang Machine Learning and Cybernetics (ICMLC), 2011 International Conference on Year: 2011, Page(s): 1878 – 1883.
--He Xiangnan PhD student Importance Estimation of User-generated Data.
Mining Social Networks for Personalized Prioritization Shinjae Yoo, Yiming Yang, Frank Lin, II-Chul Moon [KDD ’09] 1 Advisor: Dr. Koh Jia-Ling Reporter:
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Adaptive Mining Techniques for Data Streams using Algorithm Output Granularity Mohamed Medhat Gaber, Shonali Krishnaswamy, Arkady Zaslavsky In Proceedings.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
LOGO 1 Mining Templates from Search Result Records of Search Engines Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hongkun Zhao, Weiyi.
Finding Experts Using Social Network Analysis 2007 IEEE/WIC/ACM International Conference on Web Intelligence Yupeng Fu, Rongjing Xiang, Yong Wang, Min.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
LOGO Identifying Opinion Leaders in the Blogosphere Xiaodan Song, Yun Chi, Koji Hino, Belle L. Tseng CIKM 2007 Advisor : Dr. Koh Jia-Ling Speaker : Tu.
Date: 2015/11/19 Author: Reza Zafarani, Huan Liu Source: CIKM '15
Liangjie Hong and Brian D. Davison Department of Computer Science and Engineering Lehigh University SIGIR 2009.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Image Classification over Visual Tree Jianping Fan Dept of Computer Science UNC-Charlotte, NC
Advisor: Koh Jia-Ling Nonhlanhla Shongwe EFFICIENT QUERY EXPANSION FOR ADVERTISEMENT SEARCH WANG.H, LIANG.Y, FU.L, XUE.G, YU.Y SIGIR’09.
Mining information from social media
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University.
Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR
Lock-Free Consistency Control for Web 2.0 Applications Jiang-Ming Yang 1,3, Hai-Xun Wang 2, Ning Gu 1, Yi-Ming Liu 1, Chun-Song Wang 1, Qi-Wei Zhang 1.
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
Scalable Learning of Collective Behavior Based on Sparse Social Dimensions Lei Tang, Huan Liu CIKM ’ 09 Speaker: Hsin-Lan, Wang Date: 2010/02/01.
LOGO Comments-Oriented Blog Summarization by Sentence Extraction Meishan Hu, Aixin Sun, Ee-Peng Lim (ACM CIKM’07) Advisor : Dr. Koh Jia-Ling Speaker :
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
Computing and Information Sciences Kansas State University ANNIE Conference November 10, 2008 Predicting Links and Link Change in Friends Networks: Supervised.
1 Link Privacy in Social Networks Aleksandra Korolova, Rajeev Motwani, Shubha U. Nabar CIKM’08 Advisor: Dr. Koh, JiaLing Speaker: Li, HueiJyun Date: 2009/3/30.
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
ClusCite:Effective Citation Recommendation by Information Network-Based Clustering Date: 2014/10/16 Author: Xiang Ren, Jialu Liu,Xiao Yu, Urvashi Khandelwal,
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
Data mining in web applications
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
Why Does Your Website Need a Sitemap?
Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,
Building Topic/Trend Detection System based on Slow Intelligence
Presentation transcript:

Incorporating Site-level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy KDD 2009 Jiang-Ming Yang, Rui Cai, Chunsong Wang, Hua Huang, Lei Zhang, Wei-Ying May Date: 2009/9/21 Speaker: Yi-Lin, Hsu Advisor: Dr. Koh, Jia-ling

Outline Introduction Crawling in Web forum Main task Sitemap Construction List Reconstruction & Classification Timestamp Extraction Prediction model Bandwidth control Experiment Conclusion & FutureWork 2

Crawling in Web Forums Fetching forum data from various forum sites is the fundamental step of most related web applications. To satisfy the application requirements, an ideal forum crawler should make a tradeoff between its COMPLETENESS and TIMELINESS 3

Crawling in Web Forums Completeness: identify updated discussion threads and download the newly published content. (i.e., online Q&A services) 4

Crawling in Web Forums Timeliness: to efficiently discover and download newly published discussion threads. 5

Characteristics of Forums Index Page Post Page 6

Incremental Crawling General Webpages: Treating page independently,i.e., page-wise Forum Pages: Considering pagination, i.e., list wise 7

System Overview The proposed solution consists of two parts, offline mining and online crawling. offline mining: Purposes: Distinguish index and posts pages Concatenate pages to list by following pagination 8

System Overview The proposed solution consists of two parts, offline mining and online crawling. Online crawling: Identifying index page and post page Predicting the update frequency Balancing bandwidth by adjusting the numbers of various lists in the crawling queue. 9

Main Task Sitemap Construction List Construction &Classification Timestamp Extraction Prediction Models Bandwidth Control 10

Sitemap Construction 1 1 List Construction &Classification Timestamp Extraction Prediction Models Bandwidth Control 11

Forum Sitemap Sitemap : A sitemap is a directed graph consisting of a set of vertices and corresponding arcs – each vertex represents a kind of pages in that forum – each arc denotes the link relation between two vertices. 12

Page Clustering VV VV VVVV VV VV 13

14

Sitemap Construction 2 2 List Construction &Classification Timestamp Extraction Prediction Models Bandwidth Control 15

Indentify Index & Post Nodes A SVM-based Classifier Site independent Features —Node size —Link structure 16

List Reconstruction Given a new page 1.Classify into a node 2.Detect pagination links 3.Find out link orders 17

Sitemap Construction 3 3 List Construction &Classification Timestamp Extraction Prediction Models Bandwidth Control YYYY/MM/DD 18

Timestamp Extraction 19 Distinguish real timestamps from noises The temporal order can help !

Sitemap Construction 4 4 List Construction &Classification Timestamp Extraction Prediction Models Bandwidth Control 20

Prediction Model 40% threads keep active no longer than 24 hours 70% threads are no longer than 3 days. It becomes static after a few days when there is no discussion activity. 21

Feature Extraction 22

Regression Model Predict when the next new record arrives CT: current time LT: last (re-)visit time by crawler 23 Linear regression Advantages –Lightweight computational cost –Efficient for online process

Sitemap Construction 5 5 List Construction &Classification Timestamp Extraction Prediction Models Bandwidth Control 24

Bandwidth Control Index and post pages are quite different 25 IndexPost Quantity< 10 %> 90 % Avg. Update Frequencyhighlow Num. Re-crawl Pagessmalllarge Post pages blocks the bandwidth Cannot discover new threads in time A simple but practical solution

Experiment Setup 18 web forums in diverse categories March 1999 ~ June ,476 pages 5,407,854 posts Simulation Repeatable and Controllable Comparison List-wise strategy (LWS), LWS with bandwidth control (LWS + BC) Curve-fitting policy (CF) Bound-based policy (BB, WWW 2008) Oracle (Most ideal case) 26

Measurements Bandwidth Utilization I new : #pages with new information I B : #pages crawled Coverage I crawl : #new posts crawled I all : #new posts published on forums Timeliness ∆t i : #minutes between publish and download 27

Performance Comparison Experiment results Bandwidth: 3000 pages / day 28

Performance Comparison (Cont.) Comparison with various bandwidth 29

Performance Comparison (Cont.) Detailed performance of Index and Post pages Bandwidth: 3000 pages / day 30

Conclusions and Future Work A list-wise strategy for incremental crawling of web forums. Taking into account user behavior statistics The new strategy is 260% faster than methods and it also achieves a high coverage ratio. Future work Stronger prediction model than linear regression 31