Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14.

Slides:

Advertisements

Similar presentations

iRobot: An Intelligent Crawler for Web Forums

Advertisements

Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences.

Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy Jiang-Ming Yang, Rui Cai, Lei Zhang, and Wei-Ying Ma Microsoft.

Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

TrustRank Algorithm Srđan Luković 2010/3482

Natural Language Processing WEB SEARCH ENGINES August, 2002.

Site Level Noise Removal for Search Engines André Luiz da Costa Carvalho Federal University of Amazonas, Brazil Paul-Alexandru Chirita L3S and University.

A Quality Focused Crawler for Health Information Tim Tang.

CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

The PageRank Citation Ranking “Bringing Order to the Web”

6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.

A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm Dongwon Lee Database Systems Lab.

Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.

Relevance Propagation for Web Search Dr. Tie-Yan Liu Web Search and Mining Group Microsoft Research Asia Joint Work with Tao Qin, Tsinghua University.

Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.

1 Automatic Identification of User Goals in Web Search Uichin Lee, Zhenyu Liu, Junghoo Cho Computer Science Department, UCLA {uclee, vicliu,

Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.

(hyperlink-induced topic search)

Statistical Critical Path Selection for Timing Validation Kai Yang, Kwang-Ting Cheng, and Li-C Wang Department of Electrical and Computer Engineering University.

FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.

Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.

CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏

1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.

User Browsing Graph: Structure, Evolution and Application Yiqun Liu, Yijiang Jin, Min Zhang, Shaoping Ma, Liyun Ru State Key Lab of Intelligent Technology.

1 A Graph-Theoretic Approach to Webpage Segmentation Deepayan Chakrabarti Ravi Kumar

Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.

Using Hyperlink structure information for web search.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.

Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,

Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University

The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.

Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao,

Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.

1/52 Overlapping Community Search Graph Data Management Lab, School of Computer Science

Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.

Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.

Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.

1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,

Heavy-Tailed Distribution and Multi-Keyword Queries Surajit Chaudhuri, Kenneth Church, Arnd Christian K ö nig, Liying Sui Microsoft Corporation SIGIR 2007.

Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.

Predictive Ranking -H andling missing data on the web Haixuan Yang Group Meeting November 04, 2004.

Gao Cong, Long Wang, Chin-Yew Lin, Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai Date: 2009/02/09 Finding Question-Answer Pairs from Online.

GrammAds: Keyword and Ad Creative Generator for Online Advertising Campaigns Author : Stamatina Thomaidou, Konstantinos Leymonis, and Michalis Vazirgiannis.

Algorithmic Detection of Semantic Similarity WWW 2005.

Flickr Tag Recommendation based on Collective Knowledge BÖrkur SigurbjÖnsson, Roelof van Zwol Yahoo! Research WWW Summarized and presented.

Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.

Search Worms, ACM Workshop on Recurring Malcode (WORM) 2006 N Provos, J McClain, K Wang Dhruv Sharma

Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.

Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.

1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.

Liangjie Hong and Brian D. Davison Department of Computer Science and Engineering Lehigh University SIGIR 2009.

A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,

Context-Aware Query Classification Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-Tao Sun, Enhong Chen, Qiang Yang Microsoft Research Asia SIGIR.

Lock-Free Consistency Control for Web 2.0 Applications Jiang-Ming Yang 1,3, Hai-Xun Wang 2, Ning Gu 1, Yi-Ming Liu 1, Chun-Song Wang 1, Qi-Wei Zhang 1.

Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.

Incorporating Site-level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy KDD 2009 Jiang-Ming Yang, Rui Cai, Chunsong Wang, Hua Huang,

TEMPLATE DESIGN © Crawling is the process of automatically exploring a web application to discover the states of the application.

© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.

1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)

1 Link Privacy in Social Networks Aleksandra Korolova, Rajeev Motwani, Shubha U. Nabar CIKM’08 Advisor: Dr. Koh, JiaLing Speaker: Li, HueiJyun Date: 2009/3/30.

Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,

Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철

CS 115: COMPUTING FOR THE SOCIO-TECHNO WEB FINDING INFORMATION WITH SEARCH ENGINES.

GO! with Microsoft Access 2016

PageRank and Markov Chains

Why Does Your Website Need a Sitemap?

Presentation transcript:

Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR Summarized and Presented by Gihyun Gong, IDS Lab.

Copyright  2008 by CEBT Outline  Motivation  Web Forum Structure  Our System Crawling Process Traversal Strategy – Skeleton link identification – Page-flipping link detection  Evaluation 2

Copyright  2008 by CEBT What is the Web Forum? 3

Copyright  2008 by CEBT What is the Web Forum?  Forum structures User Groups – Posters – Moderators – Admin Post Thread 4

Copyright  2008 by CEBT Characteristics  Structure ‘Thread’ base (not page base) Many related/shortcut links Pre-defined templates Links with permission control  Contents User-Created Content Frequently changes Weight and quality of ‘Reply’ Suitable for discuss about a topic 5

Copyright  2008 by CEBT Why we crawl the Web Forum?  Web forum is a huge resource of human knowledge Over 20% search results are from web forums Could get related outer-link(site) Topic based structure 6 Topic Posts …. Topic Posts …. Topic Posts …. Topic Posts …. Web Forum

Copyright  2008 by CEBT The Limitation of Generic Crawlers  In general crawling, each page is treated independently, and each link is treated indiscriminately Lead to more than 50% useless pages Ignore the relationships between pages from a same thread  Forum crawling needs a site-level perspective and a careful selection of links 7

Copyright  2008 by CEBT Related Work and limitation  Basic strategy Breadth-first – Hard to select a proper crawling depth for each site Shallow/Deep crawling strategy – Cannot ensure to access all valuable content / May result in too many duplicated and invalid pages  Enhanced strategy On-line page importance computation – Calculate the partial PageRank to estimate the importance of a page – Not suitable for forum sites Deep web crawling – Forums are a kind of deep Web, but this crawler focused on how to prepare appropriate queries to probe Focused Crawling – Semantic topics in forums are too diverse to be simply characterized with a list of terms 8

Copyright  2008 by CEBT Outline  Motivation  Web Forum Structure  Our System Crawling Process Traversal Strategy – Skeleton link identification – Page-flipping link detection  Evaluation 9

Copyright  2008 by CEBT Web Forum Structure - Sitemap  Sitemap is a directed graph Set of vertices and links 10

Copyright  2008 by CEBT Web Forum Structure – Skeleton Link  Most important links supporting the structure 11

Copyright  2008 by CEBT Web Forum Structure – Page Flipping  A kind of loop-back links in the sitemap 12

Copyright  2008 by CEBT Site-Level Perspective  Understand the organization structure  Find an optimal Traversal strategy The site-level perspective of "forums.asp.net" 13

Copyright  2008 by CEBT Outline  Motivation  Web Forum Structure  Our System Crawling Process Traversal Strategy – Skeleton link identification – Page-flipping link detection  Evaluation 14

Copyright  2008 by CEBT Process 15

Copyright  2008 by CEBT Random Sampling 16

Copyright  2008 by CEBT Random Sampling  Randomly sample some pages from a given site  Try to push as many as possible unseen URLs to queue, randomly pop a URL from the front or the end of the queue To cover as many as possible unseen URL patterns  1,000 pages will be enough 17

Copyright  2008 by CEBT Sitemap Construction 18

Copyright  2008 by CEBT Traversal Strategy Exploring 19

Copyright  2008 by CEBT Skeleton Links  Skeleton links are the most important links supporting the structure of a forum site  Crawlers crawl as many as possible unique pages in a given forum site by following skeleton links  Skeleton links point to all valuable pages without introducing redundant and val ueless  How to Identify? Aim at all unique pages without duplicates An optimal set of skeleton links leads to most unique pages and few duplicates Search skeleton links for each valuable vertex – Level by level: Inspired by user browsing behavior – Find an optimal combination of links 20

Copyright  2008 by CEBT How to Identify Skeleton Links  Coverage  Informativeness 21

Copyright  2008 by CEBT Page-Flipping Links  Crawlers can completely download a long discussion thread divided into several pages by following page-flipping links  Page-flipping links are a kind of loop-back links in the sitemap. However, not all loop-back links are page-flipping ones  How to detect? For page-flipping links, if there is a path from page A to B, there must be a path follow the same type of links from B to A Page-flipping links have larger connectivity score 22 Page A Hyperlink Page B Hyperlink

Copyright  2008 by CEBT 23 An illustration of the characteristics of page-flipping links Connectivity = 722 / 890 = 0.81 Connectivity = 108 / 1153 = 0.09

Copyright  2008 by CEBT 24

Copyright  2008 by CEBT Crawling  From the given entry page  Map a new page to an existing layout vertex  Follow the explored traversal strategy for out-links from that page 25

Copyright  2008 by CEBT Outline  Motivation  Web Forum Structure  Our System Crawling Process Traversal Strategy – Skeleton link identification – Page-flipping link detection  Evaluation 26

Copyright  2008 by CEBT Experimental Setup  Contract experiments in eight forums from diverse categories Mirror pages: Crawled by a real commerce crawler Structure-driven: Crawled by structure-driven crawler proposed in SIGI R’06 Our method: Crawled by crawler using our traversal strategy 27

Copyright  2008 by CEBT Evaluation Criteria 28 Coverage Informativeness

Copyright  2008 by CEBT Effectiveness and Efficiency  Effectiveness 29

Copyright  2008 by CEBT Effectiveness and Efficiency  Efficiency 30

Copyright  2008 by CEBT Evaluation of Page-Flipping Detection 31

Copyright  2008 by CEBT Conclusions  A complete solution to automatically explore an appropriate trave rsal strategy to a given target forum site is proposed Skeleton link identification Page-flipping link detection  More future work directions Incremental crawling Forum page segmentation 32