Download presentation
Presentation is loading. Please wait.
Published byAllan Weaver Modified over 9 years ago
1
Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR 08 2009. 1. 14 Summarized and Presented by Gihyun Gong, IDS Lab.
2
Copyright 2008 by CEBT Outline Motivation Web Forum Structure Our System Crawling Process Traversal Strategy – Skeleton link identification – Page-flipping link detection Evaluation 2
3
Copyright 2008 by CEBT What is the Web Forum? 3
4
Copyright 2008 by CEBT What is the Web Forum? Forum structures User Groups – Posters – Moderators – Admin Post Thread 4
5
Copyright 2008 by CEBT Characteristics Structure ‘Thread’ base (not page base) Many related/shortcut links Pre-defined templates Links with permission control Contents User-Created Content Frequently changes Weight and quality of ‘Reply’ Suitable for discuss about a topic 5
6
Copyright 2008 by CEBT Why we crawl the Web Forum? Web forum is a huge resource of human knowledge Over 20% search results are from web forums Could get related outer-link(site) Topic based structure 6 Topic Posts …. Topic Posts …. Topic Posts …. Topic Posts …. Web Forum
7
Copyright 2008 by CEBT The Limitation of Generic Crawlers In general crawling, each page is treated independently, and each link is treated indiscriminately Lead to more than 50% useless pages Ignore the relationships between pages from a same thread Forum crawling needs a site-level perspective and a careful selection of links 7
8
Copyright 2008 by CEBT Related Work and limitation Basic strategy Breadth-first – Hard to select a proper crawling depth for each site Shallow/Deep crawling strategy – Cannot ensure to access all valuable content / May result in too many duplicated and invalid pages Enhanced strategy On-line page importance computation – Calculate the partial PageRank to estimate the importance of a page – Not suitable for forum sites Deep web crawling – Forums are a kind of deep Web, but this crawler focused on how to prepare appropriate queries to probe Focused Crawling – Semantic topics in forums are too diverse to be simply characterized with a list of terms 8
9
Copyright 2008 by CEBT Outline Motivation Web Forum Structure Our System Crawling Process Traversal Strategy – Skeleton link identification – Page-flipping link detection Evaluation 9
10
Copyright 2008 by CEBT Web Forum Structure - Sitemap Sitemap is a directed graph Set of vertices and links 10
11
Copyright 2008 by CEBT Web Forum Structure – Skeleton Link Most important links supporting the structure 11
12
Copyright 2008 by CEBT Web Forum Structure – Page Flipping A kind of loop-back links in the sitemap 12
13
Copyright 2008 by CEBT Site-Level Perspective Understand the organization structure Find an optimal Traversal strategy The site-level perspective of "forums.asp.net" 13
14
Copyright 2008 by CEBT Outline Motivation Web Forum Structure Our System Crawling Process Traversal Strategy – Skeleton link identification – Page-flipping link detection Evaluation 14
15
Copyright 2008 by CEBT Process 15
16
Copyright 2008 by CEBT Random Sampling 16
17
Copyright 2008 by CEBT Random Sampling Randomly sample some pages from a given site Try to push as many as possible unseen URLs to queue, randomly pop a URL from the front or the end of the queue To cover as many as possible unseen URL patterns 1,000 pages will be enough 17
18
Copyright 2008 by CEBT Sitemap Construction 18
19
Copyright 2008 by CEBT Traversal Strategy Exploring 19
20
Copyright 2008 by CEBT Skeleton Links Skeleton links are the most important links supporting the structure of a forum site Crawlers crawl as many as possible unique pages in a given forum site by following skeleton links Skeleton links point to all valuable pages without introducing redundant and val ueless How to Identify? Aim at all unique pages without duplicates An optimal set of skeleton links leads to most unique pages and few duplicates Search skeleton links for each valuable vertex – Level by level: Inspired by user browsing behavior – Find an optimal combination of links 20
21
Copyright 2008 by CEBT How to Identify Skeleton Links Coverage Informativeness 21
22
Copyright 2008 by CEBT Page-Flipping Links Crawlers can completely download a long discussion thread divided into several pages by following page-flipping links Page-flipping links are a kind of loop-back links in the sitemap. However, not all loop-back links are page-flipping ones How to detect? For page-flipping links, if there is a path from page A to B, there must be a path follow the same type of links from B to A Page-flipping links have larger connectivity score 22 Page A Hyperlink Page B Hyperlink
23
Copyright 2008 by CEBT 23 An illustration of the characteristics of page-flipping links Connectivity = 722 / 890 = 0.81 Connectivity = 108 / 1153 = 0.09
24
Copyright 2008 by CEBT 24
25
Copyright 2008 by CEBT Crawling From the given entry page Map a new page to an existing layout vertex Follow the explored traversal strategy for out-links from that page 25
26
Copyright 2008 by CEBT Outline Motivation Web Forum Structure Our System Crawling Process Traversal Strategy – Skeleton link identification – Page-flipping link detection Evaluation 26
27
Copyright 2008 by CEBT Experimental Setup Contract experiments in eight forums from diverse categories Mirror pages: Crawled by a real commerce crawler Structure-driven: Crawled by structure-driven crawler proposed in SIGI R’06 Our method: Crawled by crawler using our traversal strategy 27
28
Copyright 2008 by CEBT Evaluation Criteria 28 Coverage Informativeness
29
Copyright 2008 by CEBT Effectiveness and Efficiency Effectiveness 29
30
Copyright 2008 by CEBT Effectiveness and Efficiency Efficiency 30
31
Copyright 2008 by CEBT Evaluation of Page-Flipping Detection 31
32
Copyright 2008 by CEBT Conclusions A complete solution to automatically explore an appropriate trave rsal strategy to a given target forum site is proposed Skeleton link identification Page-flipping link detection More future work directions Incremental crawling Forum page segmentation 32
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.