Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences.

Slides:



Advertisements
Similar presentations
iRobot: An Intelligent Crawler for Web Forums
Advertisements

TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST
You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
1. XP 2 * The Web is a collection of files that reside on computers, called Web servers. * Web servers are connected to each other through the Internet.
Advanced Piloting Cruise Plot.
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Chapter 1 The Study of Body Function Image PowerPoint
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
STATISTICS HYPOTHESES TEST (II) One-sample tests on the mean and variance Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National.
Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.
UNITED NATIONS Shipment Details Report – January 2006.
DCV: A Causality Detection Approach for Large- scale Dynamic Collaboration Environments Jiang-Ming Yang Microsoft Research Asia Ning Gu, Qi-Wei Zhang,
Incorporating Site-Level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy Jiang-Ming Yang, Rui Cai, Lei Zhang, and Wei-Ying Ma Microsoft.
Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Reducing Fractions. Factor A number that is multiplied by another number to find a product. Factors of 24 are (1,2, 3, 4, 6, 8, 12, 24).
2 pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt ShapesPatterns Counting Number.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 5 second questions
Year 6 mental test 10 second questions
Projects in Computing and Information Systems A Student’s Guide
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
1 Competitive Privacy: Secure Analysis on Integrated Sequence Data Raymond Chi-Wing Wong 1, Eric Lo 2 The Hong Kong University of Science and Technology.
DOROTHY Design Of customeR dRiven shOes and multi-siTe factorY Product and Production Configuration Method (PPCM) ICE 2009 IMS Workshops Dorothy Parallel.
Information Systems Today: Managing in the Digital World
Randomized Algorithms Randomized Algorithms CS648 1.
1 Undirected Breadth First Search F A BCG DE H 2 F A BCG DE H Queue: A get Undiscovered Fringe Finished Active 0 distance from A visit(A)
Green Eggs and Ham.
VOORBLAD.
15. Oktober Oktober Oktober 2012.
1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.
BIOLOGY AUGUST 2013 OPENING ASSIGNMENTS. AUGUST 7, 2013  Question goes here!
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
© 2012 National Heart Foundation of Australia. Slide 2.
Adding Up In Chunks.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
Chapter 5 Test Review Sections 5-1 through 5-4.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
25 seconds left…...
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
Subtraction: Adding UP
H to shape fully developed personality to shape fully developed personality for successful application in life for successful.
Presenteren wij ………………….
Januar MDMDFSSMDMDFSSS
REGISTRATION OF STUDENTS Master Settings STUDENT INFORMATION PRABANDHAK DEFINE FEE STRUCTURE FEE COLLECTION Attendance Management REPORTS Architecture.
Analyzing Genes and Genomes
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 12 View Design and Integration.
Connecting LANs, Backbone Networks, and Virtual LANs
Intracellular Compartments and Transport
PSSA Preparation.
Xiao Zhang and Wenliang Du Dept. of Electrical Engineering & Computer Science Syracuse University.
Essential Cell Biology
CPSC 322, Lecture 5Slide 1 Uninformed Search Computer Science cpsc322, Lecture 5 (Textbook Chpt 3.5) Sept, 14, 2012.
RollCaller: User-Friendly Indoor Navigation System Using Human-Item Spatial Relation Yi Guo, Lei Yang, Bowen Li, Tianci Liu, Yunhao Liu Hong Kong University.

New Opportunities for Load Balancing in Network-Wide Intrusion Detection Systems Victor Heorhiadi, Michael K. Reiter, Vyas Sekar UNC Chapel Hill UNC Chapel.
Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR
Incorporating Site-level Knowledge for Incremental Crawling of Web Forums: A List-wise Strategy KDD 2009 Jiang-Ming Yang, Rui Cai, Chunsong Wang, Hua Huang,
Presentation transcript:

Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences Microsoft Research, Asia February 15, 2014

Outline Motivation & Challenge Our Solution – System Overview – Traversal Strategy Skeleton link identification Page-flipping link detection Evaluation 2

Outline Motivation & Challenge Our Solution – System Overview – Traversal Strategy Skeleton link identification Page-flipping link detection Evaluation 3

Why Web Forum Web forum is a huge resource of human knowledge – Over 20% search results are from web forums – Leverage the power of users and communities Forum sites have complex link structures – Many shortcut links – Links with permission control – Page-flipping links 4

The Limitation of Generic Crawlers In general crawling, each page is treated independently, and each link is treated indiscriminately – Lead to more than 50% useless pages – Ignore the relationships between pages from a same thread Forum crawling needs a site-level perspective and a careful selection of links 5

Outline Motivation & Challenge Our Solution – System Overview – Traversal Strategy Skeleton link identification Page-flipping link detection Evaluation 6

What is Site-Level Perspective? Understand the organization structure Find our an optimal Traversal strategy 7 The site-level perspective of "forums.asp.net"

Random Sampling Randomly sample some pages from a given site Adopt a combined strategy of breadth-first and depth-first using a double-ended queue Try to cover as many as possible unseen URL patterns 1,000 pages are enough 10

Sitemap Construction A sitemap is a directed graph consisting of a set of vertices and the corresponding links Cluster pages into vertices with the same page layout Link = its URL pattern + its location More details about the first two parts, please refer to our previous work : iRobot: An Intelligent Crawler for Web Forums, in WWW08 12

Why Skeleton Links Crawlers crawl as many as possible unique pages in a given forum site by following skeleton links Skeleton links are the most important links supporting the structure of a forum site Skeleton links point to all valuable pages without introducing redundant and valueless 14

15 Example of skeleton links from forums.asp.net

How to Identify Skeleton Links Aim at all unique pages without duplicates An optimal set of skeleton links leads to most unique pages and few duplicates Search skeleton links for each valuable vertex – Level by level: Inspired by user browsing behavior – Find an optimal combination of links Optimal result comes out after exhausting all! 16

17 An illustration of the search process of skeleton links Pruning while searching for optimism – Selected but introduce many duplicate pages – Rejected but cause coverage drop significantly

Why Page-Flipping Links Crawlers can completely download a long discussion thread divided into several pages by following page- flipping links Page-flipping links are a kind of loop-back links in the sitemap. However, not all loop-back links are page- flipping ones 18

19 Example of page-flipping links from forums.asp.net

How to Detect Page-Flipping Links For page-flipping links, if there is a path from page A to B, there must be a path follow the same type of links from B to A Page-flipping links have larger connectivity score 20

21 An illustration of the characteristics of page-flipping links Connectivity = 722 / 890 = 0.81 Connectivity = 108 / 1153 = 0.09

Crawling From the given entry page Map a new page to an existing layout vertex Follow the explored traversal strategy for out- links from that page 23

Outline Motivation & Challenge Our Solution – System Overview – Traversal Strategy Skeleton link identification Page-flipping link detection Evaluation 24

Experimental Setup Contract experiments in eight forums from diverse categories – Mirror pages: Crawled by a real commerce crawler – Structure-driven: Crawled by structure-driven crawler proposed in SIGIR06 – Our method: Crawled by crawler using our traversal strategy 25

Evaluation Criteria 26 Coverage Informativeness

Effectiveness and Efficiency Effectiveness 27

Effectiveness and Efficiency Efficiency 28

Evaluation of Page-Flipping Detection 29

Conclusions A complete solution to automatically explore an appropriate traversal strategy to a given target forum site is proposed – Skeleton link identification – Page-flipping link detection More future work directions – Incremental crawling – Forum page segmentation 30

Thanks! 31