FOCUSED CRAWLING. Context ● World Wide Web growth. ● Inktomi crawler:  Hundreds of Sun Sparc workstations;  Sun Spark Э 75GB RAM, 1TB disk;  Over 10M.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)
Web Crawling Notes by Aisha Walcott
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Mobile Web Search Personalization Kapil Goenka. Outline Introduction & Background Methodology Evaluation Future Work Conclusion.
A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm Dongwon Lee Database Systems Lab.
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Presented by Zeehasham Rasheed
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Distributed process management: Distributed deadlock
BTREE Indices A little context information What’s the purpose of an index? Example of web search engines Queries do not directly search the WWW for data;
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Accelerated Focused Crawling Through Online Relevance Feedback Soumen Chakrabarti, IIT Bombay Kunal Punera, IIT Bombay Mallela Subramanyam, UT Austin.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
Master Thesis Defense Jan Fiedler 04/17/98
Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Overview of Web Ranking Algorithms: HITS and PageRank
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Algorithmic Detection of Semantic Similarity WWW 2005.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
The Structure of Broad Topics on the Web Soumen Chakrabarti, Mukul M. Joshi, etc Presentation by Na Dai.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
CSE326: Data Structures World Wide What? Hannah Tang and Brian Tjaden Summer Quarter 2002.
Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Focused Crawler for Topic Specific Portal Construction Ruey-Lung, Hsiao 25 Oct, 2000 Toward A Full Automatic Web Site Construction & Service (II)
Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
Using ODP Metadata to Personalize Search University of Seoul Computer Science Database Lab. Min Mi-young.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
Using ODP Metadata to Personalize Search Presented by Lan Nie 09/21/2005, Lehigh University.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
The Structure of Broad Topics on the Web
WEB SPAM.
Information Retrieval
Methods and Apparatus for Ranking Web Page Search Results
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Restrict Range of Data Collection for Topic Trend Detection
Random Sampling over Joins Revisited
Information Retrieval
Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000.
Panagiotis G. Ipeirotis Luis Gravano
CS 261 – Data Structures Trees.
Important Problem Types and Fundamental Data Structures
3.2 Graph Traversal.
Chapter 31: Information Retrieval
Information Retrieval and Web Design
Chapter 19: Information Retrieval
Presentation transcript:

FOCUSED CRAWLING

Context ● World Wide Web growth. ● Inktomi crawler:  Hundreds of Sun Sparc workstations;  Sun Spark Э 75GB RAM, 1TB disk;  Over 10M pages crawled. ● Still only 30-40% Web crawled. ● Long refreshes (weeks up to a month). ● Low precision results for crafty queries. ● Burden of indexing millions of pages. ● Inefficient location of relevant topic-specific resources when using keyword queries. 1

Why Focused? ● Better cover a single galaxy than the whole universe. ● Work done on relatively narrow segment of Web. ● Respectable coverage at rapid rate (due to segment-of-interest narrowness). ● Small investment in hardware. ● Low network resource usage. 2

Core Elements ● Focused crawler = example-driven automatic porthole generator. ● Guided by a classifier and a distiller.  Former recognizes relevance from examples embedded in topic taxonomy.  Latter identifies topical vantage points on Web. ● Based on canonical topic taxonomy with examples. 3

Operation Synopsis 1.Taxonomy creation. 2.Example collection. 3.Taxonomy selection and refinement. 4.Interactive exploration. 5.Training. 6.Resource discovery. 7.Distillation. 8.Feedback. 4

Taxonomy Creation ● Pre-training classifier with:  Canonical taxonomy,  Corresponding examples. 5

Example Collection ● Collect URLs of interest (e.g browsing). ● Import collected URLs. 6

Taxonomy Selection and Refinement ● Propose most common classes where examples fit best. ● Mark classes as GOOD. ● Refine taxonomy, i.e.:  Refine categories and/or,  Move documents from one category to another. ● Integration time required by major changes is:  Few hours for 260,000 Yahoo! documents. ● Smaller changes (moving docs) are interactive. 7

Interactive Exploration ● Propose URLs found in small neighbourhood of examples. ● Examine and include some of these examples. 8

Training ● Integrate refinements into statistical class model (classifier-specific action). 9

Distillation ● Identify relevant hubs by running (intermittently and/or concurrently) a topic distillation algorithm. ● Raise visit priorities of hubs and immediate neighbours. 10

Feedback ● Report most popular sites and resources. ● Mark results as useful/useless. ● Send feedback to classifier and distiller. 11

Snapshot 12

Some definitions... ● G = directed hypertext graph. ● C = tree-shaped hierarchical topic directory. ● D(c) = examples referred by topic node c Є C. ● C* = subset of topics marked good and known as user's interest. ✔ Remarks: 1. Good topic is not ancestor of another good topic. 2. p = web page, R C* (p) = relevance of p wrto C* must be furnished to the system. 3.R root (p) = 1 ; R c 0 (p) = ∑ R c i (p) where {c i } children of c 0. 13

Crawler in terms of Graph ● Start by visiting all pages Є D(C*). ● Inspect V = set of visited pages. ● Choose unvisited page from crawl frontier. ● GOAL: visit as many relevant pages and as few irrelevant pages as possible, i.e:  Find V D(C*) | V reachable from D(C*) s.t. ∑ R(v)/|V| -> max, v Є V.  Goal attainable due to citations. 14

Classification ● Definitions:  good(c) = c is marked as good.  For d=document: ● P(d|r) = 1; ● P(c|d) = P(parent(c)|d)*P(c|d,parent(c)); ● P(c|d,parent(c)) = P(c|parent(c)) * P(d|c) / ∑P(d|c i ) where c i are the siblings of c; ● P(d|c) depends on document generation model; ● P(c|parent(c)) = prior distribution of documents. ● Steps for model generation:  Pick leaf node c* using defined probabilities.  Class c* has a die with as many faces as unique tokens Є U.  Face t turns with probability θ(c*,t).  Length n(d) is chosen arbitrarily by generator.  Flip die and write token corresponding to face.  If token t occurs n(d,t) times => 15

Remarks on Classification ● Documents seen as bag of words, without order information and inter-term correlation. ● During crawling the task is the reverse of generation. ● Two types of focus possible with classifier:  Hard-focus: ● Find c* with highest probability; ● If Э ancestor of c* s.t. good(ancestor) => allow future visits of links Є d; ● Else prune at d.  Soft-focus: ● Page relevance R(d) = ∑ good(c) P(c|d); ● Assume priority of neighbour(d) = R(d); ● If multiple paths for a page => take maximum of relevance; ● When neighbour visited => update score. 16

Distillation ● Goal: identify hubs. ● Overtaken idea:  v node Є Web has two scores a(v), h(v) => ● h(u) = ∑ (u,v) Є E a(v) (1) ● a(v) = ∑ (u,v) Є E h(u) (2) ● E = adjacency matrix ● Enhancements:  Non-unit edge weight;  Forward and backward weights matrices: E F and E B  E F [u,v] = R(v) prevents leakage of prestige from relevant hubs to irrelevant authorities;  E B [u,v] = R(u) prevents relevant authority from reflecting prestige on irrelevant hubs;  ρ = threshold for including relevant authorities into graph. ● Steps:  Construct edge set E, only for pages on different sites, with forward and backward edge weights.  Apply (1) and (2) always restricting authorities using ρ. 17

Integration with the Crawler ● One watchdog thread:  Inspect new work from crawl frontier (stored on disk);  Pass new work to working threads(using shared memory buffers). ● Many working threads:  Save details of newly explored pages in per-worker disk structures;  Invoke classifier for each new page. ● Stop workers, collect and integrate results into central pool (priority queue).  Soft crawling -> URLs ordered by: ● (# page-fetches ascending, R descending)  Hard crawling -> surviving URLs ordered by: ● # page-fetches ascending ● Populate link graph. ● Periodically stop crawler and execute distiller => revisit obtained hubs + visit unvisited pages pointed by hubs. 18

Integration 19

Evaluation ● Performance parameters:  Precision (relevance);  Quality of resource discovery. ● Synopsis:  Experimental setup;  Harvesting rate of relevant pages;  Acquisition robustness;  Resource discovery robustness;  Good resources remoteness;  Effect of distillation on crawling. 20

Experimental Setup ● Crawler = C++ application. ● Operating through firewall. ● Crawler run with relatively few threads. ● Up to 12 example web pages used / category ● 6,000 URLs / hour returned. ● 20 topics (gardening, mutual funds, cycling, etc). 21

Harvesting Rate of Relevant Pages ● Goal: high relevant-page acquisition rate. ● Low harvest rate -> time spent merely on eliminating irrelevant pages => better use ordinary crawl instead. ● 3 crawls done: ✔ Same sample set Э few dozen relevant URLs.  Unfocused: ● All out-links registered for exploration; ● No use of R, except measurement => little slow down.  Soft: ● Probably more robust than hard crawling, BUT needs more skill against unwanted topic diffusion. ● Problem distinguish between noisy and systematic drop in relevance.  Hard; 22

Harvesting Rate Example 23

Acquisition Robustness ● Goal: maintain proper acquisition rate without being too sensitive on the start set. ● Tests:  2 disjoint sets Є 30% of starting URLs randomly chosen.  For each subset launch a focused crawler. ✔ Goal achieved by measuring URLs overlap. ✔ Generous visits to new IP-addresses and also normal increase in overlapping IP-addresses. 24

URL Overlap 25

Server overlap 26

Resource Discovery Robustness ● 2 sets of crawlers launched from different random samples. ● popularity/quality algorithm run with 50 iterations. ● Server overlap measured. ● Result: most popular sites identified by both sets of crawlers although different samples sets were used. 27

Good Resources Remoteness ● Any real exploration done ? ● Non-trivial work done by focused crawler, i.e pursuing certain paths while pruning others. ● Large # of servers found at 10 links away and beyond from starting set. ● Millions of pages within 10 links distance. 28

Remoteness Example 29

Effect of Distillation on Crawling ● Relevant page may be abandoned due to misclassification (e.g page has many images /classifier mistakes). ● Distiller reveals top hubs => new unvisited URLs. 30

Conclusion ● Strengths:  Steady collection of relevant resources;  Robustness to different starting conditions;  Localization of good resources;  Immunity to noise;  Learning specialization from examples;  Filtering done at data-acquisition level rather than as post- processing;  Crawling done to greater depths due to frontier crawling; ● Still to go:  At what specificity can focused crawl be sustained? i.e how do harvest rates depend on topics?  Sociology of citations between topics => insights on how Web evolves. ... 31