Part IV - Advanced Techniques - High-performance crawling - Recrawling and focused crawling - Link-based ranking (Pagerank, HITS) - Structural analysis.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.
Optimized Query Execution in Large Search Engines with Global Page Ordering Xiaohui Long Torsten Suel CIS Department Polytechnic University Brooklyn, NY.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk* Torsten Suel CIS Department Polytechnic University Brooklyn,
The PageRank Citation Ranking “Bringing Order to the Web”
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.
Web Search Engines and Search Technology Torsten Suel Polytechnic University Brooklyn, NY Lecture at Polytechnic University, 4/2/02.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Link Structure and Web Mining Shuying Wang
Link Analysis HITS Algorithm PageRank Algorithm.
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
Chapter 5: Information Retrieval and Web Search
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Web Crawling David Kauchak cs160 Fall 2009 adapted from:
Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department Overview: introduction.
Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Crawling Slides adapted from
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Chapter 6: Information Retrieval and Web Search
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
1 CS 430: Information Discovery Lecture 5 Ranking.
Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3.
Information Discovery Lecture 20 Web Search 2. Example: Heritrix Crawler A high-performance, open source crawler for production and research Developed.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Design and Implementation of a High-Performance distributed web crawler Vladislav Shkapenyuk and Torsten Suel Proc. 18 th Data Engineering Conf., pp ,
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
1 Design and Implementation of a High-Performance Distributed Web Crawler Polytechnic University Vladislav Shkapenyuk, Torsten Suel 06/13/2006 석사 2 학기.
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철
HITS Hypertext-Induced Topic Selection
IST 516 Fall 2011 Dongwon Lee, Ph.D.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Anatomy of a search engine
Data Mining Chapter 6 Search Engines
Web Search Engines.
Junghoo “John” Cho UCLA
Presentation transcript:

Part IV - Advanced Techniques - High-performance crawling - Recrawling and focused crawling - Link-based ranking (Pagerank, HITS) - Structural analysis of the web graph - Optimizing query execution - Parallel search engines and scaling - Meta search engines - Document clustering and duplicate detection

experience from research project at Poly need high-performance crawler: >100 pages/sec robust operation over several weeks - crawler will crash - system will have to be modified controlled operation - other users on campus - remember the 30-second rule - things will happen 1- High-performance crawling: 100 million pages or more

server/DNS/network latency (0.2-1 seconds minimum) must open hundreds of connections simultaneously 150 pages = 2 MB/s = 40% of max. T-3 capacity DNS can become bottleneck 10-20% additional accesses for robots.txt data must be streamed to disk OS limits and overheads: networking, files Networking performance

Google crawler “backRub” (see WWW’98 paper) - python downloaders on several machines - up to 100 pages per second Internet Archive Crawler Mercator/Atrax (DEC/Altavista) (see Mercator paper) - 2 GB machine with RAID - implemented in Java (many performance issues) - up to 200 pages/sec - detailed discussion of data structure size - optimized DNS caching PolyBot (S hkapenyuk/Suel ICDE 2002) Crawler Architectures

Polybot crawler: (ICDE 2002, with V. Shkapenyuk) distributed implementation in C++ manager handles several request streams with priorities manager handles DNS, exclusion, and frontier 300 pages/sec (and more)

Crawling Strategy and Download Rate: crawling strategy: “What page to download next?” download rate: “How many pages per second?” different scenarios require different strategies lots of recent work on crawling strategy little published work on optimizing download rate (main exception: Mercator from DEC/Compaq/HP?) somewhat separate issues building a slow crawler is (fairly) easy...

System Requirements: flexibility (different crawling strategies) scalabilty (high performance at low cost) robustness (odd server content/behavior, crashes) crawling etiquette and speed control (robot exclusion, 30 second intervals, domain level throttling, speed control for other users) manageable and reconfigurable (interface for statistics and control, system setup)

Structure: PolyBot System Architecture separation of crawling strategy and basic system collection of scalable distributed services (DNS, downloading, scheduling, strategy) for clusters and wide-area distributed optimized per-node performance no random disk accesses (no per-page access)

Basic Architecture, revisited: application issues requests to manager manager does DNS and robot exclusion manager schedules URL on downloader downloader gets file and puts it on disk application is notified of new files application parses new files for hyperlinks application sends data to storage component (indexing done later)

System components: downloader: optimized HTTP client written in Python (everything else in C++) DNS resolver: uses asynchronous DNS library manager uses Berkeley DB and STL for external and internal data structures manager does robot exclusion by generating requests to downloaders and parsing files application does parsing and handling of URLs (has this page already been downloaded?)

Scaling the system: small system on previous pages: 3-5 workstations and pages/sec peak can scale up by adding downloaders and DNS resolvers at pages/sec, application becomes bottleneck at 8 downloaders manager becomes bottleneck need to replicate application and manager hash-based technique (Internet Archive crawler) partitions URLs and hosts among application parts data transfer via file system (NFS)

Scaling up: 20 machines 1500 pages/s? depends on crawl strategy hash to nodes based on site (b/c robot ex)

Data Structures and Techniques parsing using pcre library NFS eventually bottleneck URL-seen problem: - need to check if file has been parsed or downloaded before - after 20 million pages, we have “seen” over 100 million URLs - each URL is 50 to 75 bytes on average Options: compress URLs in main memory, or use disk - prefix+huffman coding (DEC, JY01) or Bloom Filter (Archive) - disk access with caching (Mercator) - we use lazy/bulk operations on disk Crawling Application

Implementation of URL-seen check: - while less than a few million URLs seen, keep in main memory - then write URLs to file in alphabetic, prefix-compressed order - collect new URLs in memory and periodically reform bulk check by merging new URLs into the file on disk When is a newly a parsed URL downloaded? Reordering request stream - want to space ot requests from same subdomain - needed due to load on small domains and due to security tools - sort URLs with hostname reversed (e.g., com.amazon.www), and then “unshuffle” the stream provable load balance

Challenges and Techniques: Manager large stream of incoming URL request files goal: schedule URLs roughly in the order that they come, while observing time-out rule (30 seconds) and maintaining high speed must do DNS and robot excl. “right before”download keep requests on disk as long as possible! - otherwise, structures grow too large after few million pages (performance killer)

Manager Data Structures: when to insert new URLs into internal structures?

URL Loading Policy read new request file from disk whenever less than x hosts in ready queue choose x > speed * timeout (e.g., 100 pages/sec * 30 sec) # of current host data structures is x + speed * timeout + n_down + n_transit which is usually < 2x nice behavior for BDB caching policy performs reordering only when necessary!

Experimental Results crawl of 120 million pages over 19 days 161 million HTTP request 16 million robots.txt requests 138 million successful non-robots requests 17 million HTTP errors (401, 403, 404 etc) 121 million pages retrieved slow during day, fast at night many downtimes due to attacks, crashes, revisions “slow tail” of requests at the end (4 days)

Experimental Results ctd. bytes in bytes out frames out Poly T3 connection over 24 hours of 5/28/01 (courtesy of AppliedTheory)

Experimental Results ctd. sustaining performance: - will find out when data structures hit disk - I/O-efficiency vital speed control tricky - vary number of connections based on feedback - also upper bound on connections - complicated interactions in system - not clear what we should want other configuration: 140 pages/sec sustained on 2 Ultra10 with 60GB EIDE and 1GB/768MB similar for Linux on Intel

Atrax: recent distributed extension to Mercator - combines several Mercators - URL hashing, and off-line URL check (as we do) Ubicrawler, Nutch crawler P2P crawlers (grub.org and others) Cho/Garcia-Molina (WWW 2002) - study of overhead/quality tradeoff in paral. crawlers - difference: we scale services separately, and focus on single-node performance - in our experience, parallel overhead low Other Work on Parallel Crawlers:

index compression: use fast and good codes to decrease index size list caching: keep inverted lists in main memory as much as possible result caching: if same query has recently been issued, use result query pruning: compute best k results without scanning entire lists Approaches: How to optimize query throughput in large search engines. The ranking function is often a combination of term-based ranking and a global ordering such as Pagerank” 2 - Optimized Query Processing:

inverted index - a data structure for supporting text queries - like index in a book Query processing in search engines (single node case) inverted index aalborg 3452, 11437, …... arm 4, 19, 29, 98, 143,... armada 145, 457, 789,... armadillo 678, 2134, 3970,... armani 90, 256, 372, 511,.... zz 602, 1189, 3209,... disks with documents indexing Boolean queries: (zebra AND armadillo) OR alligator unions/intersections of lists

scoring function: assigns score to each document with respect to a given query top-k queries: return k documents with highest scores example cosine measure for query with terms t to t Ranking in search engines: can be implemented by computing score for all documents that contain any of the query words (union of inverted lists) in case of search engines: often intersection instead of union in large collections, lists are many MB for average queries 0m-1

recall the cosine measure: Combining Term-Based Methods and Pagerank Pagerank assigns a fixed score to each page (independent of query) naïve way of integrating Pagerank value: works for any global ordering of page scores (e.g., based on traffic) but some more details remain

how to combine/add pagerank score and cosine? (addition) use PR or log(PR) ? normalize using mean of top-100 in list (Richardson/Domingo) Using Pagerank in ranking:

pages index pages index pages index pages index pages index broadcasts each query and combines the results LAN Cluster with global index organization Query Processing in Parallel Search Engines query integrator local index: every node stores and indexes subset of pages every query broadcast to all nodes by query integrator (QI) every node supplies top-10, and QI computes global top-10 note: we don’t really need top-10 from all, maybe only top-2 low-cost cluster architecture (usually with additional replication)

IR: optimized evaluation of cosine measures (since 1980s) DB: top-k queries for multimedia databases (Fagin 1996) does not consider combinations of term-based and global scores Brin/Page 1998: fancy lists in Google Related Work on top-k Queries basic idea: “ presort entries in each inverted list by contribution to cosine” also process inverted lists from shortest to longest list various schemes, either reliable or probabilistic most closely related: - Persin/Zobel/Sacks-Davis 1993/96 - Anh/Moffat 1998, Anh/deKretzer/Moffat 2001 typical assumptions: many keywords/query, OR semantics Related Work (IR)

motivation: searching multimedia objects by several criteria typical assumptions: few attributes, OR semantics, random access FA (Fagin’s algorithm), TA (Threshold algorithm), others formal bounds: for k lists if lists independent term-based ranking: presort each list by contribution to cosine Related Work (DB) (Fagin 1996 and others)

“fancy lists” optimization in Google create extra shorter inverted list for “fancy matches” (matches that occur in URL, anchor text, title, bold face, etc.) note: fancy matches can be modeled by higher weights in the term-based vector space model no details given or numbers published Related Work (Google) (Brin/Page 1998) chair table fancy list rest of list with other matches

top-10 queries on 16 machines with 120 million pages up to 10 queries/sec with reliable pruning up to 20 queries per second with first-30 scheme Throughput and Latency for 16 Nodes Note: reliable pruning not implemented in purely incremental manner

Ragerank (Brin&Page/Google) “significance of a page depends on significance of those referencing it” HITS (Kleinberg/IBM) “Hubs and Authorities” 3 - Link-based ranking techniques

initialize the rank value of each node to 1/n (0.2 for 5 nodes) a node with k outgoing links transmits a 1/k fraction of its current rank value over that edge to its neighbor iterate this process many times until it converges NOTE: this is a random walk on the link graph Pagerank: stationary distribution of this random walk 1/2 1 1 Pagerank 0.2

Attempts to computationally determine hubs and authorities on a particular topic through analysis of a relevant subgraph of the web. Based on mutually recursive facts:  Hubs point to lots of authorities.  Authorities are pointed to by lots of hubs. HubsAuthorities

1/ / / /

Convergence Leak Sink dealing with leaks: - pruning, or - adding back links dealing with sinks - add a random jump to escape sink - with prob. b = 0.15 jump to random node assume: if in a leak, always take random jump in this case, always converges to unique solution!

stationary distribution: vector x with xA = x A is primitive, and x Eigenvector of A computed using Jacobi or Gauss Seidel iteration 1/2 1 1 Matrix notation / / /2 0 1/ /2 0 1/ A

“HITS” (Jon Kleinberg) (4) Mathematical Techniques Other iterative techniques for link analysis query-dependent: first get 100 docs using term-based techniques build a subgraph on these nodes and their neighbors run iterative process on this subgraph each node has hub score and authority score

“HITS” (Jon Kleinberg) (4) Mathematical Techniques Authorities are pages that are recognized as providing significant, trustworthy, and useful information on a topic. In-degree (number of pointers to a page) is one simple measure of authority.However in-degree treats all links as equal. Hubs are index pages that provide lots of useful links to relevant content pages (topic authorities).

(4) Mathematical Techniques For a query Q, let the set of documents returned by a standard search engine be called the root set R. Initialize base set S to R. Add to S all pages pointed to by any page in R. Add to S all pages that point to any page in R. R S

(4) Mathematical Techniques To limit computational expense: – Limit number of root pages to the top 100 pages retrieved for the query. To eliminate purely navigational links: – Eliminate links between two pages on the same host. To eliminate “non-authority-conveying” links: – Allow only m (m  4  8) pages from a given host as pointers to any individual page.

Pagerank vs. HITS (4) Mathematical Techniques Pagerank does not attempt to capture the distinction between hubs and authorities, but ranks just by authority Pagerank is applied to the entire web rather than a local neighborhood of result pages HITS is sensitive to local topology HITS is noisy! Authority scores of nodes 2 and 4 is lost, and node 4’s authority score vanishes to zero compared to those of nodes 2 and 5.

many other techniques based on techniques such as markov chains, SVD, principal component analysis web graph result of billions of decisions by millions of independent web page creators (or not!) data mining of web graphs and web usage patterns techniques not really new - analysis of social networks (Katz 1953) - citation analysis (Garfield) security-related applications: call pattern analysis (4) Mathematical Techniques Other iterative techniques for link analysis