CS276B Text Retrieval and Mining Winter 2005 Lecture 7.

Slides:

Advertisements

Similar presentations

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Advertisements

Information Retrieval in Practice

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.

Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou

CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.

CS Lecture 9 Storeing and Querying Large Web Graphs.

Web Crawling Notes by Aisha Walcott

Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.

CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.

1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.

Searching the Web. The Web Why is it important: –“Free” ubiquitous information resource –Broad coverage of topics and perspectives –Becoming dominant.

Parallel Crawlers Junghoo Cho, Hector Garcia-Molina Stanford University Presented By: Raffi Margaliot Ori Elkin.

Web Information retrieval Lecture 1 Presentation by Andrei Broder, IBM Krishna Bharat, Google Prabhakar Rahavan, Verity Inc.

WEB CRAWLERs Ms. Poonam Sinai Kenkre.

distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

Web Search – Summer Term 2006 VII. Selected Topics - PageRank (closer look) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Overview of Search Engines

WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.

IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.

SEARCH ENGINE By Ms. Preeti Patel Lecturer School of Library and Information Science DAVV, Indore E mail:

Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher A bunch of scientific.

Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.

Web Crawling David Kauchak cs160 Fall 2009 adapted from:

PrasadL16Crawling1 Crawling and Web Indexes Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning (Stanford)

Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,

Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.

Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 20: Crawling and web indexes.

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Crawling.

Crawling Slides adapted from

CSCI 5417 Information Retrieval Systems Jim Martin Lecture 19 11/1/2011.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University

Web Search Algorithms By Matt Richard and Kyle Krueger.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.

Parallel Crawlers Junghoo Cho (UCLA) Hector Garcia-Molina (Stanford) May 2002 Ke Gong 1.

CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.

Search Engines By: Faruq Hasan.

1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:

What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.

Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam AND.

Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University.

Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.

CS276A Text Information Retrieval, Mining, and Exploitation Lecture November, 2002 Special thanks to Andrei Broder, IBM Krishna Bharat, Google for.

WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3.

Information Retrieval (9) Prof. Dragomir R. Radev

How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.

Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.

1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.

1 CS 430: Information Discovery Lecture 17 Web Crawlers.

ITCS 6265 Lecture 11 Crawling and web indexes. This lecture Crawling Connectivity servers.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

General Architecture of Retrieval Systems 1Adrienn Skrop.

Search Engine and Optimization 1. Introduction to Web Search Engines 2.

GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.

Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.

Information Retrieval in Practice

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Lecture 17 Crawling and web indexes

UbiCrawler: a scalable fully distributed Web crawler

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Information Retrieval

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Anwar Alhenshiri.

Presentation transcript:

CS276B Text Retrieval and Mining Winter 2005 Lecture 7

Plan for today Review search engine history (slightly more technically than in the first lecture) Web crawling/corpus construction Distributed crawling Connectivity servers

Evolution of search engines First generation – use only “on page”, text data Word frequency, language Boolean Second generation – use off-page, web-specific data Link (or connectivity) analysis Click-through data Anchor-text (How people refer to this page) Third generation – answer “the need behind the query” Semantic analysis – what is this about? Focus on user need, rather than on query Context determination Helping the user UI, spell checking, query refinement, query suggestion, syntax driven feedback, context help, context transfer, etc Integration of search and text analysis AV, Excite, Lycos, etc From Made popular by Google but everyone now Evolving

Connectivity analysis Idea: mine hyperlink information of the Web Assumptions: Links often connect related pages A link between pages is a recommendation “people vote with their links” Classic IR work (citations = links) a.k.a. “Bibliometrics” [Kess63, Garf72, Smal73, …]. See also [Lars96]. Much Web related research builds on this idea [Piro96, Aroc97, Sper97, Carr97, Klei98, Brin98,…]

Third generation search engine: answering “the need behind the query” Semantic analysis Query language determination Auto filtering Different ranking (if query in Japanese do not return English) Hard & soft matches Personalities (triggered on names) Cities (travel info, maps) Medical info (triggered on names and/or results) Stock quotes, news (triggered on stock symbol) Company info …

Answering “the need behind the query” Context determination spatial (user location/target location) query stream (previous queries) personal (user profile) explicit (vertical search, family friendly) Context use Result restriction Ranking modulation

Spatial context – geo-search Geo-coding Geometrical hierarchy (squares) Natural hierarchy (country, state, county, city, zip-codes) Geo-parsing Pages (infer from phone nos, zip, etc) Queries (use dictionary of place names) Users Explicit (tell me your location) From IP data Mobile phones In its infancy, many issues (display size, privacy, etc)

Helping the user UI Spell checking Query completion …

Crawling

Crawling Issues How to crawl? Quality: “Best” pages first Efficiency: Avoid duplication (or near duplication) Etiquette: Robots.txt, Server load concerns How much to crawl? How much to index? Coverage: How big is the Web? How much do we cover? Relative Coverage: How much do competitors have? How often to crawl? Freshness: How much has changed? How much has really changed? (why is this a different question?)

Basic crawler operation Begin with known “seed” pages Fetch and parse them Extract URLs they point to Place the extracted URLs on a queue Fetch each URL on the queue and repeat

Simple picture – complications Web crawling isn’t feasible with one machine All of the above steps distributed Even non-malicious pages pose challenges Latency/bandwidth to remote servers vary Robots.txt stipulations How “deep” should you crawl a site’s URL hierarchy? Site mirrors and duplicate pages Malicious pages Spam pages (Lecture 1, plus others to be discussed) Spider traps – incl dynamically generated Politeness – don’t hit a server too often

Robots.txt Protocol for giving spiders (“robots”) limited access to a website, originally from Website announces its request on what can(not) be crawled For a URL, create a file URL/robots.txt This file specifies access restrictions

Robots.txt example No robot should visit any URL starting with "/yoursite/temp/", except the robot called “searchengine": User-agent: * Disallow: /yoursite/temp/ User-agent: searchengine Disallow:

Crawling and Corpus Construction Crawl order Distributed crawling Filtering duplicates Mirror detection

Where do we spider next? Web URLs crawled and parsed URLs in queue

Crawl Order Want best pages first Potential quality measures: Final In-degree Final Pagerank What’s this?

Crawl Order Want best pages first Potential quality measures: Final In-degree Final Pagerank Crawl heuristic: Breadth First Search (BFS) Partial Indegree Partial Pagerank Random walk Measure of page quality we’ll define later in the course.

BFS & Spam (Worst case scenario) BFS depth = 2 Normal avg outdegree = URLs on the queue including a spam page. Assume the spammer is able to generate dynamic pages with 1000 outlinks Start Page Start Page BFS depth = URLs on the queue 50% belong to the spammer BFS depth = million URLs on the queue 99% belong to the spammer

Overlap with best x% by indegree x% crawled by O(u) Stanford Web Base (179K, 1998) [Cho98]

Web Wide Crawl (328M pages, 2000) [Najo01] BFS crawling brings in high quality pages early in the crawl

Queue of URLs to be fetched What constraints dictate which queued URL is fetched next? Politeness – don’t hit a server too often, even from different threads of your spider How far into a site you’ve crawled already Most sites, stay at ≤ 5 levels of URL hierarchy Which URLs are most promising for building a high-quality corpus This is a graph traversal problem: Given a directed graph you’ve partially visited, where do you visit next?

Where do we spider next? Web URLs crawled and parsed URLs in queue

Where do we spider next? Keep all spiders busy Keep spiders from treading on each others’ toes Avoid fetching duplicates repeatedly Respect politeness/robots.txt Avoid getting stuck in traps Detect/minimize spam Get the “best” pages What’s best? Best for answering search queries

Where do we spider next? Complex scheduling optimization problem, subject to all the constraints listed Plus operational constraints (e.g., keeping all machines load-balanced) Scientific study – limited to specific aspects Which ones? What do we measure? What are the compromises in distributed crawling?

Parallel Crawlers We follow the treatment of Cho and Garcia-Molina: Raises a number of questions in a clean setting, for further study Setting: we have a number of c-proc’s c-proc = crawling process Goal: we wish to spider the best pages with minimum overhead What do these mean?

Distributed model Crawlers may be running in diverse geographies – Europe, Asia, etc. Periodically update a master index Incremental update so this is “cheap” Compression, differential update etc. Focus on communication overhead during the crawl Also results in dispersed WAN load

c-proc’s crawling the web URLs crawled URLs in queues Which c-proc gets this URL? Communication: by URLs passed between c-procs.

Measurements Overlap = (N-I)/I where N = number of pages fetched I = number of distinct pages fetched Coverage = I/U where U = Total number of web pages Quality = sum over downloaded pages of their importance Importance of a page = its in-degree Communication overhead = Number of URLs c-proc’s exchange x

Crawler variations c-procs are independent Fetch pages oblivious to each other. Static assignment Web pages partitioned statically a priori, e.g., by URL hash … more to follow Dynamic assignment Central co-ordinator splits URLs among c- procs

Static assignment Firewall mode: each c-proc only fetches URL within its partition – typically a domain inter-partition links not followed Crossover mode: c-proc may following inter- partition links into another partition possibility of duplicate fetching Exchange mode: c-procs periodically exchange URLs they discover in another partition

Experiments 40M URL graph – Stanford Webbase Open Directory (dmoz.org) URLs as seeds Should be considered a small Web

Summary of findings Cho/Garcia-Molina detail many findings We will review some here, both qualitatively and quantitatively You are expected to understand the reason behind each qualitative finding in the paper You are not expected to remember quantities in their plots/studies

Firewall mode coverage The price of crawling in firewall mode

Crossover mode overlap Demanding coverage drives up overlap

Exchange mode communication Communication overhead sublinear Per downloaded URL

Connectivity servers

Connectivity Server [CS1: Bhar98b, CS2 & 3: Rand01] Support for fast queries on the web graph Which URLs point to a given URL? Which URLs does a given URL point to? Stores mappings in memory from URL to outlinks, URL to inlinks Applications Crawl control Web graph analysis Connectivity, crawl optimization Link analysis More on this later

Most recent published work Boldi and Vigna Webgraph – set of algorithms and a java implementation Fundamental goal – maintain node adjacency lists in memory For this, compressing the adjacency lists is the critical component

Adjacency lists The set of neighbors of a node Assume each URL represented by an integer Properties exploited in compression: Similarity (between lists) Locality (many links from a page go to “nearby” pages) Use gap encodings in sorted lists As we did for postings in CS276A Distribution of gap values

Storage Boldi/Vigna report getting down to an average of ~3 bits/link (URL to URL edge) For a 118M node web graph

Resources www2002.org/CDROM/refereed/108/index.html www2004.org/proceedings/docs/1p595.pdf