Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences

Slides:

Advertisements

Similar presentations

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.

Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,

CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.

Web Crawling Notes by Aisha Walcott

T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.

Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.

Searching the Web. The Web Why is it important: –“Free” ubiquitous information resource –Broad coverage of topics and perspectives –Becoming dominant.

March 26, 2003CS502 Web Information Systems1 Web Crawling and Automatic Discovery Donna Bergmark Cornell Information Systems

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.

WEB CRAWLERs Ms. Poonam Sinai Kenkre.

Information Retrieval

Chapter 5: Information Retrieval and Web Search

How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-

Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα

Web Crawling David Kauchak cs160 Fall 2009 adapted from:

Emerging Topic Detection on Twitter (Cataldi et al., MDMKDD 2010) Padmini Srinivasan Computer Science Department Department of Management Sciences

HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.

Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.

Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.

Accelerated Focused Crawling Through Online Relevance Feedback Soumen Chakrabarti, IIT Bombay Kunal Punera, IIT Bombay Mallela Subramanyam, UT Austin.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.

Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec

Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.

Crawling Slides adapted from

How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,

Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University

Chapter 6: Information Retrieval and Web Search

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.

Web Search Algorithms By Matt Richard and Kyle Krueger.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.

Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.

Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.

1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,

Ch 14. Link Analysis Padmini Srinivasan Computer Science Department

CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.

Web- and Multimedia-based Information Systems Lecture 2.

1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:

ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.

Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.

Search Engine-Crawler Symbiosis: Adapting to Community Interests

What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.

Information Retrieval (9) Prof. Dragomir R. Radev

Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.

1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.

1 CS 430: Information Discovery Lecture 17 Web Crawlers.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

General Architecture of Retrieval Systems 1Adrienn Skrop.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Search Engine and Optimization 1. Introduction to Web Search Engines 2.

1 Web Search Spidering (Crawling)

Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

IST 516 Fall 2011 Dongwon Lee, Ph.D.

The Anatomy of a Large-Scale Hypertextual Web Search Engine

IST 497 Vladimir Belyavskiy 11/21/02

Search Search Engines Search Engine Optimization Search Interfaces

Information Retrieval

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Presentation transcript:

Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences

Basics What is a crawler? HTTP client software that sends out an HTTP request for a page and reads a resppnse. Timeouts How much to download? Exception handling Error handling Collect statistics: time-outs, etc. Follows Robot Exclusion Protocol (de facto, 1994 onwards)

Tippie web site # robots.txt for or # Rules for all robots accessing the site. User-agent: * Disallow: /error-pages/ Disallow: /includes/ Disallow: /Redirects/ Disallow: /scripts/ Disallow: /CFIDE/ # Individual folders that should not be indexed Disallow: /vaughan/Board/ Disallow: /economics/mwieg/ Disallow: /economics/midwesttheory/ Disallow: /undergraduate/scholars/ Sitemap:

Robots.txt User-agent: * Disallow: User-agent: BadBot Disallow: / User-agent: Google Disallow: User-agent: * Disallow: / Legal? No. But has been used in legal cases.

Types of crawlers Get everything? – Broad….. Get everything within on a topic? – Preferential, topical, focused, thematic – What are your objectives behind the crawl? Keep it fresh – When does one run it? Get new versus check old? How does one evaluate performance? – Sometimes? Continuously? What’s the Gold Standard?

Design

Crawler Parts Frontier – List of “to be visited” URLS – FIFO (first in first out) – Priority queue (preferential) – When the Frontier is full Does this happen? What to do? – When the Frontier is empty Does this happen? 10,000 pages crawled, average 7 links / page: 60,000 URLS in the frontier, how so? Unique URLS?

Crawler Parts History – Time-stamped listing of visited URLS – take out of frontier first – Can keep other information too: quality estimate, update frequency, rate of errors (in accessing page), last update date, anything you want to track related to the fetching of the page. – Fast lookup Hashing scheme on the URL itself Canonicalize: – Lowercasing; – remove anchor reference parts: » » – Remove tildas – Add or subtract trailing / – Remove default pages: index.html – Normalize paths: removing parent pointers in url » – Normalize port numbers: default numbers (80) Spider traps: long URLS, limit length.

Crawler Parts Page Repository – Keep all of it? Some of it? Just the anchor texts? – You decide Parse the web page – Index and store information (if creating a search engine of some kind) – What to index? How to index? How to store? Stopwords, stemming, phrases, Tag tree evidence (DOM), NOISE! – Extract URLS Google initially: show you next time.

And… But why do crawlers actually work? – Topical locality hypothesis An on topic page tends to link to other on topic pages. Empirical test : that two pages that are topically similar have higher probability of linking to each other than two random pages on the web. (Davison, 2000) And too browsing works! – Status locality? high status web pages are more likely to link to other high status pages than to low status pages Rationale from social theories: relationship asymmetry in social groups and the spontaneous development of social hierarchies.

Crawler Algorithms Naïve best-first crawler – Best-N-first crawler SharkSearch crawler – FishSearch Focused crawler Context Focused crawler InfoSpiders Utility-biased web crawlers

Naïve Best First Crawler Compute cosine between page and query/description as URL score Term frequency (TF) and Inverse Document Frequency (IDF) weights Multi-threaded: Best-N-crawler (256)

Naïve best-first crawler Bayesian classifier to score URLS Chakrabarti et al SVM (Pant and Srinivasan, 2005) better. Naïve Bayes tends to produce skewed scores. Use PageRank to score URLS? – How to compute? Partial data. Based on crawled data – poor results – Later: utility-biased crawler

Shark Search Crawler From earlier Fish Search (de Bra et al.) – Depth bound; anchor text; link context; inherited scores score(u) = g * inherited(u) + (1 – g) * neighbourhood (u) inherited(u) = x * sim(p, q) if sim(p, q) > 0 else inherited(p) (x < 1). neighbourhood(u) = b * anchor(u) + (1-b) * context(u) (b < 1) context(u) = 1 if anchor(u) > 0 else sim(aug_context, q) Depth: controls travel in a sub space; no more ‘relevant’ information found.

Focused Crawler Chakrabarti et al. Stanford/IIT – Topic taxonomy – User provided sample URLs – Classify these onto the taxonomy (Prob(c|url) where Prob(root|url) = 1. – User iterates selecting and deselecting categories – Mark the ‘good’ categories – When page crawled: relevance(page) = sum(Prob(c|page)) where sum is over the good categories; score URLS – When crawling: Soft mode: use this relevance score to rank URLS Hard mode: find leaf node with highest score, if any ancestor marked relevant then add to frontier else not

Context Focused Crawler A rather different strategy – Topic locality hypothesis somewhat explicitly used here – Classifiers estimate distance to relevant page from a crawled page. This estimate scores urls.

Context Graph

Levels: L Probability (page in class, i.e., level x) x = 1, 2, 3 (other) Bayes theorem: Prob(L1|page) = {Prob(page|L1) * Prob(L1)}/Prob(page) Prob(L1) = 1/L (number of levels)

Utility-Biased Crawler Considers both topical and status locality. Estimates status via local properties Combines using several functions. – One: Cobb-Douglas function Utility(URL) = topicality a * status b (a + b = 1) – if a page is twice as high in topicality and twice as high in status then twice as high utility as well. – Increases in topicality (or status) cause smaller increases in utility as the topicality (or status) increases.

Estimating Status ~ cool part Local properties – M5’ decision tree algorithm Information volume Information location – Information specificity Information brokerage Link ratio: # links/ # words Quantitative ratio: # numbers/# words Domain traffic: ‘reach’ data for domain obtained from Alexa Web Information Service Pant & Srinivasan, 2010, ISR

Utility-Biased Crawler Cobb-Douglas function – Utility(URL) = topicality a * status b (a + b = 1) Should a be fixed? “one size fits all” Or should it vary based on the subspace? Target topicality level (d) a = a + delta (d – t), 0 <= a <= 1 – t: average estimated topicality of the last 25 pages fetched – Delta is a step size (0.01) Assume a = 0.7, delta = 0.01 and t = 0.9 » a = (0.7 – 0.9) = 0.7 – Assume a = 0.7, delta = 0.01 and t = 0.4 » a = (0.7 – 0.4) =

It’s a matter of balance

Crawler Evaluation What are good pages? Web scale is daunting User based crawls are short, but web agents? Page importance assessed – Presence of query keywords – Similarity of page to query/description – Similarity to seed pages (held out sample) – Use a classifier – not the same as used in crawler – Link-based popularity (but within topic?)

Summarizing Performance Precision – Relevance is Boolean: yes/no Harvest rate: # of good pages/total # pages – Relevance is continuous Average relevance over crawled set – Recall Target recall: held out seed pages (H) – |H ∧ pages crawled|/|pages crawled| Robustness – Start same crawler on disjoint seed sets. Examine overlap of fetched pages

Sample Performance Graph

Summary Crawler architecture Crawler algorithms Crawler evaluation Assignment 1 – Run two crawlers for 5000 pages. – Start with the same set of seed pages for a topic. – Look at overlap and report this over time (robustness)