CSCI 5417 Information Retrieval Systems Jim Martin Lecture 17 10/25/2011.

Slides:



Advertisements
Similar presentations
CS276 Information Retrieval and Web Search
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
Web Categorization Crawler – Part I Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Final Presentation Sep Web Categorization.
Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou
CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
CS 345 Data Mining Lecture 1 Introduction to Web Mining.
WEB CRAWLERs Ms. Poonam Sinai Kenkre.
Overview of Search Engines
Internet Research Search Engines & Subject Directories.
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Web search basics.
ITCS 6265 Information Retrieval and Web Mining Lecture 10: Web search basics.
Web Crawling David Kauchak cs160 Fall 2009 adapted from:
1 Web Developer Foundations: Using XHTML Chapter 11 Web Page Promotion Concepts.
Web Crawling Fall 2011 Dr. Lillian N. Cassel. Overview of the class Purpose: Course Description – How do they do that? Many web applications, from Google.
PrasadL16Crawling1 Crawling and Web Indexes Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning (Stanford)
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.
CSCI 5417 Information Retrieval Systems Jim Martin
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Building Search Portals With SP2013 Search. 2 SharePoint 2013 Search  Introduction  Changes in the Architecture  Result Sources  Query Rules/Result.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 20: Crawling and web indexes.
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
Web Search Module 6 INST 734 Doug Oard. Agenda The Web  Crawling Web search.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Crawling.
Crawling Slides adapted from
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 19 11/1/2011.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial.
Overview What is a Web search engine History Popular Web search engines How Web search engines work Problems.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
| 1 › Gertjan van Noord2014 Search Engines Lecture 7: relevance feedback & query reformulation.
Brief (non-technical) history Full-text index search engines Altavista, Excite, Infoseek, Inktomi, ca Taxonomies populated with web page Yahoo.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Search Engines By: Faruq Hasan.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
Information Retrieval and Web Search Web Crawling Instructor: Rada Mihalcea (some of these slides were adapted from Ray Mooney’s IR course at UT Austin)
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
ITCS 6265 Lecture 11 Crawling and web indexes. This lecture Crawling Connectivity servers.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Seminar on seminar on Presented By L.Nageswara Rao 09MA1A0546. Under the guidance of Ms.Y.Sushma(M.Tech) asst.prof.
WEB SEARCH BASICS By K.KARTHIKEYAN. Web search basics The Web Ad indexes Web spider Indexer Indexes Search User Sec
Data mining in web applications
Information Retrieval in Practice
CS276 Information Retrieval and Web Search
Search Engine Architecture
Lecture 17 Crawling and web indexes
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Data Mining Chapter 6 Search Engines
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Anwar Alhenshiri.
Presentation transcript:

CSCI 5417 Information Retrieval Systems Jim Martin Lecture 17 10/25/2011

Today Finish topic models model intro Start on web search

What if? What if we just have the documents but no class assignments? But assume we do have knowledge about the number of classes involved Can we still use probabilistic models? In particular, can we use naïve Bayes? Yes, via EM Expectation Maximization

EM 1. Given some model, like NB, make up some class assignments randomly. 2. Use those assignments to generate model parameters P(class) and P(word|class) 3. Use those model parameters to re-classify the training data. 4. Go to 2

Naïve Bayes Example (EM) DocCategory D1? D2? D3? D4? D5?

Naïve Bayes Example (EM) DocCategory D1Sports D2Politics D3Sports D4Politics D5Sports DocCategory {China, soccer}Sports {Japan, baseball}Politics {baseball, trade}Sports {China, trade}Politics {Japan, Japan, exports}Sports Sports (.6) baseball2/13 China2/13 exports2/13 Japan3/13 soccer2/13 trade2/13 Politics (.4) baseball2/10 China2/10 exports1/10 Japan2/10 soccer1/10 trade2/10

Naïve Bayes Example (EM) Use these counts to reassess the class membership for D1 to D5. Reassign them to new classes. Recompute the tables and priors. Repeat until happy

Topics DocCategory {China, soccer}Sports {Japan, baseball}Sports {baseball, trade}Sports {China, trade}Politics {Japan, Japan, exports}Politics What’s the deal with trade?

Topics DocCategory {China 1, soccer 2 }Sports {Japan 1, baseball 2 }Sports {baseball 2, trade 2 }Sports {China 1, trade 1 }Politics {Japan 1, Japan 1, exports 1 }Politics {basketball 2, strike 3 }

Topics So let’s propose that instead of assigning documents to classes, we assign each word token in each document to a class (topic). Then we can some new probabilities to associate with words, topics and documents Distribution of topics in a doc Distribution of topics overall Association of words with topics 10/1/2015CSCI IR10

Topics Example. A document like {basketball 2, strike 3 } Can be said to be.5 about topic 2 and.5 about topic 3 and 0 about the rest of the possible topics (may want to worry about smoothing later. For a collection as a whole we can get a topic distribution (prior) by summing the words tagged with a particular topic, and dividing by the number of tagged tokens. 10/1/2015CSCI IR11

Problem With “normal” text classification the training data associates a document with one or more topics. Now we need to associate topics with the (content) words in each document This is a semantic tagging task, not unlike part-of-speech tagging and word-sense tagging It’s hard, slow and expensive to do right 10/1/2015CSCI IR12

Topic modeling Do it without the human tagging Given a set of documents And a fixed number of topics (given) Find the statistics that we need 10/1/2015CSCI IR13

Graphical Models Notation: Take 2 Category wiwi n w1w1 w2w2 w3w3 w4w4 wnwn …

Unsupervised NB Now suppose that Cat isn’t observed That is, we don’t have category labels for each document Then we need to learn two distributions: P( Cat ) P( w|Cat ) How do we do this? We might use EM Alternative: Bayesian methods Category wiwi n

Bayesian document categorization Cat w1w1 nDnD     D priors P(w|Cat) P(Cat)

Latent Dirichlet Allocation: Topic Models (Blei, Ng, & Jordan, 2001; 2003) NdNd D zizi wiwi  (d) (d)  (j) (j)    (d)  Dirichlet() z i  Discrete( (d) )  (j)  Dirichlet() w i  Discrete( (z i ) ) T distribution over topics for each document topic assignment for each word distribution over words for each topic word generated from assigned topic Dirichlet priors

Given That What could you do with it. Browse/explore a collection and individual documents is the basic task 10/1/2015CSCI IR18

Visualize the topics 10/1/2015CSCI IR19

Visualize documents 10/1/2015CSCI IR20

Break 10/1/2015CSCI IR21

10/1/2015 CSCI IR 22 Brief History of Web Search Early keyword-based engines Altavista, Excite, Infoseek, Inktomi, Lycos ca Sponsored search ranking: WWWW (Colorado/McBryan) -> Goto.com (morphed into Overture.com  Yahoo!  ???) Your search ranking depended on how much you paid Auction for keywords: casino was an expensive keyword!

10/1/2015 CSCI IR 23 Brief history 1998+: Link-based ranking introduced by Google Perception was that it represented a fundamental improvement over existing systems Great user experience in search of a business model Meanwhile Goto/Overture’s annual revenues were nearing $1 billion Google adds paid-placement “ads” to the side, distinct from search results 2003: Yahoo follows suit acquires Overture (for paid placement) and Inktomi (for search)

10/1/2015 CSCI IR 24 Web search basics The Web Ad indexes Web spider Indexer Indexes Search User

User Needs Need [Brod02, RL04] Informational – want to learn about something (~40% / 65%) Navigational – want to go to that page (~25% / 15%) Transactional – want to do something (web-mediated) (~35% / 20%) Access a service Downloads Shop Gray areas Find a good hub Exploratory search “see what’s there” Low hemoglobin United Airlines Seattle weather Mars surface images Canon S410 Car rental Brazil Sec /1/ CSCI IR

How far do people look for results? (Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)iprospect.com 10/1/ CSCI IR

Users’ empirical evaluation of results Quality of pages varies widely Relevance is not enough Other desirable qualities Content: Trustworthy, diverse, non-duplicated, well maintained Web readability: display correctly & fast No annoyances: pop-ups, etc Precision vs. recall On the web, recall seldom matters What matters Precision at 1? Precision at k? Comprehensiveness – must be able to deal with obscure queries Recall matters when the number of matches is very small 10/1/ CSCI IR

Users’ empirical evaluation of engines Relevance and validity of results UI – Simple, no clutter, error tolerant Trust – Results are objective Coverage of topics for polysemic queries Pre/Post process tools provided Mitigate user errors (auto spell check, search assist,…) Explicit: Search within results, more like this, refine... Anticipative: related searches, suggest, instant search Deal with idiosyncrasies Web specific vocabulary Impact on stemming, spell-check, etc Web addresses typed in the search box 10/1/ CSCI IR

The Web as a Document Collection No design/co-ordination Distributed content creation, linking, democratization of publishing Content includes truth, lies, obsolete information, contradictions … Unstructured (text, html, …), semi- structured (XML, annotated photos), structured (Databases)… Scale much larger than previous text collections … but corporate records are catching up Growth – slowed down from initial “volume doubling every few months” but still expanding Content can be dynamically generated The Web 10/1/ CSCI IR

10/1/2015 CSCI IR 30 Web search engine pieces Spider (a.k.a. crawler/robot) – builds corpus Collects web pages recursively For each known URL, fetch the page, parse it, and extract new URLs Repeat Additional pages from direct submissions & other sources The indexer – creates inverted indexes Usual issues wrt which words are indexed, capitalization, support for Unicode, stemming, support for phrases, language issues, etc. Query processor – serves query results Front end – query reformulation, word stemming, capitalization, optimization of Booleans, phrases, wildcards, spelling, etc. Back end – finds matching documents and ranks them

10/1/2015 CSCI IR 31 Search Engine: Three sub-problems 1. Match ads to query/context 2. Generate and Order the ads 3. Pricing on a click-through IR Econ

10/1/2015 CSCI IR 32 The trouble with search ads… They cost real money. Search Engine Optimization: “Tuning” your web page to rank highly in the search results for select keywords Alternative to paying for placement Thus, intrinsically a marketing function Performed by companies, webmasters and consultants (“Search engine optimizers”) for their clients Some perfectly legitimate, some very shady

10/1/2015 CSCI IR 33 Basic crawler operation Begin with known “seed” pages Fetch and parse them Extract URLs they point to Place the extracted URLs on a queue Fetch each URL on the queue and repeat

10/1/2015 CSCI IR 34 Crawling picture Web URLs crawled and parsed URLs frontier Unseen Web Seed pages

10/1/2015 CSCI IR 35 Simple picture – complications Effective Web crawling isn’t feasible with one machine All of the above steps need to be distributed Even non-malicious pages pose challenges Latency/bandwidth to remote servers vary Webmasters’ stipulations How “deep” should you crawl a site’s URL hierarchy? Site mirrors and duplicate pages Malicious pages Spam pages Spider traps – incl dynamically generated Politeness – don’t hit a server too often

10/1/2015 CSCI IR 36 What any crawler must do Be Polite: Respect implicit and explicit politeness considerations for a website Only crawl pages you’re allowed to Respect robots.txt Be Robust: Be immune to spider traps and other malicious behavior from web servers

10/1/2015 CSCI IR 37 What any crawler should do Be capable of distributed operation: designed to run on multiple distributed machines Be scalable: designed to increase the crawl rate by adding more machines Performance/efficiency: permit full use of available processing and network resources

10/1/2015 CSCI IR 38 What any crawler should do Fetch important stuff first Pages with “higher quality” Continuous operation: Continue to fetch fresh copies of a previously fetched page Extensible: Adapt to new data formats, protocols, etc.

10/1/2015 CSCI IR 39 Updated crawling picture URLs crawled and parsed Unseen Web Seed Pages URL frontier Crawling thread

10/1/2015 CSCI IR 40 URL frontier Can include multiple pages from the same host Must avoid trying to fetch them all at the same time Must try to keep all crawling threads busy

10/1/2015 CSCI IR 41 Explicit and implicit politeness Explicit politeness: specifications from webmasters on what portions of site can be crawled robots.txt Implicit politeness: even with no specification, avoid hitting any site too often

10/1/2015 CSCI IR 42 Robots.txt Protocol for giving spiders (“robots”) limited access to a website, originally from 1994 Website announces its request on what can(not) be crawled For a URL, create a file URL/robots.txt This file specifies access restrictions

10/1/2015 CSCI IR 43 Robots.txt example No robot should visit any URL starting with "/yoursite/temp/", except the robot called “searchengine": User-agent: * Disallow: /yoursite/temp/ User-agent: searchengine Disallow:

10/1/2015 CSCI IR 44 Processing steps in crawling Pick a URL from the frontier Fetch the document at the URL Parse the document Extract links from it to other docs (URLs) Check if document has content already seen If not, add to indexes For each extracted URL Ensure it passes certain URL filter tests Check if it is already in the frontier (duplicate URL elimination) E.g., only crawl.edu, obey robots.txt, etc. Which one?

10/1/2015 CSCI IR 45 Basic crawl architecture WWW Fetch DNS Parse Content seen? Doc FP’s Dup URL elim URL set URL Frontier URL filter robots filters

10/1/2015 CSCI IR 46 DNS (Domain Name Server) A lookup service on the internet Given a URL, retrieve its IP address Service provided by a distributed set of servers – thus, lookup latencies can be high (even seconds) Common OS implementations of DNS lookup are blocking: only one outstanding request at a time Solutions DNS caching Batch DNS resolver – collects requests and sends them out together

10/1/2015 CSCI IR 47 Parsing: URL normalization When a fetched document is parsed, some of the extracted links are relative URLs E.g., at we have a relative link to /wiki/Wikipedia:General_disclaimer which is the same as the absolute URL aimerhttp://en.wikipedia.org/wiki/Main_Page aimer Must expand such relative URLs URL shorteners (bit.ly, etc) are a new problem

10/1/2015 CSCI IR 48 Content seen? Duplication is widespread on the web If the page just fetched is already in the index, do not further process it This is verified using document fingerprints or shingles

10/1/2015 CSCI IR 49 Filters and robots.txt Filters – regular expressions for URL’s to be crawled/not Once a robots.txt file is fetched from a site, need not fetch it repeatedly Doing so burns bandwidth, hits web server Cache robots.txt files

10/1/2015 CSCI IR 50 Duplicate URL elimination For a non-continuous (one-shot) crawl, test to see if an extracted+filtered URL has already been passed to the frontier