Slide 1 Lecture 9: Unstructured Data Information Retrieval –Types of Systems, Documents, Tasks –Evaluation: Precision, Recall Search Engines (Google) –Architecture.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Page Rank.  Intuition: solve the recursive equation: “a page is important if important pages link to it.”  Maximailly: importance = the principal eigenvector.
Lexicon/dictionary DIC Inverted Index Allows quick lookup of document ids with a particular word Stanford UCLA MIT … PL(Stanford) PL(UCLA)
Link Analysis, PageRank and Search Engines on the Web
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Structure and Web Mining Shuying Wang
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
PageRank Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 27, 2014.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
Jeffrey D. Ullman Stanford University.  Web pages are important if people visit them a lot.  But we can’t watch everybody using the Web.  A good surrogate.
Information Retrieval in Practice
Motivation Modern search engines for the World Wide Web use methods that require solving huge problems. Our aim: to develop multiscale techniques that.
Search Engine Architecture
Search Engines and Link Analysis on the Web
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Search Engines & Subject Directories
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Information Retrieval
Lecture 22 SVD, Eigenvector, and Web Search
Anatomy of a search engine
CS 440 Database Management Systems
Data Mining Chapter 6 Search Engines
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
Search Engines & Subject Directories
Search Engines & Subject Directories
Junghoo “John” Cho UCLA
Web Search Engines.
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

Slide 1 Lecture 9: Unstructured Data Information Retrieval –Types of Systems, Documents, Tasks –Evaluation: Precision, Recall Search Engines (Google) –Architecture –Web Crawling –Query Processing Inverted Indexes PageRank (!) Most of the IR portion of this material is take from the course "Information retrieval on the Internet" by Maier and Price, taught at PSU in alternate years.

Slide 2 Leaarning Objectives LO9.1 Given a Transition matrix draw a transition graph, and vice versa. LO9.2 Given a Transition matrix, and a residence vector, decide if it is the PageRank for that matrix.

Slide 3 Information Retrieval (IR) The study of Unstructured Data is called Information Retrieval (IR) A Database refers to Structured Data DBMSIR TargetStructured Data: rows in tables Unstructured Data: documents, media, etc. QueriesSQLKeyword Matchingpreciseapproximate Resultsunordered (unless specified) list List ordered by matching priority

Slide 4 General types of IR systems Web Pages Full text documents Bibliographies Distributed variations –Metasearch –Virtual document collections

Slide 5 Types of Documents in IR Systems Hyperlinked or not Format –HTML –PDF –Word Processed –Scanned OCR Type –Text –Multimedia –Semistructured, e.g., XML Static or Dynamic

Slide 6 Types of tasks in IR systems Find –an overview –a fact/answer a question –comprehensive information – a known item (document, page or site) –a site to execute a transaction (e.g., buy a book, download a file)

Slide 7 Evaluation How can we evaluate performance of an IR system? –System perspective –User perspective User perspective: Relevance –(How well) does a document satisfy a user's need? Ideally, an IR system will retrieve exactly those items that satisfy the user's needs, no more, no less. More: wastes user's time Less: user misses valuable information

Slide 8 Notation In response to a user’s query: The IR system reTrieves a set of documents T The user knows the set of reLevant documents L |X| denotes the number of documents in X Ideally, T = L, no more (no junk), no less(no missing)

Slide 9 The big picture T TLTL L Retrieved, Not Relevant = Junk Relevant, Not Retrieved = Missing |T  L|  |T| = 1 if No Junk Precision = fraction of retrieved items that were relevant =1 if all retrieved items were relevant |T  L|  |L| = 1 if No Missing Recall = fraction of relevant items that were retrieved =1 if all the relevant items were retrieved

Slide 10 Context Precision, Recall were created for IR systems that retrieved from a small set of items. In that case one could calculate T and L. Web search engines do not fit this model well; T and L are huge. Recall does not make sense in this model, but we can apply the definition of measuring the fraction of relevant items that were retrieved among the first 10 displayed.

Slide 11 Experiment Compute for Google, Bing and Yahoo for this query: –Paris Hilton Hotel Precision = fraction of retrieved items that are relevant Google Bing Yahoo

Slide 12 Search Engine Architecture How often do you google? What happens when you google? – Average time: half a second We need a crawler to create the indexes and docs. –Notice that the web crawler creates the docs. –From the docs, the indexes are created and the docs are given ranks… cf. later slides. Let's study the Web Crawler Algorithm (WCA) –Page 1143 of the handout

Slide 13 Web Crawler Algorithm Input: Set of popular URLs S Output: Repository of visited web pages R Method: 1.If S is empty, end 2.Select page p from S to crawl, delete p from S 3.Get p* (page that p points to). 4.If p* is in R, return to (1), Else add p* to R, and add to S all outlinks from p* unless they are already in R or S 5.Return to step (1)

Slide 14 WCA: Terminating Search Limit the number of pages crawled –Total number of pages, or –Pages per site Limit the depth of the crawl

Slide 15 WCA: Managing the Repository Don't add duplicates to S –Need an index on S, probably hash Don't add duplicates to R –Cannot happen since we search each URL only once? A page can come from >1 URL; mirror sites –So use hash table of pages in R

Slide 16 WCA: Select Next Page in S? Can use Random Search Better: Most Important First –Can consider first set of pages to be most important As pages are added, make them less important Breadth first search –Can do a simplified PageRank (cf. later) calculation

Slide 17 WCA: Faster, Faster Multiprogramming, Multiprocessing –Must manage locks on S With billions of URLs, this becomes a bottlneck So assign each process to a host/site, not a URL –This can become a denial-of-service attack, so throttle down and take on several sites, organized by hash buckets –R also has bottleneck problems, and can be handled with locks

Slide 18 On to Query Processing Very different from structured data: no SQL, parser, optimizer Input is boolean combination of keywords –data [and] base –data OR base Google's goal is an engine that "understands exactly what you mean and gives you back exactly what you want "

Slide 19 Inverted Indexes When the crawl is complete, the search engine builds, for each and every word, an inverted index. An inverted index is a list of all documents containing that word –The index may be a bit vector –It may also contain the location(s) of the word in the document Word: any word in any language, plus misspelling, plus any sequence of characters surrounded by punctuation!  Hundreds of millions of words  Farms of PCs, e.g. near Bonneville Dam, to hold all this data

Slide 20 Mechanics of Query Processing 1.Relevant inverted indexes are found 1.Typically the indexes are in memory, otherwise this could take a full half second 2.If they are bit vectors, they are ANDed or ORed, then materialized, then lists are handled Result is many URLs. Next step is to determine their rank so the highest ranked URLs can be delivered to the user.

Slide 21 Ranking Pages Indexes have returned pages. Which ones are most relevant to you? There are many criteria for ranking pages; here are some no-brainers (except !) –Presence of all words –All words close together –Words in important locations and formats on the page –! Words near anchor text of links in reference pages But the killer criteria is PageRank

Slide 22 PageRank Intuition You need to find a plumber. How do you do it? 1.Call plumbers and talk to them 2.! Call friends and ask for plumber references Then choose plumbers who have the most references 3.!! Call friends who know a lot about plumbers (important friends) and ask them for plumber references Then choose plumbers who have the most references from important people. Technique 1 was used before Google. Google introduced technique 2 to search engines Google also introduced technique 3 Techniques 2, and especially 3, wiped out the competition. The big challenge: determine which pages are important

Slide 23 What does this mean for pages? 1.Most search engines look for pages containing the word "plumber" 2.Google searches for pages that are linked to by pages containing "plumber". 3.Google searches for pages that are linked to by important pages containing "plumber". A web page is important if many important pages link to it. –This is a recursive equation. –Google solves it by imagining a web walker.

Slide 24 The Web Walker From page p, the walker follows a random link in p –Note that all links in p have equal weight The walker walks for a very, very, long time. A residence vector [ y a m ] describes the percentage of time that the walker spends on each page –What does the vector [1/3 1/3 1/3 ] mean? In steady state, the residence vector will be (1 st draft of) the PageRank Observe: pages with many in-links are visited often Observe: important pages are visited most often

Slide 25 Stochastic Transition Matrix To describe the page walker's moves, we use a stochastic transition matrix. –Stochastic = each column sums to 1 There are 3 web pages: Yahoo, Amazon and Microsoft This matrix means that the Yahoo page has 2 outlinks, to Yahoo (a self-link) and to Amazon, etc. Matrix = ½ ½ 0 ½ ½ 0 Y A M

Slide 26 Transition Graph Each Transition Matrix corresponds to a Transition Graph, e.g. Y A M 1/2 1

Slide 27 LO9.1:Transition Graph* What is the Transition Graph for this Matrix? 0 ½ ⅔ ⅓ 0 ⅓ ⅔ ½ 0 Y A M

Slide 28 Solving for Page Rank For small dimension matrices it is simple to calculate the PageRank using Gaussian Elimination. Remember [y,a,m] is the time the walker spends at each site. Since it is a probability distribution, y+a+m=1. Since the walker has reached steady state, ½ ½ 0 ½ ½ 0 yamyam yamyam =

Slide 29 Solving, ctd Solving such small equations is easy, but in reality the matrix dimension is the number of pages in the web, so it is in the billions. There is a simpler way, called relaxation. Start with a distribution, typically equal values, and transform it by the matrix. ½ ½ 0 ½ ½ 0 1/3 = 2/6 3/6 1/6

Slide 30 Solving, ctd If we repeat this only 5-10* times the vectors converge to values very close to [2/5,2/5,1/5]. Check that this is a solution: ½ ½ 0 ½ ½ 0 2/5 1/5 = 2/5 1/5 This solution gives the PageRank of each page on the Web. It is also called the eigenvector of the matrix with eigenvalue one. Does this agree with our intuition about Page Rank? *For real web values, at most 100 iterations suffice

Slide 31 LO9.2: Identify Solution Is [ 3/8, 1/4, 3/8 ] a solution for this transition matrix ? 0 ½ ⅔ ⅓ 0 ⅓ ⅔ ½ 0

Slide 32 A Spider Trap Let's look at a more realistic example called a spider trap. M = ½ ½ 0 ½ ½ 1 The Transition Graph is: M represents any set of web pages that does not have a link outside the set. Y AM 1/2 1

Slide 33 A Spider Trap The Page Rank is: ½ ½ 0 ½ ½ = Relaxation arrives at this vector because a random walker arrives at M and stays there in a loop. This Page Rank vector violates the Page Rank principle that inlinks should determine importance.

Slide 34 A Dead End A similar example, called a dead end, is M = ½ ½ 0 ½ ½ 0 Y AM 1/2 The Transition Graph is: M represents any set of web pages that does not have out-links.

Slide 35 A Dead End, ctd A dead end matrix is not stochastic, because M does not obey the stochastic rule. The only eigenvector for a dead end matrix is the zero vector. Relaxation arrives at the zero vector because a random walker arrives at M and then has nowhere to go.

Slide 36 What to do? In these cases, which happen all the time on the web, the web walker algorithm does not identify which pages are truly important. But we can tweak the algorithm to do so: Every 5 th walk, or so, the walker steps to a random page on the web. Then the walk (spider trap example) becomes ½ ½ 0 ½ ½ 1 1/3 P new = 0.8 *P old *

Slide 37 Teleporter Now our tweaked random walker is a teleporter. With probability 80%* s/he follows a random link from the current page, as before. But with probability 20% s/he teleports to a random page with uniform probability. –It could be anywhere on the web, even the current page If s/he is at a dead end, with 100% probability s/he teleports to a random page with uniform probability. *80-20% are tunable paramaters

Slide 38 Solving the Teleporter Equation The equation on slide 36 describes the teleporter's walk. It can be solved using relaxation or Gaussian elimination. The solution is (7/33, 5/33, 21/33). It gives unreasonably high importance to M, but does recognize that Y is more important than A.