Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Information Retrieval in Practice
Chapter 4 Processing Text. n Modifying/Converting documents to index terms n Why?  Convert the many forms of words into more consistent index terms that.
Search Engines and Information Retrieval
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
The PageRank Citation Ranking “Bringing Order to the Web”
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
Advances & Link Analysis
Link Analysis, PageRank and Search Engines on the Web
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Chapter 19: Information Retrieval
Link Structure and Web Mining Shuying Wang
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
Link Analysis HITS Algorithm PageRank Algorithm.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
1 Page Link Analysis and Anchor Text for Web Search Lecture 9 Many slides based on lectures by Chen Li (UCI) an Raymond Mooney (UTexas)
3. Processing Text Azreen Azman, PhD SMM 5891 All slides ©Addison Wesley, 2008.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Social Networking Algorithms related sections to read in Networked Life: 2.1,
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Web Intelligence Web Communities and Dissemination of Information and Culture on the www.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Overview of Web Ranking Algorithms: HITS and PageRank
Web Search Algorithms By Matt Richard and Kyle Krueger.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Information Retrieval in Practice
Search Engine Architecture
HITS Hypertext-Induced Topic Selection
Methods and Apparatus for Ranking Web Page Search Results
Information Retrieval
Thanks to Ray Mooney & Scott White
Anatomy of a search engine
Data Mining Chapter 6 Search Engines
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
10. IR on the World Wide Web and Link Analysis
Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg
Web Search Engines.
Junghoo “John” Cho UCLA
Presentation transcript:

Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014

2 Acknowledgement Contents of lectures, projects are extracted and organized from many sources, including those from Professor Manning’s lecture notes from the textbook, notes and examples from others including Professor Bruce Croft of UMass, Professor Raymond Mooney of UT Austin, Professor David Yarowsky of Johns Hopkins University, Professor David Grossman of IIT, and Professor Brian Davidson of Lehigh.

In this segment: Web search engine architecture –Text collection –Text processing –Indexing –Ranking –User interface –Storage Link analysis –Hubs and authorities: HITS –PageRank

Search Engine Architecture 4

System Architecture 5

Collecting Text Basic algorithm:  Initialize queue with URLs of known seed pages  Repeat  Take a URL from queue  Fetch and parse page specified by the URL  Extract URLs from page  Add URLs to queue  Until the stop condition is met Possible stop conditions: Page count reached, time limit reached, or a combination of these conditions Fundamental assumption: The web is well connected.

Processing Text Parsing text Removing stop words Selecting index terms Stemming terms

Processing Text – parsing text In English words are separated by delimiters such as space In Chinese, it is a bit more challenge to separate words (or characters) 8 The two characters can be treated as one word meaning ‘monk’ or as a sequence of two words meaning ‘and’ ( 和, 以及 ) and ‘still, not yet ( 尚未 ).’

Chinese: No White Space

Processing Text – selecting index term Converting documents to a collection of index terms, which is typically a subset of all terms. Why? –Matching the exact string of characters typed by the user is too restrictive i.e., it doesn’t work very well in terms of effectiveness –Not all words are of equal value in a search –Sometimes not clear where words begin and end Not even clear what a word is in some languages –e.g., Chinese, Korean

Processing Text – removing stop words Function words, e.g., the, so, have little meaning on their own High occurrence frequencies, e.g., computer in a computer science textbook Treated as stopwords (i.e., removed) –reduce index space, improve response time, improve effectiveness Some special cases have to be considered –e.g., “to be or not to be”

Processing Text – removing stop words Stopword list can be created from high- frequency words or based on a standard list Lists are customized for applications, domains, and even parts of documents –e.g., “click” is a good stopword for anchor text Best policy is to index all words in documents, make decisions about which words to use at query time

Text Processing -- stemming Many morphological variations of words –inflectional (plurals, tenses) –derivational (making verbs nouns etc.) In most cases, these have the same or very similar meanings Stemmers attempt to reduce morphological variations of words to a common stem –usually involves removing suffixes Can be done at indexing time or as part of query processing (like stopwords)

Building Inverted Index After selecting index terms and applying stemming algorithm, if necessary, an inverted indexing system is built 14 A simple inverted indexing system

Ranking Many results may “match” the query terms How to rank these results, possible solutions –No ranking –Ranking by the number of times a query term appears in documents –Ranking by some weighted average –… 15

User Interface Display one huge list 1 … n? Display k results per page? Allow the user to jump to certain pages directly? Allow user feedback for refining the search results? 16

Experiment with Code Let’s now examine the code example for a simple server program that can take a user input 17

Link Analysis 18

Link Analysis Links are a key component of the Web Important for navigation, but also for search –e.g., Example website –“Example website” is the anchor text –“ is the destination link –both are used by search engines

Anchor Text Used as a description of the content of the destination page –i.e., collection of anchor text in all links pointing to a page can be used as an additional text field Anchor text tends to be short, descriptive, and similar to query text Retrieval experiments have shown that anchor text has significant impact on effectiveness for some types of queries –i.e., more than PageRank

21 Authorities Authorities are pages that are recognized as providing significant, trustworthy, and useful information on a topic. In-degree (number of pointers to a page) is one simple measure of authority. However in-degree treats all links as equal. Should links from pages that are themselves authoritative count more?

22 Hubs Hubs are index pages that provide lots of useful links to relevant content pages (topic authorities). Hub pages for IR are included in the course home page: – – course/common-files/sortingWebIRCourses.htmlhttp:// course/common-files/sortingWebIRCourses.html

23 HITS Hyperlink Induced Topic Search Algorithm developed by Kleinberg of Cornell in Attempts to computationally determine hubs and authorities on a particular topic through analysis of a relevant subgraph of the web. Based on mutually recursive facts: –Hubs point to lots of authorities. –Authorities are pointed to by lots of hubs.

24 Hubs and Authorities Together they tend to form a bipartite graph: HubsAuthorities

25 HITS Algorithm Computes hubs and authorities for a particular topic specified by a normal query. First determines a set of relevant pages for the query called the base set S. Analyze the link structure of the web subgraph defined by S to find authority and hub pages in this set.

26 Constructing a Base Subgraph For a specific query Q, let the set of documents returned by a standard search engine (e.g., google) be called the root set R. Initialize S to R. Add to S all pages pointed to by any page in R. Add to S all pages that point to any page in R. R S

27 Base Limitations To limit computational expense: –Limit number of root pages to the top 200 pages retrieved for the query. –Limit number of “back-pointer” pages to a random set of at most 50 pages returned by a “reverse link” query. To eliminate purely navigational links: –Eliminate links between two pages on the same host. To eliminate “non-authority-conveying” links: –Allow only m (m  4  8) pages from a given host as pointers to any individual page.

28 Authorities and In-Degree Even within the base set S for a given query, the nodes with highest in-degree are not necessarily authorities (may just be generally popular pages like Facebook or Amazon). True authority pages are pointed to by a number of hubs (i.e., pages that point to lots of authorities).

29 Iterative Algorithm Use an iterative algorithm to slowly converge on a mutually reinforcing set of hubs and authorities. Maintain for each page p  S: –Authority score: a p (vector a) –Hub score: h p (vector h) Initialize all a p = h p = 1 Maintain normalized scores:

30 HITS Update Rules Authorities are pointed to by lots of good hubs: Hubs point to lots of good authorities:

31 Illustrated Update Rules 2 3 a 4 = h 1 + h 2 + h h 4 = a 5 + a 6 + a 7

32 HITS Iterative Algorithm Initialize for all p  S: a p = h p = 1 For i = 1 to k: For all p  S: (update auth. scores) For all p  S: (update hub scores) For all p  S: a p = a p /c c: For all p  S: h p = h p /c c: (normalize a) (normalize h)

33 Convergence Algorithm converges to a fix-point if iterated indefinitely. Define A to be the adjacency matrix for the subgraph defined by S. –A ij = 1 for i  S, j  S iff i  j Authority vector, a, converges to the principal eigenvector of A T A Hub vector, h, converges to the principal eigenvector of AA T In practice, 20 iterations produces fairly stable results.

34 Results (late 1990s) Authorities for query: “Java” –java.sun.com –comp.lang.java FAQ Authorities for query “search engine” –Yahoo.com –Excite.com –Lycos.com –Altavista.com Authorities for query “Gates” –Microsoft.com –roadahead.com

35 Result Comments In most cases, the final authorities were not in the initial root set generated using Altavista. Authorities were brought in from linked and reverse-linked pages and then HITS computed their high authority score.

36 Finding Similar Pages Using Link Structure Given a page, P, let R (the root set) be t (e.g., 200) pages that point to P. Grow a base set S from R. Run HITS on S. Return the best authorities in S as the best similar-pages for P. Finds authorities in the “link neighbor- hood” of P.

37 Similar Page Results Given “honda.com” –toyota.com –ford.com –bmwusa.com –saturncars.com –nissanmotors.com –audi.com –volvocars.com

38 HITS for Clustering An ambiguous query can result in the principal eigenvector only covering one of the possible meanings. Non-principal eigenvectors may contain hubs & authorities for other meanings. Example: “jaguar”: –Atari video game (principal eigenvector) –NFL Football team (2 nd non-princ. eigenvector) –Automobile (3 rd non-princ. eigenvector)

In-Class Work Working with your neighbor(s), compute the first round of result from the HITS algorithm for the graph represented by the following adjacency matrix

Work in Progress a1=a2=a3=a4=1, h1=h2=h3=h4=1 a1 = h3 + h4 = 2 a2 = h1 + h3 = 2 a3 = h4 = 1 a4 = h2 = 1 c 2 = sum(a i 2 ) = = 10 a1/c = a2/c = 2/3.16 = a3/c = a4/c = 1/3.16 =

PageRank Billions of web pages, some more informative than others Links can be viewed as information about the popularity (authority?) of a web page –can be used by ranking algorithm Inlink count could be used as simple measure Link analysis algorithms like PageRank provide more reliable ratings –less susceptible to link spam

Random Surfer Model Browse the Web using the following algorithm: –Choose a random number r between 0 and 1 –If r < λ: Go to a random page –If r ≥ λ: Click a link at random on the current page –Start again PageRank of a page is the probability that the “random surfer” will be looking at that page –links from popular pages will increase PageRank of pages they point to

Dangling Links Random jump prevents getting stuck on pages that –do not have links –contains only links that no longer point to other pages –have links forming a loop Links that point to the first two types of pages are called dangling links –may also be links to pages that have not yet been crawled

PageRank PageRank (PR) of page C = PR(A)/2 + PR(B)/1 More generally, –where B u is the set of pages that point to u, and L v is the number of outgoing links from page v (not counting duplicate links)

PageRank Don’t know PageRank values at start Assume equal values (1/3 in this case), then iterate: –first iteration: PR(C) = 0.33/ = 0.5, PR(A) = 0.33, and PR(B) = 0.17 –second: PR(C) = 0.33/ = 0.33, PR(A) = 0.5, PR(B) = 0.17 –third: PR(C) = 0.42, PR(A) = 0.33, PR(B) = 0.25 Converges to PR(C) = 0.4, PR(A) = 0.4, and PR(B) = 0.2

PageRank Taking random page jump into account, 1/3 chance of going to any page when r < λ PR(C) = λ/3 + (1 − λ) · (PR(A)/2 + PR(B)/1) More generally, –where N is the number of pages, λ typically 0.15

47 PageRank Algorithm Let S be the total set of pages. Let  p  S: E(p) =  /|S| (for some 0<  <1, e.g ) Initialize  p  S: R(p) = 1/|S| Until ranks do not change (much) ( convergence ) For each p  S: For each p  S: R(p) = cR´(p) (normalize)

In-Class Work Working with your neighbor(s), compute the first round of result from the PageRank algorithm for the graph specified by the following adjacency matrix

Work in Progress  = 0.15, S = 4, E(p) =  /S = N q = (2, 2, 1, 1) Initially R’ = 1/S = 0.25 R’(1) = sum(R q /N q ) + E(p) = R 2 /N 2 + E(p) = 0.25/ = R’(2) = sum(R q /N q ) + E(p) = R 4 /N 4 + E(p) = 0.25/ = R’(3) = sum(R q /N q ) + E(p) = R 1 /N 1 + R 2 /N 2 + E(p) = 0.25/ / = R’(4) = sum(R q /N q ) + E(p) = R 1 /N 1 + R 3 /N 3 + E(p) = 0.25/ / =

Link Quality Link quality is affected by spam and other factors –e.g., link farms to increase PageRank –trackback links in blogs can create loops –links from comments section of popular blogs Blog services modify comment links to contain rel=nofollow attribute e.g., “Come visit my web page.”

Trackback Links

Information Extraction Automatically extract structure from text –annotate document using tags to identify extracted structure Named entity recognition –identify words that refer to something of interest in a particular application –e.g., people, companies, locations, dates, product names, prices, etc.

Named Entity Recognition Example showing semantic annotation of text using XML tags Information extraction also includes document structure and more complex features such as relationships and events

Named Entity Recognition Rule-based –Uses lexicons (lists of words and phrases) that categorize names e.g., locations, peoples’ names, organizations, etc. –Rules also used to verify or find new entity names e.g., “ street” for addresses “, ” or “in ” to verify city names “,, ” to find new cities “ ” to find new names

Named Entity Recognition Rules either developed manually by trial and error or using machine learning techniques Statistical –uses a probabilistic model of the words in and around an entity –probabilities estimated using training data (manually annotated text) –Hidden Markov Model (HMM) is one approach

HMM for Extraction Resolve ambiguity in a word using context –e.g., “marathon” is a location or a sporting event, “boston marathon” is a specific sporting event Model context using a generative model of the sequence of words –Markov property: the next word in a sequence depends only on a small number of the previous words

HMM for Extraction Markov Model describes a process as a collection of states with transitions between them –each transition has a probability associated with it –next state depends only on current state and transition probabilities Hidden Markov Model –each state has a set of possible outputs –outputs have probabilities

HMM Sentence Model Each state is associated with a probability distribution over words (the output)

HMM for Extraction Could generate sentences with this model To recognize named entities, find sequence of “labels” that give highest probability for the sentence –only the outputs (words) are visible or observed –states are “hidden” –e.g., Viterbi algorithm used for recognition

Named Entity Recognition Accurate recognition requires about 1M words of training data (1,500 news stories) –may be more expensive than developing rules for some applications Both rule-based and statistical can achieve about 90% effectiveness for categories such as names, locations, organizations –others, such as product name, can be much worse

Internationalization 2/3 of the Web is in English About 50% of Web users do not use English as their primary language Many (maybe most) search applications have to deal with multiple languages –monolingual search: search in one language, but with many possible languages –cross-language search: search in multiple languages at the same time

Internationalization Many aspects of search engines are language-neutral Major differences: –Text encoding (converting to Unicode) –Tokenizing (many languages have no word separators) –Stemming Cultural differences may also impact interface design and features provided

Chinese “Tokenizing”