Hyperlink Analysis for the Web. Information Retrieval Input: Document collection Goal: Retrieve documents or text with information content that is relevant.

Slides:



Advertisements
Similar presentations
Markov Models.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Faster TF-IDF David Kauchak cs160 Fall 2009 adapted from:
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
Information Retrieval Lecture 6bis. Recap: tf-idf weighting The tf-idf weight of a term is the product of its tf weight and its idf weight. Best known.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Link Analysis, PageRank and Search Engines on the Web
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 7: Scores in a Complete Search.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Markov Models. Markov Chain A sequence of states: X 1, X 2, X 3, … Usually over time The transition from X t-1 to X t depends only on X t-1 (Markov Property).
Information Retrieval
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Hyperlink Analysis for the Web. Information Retrieval Input: Document collection Goal: Retrieve documents or text with information content that is relevant.
Using Hyperlink structure information for web search.
Faster TF-IDF David Kauchak cs458 Fall 2012 adapted from:
Hyperlink Analysis for the Web. Information Retrieval Input: Document collection Goal: Retrieve documents or text with information content that is relevant.
Computing Scores in a Complete Search System Lecture 8: Scoring and results assembly Web Search and Mining 1.
Information Retrieval and Web Search Lecture 7 1.
Term Weighting and Ranking Models Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Overview of Web Ranking Algorithms: HITS and PageRank
Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,
1 ITCS 6265 Information Retrieval and Web Mining Lecture 7.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 7 Computing scores in a complete search system.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 7: Scoring and results assembly.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.
1 CS 430: Information Discovery Lecture 5 Ranking.
PrasadL09VSM-Ranking1 Vector Space Model : Ranking Revisited Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
The PageRank Citation Ranking: Bringing Order to the Web
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Information retrieval and PageRank
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000.
Graph and Link Mining.
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

Hyperlink Analysis for the Web

Information Retrieval Input: Document collection Goal: Retrieve documents or text with information content that is relevant to user’s information need Two aspects: 1. Processing the collection 2. Processing queries (searching)

Classic information retrieval Ranking is a function of query term frequency within the document (tf) and across all documents (idf) This works because of the following assumptions in classical IR: –Queries are long and well specified “What is the impact of the Falklands war on Anglo-Argentinean relations” –Documents (e.g., newspaper articles) are coherent, well authored, and are usually about one topic –The vocabulary is small and relatively well understood

Web information retrieval None of these assumptions hold: –Queries are short: 2.35 terms in avg –Huge variety in documents: language, quality, duplication –Huge vocabulary: 100s million of terms –Deliberate misinformation Ranking is a function of the query terms and of the hyperlink structure

Connectivity-based ranking Ranking Returned Documents –Query dependent rakingQuery dependent raking –Query independent rankingQuery independent ranking Hyperlink analysis – Idea: Mine structure of the web graph – Each web page is a node – Each hyperlink is a directed edge

Query dependent query Assigns a score that measures the quality and relevance of a selected set of pages to a given user query. The basic idea is to build a query-specific graph, called a neighborhood graph, and perform hyperlink analysis on it.

Building a neighborhood graph A start set of documents matching the query is fetched from a search engine (say, the top 200 matches). The start set is augmented by its neighborhood, which is the set of documents that either hyperlinks to or is hyperlinked to by documents in the start set. –Since the indegree of nodes can be very large, in practice a limited number of these documents (say, 50) is included. Each document in both the start set and the neighborhood is modeled by a node. There exists an edge from node A to node B if and only if document A hyperlinks to document B. –Hyperlinks between pages on the same Web host can be omitted.

Query Results = Start Set Forward Set Back Set Neighborhood graph An edge for each hyperlink, but no edges within the same host Result 1 Result 2 Result n f1f1 f2f2 fsfs... b1b1 b2b2 bmbm … Subgraph associated to each query

HITS [K’98] Goal: Given a query find: –Good sources of content (authorities) –Good sources of links (hubs)

Authority comes from in-edges. Being a good hub comes from out-edges. Better authority comes from in-edges from good hubs. Being a better hub comes from out-edges to good authorities. Intuition

q1q1 qkqk... A q2q2 r1r1 rkrk r2r2... H p

HITS details

HITS Kleinberg proved that the H and A vectors will eventually converge, i.e., that termination is guaranteed. –In practice we found the vectors to converge in about 10 iterations. Documents are ranked by hub and authority scores respectively. The algorithm does not claim to find all relevant pages, since there may be some that have good content but have not been linked to by many authors.

Problems with the HITS algorithm(1) Only a relatively small part of the Web graph is considered, adding edges to a few nodes can change the resulting hubs and authority scores considerably. –It is relatively easy to manipulate these scores.

Problems with the HITS algorithm(2) We often find that the neighborhood graph contains documents not relevant to the query topic. If these nodes are well connected, the topic drift problem arises. –The most highly ranked authorities and hubs tend not to be about the original topic. –For example, when running the algorithm on the query “jaguar and car" the computation drifted to the general topic “car" and returned the home pages of different car manufacturers as top authorities, and lists of car manufacturers as the best hubs.

Improvements To avoid “undue weight” of the opinion of a single person –All the documents on a single host have the same influence on the document they are connected to as a single document would. Ideas –If there are k edges from documents on a first host to a single document on a second host, we give each edge an authority weight of 1/k. –If there are l edges from a single document on a first host to a set of documents on a second host, we give each edge a hub weight of 1/l.

Improvements

To solve topic drift problem, content analysis can be used. Ideas –Eliminating non-relevant nodes from the graph –Regulating the influence of a node based on its relevance.

Improvements Computing Relevance Weights for Nodes –The documents in the start set is used to define a broader query and match every document in the graph against this query. –Specifically, the concatenation of the first 1000 words from each document are considered to be the query, Q and compute similarity(Q;D) All nodes whose weights are below a threshold are pruned.

Improvements Regulating the Influence of a Node –Let W[n] be the relevance weight of a node n –W[n]* A[n] is used instead of A[n] for computing the hub scores. –W[n]*H[n] is used instead of H[n] for computing the authority score. This reduces the influence of less relevant nodes on the scores of their neighbors.

Google’s approach Assumption: A link from page A to page B is a recommendation of page B by the author of A (we say B is successor of A)  Quality of a page is related to its in-degree Recursion: Quality of a page is related to – its in-degree, and to – the quality of pages linking to it  PageRank [BP ‘98]

Definition of PageRank Consider the following infinite random walk (surf): –Initially the surfer is at a random page –At each step, the surfer proceeds to a randomly chosen web page with probability d to a randomly chosen successor of the current page with probability 1-d The PageRank of a page p is the fraction of steps the surfer spends at p in the limit.

PageRank (cont.) By random walk theorem: PageRank = stationary probability for this Markov chain, i.e. where n is the total number of nodes in the graph

PageRank (cont.) P A B PageRank of P is (1-d)  (  1/4 th the PageRank of A + 1/3 rd the PageRank of B ) +d/n d d

PageRank Used in Google’s ranking function Query-independent Summarizes the “web opinion” of the page importance

PageRank vs. HITS Computation: –Once for all documents and queries (offline) Query-independent – requires combination with query-dependent criteria Hard to spam Computation: –Requires computation for each query Query-dependent Relatively easy to spam Quality depends on quality of start set

We want top-ranking documents to be both relevant and authoritative

Relevance is being modeled by cosine scores Authority is typically a query-independent property of a document –Assign to each document a query-independent quality score in [0,1] to each document d Denote this by g(d)

Net score Consider a simple total score combining cosine relevance and authority net-score(q,d) = g(d) + cosine(q,d) –Can use some other linear combination than an equal weighting Now we seek the top K docs by net score

Top K by net score – fast methods First idea: Order all postings by g(d) Key: this is a common ordering for all postings Thus, can concurrently traverse query terms’ postings for –Postings intersection –Cosine score computation

Why order postings by g(d)? Under g(d)-ordering, top-scoring docs likely to appear early in postings traversal In time-bound applications (say, we have to return whatever search results we can in 50 ms), this allows us to stop postings traversal early –Short of computing scores for all docs in postings

Champion lists in g(d)-ordering Can combine champion lists with g(d)- ordering Maintain for each term a champion list of the r docs with highest g(d) + tf-idf td Seek top-K results from only the docs in these champion lists

High and low lists For each term, we maintain two postings lists called high and low –Think of high as the champion list When traversing postings on a query, only traverse high lists first –If we get more than K docs, select the top K and stop –Else proceed to get docs from the low lists Can be used even for simple cosine scores, without global quality g(d) A means for segmenting index into two tiers

Impact-ordered postings We only want to compute scores for docs for which wf t,d is high enough We sort each postings list by wf t,d Now: not all postings in a common order! How do we compute scores in order to pick off top K? –Two ideas follow

1. Early termination When traversing t’s postings, stop early after either –a fixed number of r docs –wf t,d drops below some threshold Take the union of the resulting sets of docs –One from the postings of each query term Compute only the scores for docs in this union

2. idf-ordered terms When considering the postings of query terms Look at them in order of decreasing idf –High idf terms likely to contribute most to score As we update score contribution from each query term –Stop if doc scores relatively unchanged Can apply to cosine or some other net scores

Other applications Web Pages Collection –The crawling process usually starts from a set of source Web pages. The Web crawler follows the source page hyperlinks to find more Web pages. –This process is repeated on each new set of pages and continues until no more new pages are discovered or until a predetermined number of pages have been collected. –The crawler has to decide in which order to collect hyperlinked pages that have not yet been crawled. –The crawlers of different search engines make different decisions, and so collect different sets of Web documents. A crawler might try to preferentially crawl “high quality” Web pages.

Other applications Web Page Categorization Geographical Scope –Whether a given Web page is of interest only for people in a given region or is of nation- or worldwide interest is an interesting problem for hyperlink analysis. For example, a weather-forecasting page is interesting only to the region it covers, while the Internal Revenue Service Web page may be of interest to U.S. taxpayers throughout the world. –A page’s hyperlink structure also reflects its range of interest. Local pages are mostly hyperlinked to by pages from the same region, while hyperlinks to pages of nationwide interest are roughly uniform throughout the country. –This information lets search engines tailor query results to the region the user is in.

Reference Monika Henzinger, “hyperlink analysis for the web”, IEEE internet computing J. Cho, H. García-Molina, and L. Page, “Efficient Crawling through URL Ordering,” Proc. Seventh Int’l World Wide Web Conf., Elsevier Science, New York, S. Chakrabarti et al., “Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text,” Proc. Seventh Int’l World Wide Web Conf., Elsevier Science, New York, K. Bharat and M. Henzinger, “Improved Algorithms for Topic Distillation in Hyperlinked Environments,” Proc. 21 st Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR 98), ACM Press, New York, 1998 L. Page et al., “The PageRank Citation Ranking: Bringing Order to the Web,” Stanford Digital Library Technologies, Working Paper , Stanford Univ., Palo Alto, Calif., I. Varlamis et al., “THESUS, a Closer View on Web Content Management Enhanced with Link Semantics”, IEEE Transactions on Knowledge and Data Engineering, vol. 16, No. 6, June 2004.

Appendix

Random Walks Random Walk = discrete-time stochastic process over a graph G=(V,E) with a transition probability matrix P –Random Walk is at one node at any time, making node-transitions at time steps t=1,2, … with P ij being the probability of going to node j when at node i –Initial node chosen according to some probability distribution q (0) over S

Random Walks (cont.) q (t) = row vector whose i-th component is the probability that the chain is in node i at time t q (t+1) = q (t) P => q (t) = q (0) P t A stationary distribution is a probability distribution q such that q = q P (steady-state behavior) Example: –P ij = 1/degree(i) if (i,j) in G and 0 otherwise, then q i = degree(i)/2m

Random Walks (cont.) Theorem: Under certain conditions: –There exists a unique stationary distribution q with q i > 0 for all i –Let N(i,t) be the number of times the random walk visits node i in t steps. Then, the fraction of steps the walk spends at i equals q i, i.e.