Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented 19.11.2000 by Benjy Weinberger.

Slides:



Advertisements
Similar presentations
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Advertisements

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Information Networks Link Analysis Ranking Lecture 8.
Graphs, Node importance, Link Analysis Ranking, Random walks
Our purpose Giving a query on the Web, how can we find the most authoritative (relevant) pages?
Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
1 Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented by Yongqiang Li Adapted from
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
ICS 278: Data Mining Lecture 15: Mining Web Link Structure
Authoritative Sources in a Hyperlinked Environment By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented By: Talin Kevorkian Summer June
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Advances & Link Analysis
Link Analysis, PageRank and Search Engines on the Web
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Link Structure and Web Mining Shuying Wang
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good.
Link Analysis HITS Algorithm PageRank Algorithm.
CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine CS 277: Data Mining Mining Web Link Structure.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Presented By: - Chandrika B N
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.
Link Analysis on the Web An Example: Broad-topic Queries Xin.
Overview of Web Ranking Algorithms: HITS and PageRank
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
Analysis of Link Structures on the World Wide Web and Classified Improvements Greg Nilsen University of Pittsburgh April 2003.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran 1.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
1 CS 430: Information Discovery Lecture 5 Ranking.
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
HITS Hypertext-Induced Topic Selection
Methods and Apparatus for Ranking Web Page Search Results
Search Engines and Link Analysis on the Web
Link-Based Ranking Seminar Social Media Mining University UC3M
Greg Nilsen University of Pittsburgh April 2003
Text & Web Mining 9/22/2018.
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CS 440 Database Management Systems
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented by Benjy Weinberger

The Search Problem The WWW is a vast collection of information: over 1 billion text pages plus a multitude of multimedia files. Over a million new resources are added every day. How do we find the information we need in such a large collection? Search is the most common activity on the web after .

Types of Solutions There are basically two types of solution to the “search problem”: –Directories are static lists of web pages, ordered in a topic taxonomy. –Search Engines attempt to match web sites to a user’s query in real time. The distinction between the two types is becoming more and more fuzzy.

Types of Searches Two main kinds of searches: –Specific queries. E.g. “does Netscape 4.0 support the Java 1.1 code-signing API?” –Broad-topic queries. E.g. “find out information about Che Guevara.” Other types of searches exist, such as: –Similar-page queries. E.g. “find pages ‘similar’ to

Different Queries - Different Challenges Specific Queries: The difficulty is mostly that there are very few pages with the required information - the “needle in a haystack” problem. Broad-topic Queries: The difficulty is that there are too many relevant pages for humans to digest.

Broad-topic Queries Center around the notion of an authority - a page containing high-quality, authoritative information on the relevant topic. Problem: we are asking a computer to make a judgment on abstract and subjective qualities of “relevance” and “authority”.

Manual Solutions Manually edited catalogues of pages, ordered in a taxonomy of topics. Problems: –Completely unscalable. Only a tiny fraction of the web can be catalogued. –Subjective. Decisions made by individual editors.

Manual Solutions Nonetheless, solution is highly popular. Yahoo is still the #1 site on the web in terms of visits. The Open Directory Project is has over 2,000,000 sites in 250,000 categories, edited by tens of thousands of volunteers.

Automated Solutions - 1st Wave “Relevance” assumed to be associated with existence of keywords. “Authority” assumed to be associated with frequency and placement of keywords. –The more a page mentions the word the higher that page’s score. –Font size, proximity to other keywords and place in page may increase score.

Automated Solutions - 1st Wave Solution implemented by search engines such as AltaVista, HotBot, Lycos, Northern Light and many others. These solutions are inadequate: –The determination of “relevance” is inaccurate. –The determination of “authority” is inaccurate. –Don’t work for non-textual resources. Result: users still end up wading through thousands of results.

The Web as a Graph But... the web is more than just a set of pages. The hyperlinks between pages induce a digraph structure on the web. Hyperlinks are created by human site authors and therefore can imply some human determination of authority. But how can we mine this information for search?

Naive Approach Determine relevance by keyword matching. Determine authority by number of inbound links. Heuristic: –Find all pages matching keywords. –Rank them by in-degree.

Naive Approach - Problems Good authorities may not contain keywords. –Example: honda.com, toyota.com, ford.com do not contain the term “car manufacturers”. Doesn’t look at who is linking to the page. A highly popular page will be an authority on ANY query string it contains. –Example: cnn.com, yahoo.com, amazon.com.

Where Do We Stand The hyperlink structure of the web is constructed by the deliberate annotations of millions of web site authors. How can we mine this seemingly random link structure for underlying meaning?

Short Mathematical Interlude Given a square matrix A, an eigenvalue is a scalar with the property that there exists a vector such that. The vector is called an eigenvector. The normalized eigenvector corresponding to the largest eigenvalue is called the principal eigenvector.

Short Mathematical Interlude A stochastic matrix is a matrix with non- negative entries such that the entries in each row sum to 1. A stochastic matrix always has a principal eigenvector (with eigenvalue 1): All eigenvalues are <= 1 in absolute value.

Short Mathematical Interlude For any matrix A, and are symmetric. A symmetric real n-by-n matrix has n real eigenvalues (with multiplicity). For any vector x not orthogonal to the principal eigenvector y,

Short Mathematical Interlude The adjacency matrix of a directed graph G is the matrix A such that:

Google High-quality, highly popular search engine. Developed by Sergey Brin and Larry Page at Stanford. Later became a commercial venture. Combines new ranking algorithm with state-of-the-art engineering solutions for scalability and performance.

Google Offline processing assigns each site on the web a PageRank - a global ranking based on link structure analysis. Relevance determined by keywords. Authority determined by PageRank.

PageRank Intuitive description: a page has high rank if the sum of the ranks of pages pointing to it is high. This is a circular definition, represented mathematically by: where n is the total number of pages, deg(q) is q’s out degree and d is a damping factor.

PageRank PR is a probability vector on the web pages (0 <= PR(p) <= 1 and the sum is 1). PR corresponds to a random surfer model: –A surfer clicks at random through the web, without hitting “back”. When she gets bored, which happens with probability 1-d, she jumps to a random page. –PR(p) is the probability of hitting page p in this “random surf”.

PageRank We define a matrix A for which: Note that A is a stochastic matrix. The PageRank vector is an eigenvector of

PageRank It is not hard to see, using the properties of stochastic matrices, that B has a principal eigenvector. We can initialize PR to all ones and apply B iteratively until it converges to this eigenvector.

Google Advantages: –Superior results to other engines. –Responds to any query, not just predetermined topics. Disadvantages: –PageRank is global, not contextual. –Discriminates against “orphan” pages.

HITS HITS - Developed by Jon Kleinberg et al within the CLEVER project at IBM Almaden. Introduced the twin notions of Hubs and Authorities.

HITS - Hubs & Authorities An authority is a web site containing high- quality, highly-relevant information (with respect to the given search query). A hub is a web site that points to many related authorities. Intuitively, hubs help “pull” authorities together into a dense cluster of related sites.

HITS Hub (resp. authority) weights determine how good a hub (resp. authority) a site is. –The idea is to first find a focused subgraph of the web relevant to the specified topic. –Then the link structure of this subgraph is analyzed to find the hub and authority weights for each site, thus identifying the top authorities for this topic. –Processing requirement per topic implies that uses are mainly for directory compilation.

HITS The focused subgraph is created by first taking the t highest-ranked pages from a text-based search engine as a root set R. R is expanded into the base set S by taking all sites pointing to or pointed at by a site in R. Note that while R may fail to contain some “important” authorities, S will probably contain them.

HITS We have a circular definition: –A good authority is a site pointed to by many good hubs. –A good hub is a site that points to many good authorities. This is the mutually reinforcing relationship. How can we break this circularity?

HITS A static mathematical formulation: –For a web site p let h(p) be its hub weight and a(p) be its authority weight. We take the vectors a, h to be normalized. –We would like to find vectors a, h such that (up to normalization):

HITS If A is the adjacency matrix of the graph in question (the focused subgraph) then the constraints can be written as: By substitution we see that a is defined by:

HITS In other words, a is an eigenvector of B: B is the co-citation matrix: B(i,j) is the number of sites that jointly point to both i and j. B is symmetric and has n orthogonal unit eigenvectors. Which one do we take?

HITS Let’s look at the problem dynamically: –We initialize a(p) = h(p) = 1 for all p. –We iterate the following operations: And renormalize after each iteration.

HITS The eigenvectors of B are precisely the stationary points of this process. For any stationary point a, and therefore the authority weights are reinforced by the factor. Conclusion: The principal eigenvector represents the “densest cluster” within the focused subgraph.

HITS As we have said, by initializing a(p)=h(p)=1, a will converge to the principal eigenvector of B. –Initializing differently may lead to convergence to a different eigenvector. –In practice convergence is achieved after only iterations.

HITS The non-principal eigenvectors may represent other, less dense subclusters of the focused subgraph. Other subclusters can occur when: –A query string has several meanings –A concept is relevant to several communities. –The topic is highly polarized, involving groups that are not likely to link to one another (such as “abortion”).

HITS Unlike the principal eigenvector, the other eigenvectors have both positive and negative entries. Each non-principal eigenvector provides us with two densely connected clusters: The “most positive” and “most negative” entries of the vector. The larger the eigenvalue, the more “relevant” the corresponding eigenvector.

Spectral Filtering A generalization of HITS. Developed by Chakrabarti, Dom, Gibson, Raghavan and others. Some of these worked with Kleinberg on CLEVER at IBM Almaden. Used commercially in Quiver to mine community knowledge.

Spectral Filtering We have two kinds of entities: hubs and authorities. These can both be of the same type (such as web sites, cited documents) or of different types (people and products, bookmark folders and web sites). Note that we distinguish, a priori, between hubs and authorities.

Spectral Filtering The link structure consists of directed links from hubs to authorities. In mathematical terms we have a bipartite digraph (H;A;E) with all edges directed from H to A.

Spectral Filtering - Affinities For each hub i and authority j let a(i,j) >= 0 be the affinity of i for j. E.g: The percentage of common terms between documents or the number of visits a user makes to a bookmark. A=a(i,j) is the affinity matrix.

Spectral Filtering In SF the mutually reinforcing relationship becomes: Or:

Spectral Filtering Note that HITS is simply SF with each web site represented as both a hub and an authority, and with the following simple affinities:

Spectral Filtering The spectral analysis used in HITS applies here, and the principal eigenvector is used to determine the highest ranked authorities.

Spectral Filtering Notes: –The links from hubs to authorities represent the opinions of many people regarding the authorities. –SF mines this information for results reflecting the opinions of these people as a group. –The SF algorithm “pulls” the densely linked cluster together. –SF is relatively insensitive to “noise”.

References [1] J.Kleinberg. Authoritative Sources in a Hyperlinked Environment. Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, [2] S.Chakrabarti, B.Dom, D.Gibson, R.Kumar, P.Raghavan, S.Rajagopalan, A.Tomkins. Spectral Filtering for Resource Discovery. ACM SIGIR Workshop on Hypertext Information Retrieval on the Web, [3] S.Brin, L.Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Proc. 7th International WWW Conference, [4] S.Brin, R.Motwani,L.Page,T.Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Manuscript.

End Thank you for listening!