Roberto Battiti, Mauro Brunato

Slides:



Advertisements
Similar presentations
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Advertisements

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Information Networks Link Analysis Ranking Lecture 8.
Graphs, Node importance, Link Analysis Ranking, Random walks
Link Analysis: PageRank
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
ICS 278: Data Mining Lecture 15: Mining Web Link Structure
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Link Analysis, PageRank and Search Engines on the Web
Link Structure and Web Mining Shuying Wang
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Information Retrieval
Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good.
Link Analysis HITS Algorithm PageRank Algorithm.
Overview of Web Data Mining and Applications Part I
CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine CS 277: Data Mining Mining Web Link Structure.
Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
Stochastic Approach for Link Structure Analysis (SALSA) Presented by Adam Simkins.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Google PageRank Algorithm
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
Chapter 6: Link Analysis
1 CS 430: Information Discovery Lecture 5 Ranking.
Roberto Battiti, Mauro Brunato
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Automated Information Retrieval
PageRank & Random Walk “The important of a Web page is depends on the readers interest, knowledge and attitudes…” –By Larry Page, Co-Founder of Google.
Quality of a search engine
HITS Hypertext-Induced Topic Selection
Search Engines and Link Analysis on the Web
Link-Based Ranking Seminar Social Media Mining University UC3M
Text & Web Mining 9/22/2018.
Roberto Battiti, Mauro Brunato
Roberto Battiti, Mauro Brunato
Roberto Battiti, Mauro Brunato
PageRank & Random Walk “The important of a Web page is depends on the readers interest, knowledge and attitudes…” –By Larry Page, Co-Founder of Google.
Degree and Eigenvector Centrality
Roberto Battiti, Mauro Brunato
Lecture 22 SVD, Eigenvector, and Web Search
Roberto Battiti, Mauro Brunato
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Roberto Battiti, Mauro Brunato
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CSE 538 Chapter 7: Social Network Analysis (From Bing Liu’s Book)
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
COMP5331 Web databases Prepared by Raymond Wong
Presentation transcript:

Roberto Battiti, Mauro Brunato Roberto Battiti, Mauro Brunato. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligent-optimization.org/LIONbook © Roberto Battiti and Mauro Brunato , 2015, all rights reserved. Slides can be used and modified for classroom usage, provided that the attribution (link to book website) is kept.

Text and web mining – part II Wholly new forms of encyclopedias will appear, ready made with a mesh of associative trails running through them, ready to be dropped into the memex and there amplified. (Vannevar Bush, 1945)

Using hyperlinks to rank web pages Problem: given a query, how to retrieve a set of high quality and relevant pages from the Web. In scientific communities, a paper is considered of good quality if it is cited by other good quality papers (citation analysis in Bibliometrics) Analogy: a candidate for employment is valued if many other valued people are prepared to recommend him

Using hyperlinks to rank web pages (2) Prestige in social networks: recommendations from (or relationship with) high-rank individuals (above) are more effective to reach a high rank than recommendations by low-rank ones (below).

Using hyperlinks to rank web pages (3) After a seminal paper by Marchiori about the importance of hyper-information (information in the hyperlinks), Larry Page and Sergey Brin developed the PageRank algorithm Same social networks principles, by substituting “recommendations” and “citations” with hyperlinks

Using hyperlinks to rank web pages (4) The prestige of a page is related to how many pages of prestige link to it. Note the recursive definition: to calculate the prestige, one needs to start from prestige values of other pages, and so on… Start with initial prestige values (random?) and iterate, if prestige values converge the problem is solved.

Using hyperlinks to rank web pages (5) What guarantee that the process converges, hopefully to the same limiting distribution, not depending on the initial distribution of values? The solution to this problem is related to basic linear algebra concepts of eigenvalues and eigenvectors, as well as Markov chains.

Using hyperlinks to rank web pages (6) Let’s use linear definitions To calculate rank of a page: Examine incoming links (the hyperlinks of other pages pointing to the given page). Each incoming link from page i contributes an addendum equal to the rank of i divided by the number of i’s outgoing links

Using hyperlinks to rank web pages (7) Recalculating the rank of a page in PageRank. The initial rank is distributed along the outgoing links (adapted from the original paper).

Using hyperlinks to rank web pages (8) New rank values pk at iteration k are obtained by a linear transformation of the previous values through a matrix M (derived by analyzing hyperlinks) After k recalculations: Matrix power

Using hyperlinks to rank web pages (9) Assume a basis of eigenvectors of M exists let be the n eigenvalues, and v1; v2; …; vn the corresponding eigenvectors M vi = λi vi λ1 is the dominant eigenvalue, so that |λ1| > |λ j| for j > 1. The initial vector p0 can be written as a linear combination of the basis vectors:

Using hyperlinks to rank web pages (10) By linearity and definition of eigenvectors: All terms tends to zero apart from the one proportional to the dominant eigenvector. A simple iteration of matrix multiplication, after starting from almost arbitrary initial conditions extracts the dominant eigenvector! Power method for obtaining the dominant eigenvector by the power of a matrix: Mk)

A surprising connection with web surfing Modeling the movement of a web surfer on the various web pages Let E be the adjacency matrix of the web: if and only if there is a link from page u to page v. What is the probability of the surfer being at page v after one step? Let be the out-degree of page u After defining

A surprising connection with web surfing (2) One obtains: After i steps: (again, iteration of matrix multiplication) If E is irreducible and aperiodic, p converges to the largest eigenvector the prestige (rank) of a page can be interpreted also as probability that a random surfer following links will be found at a given page

A surprising connection with web surfing (3) Real-world transition matrices: Web is not strongly connected, and that random walks can be trapped into cycles. “damping factor” corresponding to a user with an arbitrary probability d of going to a random page (even unconnected) at every step The transition becomes (1 is the identity matrix)

A surprising connection with web surfing (4) The eigenvector corresponding to the largest eigenvalue can be obtained: Normalization to avoid very large components and numerical problems. One is not interested in absolute prestige values but in relative ones

Identifying hubs and authorities: HITS A different analysis of the web. In a scientific community good articles are either seminal (i.e., many others reference to them) or surveys (i.e., they reference to many others). In the web. pages may be authorities or hubs. For example portals are very good hubs. Two score measures, called hubness and authority:

Identifying hubs and authorities: HITS (2) HITS algorithm (Hyperlink-Induced Topic Search) Given query q, let Rq be the root set returned by an IR system. The computation is performed only on this result set. The expanded set is formed by adding all nodes linked to the root set: Let Eq be the induced link subset, Gq = (Vq;Eq).

Identifying hubs and authorities: HITS (3) Authority and hub values are defined in terms of one another hub score hu proportional to sum of referred authorities, authority score au proportional to the sum of referring hubs

Identifying hubs and authorities: HITS (4) The iterated method in HITS: The top-ranking authorities and hubs are reported to the user. The principal eigenvector identifies the largest dense bipartite subgraph.

Clustering in web mining During web search, to avoid overloading the user, identify groups of closely related documents, show only a small number of representative prototypes. Queries can be ambiguous. For example, star: movie stars, or celestial objects? Mutual similarities in term vector space can help clustering.

Clustering in web mining (2) Retrieved pages can be characterized either internally by some intrinsic property (e.g., terms contained, coordinates in TF-IDF space) or externally by a measure of distance between pairs. Examples are: Euclidean distance, dot product, Jaccard coefficient. After defining the metric, the usual bottom-up or top-down clustering techniques can be used (see chapter on clustering).

Gist The Web is a vast expanse of data, some of it structured, some partially structured or not at all. Crawling and indexing are systematic methods to visit web pages, harvest the information contained therein and prepare data structures for searching, information retrieval and ranking. By transforming text into vectors of data (e.g., frequencies of selected words as in the vector-space model) some traditional ML techniques can be reused, but the richer amount of structure in web documents permits amore focused analysis.

Gist (2) Web-mining find explicit relationships between documents (web links), infer implicit ones (by clustering), rank the most relevant pages or identify the most relevant and well-connected persons in a network of people. Abstraction helps to use similar tools for networks of pages and networks of people. As a notable example, the use of hyperlinks and linear algebra tools (eigenvectors and eigenvalues), previously used to rank researchers in bibliometrics, leads to a very powerful technique to rank web pages, now at the basis of Google search-engine technology.