(C) 2003, The University of Michigan1 Information Retrieval Handout #8 February 25, 2005.

Slides:



Advertisements
Similar presentations
Analysis and Modeling of Social Networks Foudalis Ilias.
Advertisements

Information Networks Link Analysis Ranking Lecture 8.
Graphs, Node importance, Link Analysis Ranking, Random walks
Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.
Advanced Topics in Data Mining Special focus: Social Networks.
CS 599: Social Media Analysis University of Southern California1 The Basics of Network Analysis Kristina Lerman University of Southern California.
Weighted networks: analysis, modeling A. Barrat, LPT, Université Paris-Sud, France M. Barthélemy (CEA, France) R. Pastor-Satorras (Barcelona, Spain) A.
1 Evolution of Networks Notes from Lectures of J.Mendes CNR, Pisa, Italy, December 2007 Eva Jaho Advanced Networking Research Group National and Kapodistrian.
Emergence of Scaling in Random Networks Barabasi & Albert Science, 1999 Routing map of the internet
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Mining and Searching Massive Graphs (Networks)
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
WEB GRAPHS (Chap 3 of Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2005/10/6.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 Complex systems Made of many non-identical elements connected by diverse interactions. NETWORK New York Times Slides: thanks to A-L Barabasi.
CS 728 Lecture 4 It’s a Small World on the Web. Small World Networks It is a ‘small world’ after all –Billions of people on Earth, yet every pair separated.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Mining and Searching Massive Graphs (Networks) Introduction and Background Lecture 1.
Link Analysis, PageRank and Search Engines on the Web
Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.
Advanced Topics in Data Mining Special focus: Social Networks.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 May 14, 2006
1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.
Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine WEB GRAPHS.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
(Social) Networks Analysis III Prof. Dr. Daning Hu Department of Informatics University of Zurich Oct 16th, 2012.
Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial
1 Applications of Relative Importance  Why is relative importance interesting? Web Social Networks Citation Graphs Biological Data  Graphs become too.
LexRank: Graph-based Centrality as Salience in Text Summarization
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Complex Networks First Lecture TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA TexPoint fonts used in EMF. Read the.
(C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Complex Networks: Models Lecture 2 Slides by Panayiotis TsaparasPanayiotis Tsaparas.
Chapter 5: Link Analysis for Authority Scoring
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
Models and Algorithms for Complex Networks Introduction and Background Lecture 1.
How Do “Real” Networks Look?
PageRank Algorithm -- Bringing Order to the Web (Hu Bin)
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
COMS Network Theory Week 5: October 6, 2010 Dragomir R. Radev Wednesdays, 6:10-8 PM 325 Pupin Terrace Fall 2010.
Information Retrieval (9) Prof. Dragomir R. Radev
Information Retrieval Search Engine Technology (10) Prof. Dragomir R. Radev.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
Information Retrieval Search Engine Technology (11)
Information Retrieval (11)
Link-Based Ranking Seminar Social Media Mining University UC3M
How Do “Real” Networks Look?
How Do “Real” Networks Look?
How Do “Real” Networks Look?
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
How Do “Real” Networks Look?
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

(C) 2003, The University of Michigan1 Information Retrieval Handout #8 February 25, 2005

(C) 2003, The University of Michigan2 Course Information Instructor: Dragomir R. Radev Office: 3080, West Hall Connector Phone: (734) Office hours: M & Th 12-1 or via Course page: Class meets on Fridays, 2:10-4:55 PM in 409 West Hall

(C) 2003, The University of Michigan3 Models of the Web

(C) 2003, The University of Michigan4 Size The Web is the largest repository of data and it grows exponentially. –320 Million Web pages [Lawrence & Giles 1998] –800 Million Web pages, 15 TB [Lawrence & Giles 1999] –8 Billion Web pages indexed [Google 2005] Amount of data –roughly 200 TB [Lyman et al. 2003]

(C) 2003, The University of Michigan5 Bow-tie model of the Web SCC 56 M OUT 44 M IN 44 M Bröder & al. WWW 2000, Dill & al. VLDB 2001 DISC 17 M TEND 44M 24% of pages reachable from a given page

(C) 2003, The University of Michigan6 Power laws Web site size (Huberman and Adamic 1999) Power-law connectivity (Barabasi and Albert 1999): exponents 2.45 for out-degree and 2.1 for the in-degree Others: call graphs among telephone carriers, citation networks (Redner 1998), e.g., Erdos, collaboration graph of actors, metabolic pathways (Jeong et al. 2000), protein networks (Maslov and Sneppen 2002). All values of gamma are around 2-3.

(C) 2003, The University of Michigan7 Small-world networks Diameter = average length of the shortest path between all pairs of nodes. Example… Milgram experiment (1967) –Kansas/Omaha --> Boston (42/160 letters) –diameter = 6 Albert et al – average distance between two verstices is d = log 10 n. For n = 10 9, d= Six degrees of separation

(C) 2003, The University of Michigan8 Clustering coefficient Cliquishness (c): between the k v (k v – 1)/2 pairs of neighbors. Examples: nkdd rand Cc rand Actors Power grid C. Elegans

(C) 2003, The University of Michigan9 Models of the Web A B a b Erdös/Rényi 59, 60 Barabási/Albert 99 Watts/Strogatz 98 Kleinberg 98 Menczer 02 Radev 03 Evolving networks: fundamental object of statistical physics, social networks, mathematical biology, and epidemiology

(C) 2003, The University of Michigan10 Self-triggerability across hyperlinks Document closures for information retrieval Self-triggerability [Mosteller&Wallace 84]  Poisson distribution Two-Poisson [Bookstein&Swanson 74] Negative Binomial, K-mixture [Church&Gale 95] Triggerability across hyperlinks? pjpj pipi p p’ by with from p p’ photo dream path

(C) 2003, The University of Michigan11 Evolving Word-based Web Observations: –Links are made based on topics –Topics are expressed with words –Words are distributed very unevenly (Zipf, Benford, self- triggerability laws) Model –Pick n –Generate n lengths according to a power-law distribution –Generate n documents using a trigram model Model (cont’d) –Pick words in decreasing order of r. –Generate hyperlinks with random directionality Outcome –Generates power-law degree distributions –Generates topical communities –Natural variation of PageRank: LexRank

(C) 2003, The University of Michigan12 Social network analysis for IR

(C) 2003, The University of Michigan13 Social networks Induced by a relation Symmetric or not Examples: –Friendship networks –Board membership –Citations –Power grid of the US –WWW

(C) 2003, The University of Michigan14 Krebs 2004

(C) 2003, The University of Michigan15 Prestige and centrality Degree centrality: how many neighbors each node has. Closeness centrality: how close a node is to all of the other nodes Betweenness centrality: based on the role that a node plays by virtue of being on the path between two other nodes Eigenvector centrality: the paths in the random walk are weighted by the centrality of the nodes that the path connects. Prestige = same as centrality but for directed graphs.

(C) 2003, The University of Michigan16 Graph-based representations Square connectivity (incidence) matrix Graph G (V,E)

(C) 2003, The University of Michigan17 Markov chains A homogeneous Markov chain is defined by an initial distribution x and a Markov kernel E. Path = sequence (x 0, x 1, …, x n ). X i = x i-1 *E The probability of a path can be computed as a product of probabilities for each step i. Random walk = find X j given x 0, E, and j.

(C) 2003, The University of Michigan18 Stationary solutions The fundamental Ergodic Theorem for Markov chains [Grimmett and Stirzaker 1989] says that the Markov chain with kernel E has a stationary distribution p under three conditions: –E is stochastic –E is irreducible –E is aperiodic To make these conditions true: –All rows of E add up to 1 (and no value is negative) –Make sure that E is strongly connected –Make sure that E is not bipartite Example: PageRank [Brin and Page 1998]: use “teleportation”

(C) 2003, The University of Michigan Example This graph E has a second graph E’ (not drawn) superimposed on it: E’ is the uniform transition graph. t=0 t=1

(C) 2003, The University of Michigan20 Eigenvectors An eigenvector is an implicit “direction” for a matrix. Mv = λv, where v is non-zero, though λ can be any complex number in principle. The largest eigenvalue of a stochastic matrix E is real: λ 1 = 1. For λ 1, the left (principal) eigenvector is p, the right eigenvector = 1 In other words, E T p = p.

(C) 2003, The University of Michigan21 Computing the stationary distribution function PowerStatDist (E): begin p (0) = u; (or p (0) = [1,0,…0]) i=1; repeat p (i) = E T p (i-1) L = ||p (i) -p (i-1 )|| 1 ; i = i + 1; until L <  return p (i) end Solution for the stationary distribution

(C) 2003, The University of Michigan Example t=0 t=1 t=10

(C) 2003, The University of Michigan23 How Google works Crawling Anchor text Fast query processing Pagerank

(C) 2003, The University of Michigan24 More about PageRank Named after Larry Page, founder of Google (and UM alum) Reading “The anatomy of a large-scale hypertextual web search engine” by Brin and Page. Independent of query (although more recent work by Haveliwala (WWW 2002) has also identified topic-based PageRank.

(C) 2003, The University of Michigan25 HITS Query-dependent model (Kleinberg 97) Hubs and authorities (e.g., cars, Honda) Algorithm –obtain root set using input query –expanded the root set by radius one –run iterations on the hub and authority scores together –report top-ranking authorities and hubs

(C) 2003, The University of Michigan26 The link-content hypothesis Topical locality: page is similar (  ) to the page that points to it (  ). Davison (TF*IDF, 100K pages) –0.31 same domain –0.23 linked pages –0.19 sibling –0.02 random Menczer (373K pages, non-linear least squares fit) Chakrabarti (focused crawling) - prob. of losing the topic Van Rijsbergen 1979, Chakrabarti & al. WWW 1999, Davison SIGIR 2000, Menczer 2001  1 =1.8,  2 =0.6,

(C) 2003, The University of Michigan27 Document closures for Q&A capital P LP Madrid spain capital

(C) 2003, The University of Michigan28 Document closures for IR Physics P LP Physics Department University of Michigan

(C) 2003, The University of Michigan29 Language models Conditional probability distributions over word sequences Example: p (“Paris”  d j ) = ? p (“Paris”  d j | d j on Europe) = ? Training models: assume a parametric form, then maximize the probability of an existing text

(C) 2003, The University of Michigan30 Link-based language models In the absence of other information, p(w i  p) = 1/d(w j ) Link information: p(w i  p|p 1  p  w i  p 1 )  p(w i  p)*R i conjecture: R i > 1

(C) 2003, The University of Michigan31 Experimental setup 2-Gigabyte wt2g corpus 247,491 Web documents 3,118,248 links 948,036 unique words (after Porter-style stemming) ALE (automatic link extrapolator)

(C) 2003, The University of Michigan32 Experiment one: setup For each stemmed word in wt2g, we compute the following numbers: –PagesContainingWord = how many pages in the collection contain the word –OutgoingLinks = the total number of outgoing links in all the pages that contain the word –LinkedPagesContainingWord = how many of the linked pages contain the word For the latter two measures, only the links inside the collection were considered

(C) 2003, The University of Michigan33 The link effect R The word “each” p = 55654/ =.225 p’ = 15815/46163 =.343 R = p’/p =.343/.225 = 1.524

(C) 2003, The University of Michigan34 Establishing values for R

(C) 2003, The University of Michigan35

(C) 2003, The University of Michigan36 Linear fit for the 2000 lowest- IDF words p p’

(C) 2003, The University of Michigan37 Cluster One p p’ by with from

(C) 2003, The University of Michigan38 Cluster Two p p’ photo dream path