Presentation is loading. Please wait.

Presentation is loading. Please wait.

(C) 2003, The University of Michigan1 Information Retrieval Handout #8 February 25, 2005.

Similar presentations


Presentation on theme: "(C) 2003, The University of Michigan1 Information Retrieval Handout #8 February 25, 2005."— Presentation transcript:

1 (C) 2003, The University of Michigan1 Information Retrieval Handout #8 February 25, 2005

2 (C) 2003, The University of Michigan2 Course Information Instructor: Dragomir R. Radev (radev@si.umich.edu) Office: 3080, West Hall Connector Phone: (734) 615-5225 Office hours: M 11-12 & Th 12-1 or via email Course page: http://tangra.si.umich.edu/~radev/650/ Class meets on Fridays, 2:10-4:55 PM in 409 West Hall

3 (C) 2003, The University of Michigan3 Models of the Web

4 (C) 2003, The University of Michigan4 Size The Web is the largest repository of data and it grows exponentially. –320 Million Web pages [Lawrence & Giles 1998] –800 Million Web pages, 15 TB [Lawrence & Giles 1999] –8 Billion Web pages indexed [Google 2005] Amount of data –roughly 200 TB [Lyman et al. 2003]

5 (C) 2003, The University of Michigan5 Bow-tie model of the Web SCC 56 M OUT 44 M IN 44 M Bröder & al. WWW 2000, Dill & al. VLDB 2001 DISC 17 M TEND 44M 24% of pages reachable from a given page

6 (C) 2003, The University of Michigan6 Power laws Web site size (Huberman and Adamic 1999) Power-law connectivity (Barabasi and Albert 1999): exponents 2.45 for out-degree and 2.1 for the in-degree Others: call graphs among telephone carriers, citation networks (Redner 1998), e.g., Erdos, collaboration graph of actors, metabolic pathways (Jeong et al. 2000), protein networks (Maslov and Sneppen 2002). All values of gamma are around 2-3.

7 (C) 2003, The University of Michigan7 Small-world networks Diameter = average length of the shortest path between all pairs of nodes. Example… Milgram experiment (1967) –Kansas/Omaha --> Boston (42/160 letters) –diameter = 6 Albert et al. 1999 – average distance between two verstices is d = 0.35 + 2.06 log 10 n. For n = 10 9, d=18.89. Six degrees of separation

8 (C) 2003, The University of Michigan8 Clustering coefficient Cliquishness (c): between the k v (k v – 1)/2 pairs of neighbors. Examples: nkdd rand Cc rand Actors225226613.652.990.790.00027 Power grid49412.6718.712.40.080.005 C. Elegans282142.652.250.280.05

9 (C) 2003, The University of Michigan9 Models of the Web A B a b Erdös/Rényi 59, 60 Barabási/Albert 99 Watts/Strogatz 98 Kleinberg 98 Menczer 02 Radev 03 Evolving networks: fundamental object of statistical physics, social networks, mathematical biology, and epidemiology

10 (C) 2003, The University of Michigan10 Self-triggerability across hyperlinks Document closures for information retrieval Self-triggerability [Mosteller&Wallace 84]  Poisson distribution Two-Poisson [Bookstein&Swanson 74] Negative Binomial, K-mixture [Church&Gale 95] Triggerability across hyperlinks? pjpj pipi p p’ by with from p p’ photo dream path

11 (C) 2003, The University of Michigan11 Evolving Word-based Web Observations: –Links are made based on topics –Topics are expressed with words –Words are distributed very unevenly (Zipf, Benford, self- triggerability laws) Model –Pick n –Generate n lengths according to a power-law distribution –Generate n documents using a trigram model Model (cont’d) –Pick words in decreasing order of r. –Generate hyperlinks with random directionality Outcome –Generates power-law degree distributions –Generates topical communities –Natural variation of PageRank: LexRank

12 (C) 2003, The University of Michigan12 Social network analysis for IR

13 (C) 2003, The University of Michigan13 Social networks Induced by a relation Symmetric or not Examples: –Friendship networks –Board membership –Citations –Power grid of the US –WWW

14 (C) 2003, The University of Michigan14 Krebs 2004

15 (C) 2003, The University of Michigan15 Prestige and centrality Degree centrality: how many neighbors each node has. Closeness centrality: how close a node is to all of the other nodes Betweenness centrality: based on the role that a node plays by virtue of being on the path between two other nodes Eigenvector centrality: the paths in the random walk are weighted by the centrality of the nodes that the path connects. Prestige = same as centrality but for directed graphs.

16 (C) 2003, The University of Michigan16 Graph-based representations 1 2 3 4 5 7 68 12345678 111 21 311 41 51111 611 7 8 Square connectivity (incidence) matrix Graph G (V,E)

17 (C) 2003, The University of Michigan17 Markov chains A homogeneous Markov chain is defined by an initial distribution x and a Markov kernel E. Path = sequence (x 0, x 1, …, x n ). X i = x i-1 *E The probability of a path can be computed as a product of probabilities for each step i. Random walk = find X j given x 0, E, and j.

18 (C) 2003, The University of Michigan18 Stationary solutions The fundamental Ergodic Theorem for Markov chains [Grimmett and Stirzaker 1989] says that the Markov chain with kernel E has a stationary distribution p under three conditions: –E is stochastic –E is irreducible –E is aperiodic To make these conditions true: –All rows of E add up to 1 (and no value is negative) –Make sure that E is strongly connected –Make sure that E is not bipartite Example: PageRank [Brin and Page 1998]: use “teleportation”

19 (C) 2003, The University of Michigan19 1 2 3 4 5 7 68 Example This graph E has a second graph E’ (not drawn) superimposed on it: E’ is the uniform transition graph. t=0 t=1

20 (C) 2003, The University of Michigan20 Eigenvectors An eigenvector is an implicit “direction” for a matrix. Mv = λv, where v is non-zero, though λ can be any complex number in principle. The largest eigenvalue of a stochastic matrix E is real: λ 1 = 1. For λ 1, the left (principal) eigenvector is p, the right eigenvector = 1 In other words, E T p = p.

21 (C) 2003, The University of Michigan21 Computing the stationary distribution function PowerStatDist (E): begin p (0) = u; (or p (0) = [1,0,…0]) i=1; repeat p (i) = E T p (i-1) L = ||p (i) -p (i-1 )|| 1 ; i = i + 1; until L <  return p (i) end Solution for the stationary distribution

22 (C) 2003, The University of Michigan22 1 2 3 4 5 7 68 Example t=0 t=1 t=10

23 (C) 2003, The University of Michigan23 How Google works Crawling Anchor text Fast query processing Pagerank

24 (C) 2003, The University of Michigan24 More about PageRank Named after Larry Page, founder of Google (and UM alum) Reading “The anatomy of a large-scale hypertextual web search engine” by Brin and Page. Independent of query (although more recent work by Haveliwala (WWW 2002) has also identified topic-based PageRank.

25 (C) 2003, The University of Michigan25 HITS Query-dependent model (Kleinberg 97) Hubs and authorities (e.g., cars, Honda) Algorithm –obtain root set using input query –expanded the root set by radius one –run iterations on the hub and authority scores together –report top-ranking authorities and hubs

26 (C) 2003, The University of Michigan26 The link-content hypothesis Topical locality: page is similar (  ) to the page that points to it (  ). Davison (TF*IDF, 100K pages) –0.31 same domain –0.23 linked pages –0.19 sibling –0.02 random Menczer (373K pages, non-linear least squares fit) Chakrabarti (focused crawling) - prob. of losing the topic Van Rijsbergen 1979, Chakrabarti & al. WWW 1999, Davison SIGIR 2000, Menczer 2001  1 =1.8,  2 =0.6,

27 (C) 2003, The University of Michigan27 Document closures for Q&A capital P LP Madrid spain capital

28 (C) 2003, The University of Michigan28 Document closures for IR Physics P LP Physics Department University of Michigan

29 (C) 2003, The University of Michigan29 Language models Conditional probability distributions over word sequences Example: p (“Paris”  d j ) = ? p (“Paris”  d j | d j on Europe) = ? Training models: assume a parametric form, then maximize the probability of an existing text

30 (C) 2003, The University of Michigan30 Link-based language models In the absence of other information, p(w i  p) = 1/d(w j ) Link information: p(w i  p|p 1  p  w i  p 1 )  p(w i  p)*R i conjecture: R i > 1

31 (C) 2003, The University of Michigan31 Experimental setup 2-Gigabyte wt2g corpus 247,491 Web documents 3,118,248 links 948,036 unique words (after Porter-style stemming) ALE (automatic link extrapolator)

32 (C) 2003, The University of Michigan32 Experiment one: setup For each stemmed word in wt2g, we compute the following numbers: –PagesContainingWord = how many pages in the collection contain the word –OutgoingLinks = the total number of outgoing links in all the pages that contain the word –LinkedPagesContainingWord = how many of the linked pages contain the word For the latter two measures, only the links inside the collection were considered

33 (C) 2003, The University of Michigan33 The link effect R The word “each” p = 55654/247491 =.225 p’ = 15815/46163 =.343 R = p’/p =.343/.225 = 1.524

34 (C) 2003, The University of Michigan34 Establishing values for R

35 (C) 2003, The University of Michigan35

36 (C) 2003, The University of Michigan36 Linear fit for the 2000 lowest- IDF words p p’

37 (C) 2003, The University of Michigan37 Cluster One p p’ by with from

38 (C) 2003, The University of Michigan38 Cluster Two p p’ photo dream path


Download ppt "(C) 2003, The University of Michigan1 Information Retrieval Handout #8 February 25, 2005."

Similar presentations


Ads by Google