(C) 2003, The University of Michigan1 Information Retrieval Handout #8 February 25, 2005.

(C) 2003, The University of Michigan2 Course Information Instructor: Dragomir R. Radev (radev@si.umich.edu) Office: 3080, West Hall Connector Phone: (734) 615-5225 Office hours: M 11-12 & Th 12-1 or via email Course page: http://tangra.si.umich.edu/~radev/650/ Class meets on Fridays, 2:10-4:55 PM in 409 West Hall

(C) 2003, The University of Michigan4 Size The Web is the largest repository of data and it grows exponentially. –320 Million Web pages [Lawrence & Giles 1998] –800 Million Web pages, 15 TB [Lawrence & Giles 1999] –8 Billion Web pages indexed [Google 2005] Amount of data –roughly 200 TB [Lyman et al. 2003]

(C) 2003, The University of Michigan6 Power laws Web site size (Huberman and Adamic 1999) Power-law connectivity (Barabasi and Albert 1999): exponents 2.45 for out-degree and 2.1 for the in-degree Others: call graphs among telephone carriers, citation networks (Redner 1998), e.g., Erdos, collaboration graph of actors, metabolic pathways (Jeong et al. 2000), protein networks (Maslov and Sneppen 2002). All values of gamma are around 2-3.

(C) 2003, The University of Michigan7 Small-world networks Diameter = average length of the shortest path between all pairs of nodes. Example… Milgram experiment (1967) –Kansas/Omaha --> Boston (42/160 letters) –diameter = 6 Albert et al. 1999 – average distance between two verstices is d = 0.35 + 2.06 log 10 n. For n = 10 9, d=18.89. Six degrees of separation

(C) 2003, The University of Michigan8 Clustering coefficient Cliquishness (c): between the k v (k v – 1)/2 pairs of neighbors. Examples: nkdd rand Cc rand Actors225226613.652.990.790.00027 Power grid49412.6718.712.40.080.005 C. Elegans282142.652.250.280.05

(C) 2003, The University of Michigan9 Models of the Web A B a b Erdös/Rényi 59, 60 Barabási/Albert 99 Watts/Strogatz 98 Kleinberg 98 Menczer 02 Radev 03 Evolving networks: fundamental object of statistical physics, social networks, mathematical biology, and epidemiology

(C) 2003, The University of Michigan10 Self-triggerability across hyperlinks Document closures for information retrieval Self-triggerability [Mosteller&Wallace 84]  Poisson distribution Two-Poisson [Bookstein&Swanson 74] Negative Binomial, K-mixture [Church&Gale 95] Triggerability across hyperlinks? pjpj pipi p p’ by with from p p’ photo dream path

(C) 2003, The University of Michigan11 Evolving Word-based Web Observations: –Links are made based on topics –Topics are expressed with words –Words are distributed very unevenly (Zipf, Benford, self- triggerability laws) Model –Pick n –Generate n lengths according to a power-law distribution –Generate n documents using a trigram model Model (cont’d) –Pick words in decreasing order of r. –Generate hyperlinks with random directionality Outcome –Generates power-law degree distributions –Generates topical communities –Natural variation of PageRank: LexRank

(C) 2003, The University of Michigan15 Prestige and centrality Degree centrality: how many neighbors each node has. Closeness centrality: how close a node is to all of the other nodes Betweenness centrality: based on the role that a node plays by virtue of being on the path between two other nodes Eigenvector centrality: the paths in the random walk are weighted by the centrality of the nodes that the path connects. Prestige = same as centrality but for directed graphs.

(C) 2003, The University of Michigan17 Markov chains A homogeneous Markov chain is defined by an initial distribution x and a Markov kernel E. Path = sequence (x 0, x 1, …, x n ). X i = x i-1 *E The probability of a path can be computed as a product of probabilities for each step i. Random walk = find X j given x 0, E, and j.

(C) 2003, The University of Michigan18 Stationary solutions The fundamental Ergodic Theorem for Markov chains [Grimmett and Stirzaker 1989] says that the Markov chain with kernel E has a stationary distribution p under three conditions: –E is stochastic –E is irreducible –E is aperiodic To make these conditions true: –All rows of E add up to 1 (and no value is negative) –Make sure that E is strongly connected –Make sure that E is not bipartite Example: PageRank [Brin and Page 1998]: use “teleportation”

(C) 2003, The University of Michigan20 Eigenvectors An eigenvector is an implicit “direction” for a matrix. Mv = λv, where v is non-zero, though λ can be any complex number in principle. The largest eigenvalue of a stochastic matrix E is real: λ 1 = 1. For λ 1, the left (principal) eigenvector is p, the right eigenvector = 1 In other words, E T p = p.

(C) 2003, The University of Michigan21 Computing the stationary distribution function PowerStatDist (E): begin p (0) = u; (or p (0) = [1,0,…0]) i=1; repeat p (i) = E T p (i-1) L = ||p (i) -p (i-1 )|| 1 ; i = i + 1; until L <  return p (i) end Solution for the stationary distribution

(C) 2003, The University of Michigan24 More about PageRank Named after Larry Page, founder of Google (and UM alum) Reading “The anatomy of a large-scale hypertextual web search engine” by Brin and Page. Independent of query (although more recent work by Haveliwala (WWW 2002) has also identified topic-based PageRank.

(C) 2003, The University of Michigan25 HITS Query-dependent model (Kleinberg 97) Hubs and authorities (e.g., cars, Honda) Algorithm –obtain root set using input query –expanded the root set by radius one –run iterations on the hub and authority scores together –report top-ranking authorities and hubs

(C) 2003, The University of Michigan26 The link-content hypothesis Topical locality: page is similar (  ) to the page that points to it (  ). Davison (TF*IDF, 100K pages) –0.31 same domain –0.23 linked pages –0.19 sibling –0.02 random Menczer (373K pages, non-linear least squares fit) Chakrabarti (focused crawling) - prob. of losing the topic Van Rijsbergen 1979, Chakrabarti & al. WWW 1999, Davison SIGIR 2000, Menczer 2001  1 =1.8,  2 =0.6,

(C) 2003, The University of Michigan29 Language models Conditional probability distributions over word sequences Example: p (“Paris”  d j ) = ? p (“Paris”  d j | d j on Europe) = ? Training models: assume a parametric form, then maximize the probability of an existing text

(C) 2003, The University of Michigan30 Link-based language models In the absence of other information, p(w i  p) = 1/d(w j ) Link information: p(w i  p|p 1  p  w i  p 1 )  p(w i  p)*R i conjecture: R i > 1

(C) 2003, The University of Michigan31 Experimental setup 2-Gigabyte wt2g corpus 247,491 Web documents 3,118,248 links 948,036 unique words (after Porter-style stemming) ALE (automatic link extrapolator)

(C) 2003, The University of Michigan32 Experiment one: setup For each stemmed word in wt2g, we compute the following numbers: –PagesContainingWord = how many pages in the collection contain the word –OutgoingLinks = the total number of outgoing links in all the pages that contain the word –LinkedPagesContainingWord = how many of the linked pages contain the word For the latter two measures, only the links inside the collection were considered

(C) 2003, The University of Michigan1 Information Retrieval Handout #8 February 25, 2005.

Similar presentations

Presentation on theme: "(C) 2003, The University of Michigan1 Information Retrieval Handout #8 February 25, 2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

(C) 2003, The University of Michigan1 Information Retrieval Handout #8 February 25, 2005.

Similar presentations

Presentation on theme: "(C) 2003, The University of Michigan1 Information Retrieval Handout #8 February 25, 2005."— Presentation transcript:

Similar presentations

About project

Feedback