Presentation is loading. Please wait.

Presentation is loading. Please wait.

7CCSMWAL Algorithmic Issues in the WWW

Similar presentations


Presentation on theme: "7CCSMWAL Algorithmic Issues in the WWW"— Presentation transcript:

1 7CCSMWAL Algorithmic Issues in the WWW
Lecture PageRank

2 WWW Search Architecture

3 Search Engines See the original description of Google
The figure on the previous page is from the book Google’s PageRank and Beyond (Langville & Meyer) See the original description of Google The Anatomy of a Search Engine by Brin & Page. A search engine performs Web crawling (Breadth first search or otherwise) Indexing of retrieved documents Searching in response to a query See e.g. Web search engine – Wikipedia for a summary of what is around. Google, Bing, Yahoo...Yandex, Baidu

4 Query-Independent Modules
Crawler Collect and categorize data from the Web The crawling software creates virtual robots, called spiders Constantly accessing the Web gathering new information and web pages and returning to store them in a central repository

5 Query-Independent Modules
Indexes Vital descriptors are extracted to create a compressed description of the page that is stored in various indexes Different type of indexes Content index: Keywords, title, and anchor text are stored using an inverted file Structure index: Hyperlink information (graph) Special-purpose indexes: for images or pdf files

6 Query-Dependent Modules
Initiated when a user enters a query The search engine must respond in real time Query Module Consult the content index (or other indexes) to find which pages are relevant to the query terms Ranking Module Rank the relevant pages Vital because of too many relevant pages Usually done by combining two scores, the content score and the popularity score popularity score is determined from the Web hyperlink structure, which is calculated before query time

7 Ranking Webpages by Popularity
The popularity of a webpage is calculated independent of the page content Two well-known methods PageRank By Brin and Page (1998), and implemented in the Google search engine HITS (Hypertext Induced Topic Search) By Kleinberg (1998), recently adopted by search engine Teoma (ask.com)

8 Link Analysis Systems Model to exploiting the Web hyperlink structure
E.g., PageRank and HITS View a hyperlink as a recommendation The more the in-links the better (or more important) of a webpage Also depend on the importance of the source webpage

9 PageRank Each webpage Pi has a score, denoted by r(Pi), or called the pagerank of Pi The value of r(Pi) is the sum of the normalized pageranks of all webpages pointing into Pi BPi is the set of webpages pointing to Pi |Pj| is the number of out-links from page Pj Normalized pagerank of Pj means r(Pj) / |Pj|. The pagerank of Pj is shared by all webpages Pj points to

10 Example P1 P2 Suppose the pagerank of P1, P2, P3, and P4 are known P5
Pages P1 P2 P3 P4 Pagerank 3.5 1.2 4.2 1.0 P3 P4 The pagerank of P5 is computed as r(P5) = 3.5/ / /3 = 4.35

11 Computation of Pagerank
Problem In the beginning all pageranks are unknown. How to determine the first pagerank value? Solution Give an initial pagerank to every webpage E.g., 1/n where n is the total number of webpages Perform the calculation of pagerank iteratively Use the pagerank formula to update the pagerank of every webpage Repeat the above step a number of times until the pagerank values are stable (converge)

12 Iterative Procedure Let rk(Pi) be the PageRank of page Pi at iteration k Starting with r0(Pi) = 1/n for all pages Pi At iteration k+1, the pagerank of every page Pi is updated using the pageranks at iteration k

13 Does this work? Digraph, edges (1,2), (2,2)
Vertex 2 gets all the PageRank Digraph, edges (1,2), (2,3), (3,1) What you get depends on initial allocation Structure is Multipartite Digraph (1,1), (1,2), (2,1) All vertices get same PageRank independent of initial allocation.

14 Example Consider the graph on the right Iteration 0 Iteration 1
r0(Pi) = 1/6 for i=1, 2, ..., 6 Iteration 1 r1(P1) = r0(P3) / 3 = r1(P2) = r0(P1) / 2 + r0(P3) / 3 = r1(P3) = r0(P1) / 2 = r1(P4) = r0(P5) / 2 + r0(P6) = 0.25 r1(P5) = r0(P3) / 3 + r0(P4) / 2 = r1(P6) = r0(P4) / 2 + r0(P5) / 2 = 1 6 5 4 3 2

15 Example (cont) After 20 iterations … 1 6 5 4 3 2 Rank Page 1 P4 2 P6 3
5, 6 P1, P3

16 Where is the PageRank in the example?
At each step your PageRank is what is passed to you from your in-neighbours 0.4 was passed to vertex 2 and is lost, as it wasn’t passed on. (See second column) = 0.4 0.6 is shared among vertices 4, 5, 6 = 0.6 The digraph is not strongly connected PageRank is only ‘conserved’ in strongly connected digraphs (also need ‘not multi-partite’) PageRank only works on the right type of graphs

17 Another example Vertices 1, 2, 3 Directed edges
(1,2), (1,3), (2,1), (2,3), (3,1) Final PageRank 4/9, 2/9, 3/9

18 Step by step

19 Exercise 1: What is final PageRank in digraph with vertices 1,2 and:
edges (1,1), (1,2), (2,2) edges (1,1), (1,2), (2,1) The answer (after you have tried it) is in ANSWERS at PageRank-Ex1answ.pdf

20 PageRank for undirected graphs
For directed graphs D, we calculate PageRank iteratively For undirected graphs G, PageRank has a formula. For vertex v, r(v)=d(v)/2m. Here d(v) is degree of v, m is the number of edges in the graph This formula is for graphs G which are connected and aperiodic

21 Matrix Representation of the Summation Equations
Using matrices and matrix multiplication, We have a simpler representation of the iterative procedure, and Compute PageRank of all pages at one time at an iteration

22 Matrix and Vector notation
Matrices are like tables of data (numbers) An n  m matrix consists of n rows and m columns, each of the nm entries consists of a number A vector is a matrix with either a single row or a single column A row vector is a 1  m matrix A column vector is a n  1 matrix Example 3  1 (column) vector 2  3 matrix 1  4 (row) vector

23 Matrix Multiplication
An n  m matrix A multiplied with an r  s matrix B, denoted as A  B (or just AB) Must have m = r, i.e., the number of columns of A = the number of rows of B Let matrix C = AB (or just AB) C will be an n  s matrix Let Aij, Bij, and Cij denote the entry at row i column j of A, B and C, respectively Cij = k = 1 to m Aik * Bkj

24 Example C11= 0*0 + 1*3 + 1*4 + 2*1 = 9 C21 = 3*0 + 0*3 + 0*4 + 1*1 = 1
C = A  B C11= 0*0 + 1*3 + 1*4 + 2*1 = 9 C21 = 3*0 + 0*3 + 0*4 + 1*1 = 1 C31 = 0*0 + 1*3 + 0*4 + 0*1 = 3 C12 = 0*1 + 1*0 + 1*0 + 2*1 = 2 C22 = 3*1 + 0*0 + 0*0 + 1*1 = 4 C32 = 0*1 + 1*0 + 0*0 + 0*1 = 0

25 Scalar Multiplication and Matrix Addition
Scalar multiplication involves a number k multiplies with a matrix M Every entry of M is multiplied by k E.g., Matrix addition is addition of two matrices Corresponding entries from two matrices are added

26 PageRank Vector For n pages, all the n PageRank values can be represented in a 1  n row vector (denoted as )  = ( r(P1) r(P2) r(Pn) ) Let (k) denote the vector at iteration k of the iterative procedure At iteration 0, (uniform intialization) (0) = ( 1/n 1/n /n )

27 Normalized Hyperlink Matrix H
Represent hyperlinks of graph in weighted adjacency matrix form For a page Pi, the proportion of Pi’s PageRank passed to another page Pj, denoted by Hij, is 1/|Pi|, if Pi has a hyperlink to Pj 0 otherwise All values of Hij are be stored in an n  n matrix H (n is the number of vertices), where Hij is stored in the entry of row i column j H also called transition matrix

28 Example For the graph on the right, the H matrix is
1 6 5 4 3 2 E.g., H56 = 1/2. Since there is a link from P5 to P6 and P5 has two out-links, the pagerank P5 contributing to P6 is r(P5)/2

29 Update PageRank at Iteration k
Performed by a single matrix multiplication (k+1) = (k) H To verify with previous formula 1  n 1  n n  n Because Hji = 1/|Pj| if there is a link from Pj to Pi, otherwise Hji = 0

30 Re-run the Example (0) = (1/6 1/6 1/6 1/6 1/6 1/6) 1 6 5 4 3 2
(0) = (1/6 1/6 1/6 1/6 1/6 1/6) 1 6 5 4 3 2 (1) = (0) H (1)11 = 1/6 * 1/3 = .0556 (1)12 = 1/6 * 1/2 + 1/6 * 1/3 = .1389 (1)13 = 1/6 * 1/2 = .0833 (1)14 = 1/6 * 1/2 + 1/6 * 1 = .25 (1)15 = 1/6 * 1/3 + 1/6 * 1/2 = .1389 (1)16 = 1/6 * 1/2 + 1/6 * 1/2 = .1667 (1) = ( )

31 Re-run the Example (1) = (0) H = (.0556 .1389 .0833 .25 .1389 .1667)
1 6 5 4 3 2

32 Problem….. Rank sinks – those pages that accumulate more and more pagerank at each iteration For the graph on the right, Pages 4, 5 & 6 are the rank sinks, while Pages 1, 2 & 3 get zero PageRank E.g., after 20 iterations, (20) = ( ) It is difficult to rank Pages 1, 2 & 3 if they all have zero PageRank 1 6 5 4 3 2

33 Final PageRank Initial allocation (1/6,1/6,1/6,1/6, 1/6,1/6)
Final values (0,0,0, , , 0.2) Total 0.6, Missing 0.4 0.4 lost via vertex 2. Does not add up to 1 Note: One approach is to normalize the final answer (divide by 0.6) giving (0, 0, 0, 0.444, 0.222, 0.333)

34 Other Problems Iterative approach is a heuristic (no guarantee)
Will the iterative process continue indefinitely or will it converge to some stable values? Under what circumstance or properties of H is it guaranteed to converge? Will it converge to something that makes sense in the context of the PageRank problem? Will it converge to just one vector or multiple vectors? Does the convergence depend on the starting vector? If it converges eventually, how long is “eventually”? How many iterations are needed for convergence? When to stop?

35 The truth about unadjusted page rank
If the digraph is strongly connected and aperiodic (i.e. not multi-partite), unadjusted page rank is meaningful and can be calculated from the normalized hyperlink matrix H by solving the equation =  H where  = ( r(P1) r(P2) r(Pn) ) subject to r(P1) + r(P2) r(Pn) =1

36 Example (1,2), (2,1), (1,1) H= (1/2,1/2) ( 1, 0) Solve = H
( 1, 0) Solve = H (1)=1/2 (1)+ (2), (2)=1/2 (1) +0(2), (1)+ (2)=1. This means (1)+ 1/2(1)=1, so (1)=2/3, (2)=1/3.

37 Exercise Solve = H for (1,2), (1,3), (3,1), (2,3)
Ans: (2/5, 1/5, 2/5)

38 Random Web-Surfer Model
Brin and Page proposed this model to explain the meaning of PageRank A web surfer arrives at a page with several out-links, chooses one at random, moves down the hyperlink to this new page The surfer continues this random decision process indefinitely The proportion of time the random surfer spends on a given page is a measure of the relative importance of that page Final PageRank value=Proportion of time on page

39 Adjustments to the Basic Setting
Dangling pages, e.g., pdf files, image files, data tables, pages with no hyperlinks etc The random surfer can’t proceed forward from these pages Adjustment (teleporting) The random surfer, after entering a dangling node, can now hyperlink to any page at random (I.e., with equal probability)

40 Re-run the Example (0) = (1/6 1/6 1/6 1/6 1/6 1/6) 1 6 5 4 3 2
(0) = (1/6 1/6 1/6 1/6 1/6 1/6) 1 6 5 4 3 2 (1) = (0) H (1)11 = 1/6 * 1/3 = .0556 (1)12 = 1/6 * 1/2 + 1/6 * 1/3 = .1389 (1)13 = 1/6 * 1/2 = .0833 (1)14 = 1/6 * 1/2 + 1/6 * 1 = .25 (1)15 = 1/6 * 1/3 + 1/6 * 1/2 = .1389 (1)16 = 1/6 * 1/2 + 1/6 * 1/2 = .1667 (1) = ( )

41 Adjustments to the Basic Setting
The rows of the hyperlink matrix H with all zeros are replaced by rows of (1/n 1/n /n) For the bad-example graph, the 2nd row of the hyperlink matrix is changed. H’ is the modified matrix

42 Adjustments to the Basic Setting
Allow teleporting to any page at any time With probability , the random surfer will follow one of the hyperlinks at current page With probability 1  , they will randomly select a page (out of the n pages) to teleport to The modified hyperlink matrix is called the Google matrix G, G = H’ + (1  )(1/n)I I is an nn matrix with all entries equal to 1

43 Simple example (1,2), (2,2), n=2, put =2/3
H=(0,1, // , 0,1), H’=H (no empty rows) G= (2/3)H’+(1/3)(1/2) I G=(1/6,5/6, // ,1/6, 5/6) Exercise (1,2), (2,1), (1,1) find G

44 Example 1 6 5 4 3 2 G = H’ + (1  )(1/n)I Let =0.9

45 Google’s Adjusted PageRank Method
(k+1) = (k) G With the Google matrix, the pagerank vector converges to a stable value …..

46 More detail

47 Google’s Adjusted PageRank Method
After 500 iterations…  = ( ) The interpretation of 11= is that 3.721% of time the random surfer visits page 1. The pages in the example graph can be ranked as meaning page 4 is the most important and page 1 the least. Rank Pages 1 4 2 6 3 5

48 Effect of teleporting No PageRank is lost Always a unique answer
Solution to PageRank is = G, subject to total of  entries equal 1, all entries ≥ 0 Follows from theory of Markov Processes

49 Exercise 2 Exercise 2: Digraph with vertices 1,2 and:
D1: edges (1,1), (1,2), (2,2) D2: edges (1,1), (1,2), (2,1) What is final Google PageRank with  = ½, Answers D1: (1/3, 2/3) D2: (3/5, 2/5)

50 More detail on exercise D1

51 Convergence How many iterations are required to obtain a converged pagerank? Brin and Page (1998) assumed =0.85 and reported that iterations are needed How does the value  affect the rate of convergence? In general, as   1, the expected number of iterations required by the power method increases dramatically

52 Meaning of   is probability surfer doesnt get bored and continues to click on page links (1- ) probability jump to a random web page 1/(1- ) average number of steps before jump to random page =0.85 (Google). 1/(1- ) =100/15=6.666

53 The  Factor. Effect on convergence
Suppose a tolerance of (the results are correct to 10 signification digits) Number of Iterations 0.5 34 0.75 81 0.8 104 0.85 142 0.9 219 0.95 449 0.99 2,292 0.999 23,015  should be relatively large to reduce the effect of random teleporting, which is artificial =0.85 is a choice of balance between computation requirement and the quality of the output

54 Another look at Google matrix
The modified hyperlink matrix is called the Google matrix G, G = H’ + (1  )(1/n)I I is an nn matrix with all entries equal to 1 Teleportation matrix E= (1/n)I

55 The Teleportation Matrix E
standard teleportation matrix E = (1/n)I = (1/n) eT e where eT is a column vector with all 1’s and e is a row vector with all 1’s, then eT e = I probability of teleporting to a page (1/n). same for all pages G = H’ + (1  )(1/n)I

56 Personalization Each surfer can have own preference on pages
Adjusted teleportation matrix E = eT v where v (called personalization vector) defines the probability of a surfer s to teleport to individual pages The probability of a particular surfer s to teleport to page i is psi (and i=1 to n psi = 1)

57 Personalization Different personalization vectors produce different pagerankings Using the same pagerank computation but with G = H’ + (1  )E where E is defined by the personalization vector However, it is infeasible for search engines to store a personalization vector for each surfer

58 Topic-Sensitive PageRank
Similar to personalized pagerank but with only a limited number of topics (i.e, “kind of persons”) Suppose all webpages are classified into 16 topics (a webpage can belong to more than one topic) Arts, Business, Computers, Games, Health, Home, Kids & Teens, News, Recreation, Reference, Regional, Science, Shopping, Society, Sports, World

59 Topic sensitive PR p1+p2+….+pn=1
For each topic j, there is a topic vector vj (works same as the personalization vector) Let Tj be the set of webpages belonging to topic j and |Tj| be the size of the set (number of pages) Topic j vector: vj = (p1 p2 ... pn), classifies pages for topic j pi = 1 / |Tj|, if page i ‘belongs to’ topic j, and pi = 0 otherwise p1+p2+….+pn=1 The teleportation matrix E(j) for topic j is defined as Ej = eT vj where eT =(1,1,…,1)

60 Topic-Sensitive PageRank
The teleportation matrix for topic j is defined as Ej = eT vj Use in the Google formula (G = H’ + (1  )E) to compute the pagerank (G(j)) of webpages in topic j If there are k topics (e.g., 16), then each webpage has k different pagerank values Let p(j)(i) denote the pagerank of webpage i with respect to topic j

61 Query-Specific PageRank
At query time, the topic-specific pageranks are combined based on the topics in the query term, to form a composite pagerank for those pages matching the query Let P(Tj | query) denote the probability that a given query belongs to topic j, which can be estimated by P(Tj | query) =(assumption)= P(Tj) * P(term1 | Tj) * P(term2 | Tj) * ... * P(termk | Tj) Query consists of k terms (or keywords), term1, term2, ..., termk P(Tj)=|Tj|/|T| The probability that the surfer is interested in topic j. Estimated as the size of (proportion of pages with topic) P(termi | Tj) is the probability termi exists in webpages of topic j. Estimated as the proportion of webpages of topic j with termi

62 Query-Specific PageRank
Let p(j)(i) denote pagerank of webpage i w.r.t. topic j The query-specific pagerank for webpage i can be calculated as all topics j P(Tj | query) * p(j)(i) Havelivwala (2002) studied the topic-sensitive pagerank with 16 topics Arts, Business, Computers, Games, Health, Home, Kids & Teens, News, Recreation, Reference, Regional, Science, Shopping, Society, Sports, World

63 Example Some queries and their corresponding P(Tj | query) in the
study of Havelivwala

64 Back Button Model More realistic.
The original random surfer model assumes that the surfer will randomly select a new page to go to when they arrive at a dangling webpage The back button model assumes the random surfer can always return to the page they came from, by pressing the browser’s back button The “bounce-back” feature can be modelled by adding new links and vertices to the Web graph How to do this?

65 Modification of the WWW Graph
Split a dangling vertex into k copies (bounce-back vertices) where k is the in-degree of the vertex Each copy takes an in-link from a source vertex and creates a new bounce-back link to the source vertex Source vertices Source vertices New links to model the back button Dangling vertex Copies of dangling vertex

66 Modification of H An Example 1 2 4 3 6 5

67 Modification of H 1 2 1 2 4 3 6 5 3 53 4 63 64

68 Example (with =0.85) The original pagerank The new pagerank
1 2 4 3 6 5 The original pagerank The ranking: P3, P1/P2/P4/P6, P5 The new pagerank 1 2 4 3 64 53 63 Combining the pageranks of 63 & 64 The new ranking: P3, P4, P6, P1/P2, P5

69 Further Topics Please read the remaining slides in your own time

70 User adjustment to Hyperlink Matrix H
A standard hyperlink matrix H assumes that surfers choose any out-link with equal probability However, we may not wish to assume all out-links of a page are equally probable May depend on the page content or link anchor text E.g., Links to content-filled pages be given more probabilistic weight than brief advertisement pages A practical approach to filling in H’s elements is to use access logs to find actual surfer tendencies

71 Example A webmaster may find that surfers on page P1 are twice as likely to hyperlink to P2 as they are to P3 1 6 5 4 3 2

72 Accelerating Computation of PageRank
Details of computation of PageRank Reduce the work per iteration Reduce the total number of iterations Adaptive power method Aggregation Efficient Vector-Matrix Multiplication

73 Adaptive Power Method Termination criteria: When the current pagerank (k) (at the k-th iteration) is “close” to the stable pagerank  But we do not know ! In practice: Check if (k) is close to (k – 1) Distance between (k) & (k – 1) is denoted as (k)1i and (k–1)1i are the pageranks of Pi at iteration k and k–1, respectively |(k)1i – (k–1)1i| is the absolute value of (k)1i – (k–1)1i, i.e., the positive value regardless of the sign

74 Termination Criteria Example
(k) = ( ) (k – 1) = ( ) ||(k) – (k – 1)||1 = |0.198 – 0.202| + |0.199 – 0.201| + |0.20 – 0.20| + |0.201 – 0.199| + |0.202 – 0.198| = = 0.012 When ||(k) – (k – 1)||1 <  (a predefine threshold), the iterative process terminates and (k) is considered to be the final pagerank Otherwise, the process continues

75 Adaptive Approach Some pages converge to their pagerank values faster than other pages Apply a termination criteria to individual pages “Lock” the pagerank of Pi if |(k)1i – (k – 1)1i| <  (a predefined value, e.g.,  = 10–3) Only update the unlocked pagerank in next iteration. Terminate the iterative process if all pageranks are locked Kamvar et al (2003) found that in their experiment 17% speedup is achieved in computing the pagerank using the adaptive approach

76 Aggregation Heuristic
Divide the pagerank computation into two stages Merge all pages from a host into one vertices (e.g., all pages from Only inter-host hyperlinks are considered. Hostrank values are computed For each host, only intra-host hyperlinks are considered. Local pagerank for each page (within the same host) are computed The method is called BlockRank

77 PageRank Approximation
Let H be the set of hosts and |H| the number of hosts Let Hi be the set of pages in host i and |Hi| the number of pages in Hi The method finds one 1  |H| hostrank vector |H| local pagerank vector, each 1  |Hi| in size The global pagerank of the k-th page of host j is approximated by the product of j-th element of the hostrank vector and k-th element of the j-th local pagerank vector

78 Example The numbers on the edges are the probability that the surfer will take those arcs (i.e. empirical transition probabilities, not random surfer model) Assume P1, P2, P3 & P7 are in one host P4, P5 & P6 are in another host Host 1 1 2 3 7 4 5 6 0.32 0.04 Host 2

79 Example (cont) 2-vertex graph of hosts
The hyperlink matrix associated is Host 1 0.04 Host 2 At (P3 of) Host 1, there is probability 0.04 of going to Host 2 and probability 0.96 of staying in Host 1 At Host 2, there is no way of going to Host 1 With =0.9, the stable hostrank obtained is ( ) 36.76% of the time the surfer is visiting pages of Host 1

80 Example (cont) The local pagerank vectors are computed for each host
Host 1: The link from P3 to P4 is ignored With =0.9, we obtain a local pagerank vector ( ) 1 1 2 0.32 1 0.32 0.32 3 7

81 Example (cont) Host 2 We obtain a local pagerank vector (1/3 1/3 1/3)
4 5 6 1

82 Example (cont) Disaggregation: Multiply the hostrank and local pagerank to approximate the global pagerank The approximated pagerank is P1 P2 P3 P4 P5 P6 P7  1/3 0.0614 0.1167 0.1280 0.2108 6-7 5 4 1-3 Rank The exact pagerank computed for the original graph ( ) Rank

83 Efficient Vector-Matrix Multiplication
Pagerank computation involves many vector-matrix multiplications in the form y = x H Thus y(i)=xH(i) where H(i) is i-th column of H

84 Efficient Vector-Matrix Multiplication
It is more efficient to represent the hyperlink matrix H in an adjacency list since the matrix is sparse (i.e., lots of zero entries) An array deg[i] for i = 1 to n to store the degree of each webpage A array L[ ] of pairs (i,j) where page i has a hyperlink to page j The length of the array depends on the total number of hyperlinks A hyperlink entry Hij = 1/deg[i] if the pair (i,j) exists, otherwise Hij = 0 (Assuming the most basic hyperlink matrix) At the end, array L may still be too large to be fit in memory

85 Efficient Vector-Matrix Multiplication
Input: x, deg[ ], and L[ ] Output: y = x H Assumption: deg[ ], x[ ] and y[ ] can fit into memory but not L[ ] Algorithm: For each j, y[ j ] = 0 While L is not exhausted Read a batch of (i, j) from L in memory For each i in a batch do z = 1 / deg[ i ] For each j such that (i,j) is in a batch do y[ j ] = y[j]+ (z * x[ i ]) The performance of the algorithm depends on how many batches of (i,j) are read in memory

86 Efficient Vector-Matrix Multiplication
Example L = “(1,2), (1,3), (3,1), (3,2), (3,5), (4,5), (4,6), (5,4), (5,6), (6,4)” deg[ ] = “2, 0, 3, 2, 2, 1” 1 6 5 4 3 2 Suppose L can be read in 2 batches and x[ ] = “1/6, 1/6, 1/6, 1/6, 1/6, 1/6” Initialization, y[ ] = “0, 0, 0, 0, 0, 0” For the first batch, y[ ] = “1/18, 5/36 (=1/12+1/18), 1/12, 0, 1/18, 0” For the second batch, (start with y[ ]) y[ ] = “1/18, 5/36, 1/12, 1/4, 5/36, 1/6”

87 Efficient Vector-Matrix Multiplication
When even deg[ ], x[ ] and y[ ] are too large to fit in memory Divide the arrays into  blocks where each block can be fit in memory L[ ] is divided into  blocks where L[b] = {(i, j) : j in block b}, each sorted first by i and thenby j Algorithm: For b=1 to  For j in block b, y[ j ] = 0 While L[ b ] is not exhausted do Read a batch of (i, j) from L in memory For each i in a batch do z = 1 / deg[ i ] For each j such that (i, j) is in a batch do y[ j ] += z * x[ i ] The performance of the algorithm depends on how many disk accesses in .. It is not guaranteed that deg[i] & x[i] are in memory but L[b] is sorted in a way that the disk accesses of deg[ ] & x[ ] within a batch of L[b] is minimized

88 Efficient Vector-Matrix Multiplication
Example =2 indices 1..3 in block 1 indices 4..6 in block 2 L[1] = {(1,2), (1,3), (3,1), (3,2)} L[2] = {(3,5), (4,5), (4,6), (5,4), (5,6), (6,4)} 1 6 5 4 3 2 deg[ ] = “2, 0, 3, 2, 2, 1” and x = “1/6, 1/6, 1/6, 1/6, 1/6 Initialization, y[ ] = “0, 0, 0, 0, 0, 0” For the first block, y[ ] = “1/18, 5/36, 1/12, 0, 0, 0” For the second block, y[ ] = “1/18, 5/36, 1/12, 1/4, 5/36, 1/6”


Download ppt "7CCSMWAL Algorithmic Issues in the WWW"

Similar presentations


Ads by Google