Download presentation
Presentation is loading. Please wait.
Published byShannon Phillips Modified over 8 years ago
1
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1
2
2 How the Web different from a database of documents?
3
3 Hypertext vs. text: a lot of additional clues – graph vs. set – anchor text vs. text: how others say about you? Geographically distributed vs. centralized – so you need to build a crawler Precision more valued than recall – quality is important than quantity, especially “broad” queries Spamming
4
4 Web data and query Data model – directed graph – nodes: Web pages – links: hyperlinks – all nodes belong to the same type. Query is a set of terms Answer – ranked list of relevant and important pages – quantifying a subjective quality more complex models – e.g., assigning types to pages.
5
Web search before Hubs-Authorities/ PageRank Web as a set of documents Relevance: content-based retrieval – documents match queries by contents – q: ’clinton’ rank higher pages with more ‘clinton’ Importance??? – contents: what documents say about themselves – many spams and unreliable information in the results. Directory services were used – Yahoo! was one of the leaders! – Google co-founders were told “nobody will use a keyword interface”. 5
6
6 Hubs and Authorities An intuitive/informal definition: – authorities: highly-regarded, authoritative pages – hubs: pages that refer you to authorities A recursive definition: mutually reinforcing relationships – hub: a page that links to many authorities – authority: a page that is linked by many hubs
7
7 Web: Adjacent Matrix Web: G = {V, E} – V = {x, y, z}, |V| = n – E = {(x, x),(x, y),(x, z), (y, z), (z, x), (z, y) } – A: n x n matrix: A ij = 1 if page i links to page j, 0 if not xy z 111 A = 001 110 source node target node
8
8 Transposed Adjacent Matrix Adjacent matrix A: – what does row j represent? Transpose A t : – what does row j represent? xy z 111 A = 001 110 101 A t = 101 110
9
9 Hubbiness and Authority Hubbiness: a vector h – h i is a value representing the “hubbiness” of page i Authority: a vector a – a i is a value representing the “authority” of page i Mutual recursive definition: in terms of h and a – ?? h x = ? – ?? a x = ? xy z
10
10 Hubbiness Hubbiness: – h x = a x + a y + a z – h y = a z – h z = a x + a y h = αAa – A: links-to nodes – a: their authority weights – α: scaling factor to normalize xy z 111 A = 001 110
11
11 Authority Authority: – a x = h x + h z – a y = h x + h z – a z = h x + h y a = βA t h – A t : linked-from nodes – h: their hub weights – β: scaling factor xy z 101 A t = 101 110
12
12 Finding Hubbiness and Authority Recursive definition: – a = βA t h, h = αAa Authority: a = αβ(A t A)a – a is an eigenvector of A t A Hubbiness: h = αβ(AA t )h – h is an eigenvector of AA t
13
13 Computing Hubbiness and Authority Computation: by “relaxation” – start from some initial values of a and h z = (1, 1, …, 1) a0 = z; h0= z – repeat until fixpoint: apply the equations a i = αβ (A t A)a i-1 h i = αβ (AA t )h i-1 fixpoint: a i » a i-1, h i » h i-1 Convergence: – for a: A t A is symmetric (and z is “right”) relaxation will converge to the principle eigenvector of A t A – for h: similarly the principle eigenvector of AA t
14
14 Assume a = 1, b = 1, initial h = a = (1, 1, 1) – note: A t A and AA t are both symmetric matrices Will converge: e.g.: with some scaling: – a --> 1.36, 1.36, 1 (or 0.63, 0.63, 0.46 as unit vector) Computing Hubbiness and Authority 312 h = 110 h 202 221 a = 221 a 112 a: 1234 1524114 1418 84 h: 1234 1628132 12 8 36 1420 96 AtAAtA AA t
15
15 Google: PageRank Reference: http://www7.scu.edu.au/http://www7.scu.edu.au/ – S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, J. M. Kleinberg: Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. WWW7 / Computer Networks 30(1-7): 65-74 (1998) – S. Brin, L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. WWW7 / Computer Networks 30(1-7): 107-117 (1998) Google.com: – in the Stanford Digital Libraries project 1996-98 around the same time as Kleinberg’s paper – tried to sell to Infoseek in 1997 – founded in 1998 by Brin and Page
16
16 PageRank: importance of pages PageRank (or importance): recursively – a page P is important if important pages link to it – importance of P: proportionally contributed by the back-linked pages Example: – r x = 1/2 r x + 1/2 r z – r y = 1/2 r z – r z = 1/2 r x + 1 r y Random-surfer interpretation: – surfer randomly follows links to navigate – PageRank = the prob. that surfer will visit the page xy z
17
17 Computing PageRank Importance-propagation equation: Computation: by relaxation 1/201/2 r = 001/2 r 1/210 linked-from (A t ) or links-to matrix (A)? column-normalized: column x is all that x points to sum of column = 1 xy z r: 123 fixpoint 115/4 …6/5 11/23/4 …3/5 13/21 …6/5
18
18 Problems: Dead Ends Dead ends: – page without successors has nowhere to send its importance – eventually, what would happen to r? Example: – ra = 0 ra + 0 rb – rb = 1 ra + 0 rb xy z ab
19
19 Problems: Spider Trap Spider traps: – group of pages without out-of-group links will trap a spider inside – what would happen to r? Example: – ra = 1/2 ra + 0 rb – rb = 1/2 ra + 1 rb Solutions?? xy z ab
20
20 Solutions: surfer’s random jump Surfer can randomly jump to a new page – without following links – d: damping factor (set to.85 in paper) model the probability of randomly jumping to this page another interpretation: – “tax” importance of each page and distribute to all pages Teleportation PR(A) = (1-d) + d (PR(T1)/C(T1) +... + PR(Tn)/C(Tn))
21
21 Anti-Spamming Spamming: – attempt to create artifacts to “please” search engines – so that ranking will be high – e.g., commercial “search engine optimization service” Google anti-spam device: – unlike other search engines, tends to believe what others say about you by links and anchor texts – recursive importance also works: importance (not just links) propagate – Still, not perfect solution: suggestions?
22
22 PageRank and Hub/Authority influence Connected DB/DM with links analysis Web, social networks, biological networks, … – information network, graph DB Typical problems – finding similar nodes (items) – community detection / node clustering – …
23
23 Web as a database Active and challenging research area Information extraction – finding entities and relationships from pages Information integration – integrating data from multiple websites Easier to use query interfaces – Natural-language queries/ question answering
24
24 What you should know Web data and query model Hubs and authorities PageRank formula and algorithm Dead ends and spider traps Teleportation
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.