CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.

CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1

2 How the Web different from a database of documents?

3 Hypertext vs. text: a lot of additional clues – graph vs. set – anchor text vs. text: how others say about you? Geographically distributed vs. centralized – so you need to build a crawler Precision more valued than recall – quality is important than quantity, especially “broad” queries Spamming

4 Web data and query Data model – directed graph – nodes: Web pages – links: hyperlinks – all nodes belong to the same type. Query is a set of terms Answer – ranked list of relevant and important pages – quantifying a subjective quality more complex models – e.g., assigning types to pages.

Web search before Hubs-Authorities/ PageRank Web as a set of documents Relevance: content-based retrieval – documents match queries by contents – q: ’clinton’  rank higher pages with more ‘clinton’ Importance??? – contents: what documents say about themselves – many spams and unreliable information in the results. Directory services were used – Yahoo! was one of the leaders! – Google co-founders were told “nobody will use a keyword interface”. 5

6 Hubs and Authorities An intuitive/informal definition: – authorities: highly-regarded, authoritative pages – hubs: pages that refer you to authorities A recursive definition: mutually reinforcing relationships – hub: a page that links to many authorities – authority: a page that is linked by many hubs

7 Web: Adjacent Matrix Web: G = {V, E} – V = {x, y, z}, |V| = n – E = {(x, x),(x, y),(x, z), (y, z), (z, x), (z, y) } – A: n x n matrix: A ij = 1 if page i links to page j, 0 if not xy z 111 A = 001 110 source node target node

8 Transposed Adjacent Matrix Adjacent matrix A: – what does row j represent? Transpose A t : – what does row j represent? xy z 111 A = 001 110 101 A t = 101 110

9 Hubbiness and Authority Hubbiness: a vector h – h i is a value representing the “hubbiness” of page i Authority: a vector a – a i is a value representing the “authority” of page i Mutual recursive definition: in terms of h and a – ?? h x = ? – ?? a x = ? xy z

10 Hubbiness Hubbiness: – h x = a x + a y + a z – h y = a z – h z = a x + a y h = αAa – A: links-to nodes – a: their authority weights – α: scaling factor to normalize xy z 111 A = 001 110

11 Authority Authority: – a x = h x + h z – a y = h x + h z – a z = h x + h y a = βA t h – A t : linked-from nodes – h: their hub weights – β: scaling factor xy z 101 A t = 101 110

12 Finding Hubbiness and Authority Recursive definition: – a = βA t h, h = αAa Authority: a = αβ(A t A)a – a is an eigenvector of A t A Hubbiness: h = αβ(AA t )h – h is an eigenvector of AA t

13 Computing Hubbiness and Authority Computation: by “relaxation” – start from some initial values of a and h z = (1, 1, …, 1) a0 = z; h0= z – repeat until fixpoint: apply the equations a i = αβ (A t A)a i-1 h i = αβ (AA t )h i-1 fixpoint: a i » a i-1, h i » h i-1 Convergence: – for a: A t A is symmetric (and z is “right”)  relaxation will converge to the principle eigenvector of A t A – for h: similarly the principle eigenvector of AA t

14 Assume a = 1, b = 1, initial h = a = (1, 1, 1) – note: A t A and AA t are both symmetric matrices Will converge: e.g.: with some scaling: – a --> 1.36, 1.36, 1 (or 0.63, 0.63, 0.46 as unit vector) Computing Hubbiness and Authority 312 h = 110 h 202 221 a = 221 a 112 a: 1234 1524114 1418 84 h: 1234 1628132 12 8 36 1420 96 AtAAtA AA t

15 Google: PageRank Reference: http://www7.scu.edu.au/http://www7.scu.edu.au/ – S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, J. M. Kleinberg: Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. WWW7 / Computer Networks 30(1-7): 65-74 (1998) – S. Brin, L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. WWW7 / Computer Networks 30(1-7): 107-117 (1998) Google.com: – in the Stanford Digital Libraries project 1996-98 around the same time as Kleinberg’s paper – tried to sell to Infoseek in 1997 – founded in 1998 by Brin and Page

16 PageRank: importance of pages PageRank (or importance): recursively – a page P is important if important pages link to it – importance of P: proportionally contributed by the back-linked pages Example: – r x = 1/2 r x + 1/2 r z – r y = 1/2 r z – r z = 1/2 r x + 1 r y Random-surfer interpretation: – surfer randomly follows links to navigate – PageRank = the prob. that surfer will visit the page xy z

17 Computing PageRank Importance-propagation equation: Computation: by relaxation 1/201/2 r = 001/2 r 1/210 linked-from (A t ) or links-to matrix (A)? column-normalized: column x is all that x points to sum of column = 1 xy z r: 123 fixpoint 115/4 …6/5 11/23/4 …3/5 13/21 …6/5

18 Problems: Dead Ends Dead ends: – page without successors has nowhere to send its importance – eventually, what would happen to r? Example: – ra = 0 ra + 0 rb – rb = 1 ra + 0 rb xy z ab

19 Problems: Spider Trap Spider traps: – group of pages without out-of-group links will trap a spider inside – what would happen to r? Example: – ra = 1/2 ra + 0 rb – rb = 1/2 ra + 1 rb Solutions?? xy z ab

20 Solutions: surfer’s random jump Surfer can randomly jump to a new page – without following links – d: damping factor (set to.85 in paper) model the probability of randomly jumping to this page another interpretation: – “tax” importance of each page and distribute to all pages Teleportation PR(A) = (1-d) + d (PR(T1)/C(T1) +... + PR(Tn)/C(Tn))

21 Anti-Spamming Spamming: – attempt to create artifacts to “please” search engines – so that ranking will be high – e.g., commercial “search engine optimization service” Google anti-spam device: – unlike other search engines, tends to believe what others say about you by links and anchor texts – recursive importance also works: importance (not just links) propagate – Still, not perfect solution: suggestions?

22 PageRank and Hub/Authority influence Connected DB/DM with links analysis Web, social networks, biological networks, … – information network, graph DB Typical problems – finding similar nodes (items) – community detection / node clustering – …

23 Web as a database Active and challenging research area Information extraction – finding entities and relationships from pages Information integration – integrating data from multiple websites Easier to use query interfaces – Natural-language queries/ question answering

24 What you should know Web data and query model Hubs and authorities PageRank formula and algorithm Dead ends and spider traps Teleportation

CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.

Similar presentations

Presentation on theme: "CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.

Similar presentations

Presentation on theme: "CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1."— Presentation transcript:

Similar presentations

About project

Feedback