Download presentation
Presentation is loading. Please wait.
Published byCrystal Harris Modified over 8 years ago
1
CS 440 Database Management Systems Web Data Management 1
2
2 How the Web different from a database of documents?
3
3 Hypertext vs. text: a lot of additional clues – graph vs. set – anchor text vs. text: how others say about you? Geographically distributed vs. centralized – so you need to build a crawler Precision more valued than recall – quality is important than quantity, especially “broad” queries Spamming Hoaxes and more … Web scale is super-huge – scalability is the key
4
4 Web data and query Data model – directed graph – nodes: Web pages – links: hyperlinks – all nodes belong to the same type. Query is a set of terms Answer – ranked list of relevant and important pages – quantifying a subjective quality Basic data/query model – more complex models, e.g., assigning types to pages.
5
Web search before Google Web as a set of documents Relevance: content-based retrieval – documents match queries by contents – q: ’clinton’ rank higher pages with more ‘clinton’ Importance??? – contents: what documents say about themselves – many spams and unreliable information in the results. Directory services were used – Yahoo! was one of the leaders – Google co-founders were told “nobody will use a keyword interface”. 5
6
6 Google: PageRank From the Stanford Digital Libraries project 1996-98 Published the paper in 1997: S. Brin, L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. WWW7 / Computer Networks 30(1-7): 107-117 (1998) Tried to sell to Infoseek in 1997 Founded in 1998 by Brin and Page
7
7 Web: Adjacent Matrix Web: G = {V, E} – V = {x, y, z}, |V| = n – E = {(x, x),(x, y),(x, z), (y, z), (z, x), (z, y) } – A: n x n matrix: A ij = 1 if page i links to page j, 0 if not xy z 111 A = 001 110 source node target node
8
8 Transposed Adjacent Matrix Adjacent matrix A: – what does row j represent? Transpose A t : – what does row j represent? xy z 111 A = 001 110 101 A t = 101 110
9
9 PageRank: importance of pages PageRank (or importance): recursively – a page P is important if important pages link to it – importance of P: proportionally contributed by the back-linked pages Example: – r x = 1/2 r x + 1/2 r z – r y = 1/2 r z – r z = 1/2 r x + 1 r y Random-surfer interpretation: – surfer randomly follows links to navigate – PageRank = the prob. that surfer will visit the page xy z
10
10 Computing PageRank Importance-propagation equation: Computation: by relaxation 1/201/2 r = 001/2 r 1/210 linked-from (A t ) or links-to matrix (A)? column-normalized: column x is all that x points to sum of column = 1 xy z r: 123 fixpoint 115/4 …6/5 11/23/4 …3/5 13/21 …6/5
11
11 Problems: Dead Ends Dead ends: – page without successors has nowhere to send its importance – eventually, what would happen to r? Example: – ra = 0 ra + 0 rb – rb = 1 ra + 0 rb xy z ab
12
12 Problems: Spider Trap Spider traps: – group of pages without out-of-group links will trap a spider inside – what would happen to r? Example: – ra = 1/2 ra + 0 rb – rb = 1/2 ra + 1 rb Solutions?? xy z ab
13
13 Solutions: surfer’s random jump Surfer can randomly jump to a new page – without following links – d: damping factor (set to.85 in paper) model the probability of randomly jumping to this page another interpretation: – “tax” importance of each page and distribute to all pages Teleportation PR(A) = (1-d) + d (PR(T1)/C(T1) +... + PR(Tn)/C(Tn))
14
14 Anti-Spamming Spamming: – attempt to create artifacts to “please” search engines – so that ranking will be high – e.g., commercial “search engine optimization service” Google anti-spam device: – unlike other search engines, tends to believe what others say about you by links and anchor texts – recursive importance also works: importance (not just links) propagate – Still, not perfect solution
15
15 PageRank influence A basic block for modern link analysis algorithms Web, social networks, biological networks, … – information network, graph DB Typical problems – finding similar nodes (items) – community detection / node clustering – keyword search – …
16
16 Web as a database Active and challenging research area Information extraction – finding entities and relationships from pages Information integration – integrating data from multiple websites Easier to use query interfaces – Natural-language queries/ question answering
17
17 What you should know Web data and query model PageRank formula and algorithm Dead ends and spider traps Teleportation
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.