CS 440 Database Management Systems Web Data Management 1.

CS 440 Database Management Systems Web Data Management 1

2 How the Web different from a database of documents?

3 Hypertext vs. text: a lot of additional clues – graph vs. set – anchor text vs. text: how others say about you? Geographically distributed vs. centralized – so you need to build a crawler Precision more valued than recall – quality is important than quantity, especially “broad” queries Spamming Hoaxes and more … Web scale is super-huge – scalability is the key

4 Web data and query Data model – directed graph – nodes: Web pages – links: hyperlinks – all nodes belong to the same type. Query is a set of terms Answer – ranked list of relevant and important pages – quantifying a subjective quality Basic data/query model – more complex models, e.g., assigning types to pages.

Web search before Google Web as a set of documents Relevance: content-based retrieval – documents match queries by contents – q: ’clinton’  rank higher pages with more ‘clinton’ Importance??? – contents: what documents say about themselves – many spams and unreliable information in the results. Directory services were used – Yahoo! was one of the leaders – Google co-founders were told “nobody will use a keyword interface”. 5

6 Google: PageRank From the Stanford Digital Libraries project 1996-98 Published the paper in 1997: S. Brin, L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. WWW7 / Computer Networks 30(1-7): 107-117 (1998) Tried to sell to Infoseek in 1997 Founded in 1998 by Brin and Page

7 Web: Adjacent Matrix Web: G = {V, E} – V = {x, y, z}, |V| = n – E = {(x, x),(x, y),(x, z), (y, z), (z, x), (z, y) } – A: n x n matrix: A ij = 1 if page i links to page j, 0 if not xy z 111 A = 001 110 source node target node

8 Transposed Adjacent Matrix Adjacent matrix A: – what does row j represent? Transpose A t : – what does row j represent? xy z 111 A = 001 110 101 A t = 101 110

9 PageRank: importance of pages PageRank (or importance): recursively – a page P is important if important pages link to it – importance of P: proportionally contributed by the back-linked pages Example: – r x = 1/2 r x + 1/2 r z – r y = 1/2 r z – r z = 1/2 r x + 1 r y Random-surfer interpretation: – surfer randomly follows links to navigate – PageRank = the prob. that surfer will visit the page xy z

10 Computing PageRank Importance-propagation equation: Computation: by relaxation 1/201/2 r = 001/2 r 1/210 linked-from (A t ) or links-to matrix (A)? column-normalized: column x is all that x points to sum of column = 1 xy z r: 123 fixpoint 115/4 …6/5 11/23/4 …3/5 13/21 …6/5

11 Problems: Dead Ends Dead ends: – page without successors has nowhere to send its importance – eventually, what would happen to r? Example: – ra = 0 ra + 0 rb – rb = 1 ra + 0 rb xy z ab

12 Problems: Spider Trap Spider traps: – group of pages without out-of-group links will trap a spider inside – what would happen to r? Example: – ra = 1/2 ra + 0 rb – rb = 1/2 ra + 1 rb Solutions?? xy z ab

13 Solutions: surfer’s random jump Surfer can randomly jump to a new page – without following links – d: damping factor (set to.85 in paper) model the probability of randomly jumping to this page another interpretation: – “tax” importance of each page and distribute to all pages Teleportation PR(A) = (1-d) + d (PR(T1)/C(T1) +... + PR(Tn)/C(Tn))

14 Anti-Spamming Spamming: – attempt to create artifacts to “please” search engines – so that ranking will be high – e.g., commercial “search engine optimization service” Google anti-spam device: – unlike other search engines, tends to believe what others say about you by links and anchor texts – recursive importance also works: importance (not just links) propagate – Still, not perfect solution

15 PageRank influence A basic block for modern link analysis algorithms Web, social networks, biological networks, … – information network, graph DB Typical problems – finding similar nodes (items) – community detection / node clustering – keyword search – …

16 Web as a database Active and challenging research area Information extraction – finding entities and relationships from pages Information integration – integrating data from multiple websites Easier to use query interfaces – Natural-language queries/ question answering

17 What you should know Web data and query model PageRank formula and algorithm Dead ends and spider traps Teleportation

CS 440 Database Management Systems Web Data Management 1.

Similar presentations

Presentation on theme: "CS 440 Database Management Systems Web Data Management 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 440 Database Management Systems Web Data Management 1.

Similar presentations

Presentation on theme: "CS 440 Database Management Systems Web Data Management 1."— Presentation transcript:

Similar presentations

About project

Feedback