CS 440 Database Management Systems Web Data Management 1.

Slides:



Advertisements
Similar presentations
Lecture 18: Link analysis
Advertisements

Markov Models.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
Link Analysis David Kauchak cs160 Fall 2009 adapted from:
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
The PageRank Citation Ranking “Bringing Order to the Web”
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Page Rank.  Intuition: solve the recursive equation: “a page is important if important pages link to it.”  Maximailly: importance = the principal eigenvector.
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Lexicon/dictionary DIC Inverted Index Allows quick lookup of document ids with a particular word Stanford UCLA MIT … PL(Stanford) PL(UCLA)
Link Analysis, PageRank and Search Engines on the Web
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
PageRank Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 27, 2014.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Information Retrieval
Web Search – Summer Term 2006 VII. Selected Topics - PageRank (closer look) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Overview of Web Data Mining and Applications Part I
Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.
Google and the Page Rank Algorithm Székely Endre
S eminar on Page Ranking Techniques In Search Engines Phapale Gaurav S. [05 IT 6010] Guide: Prof. A. Gupta.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Google’s PageRank: The Math Behind the Search Engine Author:Rebecca S. Wills, 2006 Instructor: Dr. Yuan Presenter: Wayne.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Overview of Web Ranking Algorithms: HITS and PageRank
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
Google PageRank Algorithm
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
The anatomy of a Large-Scale Hypertextual Web Search Engine.
Google's Page Rank. Google Page Ranking “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Sergey Brin and Lawrence Page
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
Jeffrey D. Ullman Stanford University.  Web pages are important if people visit them a lot.  But we can’t watch everybody using the Web.  A good surrogate.
OCR A-Level Computing - Unit 01 Computer Systems Lesson 1. 3
Search Engines and Link Analysis on the Web
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Hubs and Authorities Jeffrey D. Ullman.
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.
Lecture #11 PageRank (II)
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Centrality in Social Networks
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CS 440 Database Management Systems
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Data Mining Chapter 6 Search Engines
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
Junghoo “John” Cho UCLA
Presentation transcript:

CS 440 Database Management Systems Web Data Management 1

2 How the Web different from a database of documents?

3 Hypertext vs. text: a lot of additional clues – graph vs. set – anchor text vs. text: how others say about you? Geographically distributed vs. centralized – so you need to build a crawler Precision more valued than recall – quality is important than quantity, especially “broad” queries Spamming Hoaxes and more … Web scale is super-huge – scalability is the key

4 Web data and query Data model – directed graph – nodes: Web pages – links: hyperlinks – all nodes belong to the same type. Query is a set of terms Answer – ranked list of relevant and important pages – quantifying a subjective quality Basic data/query model – more complex models, e.g., assigning types to pages.

Web search before Google Web as a set of documents Relevance: content-based retrieval – documents match queries by contents – q: ’clinton’  rank higher pages with more ‘clinton’ Importance??? – contents: what documents say about themselves – many spams and unreliable information in the results. Directory services were used – Yahoo! was one of the leaders – Google co-founders were told “nobody will use a keyword interface”. 5

6 Google: PageRank From the Stanford Digital Libraries project Published the paper in 1997: S. Brin, L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. WWW7 / Computer Networks 30(1-7): (1998) Tried to sell to Infoseek in 1997 Founded in 1998 by Brin and Page

7 Web: Adjacent Matrix Web: G = {V, E} – V = {x, y, z}, |V| = n – E = {(x, x),(x, y),(x, z), (y, z), (z, x), (z, y) } – A: n x n matrix: A ij = 1 if page i links to page j, 0 if not xy z 111 A = source node target node

8 Transposed Adjacent Matrix Adjacent matrix A: – what does row j represent? Transpose A t : – what does row j represent? xy z 111 A = A t =

9 PageRank: importance of pages PageRank (or importance): recursively – a page P is important if important pages link to it – importance of P: proportionally contributed by the back-linked pages Example: – r x = 1/2 r x + 1/2 r z – r y = 1/2 r z – r z = 1/2 r x + 1 r y Random-surfer interpretation: – surfer randomly follows links to navigate – PageRank = the prob. that surfer will visit the page xy z

10 Computing PageRank Importance-propagation equation: Computation: by relaxation 1/201/2 r = 001/2 r 1/210 linked-from (A t ) or links-to matrix (A)? column-normalized: column x is all that x points to sum of column = 1 xy z r: 123 fixpoint 115/4 …6/5 11/23/4 …3/5 13/21 …6/5

11 Problems: Dead Ends Dead ends: – page without successors has nowhere to send its importance – eventually, what would happen to r? Example: – ra = 0 ra + 0 rb – rb = 1 ra + 0 rb xy z ab

12 Problems: Spider Trap Spider traps: – group of pages without out-of-group links will trap a spider inside – what would happen to r? Example: – ra = 1/2 ra + 0 rb – rb = 1/2 ra + 1 rb Solutions?? xy z ab

13 Solutions: surfer’s random jump Surfer can randomly jump to a new page – without following links – d: damping factor (set to.85 in paper) model the probability of randomly jumping to this page another interpretation: – “tax” importance of each page and distribute to all pages Teleportation PR(A) = (1-d) + d (PR(T1)/C(T1) PR(Tn)/C(Tn))

14 Anti-Spamming Spamming: – attempt to create artifacts to “please” search engines – so that ranking will be high – e.g., commercial “search engine optimization service” Google anti-spam device: – unlike other search engines, tends to believe what others say about you by links and anchor texts – recursive importance also works: importance (not just links) propagate – Still, not perfect solution

15 PageRank influence A basic block for modern link analysis algorithms Web, social networks, biological networks, … – information network, graph DB Typical problems – finding similar nodes (items) – community detection / node clustering – keyword search – …

16 Web as a database Active and challenging research area Information extraction – finding entities and relationships from pages Information integration – integrating data from multiple websites Easier to use query interfaces – Natural-language queries/ question answering

17 What you should know Web data and query model PageRank formula and algorithm Dead ends and spider traps Teleportation