(A taste of) Data Management Over the Web. Web R&D The web has revolutionized our world – Relevant research areas include databases, networks, security…

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Crawling and Ranking. HTML (HyperText Markup Language) Described the structure and content of a (web) document HTML 4.01: most common version, W3C standard.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Web Data Management Dr. Daniel Deutch. Web Data The web has revolutionized our world Data is everywhere Constitutes a great potential But also a lot of.
Link Structure and Web Mining Shuying Wang
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Overview of Web Data Mining and Applications Part I
Application Layer. Domain Name System Domain Name System (DNS) Problem – Want to go to but don’t know the IP addresswww.google.com Solution.
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
2013Dr. Ali Rodan 1 Handout 1 Fundamentals of the Internet.
XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.
Web Data Management Dr. Daniel Deutch. Web Data The web has revolutionized our world Data is everywhere Constitutes a great potential But also a lot of.
Web Search Module 6 INST 734 Doug Oard. Agenda The Web  Crawling Web search.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Social Networking Algorithms related sections to read in Networked Life: 2.1,
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Overview of Web Ranking Algorithms: HITS and PageRank
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Internet Architecture and Governance
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Crawling and Ranking. HTML (HyperText Markup Language) Described the structure and content of a (web) document HTML 4.01: most common version, W3C standard.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
Information Discovery Lecture 20 Web Search 2. Example: Heritrix Crawler A high-performance, open source crawler for production and research Developed.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
General Architecture of Retrieval Systems 1Adrienn Skrop.
CS 440 Database Management Systems Web Data Management 1.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
The PageRank Citation Ranking: Bringing Order to the Web
The PageRank Citation Ranking: Bringing Order to the Web
15-499:Algorithms and Applications
HITS Hypertext-Induced Topic Selection
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Link-Based Ranking Seminar Social Media Mining University UC3M
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CS 440 Database Management Systems
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Web Search Engines.
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
COMP5331 Web databases Prepared by Raymond Wong
Presentation transcript:

(A taste of) Data Management Over the Web

Web R&D The web has revolutionized our world – Relevant research areas include databases, networks, security… – Data structures and architecture, complexity, image processing, security, natural language processing, user interfaces design.. Lots of research in each of these directions – Specialized conferences for web research – Lots of companies This course will focus on Web Data

Web Data The web has revolutionized our world Data is everywhere – Web pages, images, movies, social data, likes and dislikes… Constitutes a great potential But also a lot of challenges – Web data is huge, not structured, dirty.. Just the ingredients of a fun research topic!

Ingredients Representation & Storage – Standards (HTML, HTTP), compact representations, security… Search and Retrieval – Crawling, inferring information from text… Ranking – What's important and what's not – Google PageRank, Top-K algorithms, recommendations…

Challenges Huge – Over 14 Billions of pages indexed in Google Unstructured – But we do have some structure, such as html links, friendships in social networks.. Dirty – A lot of the data is incorrect, inconsistent, contradicting, just irrelevant..

Course Goal Introducing a selection of fun topics in web data management Allowing you to understand some state-of-the- art notions, algorithms, and techniques As well as the main challenges and how we approach them

Course outline Ranking: HITS and PageRank Data representation: XML; HTML Crawling Information Retrieval and Extraction, Wikipedia example Aggregating ranks and Top-K algorithms Recommendations, Collaborative Filtering for recommending movies in NetFlix Other topics (time permitting): Deep Web, Advertisements… The course is partly based on: Web Data Management and Distribution, Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart And on a course by Pierre Senellart (and others) in telecom Paris-tech

Course requirement A small final project Will involve understanding of 2 or 3 of the subjects studied and some implementation Will be given next monday

Ranking

Why Ranking? Huge number of pages Huge even if we filter according to relevance – Keep only pages that include the keywords A lot of the pages are not informative – And anyway it is impossible for users to go through 10K results

How to rank? Observation: links are very informative! Instead of a collection of Web Pages, we have a Web Graph!! This is important for discovering new sites (see crawling), but also for estimating the importance of a site CNN.com has more links to it than my homepage…

Authority and Hubness Authority: a site is very authoritative if it receives many citations. Citation from important sites weight more than citations from less-important sites A(v) = The authority of v Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites H(v) = The hubness of v

HITS (Kleinberg ’99) Recursive dependency: a(v) = Σ in h(u) h(v) = Σ out a(u) Normalize according to sum of authorities \ hubness values We can show that a(v) and h(v) converge

Random Surfer Model Consider a "random surfer" At each point chooses a link and clicks on it P(W) = P(W 1 )* (1/O(W 1 ))+…+ P(W n )* (1/O(W n )) Where Wi…Wn link to W, O(Wi) is the number of out-edges of Wi

Recursive definition PageRank reflects the probability of being in a web-page (PR(w) = P(w)) Then: PR(W) = PR(W 1 )* (1/O(W 1 ))+…+ PR(W n )* (1/O(W n )) How to solve?

EigenVector! PR (row vector) is the left eigenvector of the stochastic transition matrix – I.e. the adjacency matrix normalized to have the sum of every column to be 1 The Perron-Frobinius theorem ensures that such a vector exists Unique if the matrix is irreducible – Can be guaranteed by small perturbations

Problems A random surfer may get stuck in one component of the graph May get stuck in loops “Rank Sink” Problem – Many Web pages have no outlinks

Damping Factor Add some probability d for "jumping" to a random page Now P(W) = (1-d) * [P(W 1 )* (1/O(W 1 ))+…+ P(W n )* (1/O(W n ))] + d* 1/N Where N is the number of pages in the index

How to compute PR? Analytical methods – Can we solve the equations? – In principle yes, but the matrix is huge! – Not a realistic solution for web scale Approximations

A random surfer algorithm Start from an arbitrary page Toss a coin to decide if you want to follow a link or to randomly choose a new page Then toss another coin to decide which link to follow \ which page to go to Keep record of the frequency of the web- pages visited The frequency for each page converges to its PageRank

Power method Start with some arbitrary rank row vector R 0 Compute R i = Ri-1* A If we happen to get to the eigenvector we will stay there Theorem: The process converges to the eigenvector! Convergence is in practice pretty fast (~100 iterations)

Other issues Accelerating Computation Distributed PageRank Mixed Model (Incorporating "static" importance) Personalized PageRank

XML

HTML (HyperText Markup Language) Used for presentation Standardized by W3C (1999) Described the structure and content of a (web) document HTML is an open format – Can be processed by a variety of tools

HTTP Application protocol Client request: GET /MarkUp/ HTTP/1.1 Host: Server response: HTTP/ OK Two main HTTP methods: GET and POST

GET URL: Corresponding HTTP GET request: GET /search?q=BGU HTTP/1.1 Host:

POST Used for submitting forms POST /php/test.php HTTP/1.1 Host: Content-Type: application/x-www- formurlencoded Content-Length: 100 …

Status codes HTTP response always starts with a status code followed by a human-readable message (e.g., 200 OK) First digit indicates the class of the response: 1 Information 2 Success 3 Redirection 4 Client-side error 5 Server-side error

Authentication HTTPS is a variant of HTTP that includes encryption, cryptographic authentication, session tracking, etc. It can be used instead to transmit sensitive data GET... HTTP/1.1 Authorization: Basic dG90bzp0aXRp

Cookies Key/value pairs, that a server asks a client to store and retransmit with each HTTP request (for a given domain name). Can be used to keep information on users between visits Often what is stored is a session ID – Connected, on the server side, to all session information

Crawling

Basics of Crawling Crawlers, (Web) spiders, (Web) robots: autonomous agents that retrieve pages from the Web Basics crawling algorithm: 1. Start from a given URL or set of URLs 2. Retrieve and process the corresponding page 3. Discover new URLs (next slide) 4. Repeat on each found URL Problem: The web is huge!

Discovering new URLs Browse the "internet graph" (following e.g. hyperlinks) Referrer urls Site maps (sitemap.org)

The internet graph At least billion nodes = pages At least 140 billion edges = links

Graph-browsing algorithms Depth-first Breath-first Combinations..

Duplicates Identifying duplicates or near-duplicates on the Web to prevent multiple indexing Trivial duplicates: same resource at the same canonized URL: Exact duplicates: identification by hashing near-duplicates: (timestamps, tip of the day, etc.) more complex!

Near-duplicate detection Edit distance – Good measure of similarity, – Does not scale to a large collection of documents (unreasonable to compute the edit distance for every pair!). Shingles : two documents similar if they mostly share the same succession of k-grams

Crawling ethics robots.txt at the root of a Web server User-agent: * Allow: /searchhistory/ Disallow: /search Per-page exclusion (de facto standard). Per-link exclusion (de facto standard). Toto Avoid Denial Of Service (DOS), wait 100ms/1s between two repeated requests to the same Web server