Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Slides:

Advertisements

Similar presentations

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.

Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Analysis and Modeling of Social Networks Foudalis Ilias.

Web as Network: A Case Study Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.

Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.

Web Graph representation and compression Thanks to Luciana Salete Buriol and Debora Donato.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005

CS 345A Data Mining Lecture 1

CS 345A Data Mining Lecture 1 Introduction to Web Mining.

Measuring the Web. What? Use, size –Of entire Web, of sites (popularity), of pages –Growth thereof Technologies in use (servers, media types) Properties.

Web as Graph – Empirical Studies The Structure and Dynamics of Networks.

The PageRank Citation Ranking “Bringing Order to the Web”

Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.

Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, , blog, e-book,... Query.

Web Algorithmics Web Search Engines. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, , blog, e-book,... Query.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.

Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Web Algorithmics Web Search Engines. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, , blog, e-book,... Query.

The Web as Network Networked Life CSE 112 Spring 2006 Prof. Michael Kearns.

CS 345 Data Mining Lecture 1 Introduction to Web Mining.

Network Science and the Web: A Case Study Networked Life CIS 112 Spring 2009 Prof. Michael Kearns.

1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.

Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher A bunch of scientific.

Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.

CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics.

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.

Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University

Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State.

Course grading Project: 75% Broken into several incremental deliverables Paper appraisal/evaluation/project tool evaluation in earlier May: 25%

Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University

Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Ch 14. Link Analysis Padmini Srinivasan Computer Science Department

Meet the web: First impressions How big is the web and how do you measure it? How many people use the web? How many use search engines? What is the shape.

Search Engines By: Faruq Hasan.

Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

A Taxonomy of Web Searches Andrei Broder, SIGIR Forum, 2002 Ahmet Yenicag Ceyhun Karbeyaz.

The Structure of the Web. Getting to knowing the Web How big is the web and how do you measure it? How many people use the web? How many use search engines?

1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.

Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University.

1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.

Recommendation systems Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3.

Information Retrieval (9) Prof. Dragomir R. Radev

Models of Web-Like Graphs: Integrated Approach

Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

General Architecture of Retrieval Systems 1Adrienn Skrop.

CS 440 Database Management Systems Web Data Management 1.

Search Engine and Optimization 1. Introduction to Web Search Engines 2.

Mathematics of the Web Prof. Sara Billey University of Washington.

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Web Basics Slides adapted from

22C:145 Artificial Intelligence

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

15-499:Algorithms and Applications

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Introduction to Web Mining

Uniform Sampling from the Web via Random Walks

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

CS246 Web Characteristics.

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

CS 345A Data Mining Lecture 1

CS 345A Data Mining Lecture 1

Introduction to Web Mining

CS 345A Data Mining Lecture 1

Presentation transcript:

Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa

Two main difficulties The Web: Size: more than tens of billions of pages Language and encodings: hundreds… Distributed authorship: SPAM, format-less,… Dynamic: in one year 35% survive, 20% untouched The User: Query composition: short (2.5 terms avg) and imprecise Query results: 85% users look at just one result-page Several needs: Informational, Navigational, Transactional Extracting “significant data” is difficult !! Matching “user needs” is difficult !!

Evolution of Search Engines First generation -- use only on-page, web-text data Word frequency and language Second generation -- use off-page, web-graph data Link (or connectivity) analysis Anchor-text (How people refer to a page) Third generation -- answer “the need behind the query” Focus on “user need”, rather than on query Integrate multiple data-sources Click-through data AltaVista, Excite, Lycos, etc 1998: Google Fourth generation  Information Supply [Andrei Broder, VP emerging search tech, Yahoo! Research] Google, Yahoo, MSN, ASK,………

This is a search engine!!!

The web-graph: properties Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 19.1 and 19.2

The Web’s Characteristics Size 1 trillion of pages is available (Google 7/08) 50 billion static pages 5-40K per page => terabytes & terabytes Size grows every day!! Change 8% new pages, 25% new links change weekly Life time of about 10 days

The Bow Tie

SCC WCC Some definitions Weakly connected components (WCC) Set of nodes such that from any node can go to any node via an undirected path. Strongly connected components (SCC) Set of nodes such that from any node can go to any node via a directed path.

Find the CORE Iterate the following process: Pick a random vertex v Compute all nodes reached from v: O(v) Compute all nodes that reach v: I(v) Compute SCC(v):= I(v) ∩ O(v) Check whether it is the largest SCC If the CORE is about ¼ of the vertices, after 20 iterations, Pb to not find the core < 1%.

Compute SCCs Classical Algorithm: 1) DFS(G) 2) Transpose G in G T 3) DFS(G T ) following vertices in decreasing order of the time their visit ended at step 1. 4) Every tree is a SCC. DFS hard to compute on disk: no locality

DFS DFS(u:vertex) color[u]=GRAY d[u]  time  time +1 foreach v in succ[u] do if (color[v]=WHITE) then p[v]  u DFS(v) endFor color[u]  BLACK f[u]  time  time + 1 Classical Approach main(){ foreach vertex v do color[v]=WHITE endFor foreach vertex v do if (color[v]==WHITE) DFS(v); endFor }

Semi-External DFS Key observation: If bit-array fits in internal memory than a DFS takes |V| + |E|/B disk accesses. Bit array of nodes (visited or not) Array of successors (stack of the DFS-recursion)

Observing Web Graph We do not know which percentage of it we know The only way to discover the graph structure of the web as hypertext is via large scale crawls Warning: the picture might be distorted by Size limitation of the crawl Crawling rules Perturbations of the "natural" process of birth and death of nodes and links

Why is it interesting? Largest artifact ever conceived by the human Exploit its structure of the Web for Crawl strategies Search Spam detection Discovering communities on the web Classification/organization Predict the evolution of the Web Sociological understanding

Many other large graphs… Physical network graph V = Routers E = communication links The “cosine” graph (undirected, weighted) V = static web pages E = semantic distance between pages Query-Log graph (bipartite, weighted) V = queries and URL E = (q,u) u is a result for q, and has been clicked by some user who issued q Social graph (undirected, unweighted) V = users E = (x,y) if x knows y (facebook, address book, ,..)

The size of the web Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 19.5

What is the size of the web ? Issues The web is really infinite Dynamic content, e.g., calendar Static web contains syntactic duplication, mostly due to mirroring (~30%) Some servers are seldom connected Who cares? Media, and consequently the user Engine design

What can we attempt to measure? The relative sizes of search engines Document extension: e.g. engines index pages not yet crawled, by indexing anchor-text. Document restriction: All engines restrict what is indexed (first n words, only relevant words, etc.) The coverage of a search engine relative to another particular crawling process.

A  B = (1/2) * Size A A  B = (1/6) * Size B (1/2)*Size A = (1/6)*Size B  Size A / Size B = (1/6)/(1/2) = 1/3 Sample URLs randomly from A Check if contained in B and vice versa A BA B Each test involves: (i) Sampling (ii) Checking Relative Size from Overlap Given two engines A and B Sec. 19.5

Sampling URLs Ideal strategy: Generate a random URL and check for containment in each index. Problem: Random URLs are hard to find! Approach 1: Generate a random URL contained in a given engine Suffices for the estimation of relative size Approach 2: Random walks or IP addresses In theory: might give us a true estimate of the size of the web (as opposed to just relative sizes of indexes)

Random URLs from random queries Generate random query: how? Lexicon: 400,000+ words from a web crawl Conjunctive Queries: w 1 and w 2 e.g., vocalists AND rsi Get 100 result URLs from engine A Choose a random URL as the candidate to check for presence in engine B (next slide) This distribution induces a probability weight W(p) for each page. Conjecture: W(SE A ) / W(SE B ) ~ |SE A | / |SE B |

Query-based checking Strong Query to check whether an engine B has a document D: Download D. Get list of words. Use 8 low frequency words as AND query to B Check if D is present in result set. Problems: Near duplicates Redirects Engine time-outs Is 8-word query good enough?

Advantages & disadvantages Statistically sound under the induced weight. Biases induced by random query Query bias: Favors content-rich pages in the language(s) of the lexicon Ranking bias: Solution: Use conjunctive queries & fetch all Checking bias: Duplicates Query restriction bias: engine might not deal properly with 8 words conjunctive query Malicious bias: Sabotage by engine Operational Problems: Time-outs, failures, engine inconsistencies, index modification.

Random IP addresses Generate random IP addresses Find a web server at the given address If there’s one Collect all pages from server From this, choose a page at random

Advantages & disadvantages Advantages Clean statistics Independent of crawling strategies Disadvantages Many hosts might share one IP, or not accept requests No guarantee all pages are linked to root page, and thus can be collected. Power law for # pages/hosts generates bias towards sites with few pages.

Conclusions No sampling solution is perfect. Lots of new ideas but the problem is getting harder Quantitative studies are fascinating and a good research problem

The web-graph: storage Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.4

Definition Directed graph G = (V,E) V = URLs, E = (u,v) if u has an hyperlink to v Isolated URLs are ignored (no IN & no OUT) Three key properties: Skewed distribution: Pb that a node has x links is 1/x ,  ≈ 2.1

The In-degree distribution Altavista crawl, 1999WebBase Crawl 2001 Indegree follows power law distribution This is true also for: out-degree, size components,...

Definition Directed graph G = (V,E) V = URLs, E = (u,v) if u has an hyperlink to v Isolated URLs are ignored (no IN, no OUT) Three key properties: Skewed distribution: Pb that a node has x links is 1/x ,  ≈ 2.1 Locality: usually most of the hyperlinks point to other URLs on the same host (about 80%). Similarity: pages close in lexicographic order tend to share many outgoing lists

A Picture of the Web Graph i j 21 millions of pages, 150millions of links

URL-sorting Stanford Berkeley

Front-compression of URLs + delta encoding of IDs. Front-coding

The library WebGraph. Successor list S(x) = {s 1 -x, s 2 -s 1 -1,..., s k -s k-1 -1} For negative entries: Adjacency list with compressed gaps (locality) Uncompressed adjacency list

Copy-lists. Uncompressed adjacency list Each bit of y informs whether the corresponding successor of y is also a successor of the reference x; The reference index is chosen in [0,W] that gives the best compression. Adjacency list with copy lists (similarity) Reference chains possibly limited 

Copy-blocks = RLE(Copy-list). Adjacency list with copy lists. The first copy block is 0 if the copy list starts with 0; The last block is omitted (we know the length…); The length is decremented by one for all blocks Adjacency list with copy blocks (RLE on bit sequences)

3 Extra-nodes: Compressing Intervals. Adjacency list with copy blocks. Consecutivity in extra-nodes 0 = (15-15)*2 (positive) 2 = (23-19)-2 (jump >= 2) 600 = (316-16)*2 12 = (22-16)*2 (positive) 3018 = (difference) Intervals: use their left extreme and length Int. length: decremented by L min = 2 Residuals: differences between residuals, or the source This is a Java and C++ lib (≈3 bits/edge)