1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center.

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
VisualRank: Applying PageRank to Large-Scale Image Search Yushi Jing, Member, IEEE, and Shumeet Baluja, Member, IEEE.
Improving Hypertext Data using Pagelets and Templates Ziv Bar-Yossef U.C. Berkeley and IBM Almaden Sridhar Rajagopalan IBM Almaden 1.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
1 Lecture 18 Syntactic Web Clustering CS
Link Analysis, PageRank and Search Engines on the Web
1 Sampling Lower Bounds via Information Theory Ziv Bar-Yossef IBM Almaden.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005
1 Streaming Computation of Combinatorial Objects Ziv Bar-Yossef U.C. Berkeley Omer Reingold AT&T Labs – Research Ronen.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 10 June 4, 2006
Link Structure and Web Mining Shuying Wang
1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion.
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.
1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.
Sampling a web subgraph Paraskevas V. Lekeas Proceedings of the 5 th Algorithms, Scientific Computing, Modeling and Simulation (ASCOMS), Web conference,
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 22, 2005
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Sublinear time algorithms Ronitt Rubinfeld Computer Science and Artificial Intelligence Laboratory (CSAIL) Electrical Engineering and Computer Science.
CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics.
Using Hyperlink structure information for web search.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.
Web Mining Class Nam Hoai Nguyen Hiep Tuan Nguyen Tri Survey on Web Structure Mining
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State.
Overview of Web Ranking Algorithms: HITS and PageRank
Challenges and Opportunities Posed by Power Laws in Network Analysis Bruno Ribeiro UMass Amherst MURI REVIEW MEETING Berkeley, 26 th Oct 2011.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Algorithmic Detection of Semantic Similarity WWW 2005.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
Amplification and Derandomization Without Slowdown Dana Moshkovitz MIT Joint work with Ofer Grossman (MIT)
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.
Calculating frequency moments of Data Stream
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
IR Theory: Web Information Retrieval. Web IRFusion IR Search Engine 2.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
The Structure of Broad Topics on the Web
15-499:Algorithms and Applications
Uniform Sampling from the Web via Random Walks
Junghoo “John” Cho UCLA
Presentation transcript:

1 Massive Data Sets: Theory & Practice Ziv Bar-Yossef IBM Almaden Research Center

2 What are Massive Data Sets? Technology The World-Wide Web IP packet flows Phone call logs Science Genomic data Astronomical sky surveys Weather data Business Credit card transactions Billing records Supermarket sales Petabytes Terabytes Gigabytes Huge Distributed Dynamic Heterogeneous Noisy Unstructured / semi-structured

3 Nontraditional Challenges Traditionally Cope with the complexity of the problem New challenges How to efficiently compute on massive data sets? –Restricted access to the data –Not enough time to read the whole data –Tiny fraction of the data can be held in main memory How to find desired information in the data? How to summarize the data? How to clean the data? Massive Data Sets Cope with the complexity of the data

4 Algorithm Sampling Query a small number of data elements Data streams Stream through the data; limited main memory storage Sketching Compress data chunks into small “sketches”; compute over the sketches Computational Models for Massive Data Sets Algorithm Data Set Algorithm Data Set

5 Outline of the Talk Web statistics Sampling lower bounds Hamming distance sketching Template detection “Theory” “Practice”

6 Web Statistics (with A. Berg, S. Chien, J. Fakcharoenphol, D. Weitz, VLDB 2000) The “BowTie” Structure of the Web [Broder et al, 2000] crawlable web What fraction of the web is covered by Google? Which is the largest country domain on the web? What is the percentage of French language pages? How large is the web?

7 Our Approach Straightforward solution: –Crawl the crawlable web –Generate statistics based on the crawl Drawbacks: –Expensive –Complicated implementation –Slow –Inaccurate Our approach: uniform sampling by random walks –Random walk on an undirected & regular version of the crawlable web Advantages: –Provably uniform samples from the crawlable web –Runs on a desktop PC in a couple of days

8 Undirected Regular Random Walk Fact: A random walk on a connected (non-bipartite) undirected regular graph converges to a uniform limit distribution. w(v) = deg max - deg(v) Follow a random out-link or a random in-link at each step Use weighted self loops to even out page degrees

9 Convergence Rate (“Mixing Time”) TheoremMixing time  log(N)/  (N = graph size,  = transition matrix’s spectral gap) Experiment (based on a crawl) For the web,   Mixing time: 3.3 million steps Self loop steps are free 29,999 out of 30,000 steps are self loop steps Actual mixing time is only 110 steps

10 Realization of the Random Walk Problems The in-links of pages are not readily available The degree of pages is not available Available sources of in-links: Previously visited nodes Reverse link services of search engines Experiments indicate samples are still nearly uniform.

11 Top 20 Internet Domains (summer 2003)

12 Search Engine Coverage (summer 2000)

13 Subsequent Extensions Focused Sampling (with T. Kanungo and R. Krauthgamer, 2003) –“Focused statistics” about web communities: Statistics about the.uk domain Statistics about pages on bicycling Statistics about Arabic language pages –Based on a sophisticated extension of the above random walk. Study of the web’s decay ( with A. Broder, R. Kumar, and A. Tomkins, 2003) –A measure for how well-maintained web pages are. –Based on a random walk idea.

14 Sampling Lower Bounds (STOC 2003) Q1. How many samples are needed to estimate: –The fraction of pages covered by Google? –The number of distinct web-sites? –The distribution of languages on the web? Q2. Can we save samples by sampling non-uniformly? A2. For “symmetric” functions, uniform sampling is the best possible. (“symmetric” – invariant under permutations of data elements) A1. A “recipe” for obtaining sampling lower bounds for symmetric functions.

15 Algorithm Optimality of Uniform Sampling (with R. Kumar and D. Sivakumar, STOC 2001) Theorem When estimating symmetric functions, uniform sampling is the best possible. Proof idea X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 X8X8 X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 X8X8 X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 X8X8 X2X2 X7X7 X5X5 original algorithm simulation x  x) X2X2 X7X7 X5X5

16 Preliminaries B f(a) f(b) pairwise “disjoint inputs“ f(c) f: A n  B : symmetric function  approximation parameter x  1) = 1/2 (2) = 1/3 (3) = 1/6 input “sample distribution”

17 The Lower Bound Recipe x 1,…,x m : “pairwise disjoint” inputs 1,…, m : “sample distributions” on x 1,…,x m Theorem: Any algorithm approximating f requires q samples, where Proof steps: Reduction from statistical classification Lower bound for statistical classification ( 0 · JS( 1,…, m ) · log m )

18 Reduction from Statistical Classification B f(a) f(b) pairwise f(c) “disjoint inputs” Statistical classification: Given uniform samples from x  { a, b, c }, decide whether x = a or x = b or x = c. f: A n  B: symmetric function Can be solved by any sampling algorithm approximating f

19 The “Election Problem” input: a sequence x of n votes to k parties 7/18 4/183/182/18 1/18 (n = 18, k = 6) Want to get s.t. || -  x || < . Vote Distribution  x Theorem A poll of size  (k/  2 ) is required for estimating the election problem.

20 Combinatorial Designs 1.Each of them constitutes half of U. 2.The intersection of each two of them is relatively small. B1B1 B2B2 B3B3 U A family of subsets B 1,…,B m of a universe U s.t. Fact: There exist designs of size exponential in |U|.

21 Proof of the Lower Bound for the Election Problem Step 1: Identification of a set S of pairwise disjoint inputs: B 1,…,B m µ {1,…,k}: a design of size m = 2  (k). S = { x 1,…,x m }, where in x i : BiBi BicBic Step 2: JS( 1,…, m ) = O(  2 ). By our theorem, # of queries is at least  (k/  2 ). ½ +  of the votes are split among parties in B i. ½ -  of the votes are split among parties in B i c.

22 Hamming Distance Sketching (with T.S. Jayram and R. Kumar, 2003) Alice Bob Referee Ham(x,y) > k x y  x)  y) Ham(x,y) · k $$

23 Hamming Distance Sketching Applications Maintenance of large crawls Comparison of large files over the network Previous schemes: Sketches of size O(k 2 ) [Kushilevitz, Ostrovsky, Rabani, 98], [Yao 03] Lower bound:  (k) Our scheme: Sketches of size O(k log k)

24 Preliminaries Balls and Bins: When throwing n balls into n/log n bins, then with high probability the fullest bin has O(log n) balls. When throwing n balls into n 2 bins, then with high probability no two balls fall into the same bin. Using KOR scheme, can assume Ham(x,y) · 2k.

25 First Level Hashing x y k/log k bins y1y1 y2y2 y3y3 x2x2 x1x1 x3x3 Ham(x,y) =  i Ham(x i,y i ) 8i, Ham(x i,y i ) · O(log k)

26 Second Level Hashing y3y3 x3x log 2 k bins  3,1  3,2  3,3  3,4  3,5  3,6  3,1  3,2  3,3  3,4  3,5  3,6  3,j =  3,j iff # of “pink positions” in the j-th bin is even. If no collisions, Ham(  3,  3 ) = Ham(x 3,y 3 ) If collisions, Ham(  3,  3 ) · Ham(x 3,y 3 )

27 The Sketch  (x) = {  i j | i = 1,…,k/log k, j = 1,…,t }  (y) = {  i j | i = 1,… k/log k, j = 1,…,t } Referee decides Ham(x,y) · k if and only if  i max j Ham(  i j,  i j ) · k Probability of collision: a small constant For each i = 1,…,k/log k, repeat second level hashing t = O(log k) times, obtaining (  i 1,  i 1 ),…,(  i t,  i t ). With probability at least 1 – 1/k, Ham(x i,y i ) = max j Ham(  i j,  i j )

28 Other Sketching Results A sketching scheme for the edit distance –Leads to the first almost-linear time approximation algorithm for the edit distance. Sketch lower bounds for (compressed) pattern matching.

29 Template Detection (with S. Rajagopalan, WWW 2002) Template – Master HTML shell page used for composing new pages. Our contributions: Efficient algorithm for template detection Application to improvement of search engine precision

30 Templates are Bad for Web IR Pose a significant source of “noise” in web pages –Their content is not related to the topics of pages in which they reside –Create spurious linkage to unimportant pages Extremely common –Became standard in website design

31 Pagelets [Chakrabarti 01] has a single theme not nested within a bigger region with the same theme Navigational bar pagelet Search pagelet Directory pagelet News headlines pagelet Pagelet – a region in a page that:

32 Template Definition Template = a collection of pagelets that: 1.Belong to the same website. 2.Are nearly-identical.

33 Template Detection Template Detection Algorithm Group the pages in S according to website. For each website w: –For each page p 2 w: Partition p into pagelets p 1,…,p k Compute a “shingle” sketch for each pagelet [Broder et al. 1997] –Group the resulting pagelets by their sketches. –Output all the pagelet groups of size > 1. Template Detection Problem: Given a set of pages S, find all the templates in S.

34 HITS & Clever [Kleinberg 1997, Chakrabarti et al ] HubsAuthorities h(p) =  q 2 out(p) a(q) a(p) =  q 2 in(p) h(q)

35 “Template” Clever Hubs Authorities Hubs – all the non-templatized constituent pagelets of pages in the base set. Authorities – all pages in the base set. Page Pagelet Templatized pagelet Legend

36 Classical Clever vs. Template Clever

37 Template Proliferation

38 Summary Web data mining via random walks on the web graph: –Web statistics –Focused statistics –Web decay Sampling lower bounds –Optimality of uniform sampling for symmetric functions –A “recipe” for lower bounds Sketching of string distance measures –Hamming distance –Edit distance Template detection

39 Some of My Other Work Database –Semi-structured data and XML Computational Complexity –Communication complexity –Pseudo-randomness and de-randomization –Space-bounded computations –Parallel computation complexity Algorithm Design –Data stream algorithms –Internet auctions

40

41 Web Statistics (with A. Berg, S. Chien, J. Fakcharoenphol, D. Weitz, VLDB 2000) The “BowTie” Structure of the Web [Broder et al, 2000] crawlable web SCC OUT IN What fraction of the web is covered by Google? Which is the largest country domain on the web? What is the percentage of porn pages? How large is the web?

42 Straightforward Random Walk Gets stuck in sinks and in dense web communities Biased towards popular pages Converges slowly, if at all yahoo.com amazon.com Follow a random out-link at each step

43 Undirected Regular Random Walk Fact: A random walk on a connected (non-bipartite) undirected regular graph converges to a uniform limit distribution. w(v) = deg max - deg(v) yahoo.com amazon.com Follow a random out-link or a random in-link at each step Use weighted self loops to even out page degrees

44 Evaluation: Bias towards High Degree Nodes Deciles of nodes ordered by degree High Degree Low Degree Percent of nodes from walk

45 Evaluation: Bias towards the Search Engines Search engine size 30%50% Estimate of search engine size

46 Link-Based Web IR Applications Search and ranking –HITS and Clever [Kleinberg 1997,Chakrabarti et al. 1998] –PageRank [Brin and Page 1998] –SALSA [Lempel and Moran 2000] Similarity search –Co-Citation [Dean and Henzinger 1999] Categorization –Hyperclass [Chakrabarti, Dom, Indyk 1998] Focused crawling –FOCUS [Chakrabarti, van der Berg, Dom 1999] …

47 Hypertext IR Principles Relevant Linkage Principle [Kleinberg 1997] –p links to q  q is relevant to p Topical Unity Principle [Kessler 1963, Small 1973] –q 1 and q 2 are co-cited in p  q 1 and q 2 are related to each other Lexical Affinity Principle [Maarek et al. 1991] –The closer the links to q 1 and q 2 are the stronger the relation between them. Underlying principles of link analysis: p q p q1q1 q2q2 p q1q1 q2q2 q3q3

48 Example: HITS & Clever [Kleinberg 1997, Chakrabarti et al ] Relevant Linkage Principle –All links propagate score from hubs to authorities and vice versa. Topical Unity Principle –Co-cited authorities propagate score to each other. Lexical Affinity Principle (Clever) –Text around links is used to weight relevance of the links. HubsAuthorities h(p) =  q 2 out(p) a(q) a(p) =  q 2 in(p) h(q)