Presentation is loading. Please wait.

Presentation is loading. Please wait.

Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.

Similar presentations


Presentation on theme: "Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa."— Presentation transcript:

1 Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa

2 Is Big Data a buzz word ?

3 Big Data vs Grid Computing

4 VLDB does exist since 1992

5 Big data, big impact !

6 Big data are everywhere !

7 [Procs OSDI 2006] No SQL HyperTable Cassandra Hadoop Cosmos

8

9

10 From macro to micro-users Energy is related to time/memory-accesses in an intricated manner, so the issue algo + memory levels is a key for everyday users, not only big players

11 Our driving moral... Big steps come from theory... but do NOT forget practice ;-)

12 Our running example

13 (String-) Dictionary Problem Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix searches for a pattern P. Exact search Hashing

14 (Compacted) Trie 1 2 2 0 4 5 6 7 2 3 y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin szomo [Fredkin, CACM 1960] ( 2 ; 3,5) Performance: Search O(|P|) time Space O(N) Dominated the string-matching scene in the 80s-90s Most known is the Suffix Tree Software engineers objected: Search: random memory accesses Space: pointers + strings Lexicographic search P = systo

15 Timeline: theory and practice... 60 Trie 90 70-80 Suffix Tree What about Software Engineers ??

16 What did systems implement? Used the Compacted trie, of course, but with 2 other concerns because of large data

17 1° issue: space concern http://checkmate.com/All_Natural/ http://checkmate.com/All_Natural/Applied.html http://checkmate.com/All_Natural/Aroma.html http://checkmate.com/All_Natural/Aroma1.html http://checkmate.com/All_Natural/Aromatic_Art.html http://checkmate.com/All_Natural/Ayate.html http://checkmate.com/All_Natural/Ayer_Soap.html http://checkmate.com/All_Natural/Ayurvedic_Soap.html http://checkmate.com/All_Natural/Bath_Salt_Bulk.html http://checkmate.com/All_Natural/Bath_Salts.html http://checkmate.com/All/Essence_Oils.html http://checkmate.com/All/Mineral_Bath_Crystals.html http://checkmate.com/All/Mineral_Bath_Salt.html http://checkmate.com/All/Mineral_Cream.html http://checkmate.com/All/Natural/Washcloth.html... 0 http://checkmate.com/All_Natural/ 33 Applied.html 34 roma.html 38 1.html 38 tic_Art.html 34 yate.html 35 er_Soap.html 35 urvedic_Soap.html 33 Bath_Salt_Bulk.html 42 s.html 25 Essence_Oils.html 25 Mineral_Bath_Crystals.html 38 Salt.html 33 Cream.html 33 45% 0 http://checkmate.com/All/Natural/Washcloth.html... systile syzygetic syzygial syzygy…. 2,zygetic 5,ial5,y Front Coding

18 2° issue: Disk memory track B CPU Internal Memory 1 2 main features: Seek time = I/Os are costly Blocked access = B items per I/O Count I/Os Strings may be arbitrarily long Why are strings challenging ?

19 ….0systile 2zygetic 5ial 5y 0szaibelyite 2czecin 2omo…. systile szaielyite CT on a sample 2-level indexing Disk Internal Memory One main limitation: Sampling rate & lengths of sampled strings Trade-off btw speed vs space (because of bucket size) 2 advantages: Search typically 1 disk access Space Front-coding over buckets (Prefix) B-tree BB

20 Timeline: theory and practice... 60 Trie 90 2-level indexing 70-80 Suffix Tree Space + Hierarchical Memory Do we need to trade space by I/Os ? 1995 String B-tree

21 An old idea: Patricia Trie ….systile syzygetic syzygial syzygy szaibelyite szczecin szomo…. 2 2 0 y s 1 z stile zyg 5 etic ial y aibelyte czecin omo [Morrison, J.ACM 1968] Disk

22 A new (lexicographic) search ….systile syzygetic syzygial syzygy szaibelyite szczecin szomo…. 2 2 0 y s 1 z s z 5 e i y a c o Search(P): Phase 1: tree navigation Phase 2: Compute LCP Phase 3: tree navigation Lexicographic search: P = syzytea 0 1 25 yg yg Lexicographic position Only 1 string is checked on disk Trie Space #strings, NOT their length [Ferragina-Grossi, J.ACM 1999] Disk

23 The String B-tree 29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23 29 2 26 13 20 25 6 18 3 14 21 23 29 13 20 18 3 23 PT Search(P) O((p/B) log B K) I/Os O(occ/B) I/Os It is dynamic... Check 1 string = O(p/B) I/Os O(log B K) levels + Lexicographic position of P [Ferragina-Grossi, J.ACM 1999] > 15 US-patents cite it !! Knuth, vol 3°, pag. 489: elegant

24 I/O-aware algorithms & data structures [CACM 1988] [2006] Huge literature !! I/Os was the main concern

25 Timeline: theory and practice... 60 Trie 90 2-level indexing 70-80 Suffix Tree 1995 String B-tree 1999 CPU registers L1 L2RAM Cache HD net Cache-oblivious solutions, aka parameter-free algo+ds Anywhere, anytime, anyway... I/O-optimal !! Not just 2 memory levels

26 Timeline: theory and practice... 60 Trie 90 2-level indexing 70-80 Suffix Tree 1995 String B-tree 1999 Space Cache-oblivious data structures Compressed data structures Not just 2 memory levels

27 Can we automate and guarantee the process ? A challenging question [Ken Church, AT&T 1995] Software Engineers use squeezing heuristics that compress data and still support fast access to them

28 Aka: Compressed self-indexes Opportunistic Data Structures with Applications P. Ferragina, G. Manzini Space for text+index space for compressed text only ( H k ) Query/Decompression time theoretically (quasi-)optimal...now, J.ACM 2005

29 The big (unconscious) step... [Burrows-Wheeler, 1994] p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m Let us given a text T = mississippi# mississippi# ississippi#m ssissippi#mi sissippi#mis sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi ssippi#missi issippi#miss Sort the rows # mississipp i i #mississip p i ppi#missis s Highly compressible, but…

30 The big (unconscious) step... [Burrows-Wheeler, 1994] p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m Let us given a text T = mississippi# mississippi# ississippi#m ssissippi#mi sissippi#mis sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi ssippi#missi issippi#miss Sort the rows # mississipp i i #mississip p i ppi#missis s T bzip2 = BWT + other simple compressors bwt(T)

31 From practice to theory... FM-index = BWT is searchable... or Suffix Array is compressible Space = |T| H k + o(|T|) bits Search(P) = O(p + occ * polylog(|T|)) Nowadays tons of papers: theory & experiments [Navarro-Makinen, ACM Comp. Surveys 2007] [Ferragina-Manzini, IEEE Focs 00] 10 pi#mississi p 9 ppi#mississ i 7 sippi#missi s 4 sissippi#mi s 6 ssippi#miss i 3 ssissippi#m i 5 issippi#mis s 1 mississippi # 2 ississippi# m 12 #mississipp i 11 i#mississip p 8 ippi#missis s bwt(T) sa(T)

32 Compressed & Searchable data formats After our paper in FOCS 2000, about texts We find nowdays compressed indexes for: Trees Labeled trees and graphs Functions Integer Sets Geometry Images...

33 From theory to practice… December 2003

34 ACM J. on Experimental Algorithmics, 2009

35 > 10 3 faster than Smith-W. >10 2 faster than SOAP & Maq

36 What about the Web ? [Ferragina-Manzini, ACM WSDM 2010]

37 An XML excerpt Donald E. Knuth The TeXbook Addison-Wesley 1986 Donald E. Knuth Ronald W. Moore An Analysis of Alpha-Beta Pruning 293-326 1975 6 Artificial Intelligence... IEEE FOCS 2005WWW 2006 J. ACM 2009US Patent 2012

38 A tree interpretation XML document exploration Tree navigation XML document search Labeled subpath searches XBW transform

39 XBW Transform: Some performance figures Num searches per second Xerces better on smaller files larger and larger datasets Xerces worse on larger files Xerces uses 10x space

40 Where we are nowadays 60 Trie 90 2-level indexing 70-80 Suffix Tree 1995 String B-tree 1999 Cache-oblivious data structures Compressed data structures Something is known... yet very preliminary Lower Bounds derived from Geometry Text search = 2d Range Search

41 New food for research.. [E. Gal, S. Toledo. ACM Comp. Surv., 2005] [Ajwani et al, WEA 2009] Solid-state disks: no mechanical parts... very fast reads, but slow writes & wear leveling Self-adjusting or Weighted design Time ops depend on some (un/known) distribution Challenge: no pointers, self-adjust ( perf ) vs compression ( space ) [Ferragina et al, ESA 2011] 40Gb, about 100$

42 The energy challenge IEEE Computer, 2007

43 Browsing a web site Javascript framework PrototypeDojojQuery Chrome best choice 1,5% FireFox 2,5%4,8%4,3% IE 10,2%8,5% 11% The most used!

44

45 Yet today, it is a problem... Apple is still working on the battery life problem: The recent iOS software update addressed many of the battery issues that some customers experienced on their iOS 5 devices. We continue to investigate a few remaining issues. (nov 2011, wired.com ) Windows 8's power hygiene: the scheduler will ignore the unused software (Feb 2012, MSDN)

46 Energy-aware Algo+Ds ? Locality pays off Memory-level impacts I/Os and compression are obviously important BUT here there is a new twist

47 MIPS per Watt ? Battery life !! Who cares whether your application: 1.is y% slower than optimal, but it is more energy efficient ? 2.takes x% more space than optimal, but it is more energy efficient ? Idea: Multi-objective optimization in data-structure design Approach in a principled way

48 A preliminary step Took inspiration from BigTable (Google),... Design a compressed storage scheme that can trade in a principled way between space vs decompression time [vs energy efficiency] Requirements: gzip-like compression [like Snappy or lz4 by Google] Goal: Fix the space occupancy, find the best compression that achieves that space and minimizes the decompression time (or vice versa) [abrac] adabra -> [abrac] (a) (d) (abra) -> [abrac] Copy backnew charCopy back

49 A preliminary step... Modeled as a Constrained Shortest Path problem: Nodes = one per char of the text to be compressed Edges = single char or copy back substrings 2 edge weights = decompression time (t) and compressed space (c) NP-hard in general This special case is POLY: O(n 3 ) n is huge m might be n 2 LZ-parsing = Path from 1 to 12 We solved heuristically (Lagrangian Dual) and provably (Path Swap)

50 A preliminary step...

51 String Matching RAM model, char cmp and time 1990s: Data Bases Hierarchical memories and I/Os 2000s: Data Compression Space reduction in indexes and entropy space-bounds Graph Theory Space reduction in compressors Optimization Multi-objective design and joules 2010s: Computational Geometry Lower bounds on indexes New upper bounds on I/Os, entropy Nowadays… We mainly commented:

52 A quote to conclude The distance between theory and practice is closer in theory than in practice [Y. Matias, Google] Big steps come from theory... but do NOT forget practice ;-)

53 Thats all !


Download ppt "Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa."

Similar presentations


Ads by Google