1 Parallel Database Systems Analyzing LOTS of Data Jim Gray Microsoft Research.

1 Parallel Database Systems Analyzing LOTS of Data Jim Gray Microsoft Research

2 Outline l Why Parallelism: –technology push –application pull l Parallel Database Techniques –partitioned data –partitioned and pipelined execution –parallel relational operators

3 Main Message l Technology trends give –many processors and storage units –inexpensively l To analyze large quantities of data –sequential (regular) access patterns are 100x faster –parallelism is 1000x faster (trades time for money) –Relational systems show many parallel algorithms.

4 Database Store ALL Data Types The New World: Billions of objects Big objects (1MB) Objects have behavior (methods) The Old World: Millions of objects 100-byte objects Mike Won David NY Berk Austin People NameAddress Mike Won David NY Berk Austin Paperless office Library of congress online All information online entertainment publishing business Information Network, Knowledge Navigator, Information at your fingertips Name AddressPapersPicture Voice People

5 Moore’s Law 128KB 128MB 2000 8KB 1MB 8MB 1GB 1970 19801990 1M16Mbits: 1K4K16K64K256K4M64M256M 1 chip memory size ( 2 MB to 32 MB) XXX doubles every 18 months 60% increase per year Micro Processor speeds chip density Magnetic disk density Communications bandwidth WAN bandwidth approaching LANs.

6 Magnetic Storage Cheaper than Paper File Cabinet : cabinet (4 drawer)250$ paper (24,000 sheets)250$ space (2x3 @ 10$/ft 2 )180$ total700$ 3 ¢/sheet Disk : disk (4 GB =)500$ ASCII: 2 m pages (100x cheaper) 0.025 ¢/sheet Image : 200 k pages (10x cheaper).25 ¢/sheet Store everything on disk

7 Disc & Tape Storage $/byte got 10 4 better $/access got 10 3 better capacity grew 10 3 Latency down 10x Bandwidth up 10x 1e 0 1e 4 Tape (a/hr) Disk (a/min) RAM (a/s) 1970198019902000 1970198019902000 1e 8 1e 7 1e 6 1e 5 1e 4 1e 3 Disk RAM Tape B/$

8 What's a Terabyte? (250 K$ of Disk @.25$/MB) 1 Terabyte 1,000,000,000 business letters 100,000,000 book pages 50,000,000 FAX images 10,000,000 TV pictures (mpeg) 4,000 LandSat images Library of Congress (in ASCII) is 25 TB 1980: 200 M$ of disc 10,000 discs 5 M$ of tape silo 10,000 tapes 1995: 250 K$ of magnetic disc 70 discs 500 K$ of optical disc robot 250 platters 50 K$ of tape silo 50 tapes Terror Byte !! 150 miles of bookshelf 15 miles of bookshelf 7 miles of bookshelf 10 days of video

9 Today’s Storage Hierarchy : Speed & Capacity vs Cost Tradeoffs 1 10 15 10 12 10 9 6 3 Size vs Speed Access Time (seconds) 10 -9 10 -6 10 -3 10 0 3 Cache Main Secondary Disc Nearline Tape Offline Tape Online Tape 10 4 2 0 -2 10 -4 Price vs Speed Access Time (seconds) 10 -9 10 -6 10 -3 10 0 3 Cache Main Secondary Disc Nearline Tape Offline Tape Online Tape Size(B) $/MB

10 Storage Latency: How Far Away is the Data?

11 Thesis Many Little will Win over Few Big 1 M$ 100 K$10 K$ Mainframe Mini Micro Nano 14" 9" 5.25" 3.5" 2.5" 1.8"

12 1k SPECint CPU 500 GB Disc 2 GB RAM Implications of Hardware Trends Large Disc Farms will be inexpensive (10k$/TB) Large RAM databases will be inexpensive (1K$/GB) Processors will be inexpensive So building block will be a processor with large RAM lots of Disc lots of network bandwidth CyberBrick™

13 Implication of Hardware Trends: Clusters Future Servers are CLUSTERS of processors, discs Distributed Database techniques make clusters work CPU 50 GB Disc 5 GB RAM

14 Scaleables: Uneconomic So Far A Slice: a processor, memory, and a few disks. Slice Price of Scaleables is 5x to 10x markup –Teradata: 70K$ for a Intel P5 + 32MB + 4 disk. –Tandem: 100k$ for a MipsCo R4000 + 64MB + 4 disk –Intel: 75k$ for an I860 +32MB + 2 disk –IBM/SP2: 100k$ for a R6000 + 64MB + 8 disk Compaq Slice Price is less than 10k$ What is the problem? –Proprietary interconnect –Proprietary packaging –Proprietary software (vendorIX)

15 Summary Tech trends => pipeline & partition parallelism –Lots of bytes & bandwidth per dollar –Lots of latency –Lots of MIPS per dollar –Lots of processors Putting it together Scaleable Networks and Platforms –Build clusters of commodity processors & storage –Commodity interconnect is key (S of PMS) Traditional interconnects give 100k$/slice. –Commodity Cluster Operating System is key –Fault isolation and tolerance is key –Automatic Parallel Programming is key

16 The Hardware is in Place and Then A Miracle Occurs SNAP SNAP Scaleable Network And Platforms Commodity Distributed OS built on Commodity Platforms Commodity Network Interconnect ? Enables Parallel Applications

17 The Software Challenge Automatic data placement (partition: random or organized) Automatic parallel programming (process placement) Parallel concepts, algorithms & tools Parallel Query Optimization Execution Techniques load balance, checkpoint/restart, pacing, multi-programming

18 Parallelism: Performance is the Goal Goal is to get 'good' performance. Law 1: parallel system should be faster than serial system Law 2: parallel system should give near-linear scaleup or near-linear speedup or both.

19 Kinds of Parallel Execution Pipeline Partition outputs split N ways inputs merge M ways Any Sequential Program Any Sequential Program Sequential Any Sequential Program Any Sequential Program

20 The Perils of Parallelism Startup: Creating processes Opening files Optimization Interference: Device (cpu, disc, bus) logical (lock, hotspot, server, log,...) Skew:If tasks get very small, variance > service time

21 Why Parallel Access To Data? At 10 MB/s 1.2 days to scan 1,000 x parallel 1.5 minute SCAN. Parallelism: divide a big problem into many smaller ones to be solved in parallel. Bandwidth

22 Data Flow Programming Prefetch & Postwrite Hide Latency Can't wait for the data to arrive (2,000 years!) Memory that gets the data in advance ( 100MB/S) Solution: Pipeline from storage (tape, disc...) to cpu cache Pipeline results to destination Latency

23 Parallelism: Speedup & Scaleup 100GB Speedup: Same Job, More Hardware Less time Scaleup: Bigger Job, More Hardware Same time 100GB 1 TB 100GB 1 TB Server 1 k clients10 k clients Transaction Scaleup: more clients/servers Same response time

24 Benchmark Buyer's Guide Things to ask When does it stop scaling? Throughput numbers, Not ratios. Standard benchmarks allow Comparison to others Comparison to sequential Ratios & non-standard benchmarks are red flags.

25 Why are Relational Operators So Successful for Parallelism? Relational data model uniform operators on uniform data stream closed under composition Each operator consumes 1 or 2 input streams Each stream is a uniform collection of data Sequential data in and out: Pure dataflow partitioning some operators (e.g. aggregates, non-equi-join, sort,..) requires innovation AUTOMATIC PARALLELISM

26 Database Systems “Hide” Parallelism Automate system management via tools data placement data organization (indexing) periodic tasks (dump / recover / reorganize) Automatic fault tolerance duplex & failover transactions Automatic parallelism among transactions (locking) within a transaction (parallel execution)

27 Automatic Parallel OR DB Select image from landsat where date between 1970 and 1990 and overlaps(location, :Rockies) and snow_cover(image) >.7; Temporal Spatial Image datelocimage Landsat 1/2/72.... 4/8/95 33N 120W. 34N 120W Assign one process per processor/disk: find images with right data & location analyze image, if 70% snow, return it image Answer date, location, & image tests

28 Automatic Data Partitioning Split a SQL table to subset of nodes & disks Partition within set: Range Hash Round Robin Shared disk and memory less sensitive to partitioning, Shared nothing benefits from "good" partitioning Good for equi-joins, range queries group-by Good for equi-joinsGood to spread load

29 Index Partitioning Hash indices partition by hash B-tree indices partition as a forest of trees. One tree per range Primary index clusters data 0...910..1920..2930..3940.. A..C D..F G...M N...R S..Z

30 Secondary Index Partitioning In shared nothing, secondary indices are Problematic Partition by base table key ranges Insert: completely local (but what about unique?) Lookup: examines ALL trees (see figure) Unique index involves lookup on insert. Partition by secondary key ranges Insert: two nodes (base and index) Lookup: two nodes (index -> base) Uniqueness is easy Teradata solution A..C D..F G...M N...R S.. Base Table A..Z Base Table A..Z

31 Data Rivers: Split + Merge Streams Producers add records to the river, Consumers consume records from the river Purely sequential programming. River does flow control and buffering does partition and merge of data records River = Split/Merge in Gamma = Exchange operator in Volcano. River M Consumers N producers N X M Data Streams

32 Partitioned Execution Spreads computation and IO among processors Partitioned data gives NATURAL parallelism

33 N x M way Parallelism N inputs, M outputs, no bottlenecks. Partitioned Data Partitioned and Pipelined Data Flows

34 Picking Data Ranges Disk Partitioning For range partitioning, sample load on disks. Cool hot disks by making range smaller For hash partitioning, Cool hot disks by mapping some buckets to others River Partitioning Use hashing and assume uniform If range partitioning, sample data and use histogram to level the bulk Teradata, Tandem, Oracle use these tricks

35 Blocking Operators = Short Pipelines An operator is blocking, if it does not produce any output, until it has consumed all its input Examples: Sort, Aggregates, Hash-Join (reads all of one operand) Blocking operators kill pipeline parallelism Make partition parallelism all the more important. Sort Runs Scan Sort Runs Tape File SQL Table Process Merge Runs Table Insert Index Insert SQL Table Index 1 Index 2 Index 3 Database Load Template has three blocked phases

36 Simple Aggregates (sort or hash?) Simple aggregates (count, min, max,...) can use indices More compact Sometimes have aggregate info. GROUP BY aggregates scan in category order if possible (use indices) Else If categories fit in RAM use RAM category hash table Else make temp of sort by category, do math in merge step.

37 Parallel Aggregates For aggregate function, need a decomposition strategy: count(S) =  count(s(i)), ditto for sum() avg(S) = (  sum(s(i))) /  count(s(i)) and so on... For groups, sub-aggregate groups close to the source drop sub-aggregates into a hash river.

38 Sub-sorts generate runs Merge runs Range or Hash Partition River River is range or hash partitioned Scan or other source Parallel Sort M inputs N outputs Disk and merge not needed if sort fits in memory Scales “linearly” because 6 12 = => 2x slower log(10 ) 6 log(10 ) 12 Sort is benchmark from hell for shared nothing machines net traffic = disk bandwidth, no data filtering at the source

39 Hash Join: Combining Two Tables Hash smaller table into N buckets (hope N=1) If N=1 read larger table, hash to smaller Else, hash outer to disk then bucket-by-bucket hash join. Purely sequential data behavior Always beats sort-merge and nested unless data is clustered. Good for equi, outer, exclusion join Lots of papers, products just appearing (what went wrong?) Hash reduces skew Right Table Left Table Hash Buckets

40 Parallel Hash Join ICL implemented hash join with bitmaps in CAFS machine (1976)! Kitsuregawa pointed out the parallelism benefits of hash join in early 1980’s (it partitions beautifully) We ignored them! (why?) But now, Everybody's doing it. (or promises to do it). Hashing minimizes skew, requires little thinking for redistribution Hashing uses massive main memory

41 Exotic Joins ( Exclusion, Cartesian, Outer,...) Exclusion used for NOT IN and DIFFERENCE queries Outer is “lossless” join, {left, right} appears with null sibling if matching sibling not found. Cartesian is often a mistake (missing where clause) but also important for small-table large-table optimization (also called STAR schema). Small table used as a pick list. Each small table represents a mapping: name -> code set membership (e.g. holidays) Best plan: Restrict small tables, Form Cartesian product of all small Send it to each partition of large table for hash join.

42 What Systems Work This Way Shared Nothing Teradata: 400 nodes 80x12 nodes Tandem: 110 nodes IBM / SP2 / DB2: 128 nodes Informix/SP2 100 nodes ATT & Sybase 8x14 nodes Shared Disk Oracle170 nodes Rdb 24 nodes Shared Memory Informix 9 nodes RedBrick ? nodes

43 Outline l Why Parallelism : –technology push –application pull l Parallel Database Techniques –partitioned data –partitioned and pipelined execution –parallel relational operators

44 Main Message l Technology trends give –many processors and storage units –inexpensively l To analyze large quantities of data –sequential (regular) access patterns are 100x faster –parallelism is 1000x faster (trades time for money) –Relational systems show many parallel algorithms.

1 Parallel Database Systems Analyzing LOTS of Data Jim Gray Microsoft Research.

Similar presentations

Presentation on theme: "1 Parallel Database Systems Analyzing LOTS of Data Jim Gray Microsoft Research."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Parallel Database Systems Analyzing LOTS of Data Jim Gray Microsoft Research.

Similar presentations

Presentation on theme: "1 Parallel Database Systems Analyzing LOTS of Data Jim Gray Microsoft Research."— Presentation transcript:

Similar presentations

About project

Feedback