Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 PennySort Award Ceremony Beijing China 23 October 2006.

Similar presentations


Presentation on theme: "1 PennySort Award Ceremony Beijing China 23 October 2006."— Presentation transcript:

1 1 PennySort Award Ceremony Beijing China 23 October 2006

2 2 Outline Penny Sort history and Award What I have been doing.

3 3 Benchmark History Wisconsin Bitton Boral DeWitt Turbyfill IBM TP 1-7 CA and Tony Lukes Debit Credit Gray Datamation Anon et al TPC-A MCC Boral &... TPC-B TPC-C 1970 1980 1990 2000 TPC-W ? Teradata Bollinger &... TPC-D Sort PennySort MinuteSort TPC-H 2010

4 4 A Short History of Sort April Fools 1995: Datamation Sort –Sort 1M 100 B records –An IO benchmark: 15-min to 1 hr! 1993:{Minute | Penny}x{Daytona | Indy} 1998: TeraByte Sort Web site: http://research.Microsoft.com/barc/SortBenchmark/

5 5 Ground Rules How much can you sort for a penny (or in a minute). –Hardware cost –Depreciated over 3 years –1M$ system gets about 1 second, –1K$ system gets about 1,000 seconds. – Time (seconds) = SystemPrice ($) / 946,080 Input and output are disk resident Input is –100-byte records (random data) –key is first 10 bytes. Must create output file and fill with sorted version of input file. Daytona (product) and Indy (special) categories

6 6 1998 PennySort Hardware –266 Mhz Intel PPro –64 MB SDRAM (10ns) –Dual Fujitsu DMA 3.2GB EIDE disks Software –NT workstation 4.3 –NT 5 sort Performance –sort 15 M 100-byte records (~1.5 GB) –Disk to disk –elapsed time 820 sec cpu time = 404 sec

7 7 2004 Daytona Terabyte Sort NEC Express/5800/1320Xd 32x Itanium2 1.5Ghz 128GB 900 disk TPC-C machine Striped across 20 HBA –Read and write at 3.5 GBps –Sort 34GB in 60 seconds. –Sort 1 TB in 33 minutes Input Phase of 1 TB nSort

8 8 1999 Sort Records 2006 Sort Records Daytona Indy Penny 590 M records ( 55GB) in 644 seconds GpuTeraSort 1,469$ system 3 GHz Pentium IV, 2 GB RAM, 7800GT Nvidia graphics card, 9x80GB SATA disks (4 data and 5 “runs”) WindowsXP Naga Govindaraju, Ritesh Kumar, Dinesh Manocha, Jim Gray U. North Carolina at Chapel Hill, USA GpuTeraSort Naga GovindarajuRitesh Kumar Dinesh ManochaJim Gray U. North Carolina at Chapel Hill Minute 40 GB (400 million records) NeoSort pdf MSword Windows, Fujitsu 32 Itanium2, 128 SAN disks Chris Nyberg, Charles Koester Ordinal Technology NeoSort pdfMSword Chris NybergCharles KoesterOrdinal Technology ( 2005) 116GB (125 M records) SCS pdf 58.7 secondspdf Linux, 80 Itanium2, 2,520 SAN disks Jim WyllieJim Wyllie, IBM Almaden Research TeraByt e (2004) 33 minutes Nsort pdf, word, htm Windows, 32 Itanium2, 2,350 SAN disks Chris Nyberg, Charles Koester Ordinal Technologypdfwordhtm Chris NybergCharles KoesterOrdinal Technology (2005) 435 seconds (7.25 minutes) SCS pdf pdf Linux, 80 Itanium2, 2,520 SAN disks Jim Wyllie, IBM Almaden Research Jim Wyllie 344 million records (32 GB) in 1,679 seconds Bytes-Split-Index Sort (BSIS) $760 system 1.8 GHz AMD, 1 GB RAM, 4x80GB SATA disks, WindowsXP Xing Huang and BinHeng Song School of Software, Tsinghua U., Beijing, China Bo Huang Math&CS, Hunan U. of Technology, Zhuzhou, China Bytes-Split-Index Sort (BSIS) Xing HuangBinHeng Song School of Software, Tsinghua U. Bo Huang Math&CS, Hunan U. of Technology

9 9 Bytes Split Index Sort (BSIS) Xing Huang & BinHeng Song, Tsinghua Bo Huang, Hunan U. of Technology Xing HuangBinHeng Song, Tsinghua Bo Huang Hunan U. of Technology A radix-partition sort. Then merge the partitions. 344 million records (32 GB) in 1,679 seconds $760 system 1.8 GHz AMD, 1 GB RAM, 4x80GB SATA disks, WindowsXP Phase 1: 66 MB/s, Phase 28 MB/s See http://research.microsoft.com/barc/SortBenchmark/BSIS-PennySort_2006.pdf http://research.microsoft.com/barc/SortBenchmark/BSIS-PennySort_2006.pdf

10 10 Sort 100 byte records (minute / penny) Shows We Hit Memory Ceiling in 1995 http://research.microsoft.com/barc/SortBenchmark/ Sort recs/s/cpu plateaued in 1995

11 11 Technology Trends: CPU and GPU 2.2 GHz 4.4 GHz 31 GHz 0.8 GHz 1.6 GHz 11.2 4.2 Log of Relative Processing Power 2002200420062008 Corporate DT SW Requirements Moore’s Law Trajectory CPU Value Leading Edge Mobile Mainstream Desktop DT ‘Replacement’ Enthusiast / Specialty Cooling (Cost) Limitations GPU Moore’s Law 3 for 18 mo Then Moore’s Law trajectory Graphics Req’mts (enhanced experience) Leading Edge Value / UMA ? CPU

12 12 Moore’s Wall: Chip Heat Death Processor power density going to infinity. Solution: stablize clock at ~5GHz Multi-core (aka MTA) (1,000 core?)

13 13 GPU TeraSort Naga Govindaraju, Ritesh Kumar, Dinesh Manocha, U. North Carolina at Chapel Hill Naga GovindarajuRitesh Kumar Dinesh Manocha U. North Carolina at Chapel Hill Use GPU for Phase 1 bitonic sort 590 M records ( 55GB) in 644 seconds 1,469$ system 3 GHz Pentium IV, 2 GB RAM, 7800GT Nvidia graphics card, 9x80GB SATA disks (4 data and 5 “runs”) WindowsXP WindowsXP Phase 1: 185 MB/s, Phase 150 MB/s See http://research.microsoft.com/research/pubs/view.aspx?msr_tr_id=MSR-TR-2005-183 http://research.microsoft.com/research/pubs/view.aspx?msr_tr_id=MSR-TR-2005-183

14 14 Sort 100 byte records (minute / penny) Shows We Hit Memory Ceiling in 1995 http://research.microsoft.com/barc/SortBenchmark/ Sort recs/s/cpu plateaued in 1995 Had to get GPU to get better Memory bandwidth SIGMOD 2006 GpuTeraSort GPU better memory architecture, so finally more records/second

15 15 BSIS 2006 PennySort Price Breakdown Motherboard 16% CPU 12% GPU 18% RAM 10% Disk controller 6% Disks 33% Case, power, fan 3% Assembly 2% GpuTeraSort $760 $1470

16 16 Sort Performance/Price improved Based on parallelism and “commodity” not per-cpu performance.

17 17 Musings: PennySort=TBsort 2 pass so 3TB of disk = 8 disks if 400GB/disk = 0.5GBps (if each disk = 65 Mbps) So, 6000 seconds (3TB/5GBps) So, node can cost 200$ Costs 10x that today maybe in 5 years?

18 18 Musings: MinuteSort=TBsort Sorts 1TB in 1Minute 1 pass so 1TB of ram 266Gbps bisection bandwidth 1 pass so 2TB of IO in 60 sec => 600 disks => ~80 nodes: 8 disks 2GB ram => interconnect with 10Gbps Ethernet or 300 nodes at 1Gbps Ethernet. doable today

19 19 What I Have Been Doing Traveling & Talking Helping Build the SkyServer and the Virtual Observatory Doing spatial geometry in SQL (no kidding)! Trying to get all science literature and data online and interlinked. and… –to blob or not to blob –disk reliability

20 20 To Blob or Not To Blob For objects X smaller than 1MB Select X into x from T where key = 123 faster than h = open(X); read(h,x,n); close(h) So, blob beats file for objects < 1MB (on SQL Server – what about other DBs?) Because DB is CISC and FS is RISC Most things are less than 1MB DB should work to make this 10MB File system should borrow ideas from DB. “To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem?”To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem? Rusty Sears, Catharine Van Ingen, Jim Gray, MSR-TR-2006-45, April 2006

21 21 How Often do Disks Fail? Observed failure rates. System Source Type Part Years Fails Fails /Year TerraServer SAN Barclay SCSI 10krpm 858242.8% controllers 7222.8% san switch 9111.1% TerraServer Brick Barclay SATA 7krpm 138107.2% Web Property 1 anon SCSI 10krpm 15,8059726.0% controllers 90013915.4% Web Property 2 anon PATA 7krpm 22,4007403.3% motherboard 3,769661.7%

22 22 What About Bit Error Rates Uncorrectable Errors on Read (UERs) –Quoted uncorrectable bit error rates10 -13 to 10 -15 –That’s 1 error in 1TB to 1 error in 100TB –WOW!!! We moved 1.5 PB looking for errors Saw 5 UER events –3 real, 3 of them were masked by retry Many controller fails and system security reboots Conclusion: –UER not a useful metric – want mean time to data loss –UER better than advertised. Empirical Measurements of Disk Failure Rates and Error Rates Jim Gray, Catharine van Ingen, Microsoft Technical Report MSR-TR-2005-166

23 23 So, You Want to Copy a Petabyte? Today, that’s 4,000 disks (read 2k write 2k) Takes ~4 hours if they run in parallel, but… Probably not one file. You will see a few UERs. What’s the best strategy? How fast can you move a Petabyte from CERN to Pasadena? Is sneaker-net fastest and cheapest?

24 24 UER things I wish I knew Better statistics from larger farms, and more diversity. What is the UER on a LAN, WAN? What is the UER over time: for a file on disk for a disk What’s the best replication strategy? –Symmetric (1+1)+(1+1) or triplex (1+1) + 1


Download ppt "1 PennySort Award Ceremony Beijing China 23 October 2006."

Similar presentations


Ads by Google