Presentation is loading. Please wait.

Presentation is loading. Please wait.

KOCSEA’09, Las Vegas, December 2009 -1- COMPUTER SCIENCE DEPARTMENT Flash Memory Database Systems and In-Page Logging Bongki Moon Department of Computer.

Similar presentations


Presentation on theme: "KOCSEA’09, Las Vegas, December 2009 -1- COMPUTER SCIENCE DEPARTMENT Flash Memory Database Systems and In-Page Logging Bongki Moon Department of Computer."— Presentation transcript:

1 KOCSEA’09, Las Vegas, December 2009 -1- COMPUTER SCIENCE DEPARTMENT Flash Memory Database Systems and In-Page Logging Bongki Moon Department of Computer Science University of Arizona University of Arizona Tucson, AZ 85721, U.S.A. bkmoon@cs.arizona.edu In collaboration with Sang-Won Lee (SKKU), Chanik Park (Samsung) Flash Talk

2 KOCSEA’09, Las Vegas, December 2009 -2- COMPUTER SCIENCE DEPARTMENT Magnetic Disk vs Flash SSD Samsung FlashSSD 128 GB 2.5/1.8 inch Seagate ST340016A 40GB,7200rpm Champion for 50 years New challengers! Intel X25-M Flash SSD 80GB 2.5 inch

3 KOCSEA’09, Las Vegas, December 2009 -3- COMPUTER SCIENCE DEPARTMENT Past Trend of Disk From 1983 to 2003 [Patterson, CACM 47(10) 2004]   Capacity increased about 2500 times (0.03 GB  73.4 GB)   Bandwidth improved 143.3 times (0.6 MB/s  86 MB/s)   Latency improved 8.5 times (48.3 ms  5.7 ms) Year19831990199419982003 Product CDC 94145-36 Seagate ST41600 Seagate ST15150 Seagate ST39102 Seagate ST373453 Capacity 0.03 GB 1.4 GB 4.3 GB 9.1 GB 73.4 GB RPM3600540072001000015000 Bandwidth (MB/sec) 0.6492486 Media diameter 5.255.253.53.02.5 Latency (msec) 48.317.112.78.85.7

4 KOCSEA’09, Las Vegas, December 2009 -4- COMPUTER SCIENCE DEPARTMENT I/O Crisis in OLTP Systems I/O becomes bottleneck in OLTP systemsI/O becomes bottleneck in OLTP systems  Process a large number of small random I/O operations Common practice to close the gapCommon practice to close the gap  Use a large disk farm to exploit I/O parallelism Tens or hundreds of disk drives per processor coreTens or hundreds of disk drives per processor core (E.g.) IBM Power 596 Server : 172 15k-RPM disks per processor core(E.g.) IBM Power 596 Server : 172 15k-RPM disks per processor core  Adopt short-stroking to reduce disk latency Use only the outer tracks of disk plattersUse only the outer tracks of disk platters  Other concerns are raised too Wasted capacity of disk drivesWasted capacity of disk drives Increased amount of energy consumptionIncreased amount of energy consumption Then, what happens 18 months later?   To catch up with Moore’s law and balance CPU and I/O (Amdhal’s law), the number of spindles should be doubled again?

5 KOCSEA’09, Las Vegas, December 2009 -5- COMPUTER SCIENCE DEPARTMENT Flash News in the Market Sun Oracle Exadata Storage Server [Sep 2009]   Each Exadata cell comes with 384 GB flash cache MySpace dumped disk drives [Oct 2009]   Went all flash, saving power by 99% Google Chrome ditched disk drives [Nov 2009]   SSD is the key to 7-sec boot time Gordon at UCSD/SDSC [Nov 2009]   64 TB RAM, 256 TB Flash, 4 PB Disks IBM hooked up with Fusion-io [Dec 2009]   SSD storage appliance for System X server line

6 KOCSEA’09, Las Vegas, December 2009 -6- COMPUTER SCIENCE DEPARTMENT Characteristics of NAND Flash No mechanical latencyNo mechanical latency  Flash memory is an electronic device without moving parts  Provides uniform random access speed without seek/rotational latency Very low latency, independently of physical location of dataVery low latency, independently of physical location of data Asymmetric read & write speedAsymmetric read & write speed  Read speed is typically at least twice faster than write speed (E.g.) Samsung 16 Gbits SLC NAND chips: 80  sec vs 200  sec (2 KB)(E.g.) Samsung 16 Gbits SLC NAND chips: 80  sec vs 200  sec (2 KB) No in-place updateNo in-place update  No data item or page can be updated in place before erasing it first. An erase unit (typically 128 KB) is much larger than a page (2 KB).An erase unit (typically 128 KB) is much larger than a page (2 KB). (E.g.) Samsung 16 Gbits SLC NAND chips: 1.5 msec (128 KB)(E.g.) Samsung 16 Gbits SLC NAND chips: 1.5 msec (128 KB)  Write (and erase) optimization is critical

7 KOCSEA’09, Las Vegas, December 2009 -7- COMPUTER SCIENCE DEPARTMENT Flash for Database, Really? Immediate benefit for some DB operationsImmediate benefit for some DB operations  Reduce commit-time delay by fast logging  Reduce read time for multi-versioned data  Reduce query processing time (sort, hash) What about the Big Fat Tables?What about the Big Fat Tables?  Random scattered I/O is very common in OLTP Slow random writes by flash SSD can handle this?Slow random writes by flash SSD can handle this?

8 KOCSEA’09, Las Vegas, December 2009 -8- COMPUTER SCIENCE DEPARTMENT Transactional Log SQL Queries System Buffer Cache Database Table space Temporary Table Space Transaction (Redo) Log Rollback Segments

9 KOCSEA’09, Las Vegas, December 2009 -9- COMPUTER SCIENCE DEPARTMENT Commit-time Delay by Logging Write Ahead Log (WAL)   A committing transaction force-writes its log records   Makes it hard to hide latency   With a separate disk for logging No seek delay, but … Half a revolution of spindle on average 4.2 msec (7200RPM), 2.0 msec (15k RPM)   With a Flash SSD: about 0.4 msec Commit-time delay remains to be a significant overhead  Group-commit helps but the delay doesn’t go away altogether. How much commit-time delay?  On average, 8.2 msec (HDD) vs 1.3 msec (SDD) : 6-fold reduction TPC-B benchmark with 20 concurrent users. SQL BufferLog Buffer DB LOG pi T1T1 T2T2 …TnTn

10 KOCSEA’09, Las Vegas, December 2009 -10- COMPUTER SCIENCE DEPARTMENT Rollback Segments SQL Queries System Buffer Cache Database Table space Temporary Table Space Transaction (Redo) Log Rollback Segments

11 KOCSEA’09, Las Vegas, December 2009 -11- COMPUTER SCIENCE DEPARTMENT MVCC Rollback Segments Multi-version Concurrency Control (MVCC)   Alternative to traditional Lock-based CC   Support read consistency and snapshot isolation   Oracle, PostgresSQL, Sybase, SQL Server 2005, MySQL Rollback Segments   Each transaction is assigned to a rollback segment   When an object is updated, its current value is recorded in the rollback segment sequentially (in append-only fashion)   To fetch the correct version of an object, check whether it has been updated by other transactions

12 KOCSEA’09, Las Vegas, December 2009 -12- COMPUTER SCIENCE DEPARTMENT MVCC Write Pattern Write requests from TPC-C workload  Concurrent transactions generate multiple streams of append-only traffic in parallel (apart by approximately 1 MB)  HDD moves disk arm very frequently  SSD has no negative effect from no in-place update limitation

13 KOCSEA’09, Las Vegas, December 2009 -13- COMPUTER SCIENCE DEPARTMENT MVCC Read Performance To support MV read consistency, I/O activities will increase   A long chain of old versions may have to be traversed for each access to a frequently updated object Read requests are scattered randomly   Old versions of an object may be stored in several rollback segments   With SSD, 10-fold read time reduction was not surprising 100B …C 50A 100 A (2) A: 100 - > 50 200 A (1) A: 200 - > 100 Rollback segment T1T1 T2T2 T0T0

14 KOCSEA’09, Las Vegas, December 2009 -14- COMPUTER SCIENCE DEPARTMENT Database Table Space SQL Queries System Buffer Cache Database Table space Temporary Table Space Transaction (Redo) Log Rollback Segments

15 KOCSEA’09, Las Vegas, December 2009 -15- COMPUTER SCIENCE DEPARTMENT Workload in Table Space TPC-C workload   Exhibit little locality and sequentiality Mix of small/medium/large read-write, read-only (join)   Highly skewed 84% (75%) of accesses to 20% of tuples (pages) Write caching not as effective as read caching   Physical read/write ratio is much lower that logical read/write ratio All bad news for flash memory SSD   Due to the No in place update and Asymmetric read/write speeds   In-Page Logging (IPL) approach [SIGMOD’07]

16 KOCSEA’09, Las Vegas, December 2009 -16- COMPUTER SCIENCE DEPARTMENT In-Page Logging (IPL) Key Ideas of the IPL ApproachKey Ideas of the IPL Approach  Changes written to log instead of updating them in place Avoid frequent write and erase operationsAvoid frequent write and erase operations  Log records are co-located with data pages No need to write them sequentially to a separate log regionNo need to write them sequentially to a separate log region Read current data more efficiently than sequential loggingRead current data more efficiently than sequential logging  DBMS buffer and storage managers work together

17 KOCSEA’09, Las Vegas, December 2009 -17- COMPUTER SCIENCE DEPARTMENT Design of the IPL Logging on Per-Page basis in both Memory and FlashLogging on Per-Page basis in both Memory and Flash  An In-memory log sector can be associated with a buffer frame in memory  Allocated on demand when a page becomes dirty  An In-flash log segment is allocated in each erase unit The log area is shared by all the data pages in an erase unit Flash Memory Database Buffer in-memory data page (8KB) update-in-place in-memory log sector (512B) log area (8KB): 16 sectors Erase unit: 128KB 15 data pages (8KB each) ….

18 KOCSEA’09, Las Vegas, December 2009 -18- COMPUTER SCIENCE DEPARTMENT IPL Write Buffer Pool Flash Memory Update / Insert / Delete Data Block Area update-in-place physiological log Page : 8KB Sector : 512B Block : 128KB Data pages in memoryData pages in memory  Updated in place, and  Physiological log records written to its in-memory log sector In-memory log sector is written to the in-flash log segment, whenIn-memory log sector is written to the in-flash log segment, when  Data page is evicted from the buffer pool, or  The log sector becomes full When a dirty page is evicted, the content is not written to flash memoryWhen a dirty page is evicted, the content is not written to flash memory  The previous version remains intact  Data pages and their log records are physically co-located in the same erase unit

19 KOCSEA’09, Las Vegas, December 2009 -19- COMPUTER SCIENCE DEPARTMENT IPL Read When a page is read from flash, the current version is computed on the flyWhen a page is read from flash, the current version is computed on the fly Buffer Pool Apply the “physiological action” to the copy read from Flash (CPU overhead) Flash Memory Read from Flash  Original copy of P i  All log records belonging to P i (IO overhead) Re-construct the current in-memory copy PiPi log area (8KB): 16 sectors data area (120KB): 15 pages

20 KOCSEA’09, Las Vegas, December 2009 -20- COMPUTER SCIENCE DEPARTMENT IPL Merge When all free log sectors in an erase unit are consumedWhen all free log sectors in an erase unit are consumed  Log records are applied to the corresponding data pages  The current data pages are copied into a new erase unit Consumes, erases, and releases only one erase unitConsumes, erases, and releases only one erase unit Physical Flash Block log area (8KB): 16 sectors B old B new clean log area 15 up-to-date data pages Merge Can be Erased

21 KOCSEA’09, Las Vegas, December 2009 -21- COMPUTER SCIENCE DEPARTMENT Industry Response Common in Enterprise Class SSDs   Multi-channel, inter-command parallelism Thruput than bandwidth, write-followed-by-read pattern   Command queuing (SATA-II NCQ)   Large RAM Buffer (with super-capacitor backup) Even up to 1 MB per GB Write-back caching, controller data (mapping, wear leveling)   Fat provisioning (up to ~20% of capacity) Impressive improvement Prototype/Product EC SSD X-25M 15k-RPM Disk Read (IOPS) 1050020000450 Write (IOPS) 25001200450

22 KOCSEA’09, Las Vegas, December 2009 -22- COMPUTER SCIENCE DEPARTMENT EC-SSD Architecture Parallel/interleaved operations   8 channels, 2 packages/channel, 4 chips/package   Two-plane page write, block erase, copy-back operations Host I/F (SATA-II) ARM9 ECC Flash Controller DRAM (128MB) Main Controller Flash package Flash package 8 channels NAND

23 KOCSEA’09, Las Vegas, December 2009 -23- COMPUTER SCIENCE DEPARTMENT Concluding Remarks Recent advances cope with random I/O betterRecent advances cope with random I/O better  Write IOPS 100x higher than early SSD prototype  TPS: 1.3~2x higher than RAID0-8HDDs for read-write TPC-C workload, with much less energy consumed Write still lags behind   IOPS Disk < IOPS SSD-Write << IOPS SSD-Read   IOPS SSD-Read / IOPS SSD-Write = 4 ~ 17 A lot more issues to investigate  Flash-aware buffer replacement, I/O scheduling, Energy  Fluctuation in performance, Tiered storage architecture  Virtualization, and much more …

24 KOCSEA’09, Las Vegas, December 2009 -24- COMPUTER SCIENCE DEPARTMENT Question? For more Flash Memory work of Ours   In-Page Logging [SIGMOD’07]   Logging, Sort/Hash, Rollback [SIGMOD’08]   SSD Architecture and TPC-C [SIGMOD’09]   In-Page Logging for Indexes [CIKM’09] Even More?   www.cs.arizona.edu/~bkmoon


Download ppt "KOCSEA’09, Las Vegas, December 2009 -1- COMPUTER SCIENCE DEPARTMENT Flash Memory Database Systems and In-Page Logging Bongki Moon Department of Computer."

Similar presentations


Ads by Google