Presentation is loading. Please wait.

Presentation is loading. Please wait.

DATABASE OPERATORS AND SOLID STATE DRIVES Geetali Tyagi (2012037) Mahima Malik (2012053) Shrey Gupta (2012098) Vedanshi Kataria (2012117)

Similar presentations


Presentation on theme: "DATABASE OPERATORS AND SOLID STATE DRIVES Geetali Tyagi (2012037) Mahima Malik (2012053) Shrey Gupta (2012098) Vedanshi Kataria (2012117)"— Presentation transcript:

1 DATABASE OPERATORS AND SOLID STATE DRIVES Geetali Tyagi (2012037) Mahima Malik (2012053) Shrey Gupta (2012098) Vedanshi Kataria (2012117)

2 Introduction ■ SSDs have rich internal parallelism – Chance to improve I/O bandwidth by doing parallel processing ■ Higher IOPS (Input/Output Operations Per Second) ■ Traditional query processing algorithms mainly designed according to the mechanical traits of the HDDs – Simply replacing HDDs by SSDs does very little benefit – Redesign algorithms like scan and join to take full advantage of SSDs

3 SSD Internal Architecture ■ Multiple channels shared by a set of flash memory packages ■ Two levels of parallelism – Channel level parallelism ■ Each channel can be operated independently and simultaneously – Package level parallelism ■ Operations on flash memory packages attached to the same channel can also be interleaved

4 Some Terms ■ Domain is a set of flash memories that share a specific set of resources (e.g. channels) – Can further be partitioned into sub-domains ■ Chunk is a logical data page with a unique logical address. ■ Every domain has into its own ScanBuffer to store scan results for the particular domain.

5 ParaScan ■ Most SSDs adopt a RAID-0 like striping data storage mechanism – with consecutive Logical Block Addresses for the striped chunks ■ Try uniform distribution of the table across domains to maximize parallelism benefits. ■ Example: On the assumption of 20 domains, the chunk whose logical address is 20*n (n=0,1,2...), will be in 1st domain, the chunk whose logical address is 20*n+1 (n=0,1,2...), will be in 2nd domain, and so on. ■ Parascan is twice as fast as a traditional table scan on SSD and 4 times as fast as a traditional table scan on HDD in best case.

6 Domain 0 Domain 1 Chunk 0 Chunk 1 Chunk 2 Chunk 3

7 ParaScan ■ Domain scan – Read data chunks one by one from a single domain and then put them into its own ScanBuffer, allowing multiple domain scans to be executed in parallel without interference ■ Multi-domain parallel scan – Multiple threads ■ Each in charge of one or more individual domain scans – Entire scan buffer is also divided into several ScanBuffers so that each scan thread can use one ScanBuffer ■ Performance of multi-domain parallel scan depends on the concurrency level – the maximal number of physical threads supported by the processor – the maximal queue depth supported by the SSD

8 Buffer Multi-Domain Parallel Scan SSD Domain #0 ParaScan Chunk #0.......... Chunk #20 Chunk #200 Domain #1 Chunk #1.......... Chunk #21 Chunk #201 -------- Domain #19 Chunk #19.......... Chunk #39 Chunk #219 Domain Scan ScanBuffer #0 Page #0Page #1 ScanBuffer #1 Page #0Page #1 ………… Page #0Page #1 ScanBuffer #19 Page #0Page #1

9 ParaHashJoin - Parallel Hash Join ■ Two-way equi-join consists of three phases – ParaHash, MiniJoin, and Fetch phase. ■ 3x times faster than traditional hash join in SSD Table R Table S ParaScan ParaHash MiniJoinFetch ParaHashJoi n

10 ParaHash ▪ Buffer - scan area and hash area. ▪ Multiple hash threads - to calculate the hash values of records in ScanBuffers  Each thread is assigned to one ScanBuffer ▪ Based on hash values, put their join attributes and RIDs into corresponding hash buckets in parallel  Hash function : Hash-value = join_attr & (B-1) ▪ Concurrency control  Each bucket maintain lightweight clock  Bitmap is used to check whether hash index records with specified join attribute exists in the bucket

11 ParaHashJoin ParaHash 012... B-2B-1... Table Scan Area Hash Area

12 ParaHashJoin MiniJoin ▪ Input - ParaHash table R and ParaHash table S ▪ Each bucket is read into the memory to generate join results - {join_attr, RID R, RID S }  Two passes are required to generate MiniJoin results – one pass for ParaScan and one pass for MIniJoin ▪ If enough memory present to hold hash table of R (smaller table)  Table R is ParaScan and then ParaHash, Table S needs to be ParaScan only and can directly probe hash table of R for join results  Only one pass is required

13 ParaHashJoin Fetch ▪ Outputs necessary attributes using RIDs specified in the MiniJoin output index to get the final join results. ▪ TID Hash Join approach  For each join result, fetch the needed data pages to generate the final join result ▪ this approach is reasonable if all the pages of the result can fit in memory  When memory is insufficient, some pages can be loaded multiple times, resulting in higher cost of loading pages ▪ Sort-based fetching approach  Sort MiniJoin results based on the RIDs of outer table  Load needed pages to produce final join results according to the sorted MiniJoin results

14 ParaAggr (Parallel Aggregation) ■ Parallel implementation of aggregation operations (sum, max, min, count, average). ■ Two Phases: – SubAggr: Multiple threads corresponding to each ParaScan thread or ScanBuffer. – TolAggr: Combines results of all SubAggr instances. ■ ParaAggr in SSDs is 3-4 times faster than traditional aggregation using single thread in HDD.

15 ParaAggr – Working Diagram ParaAggr Domain #1.......... Domain #n ParaScan.......... ParaScan SubAggr TotAggr SubAggr

16 ParaAggr - Example ParaAggr Domain #1.......... Domain #n ParaScan.......... ParaScan SubAggr TotAggr SubAggr Scan records in all domains in parallel. Forward those satisfying WHERE clause to SubAggr. SELECT count(*) FROM Employee WHERE dept=’Sales’

17 ParaAggr - Example Count of ParaScan results in parallel SubAggr threads. SELECT count(*) FROM Employee WHERE dept=’Sales’ ParaAggr Domain #1.......... Domain #n ParaScan.......... ParaScan SubAggr TotAggr SubAggr

18 ParaAggr - Example Summation on result of count SubAggr operations. SELECT count(*) FROM Employee WHERE dept=’Sales’ ParaAggr Domain #1.......... Domain #n ParaScan.......... ParaScan SubAggr TotAggr SubAggr

19 Summary ■ Rich internal parallelism in SSDs exploited for faster data query processing. ■ Algorithms use parallel threads and SSD domains to speed up process. – E.g. ParaScan, ParaHashJoin, ParaSort, ParaAggr. ■ Para-SSD algorithms much faster that traditional HDD algorithms.

20 References ■ Y. Fan, W. Lai, X. Meng, Optimizing Database Operators by Exploiting Internal Parallelism of Solid State Drives


Download ppt "DATABASE OPERATORS AND SOLID STATE DRIVES Geetali Tyagi (2012037) Mahima Malik (2012053) Shrey Gupta (2012098) Vedanshi Kataria (2012117)"

Similar presentations


Ads by Google