Trumping the Multicore Memory Hierarchy with Hi-Spade Phillip B. Gibbons Intel Labs Pittsburgh April 30, 2010 Keynote talk at 10 th SIAM International.

Trumping the Multicore Memory Hierarchy with Hi-Spade Phillip B. Gibbons Intel Labs Pittsburgh April 30, 2010 Keynote talk at 10 th SIAM International Conference on Data Mining

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 2 Abstract Date-intensive applications demand effective use of the cache/memory/storage hierarchy of the target computing platform(s) in order to achieve high performance. Algorithm designers and application/system developers, however, often tend towards one of two extremes: (i) they ignore the hierarchy, programming to the API view of “memory + I/O” and often ignoring parallelism; or (ii) they are (pain)fully aware of all the details of the hierarchy, and hand-tune to a given platform. The former often results in poor performance, while the latter demands high programmer effort for code that requires dedicated use of the platform and is not portable across platforms. Moreover, two recent trends—pervasive multi-cores and pervasive flash—provide both new challenges and new opportunities for maximizing performance. In the Hi-Spade (hierarchy-savvy parallel algorithm design) project, we are developing a hierarchy-savvy approach to algorithm design and systems for these emerging parallel hierarchies. The project seeks to create abstractions, tools and techniques that (i) assist programmers and algorithm designers in achieving effective use of emerging hierarchies, and (ii) leads to systems that better leverage the new capabilities these hierarchies provide. Our abstractions seek a sweet spot between ignoring and (pain)fully aware that exposes only what must be exposed for high performance, while our techniques deliver that good performance across a variety of platforms and sharing scenarios. Key enablers of our approach include novel thread schedulers and effective use of available flash devices. This talk summarizes our progress to date towards achieving our goals and the many challenges that remain. (hidden slide)

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 3 Hi-Spade: Outline / Take-Aways Hierarchies are important but challenging Hi-Spade vision: Hierarchy-savvy algorithms & systems Smart thread schedulers enable simple, hierarchy-savvy abstractions Flash-savvy (database) systems maximize benefits of Flash devices Ongoing work w/ many open problems

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 4 For Good Performance, Must Use the Hierarchy Effectively CPU L1 L2 Cache Main Memory Magnetic Disks Performance: Running/response time Throughput Power Data-intensive applications stress the hierarchy Hierarchy: Cache Memory Storage

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 5 Clear Trend: Hierarchy Getting Richer More levels of cache Pervasive Multicore New memory / storage technologies –E.g., Pervasive use of Flash These emerging hierarchies bring both new challenges & new opportunities

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 6 L2 Cache New Trend: Pervasive Multicore Shared L2 Cache Main Memory Magnetic Disks CPU L1 CPU L1 CPU L1 Much Harder to Use Hierarchy Effectively Challenges Cores compete for hierarchy Hard to reason about parallel performance Hundred cores coming soon Cache hierarchy design in flux Hierarchies differ across platforms Opportunity Rethink apps & systems to take advantage of more CPUs on chip

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 7 Shared L2 Cache New Trend: Pervasive Flash Main Memory Magnetic Disks CPU L1 CPU L1 CPU L1 New Type of Storage in the Hierarchy Flash Devices Challenges Performance quirks of Flash Technology in flux, e.g., Flash Translation Layer (FTL) Opportunity Rethink apps & systems to take advantage

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 8 up to 1 TB Main Memory 4…4… 24MB Shared L3 Cache 2 HW threads 32KB 256KB 2 HW threads 32KB 256KB 8…8… socket E.g., Xeon 7500 Series MP Platform 24MB Shared L3 Cache 2 HW threads 32KB 256KB 2 HW threads 32KB 256KB 8…8… socket Attach: Magnetic Disks & Flash Devices

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 9 How Hierarchy is Treated Today Ignorant (Pain)-Fully Aware Hand-tuned to platform Effort high, Not portable, Limited sharing scenarios API view: Memory + I/O; Parallelism often ignored Performance iffy Algorithm Designers & Application/System Developers tend towards one of two extremes Or they focus on one or a few aspects, but without a comprehensive view of the whole

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 10 From SDM’10 Call for Papers “Extracting knowledge requires the use of sophisticated, high-performance and principled analysis techniques and algorithms, based on sound theoretical and statistical foundations. These techniques in turn require powerful visualization technologies; implementations that must be carefully tuned for performance; software systems that are usable by scientists, engineers, and physicians as well as researchers; and infrastructures that support them.”

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 11 Ignore what can be ignored Focus on what must be exposed for good performance Robust across many platforms & resource sharing scenarios Sweet-spot between ignorant and (pain)fully aware “Hierarchy- Savvy” Hierarchy-Savvy parallel algorithm design (Hi-Spade) project …seeks to enable: A hierarchy-savvy approach to algorithm design & systems for emerging parallel hierarchies http://www.pittsburgh.intel-research.net/projects/hi-spade/

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 12 Hierarchy-Savvy Sweet Spot Ignorant performance programming effort Platform 1 Platform 2 Hierarchy- Savvy Modest effort, good performance, robust (Pain)-Fully Aware

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 13 Hi-Spade Research Scope Agenda: Create abstractions, tools & techniques that Assist programmers & algorithm designers in achieving effective use of emerging hierarchies Lead to systems that better leverage the new capabilities these hierarchies provide A hierarchy-savvy approach to algorithm design & systems for emerging parallel hierarchies Theory / Systems / Applications

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 14 Hi-Spade Collaborators Intel Labs Pittsburgh: Shimin Chen (co-PI) Carnegie Mellon: Guy Blelloch, Jeremy Fineman, Robert Harper, Ryan Johnson, Ippokratis Pandis, Harsha Vardhan Simhadri, Daniel Spoonhower Microsoft Research: Suman Nath EPFL: Anastasia Ailamaki, Manos Athanassoulis, Radu Stoica University of Pittsburgh: Panos Chrysanthis, Alexandros Labrinidis, Mohamed Sharaf

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 15 Hi-Spade: Outline Hierarchies are important but challenging Hi-Spade vision: Hierarchy-savvy algorithms & systems Smart thread schedulers enable simple, hierarchy-savvy abstractions Flash-savvy (database) systems maximize benefits of Flash devices Ongoing work w/ many open problems

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 17 Abstract Hierarchy: Simplified View What yields good hierarchy performance? Spatial locality: use what’s brought in –Popular sizes: Cache lines 64B; Pages 4KB Temporal locality: reuse it Constructive sharing: don’t step on others’ toes How might one simplify the view? Approach 1: Design to a 2 or 3 level hierarchy (?) Approach 2: Design to a sequential hierarchy (?) Approach 3: Do both (??)

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 18 Sequential Hierarchies: Simplified View External Memory Model –See [J.S. Vitter, ACM Computing Surveys, 2001] Simple model Minimize I/Os Only 2 levels Only 1 “cache” Main Memory (size M) External Memory Block size B External Memory Model Can be good choice if bottleneck is last level

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 19 Sequential Hierarchies: Simplified View Ideal Cache Model [Frigo et al., FOCS’99] Main Memory (size M) External Memory Block size B Ideal Cache Model Twist on EM Model: M & B unknown to Algorithm simple model Key Goal Guaranteed good cache performance at all levels of hierarchy Single CPU only (All caches shared) Key Algorithm Goal: Good performance for any M & B Encourages Hierarchical Locality

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 20 Example Paradigms Achieving Key Goal Scan: e.g., computing the sum of N items N/B misses, for any B (optimal) B 11 B 21 Divide-and-Conquer: e.g., matrix multiply C=A*B A 11 *B 11 + A 12 *B 21 A 11 A 12 O(N /B + N /(B*√M)) misses (optimal) 23 A 11 *B 12 + A 12 *B 22 = * A 21 *B 11 + A 22 *B 21 A 21 *B 12 + A 22 *B 22 A 21 A 22 B 12 B 22 Divide: Recursively compute A 11 *B 11,…, A 22 *B 22 Conquer: Compute 4 quadrant sums Uses Recursive Z-order Layout

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 21 Multicore Hierarchies: Possible Views Design to Tree-of-Caches abstraction: Multi-BSP Model [L.G. Valiant, ESA’08] –4 parameters/level: cache size, fanout, latency/sync cost, transfer bandwidth –Bulk-Synchronous Our Goal: Approach simplicity of Ideal Cache Model –Hierarchy-Savvy sweet spot –Do not require bulk-synchrony … … … … … … … … … … …

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 22 Multicore Hierarchies: Key Challenge Theory underlying Ideal Cache Model falls apart once introduce parallelism: Good performance for any M & B on 2 levels DOES NOT imply good performance at all levels of hierarchy Key reason: Caches not fully shared L2 CacheShared L2 Cache CPU2 L1 CPU1 L1 CPU3 L1 What’s good for CPU1 is often bad for CPU2 & CPU3 e.g., all want to write B at ≈ the same time B

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 23 Multicore Hierarchies Key New Dimension: Scheduling Key new dimension: Scheduling of parallel threads Key reason: Caches not fully shared Has LARGE impact on cache performance L2 CacheShared L2 Cache CPU2 L1 CPU1 L1 CPU3 L1 Can mitigate (but not solve) if can schedule the writes to be far apart in time Recall our problem scenario: all CPUs want to write B at ≈ the same time B

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 24 Key Enabler: Fine-Grained Threading Coarse Threading popular for decades –Spawn one thread per core at program initialization –Heavy-weight O.S. threads –E.g., Splash Benchmark Better Alternative: –System supports user-level light-weight threads –Programs expose lots of parallelism –Dynamic parallelism: forking can be data-dependent –Smart runtime scheduler maps threads to cores, dynamically as computation proceeds

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 25 Cache Uses Among Multiple Threads Destructive compete for the limited on-chip cache Constructive share a largely overlapping working set P L1 P P Shared L2 Cache Interconnect P L1 P P Shared L2 Cache Interconnect “Flood” off-chip PINs Slide thanks to Shimin Chen

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 26 Smart Thread Schedulers Work Stealing –Give priority to tasks in local work queue –Good for private caches Parallel Depth-first (PDF) [JACM’99, SPAA’04] –Give priority to earliest ready tasks in the sequential schedule –Good for shared caches P L2 Cache Main Memory L1 P P P P Shared L2 Cache Main Memory L1 Sequential locality to parallel locality

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 27 Parallel Depth First (PDF): Work Stealing (WS): Shared cache = 0.5 *(src array size + dest array size). Cache miss Cache hit Mixed 8 cores Parallel Merge Sort: WS vs. PDF

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 28 Private vs. Shared Caches 3-level multi-core model Designed new scheduler (Controlled PDF) with provably good cache performance for class of divide-and-conquer algorithms [SODA08] Results require exposing working set size for each recursive subproblem L2 CacheShared L2 Cache CPU2 L1 CPU1 L1 CPU3 L1 Main Memory

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 29 Low-Span + Ideal Cache Model Observation: Guarantees on cache performance depend on the computation’s span S (length of critical path) –E.g., Work-stealing on single level of private caches: Thrm: For any computation w/ fork-join parallelism, O(M P S / B) more misses on P cores than on 1 core Approach: Design parallel algorithms with –Low span, and –Good performance on Ideal Cache Model Thrm: For any computation w/ fork-join parallelism for each level i, only O(M P S / B ) more misses than on 1 core, for hierarchy of private caches i i Low span S Good miss bound [SPAA’10]

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 30 Challenge of General Case Tree-of-Caches Each subtree has a given amount of compute & cache resources To avoid cache misses from migrating tasks, would like to assign/pin task to a subtree But any given program task may not match both –E.g., May need large cache but few processors

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 33 Flash-Savvy Systems Simply replacing some magnetic disks with Flash devices WILL improve performance However: Much of the performance left on the table –Systems not tuned to Flash characteristics Flash-savvy systems: Maximize benefits of platform’s flash devices –What is best offloaded to flash? Many papers in this area--Discuss only our results

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 34 NAND Flash Chip Properties …… Block (64-128 pages) Page (512-2048 B) Read/write pages, erase blocks Write page once after a block is erased Expensive operations: In-place updates Random writes In-place update 1. Copy2. Erase3. Write4. Copy5. Erase Random Sequential 0.4ms0.6ms Read Random Sequential 0.4ms127ms Write

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 37 Flash Logging (1/3) Transactional logging: major bottleneck Today, OLTP Databases can fit into main memory (e.g., in TPCC, 30M customers < 100GB) In contrast, must flush redo log to stable media at commit time Log access pattern: small sequential writes Ill-suited for magnetic disks: incur full rotational delays Alternative solutions are expensive or complicated Exploiting flash devices for logging [SIGMOD’09] Slide thanks to Shimin Chen

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 38 USB flash drives are a good match Widely available USB ports Inexpensive: use multiple devices for better performance Hot-plug: cope with limited erase cycles Multiple USB flash drives achieve better performance with lower price than a single SSD Our solution: FlashLogging Unconventional array design Outlier detection & hiding Efficient recovery Request queue In-memory log buffer Interface Worke r Databas e Flash Logging (2/3) Slide thanks to Shimin Chen

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 39 Up to 5.7X improvements over disk based logging Up to 98% of ideal performance Multiple USB flash drives achieve better performance than a single SSD, at fraction of the price Flash Logging (3/3) Slide thanks to Shimin Chen

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 40 PR-Join for Online Aggregation Data warehouse and business intelligence –Fast growing multi-billion dollar market Interactive ad-hoc queries –Important for detecting new trends –Fast response times hard to achieve One promising approach: Online aggregation –Provides early representative results for aggregate queries (sum, avg, etc), i.e., estimates & statistical confidence intervals –Problem: Queries with joins are too slow Our goal: A faster join for online aggregation

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 42 Background: Ripple Join A join B: find matching records of A and B records from B records from A Join: Checks all pairs of records from A and B spillednew spilled new For each ripple: Read new records from A and B; check for matches Read spilled records; check for matches with new records Spill new to disk Problem: Ripple width limited by memory size

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 43 Partitioned expanding Ripple Join PR-Join Idea: Multiplicatively expanding ripples Higher result rate Representative results To overcome Ripple width > memory: & hash partitioning Each partition < memory Report results per partitioned ripple empty Partitioned on Join key [Sigmod’10]

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 45 Concurrent Queries & Updates in Data Warehouse Data Warehouse queries dominated by table scans –Sequential scan on HD Updates are delayed to avoid interfering –E.g., Mixing random updates with TPCH queries would incur 2.9X query slowdown –Thus, queries are on stale data

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 46 Concurrent Queries & Updates in Data Warehouse Our Approach: Cache updates on SSD –Queries take updates into account on-the-fly –Updates periodically migrated to HD in batch –Improves query latency by 2X, improves update throughput by 70X SSD (updates) Disks (main data) Data Warehouse Merge Table (range) scan 1. Incoming updates 3. Migrate updates 2. Query processing Related updates

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 48 Publications (1) Cache Hierarchy & Schedulers: 1. PDF scheduler for shared caches [SPAA’04] 2. Scheduling for constructive sharing [SPAA’07] 3. Controlled-PDF scheduler [SODA’08] 4. Combinable MBTs [SPAA’08] 5. Semantic space profiling & visualization [ICFP’08] 6. Scheduling beyond nested parallelism [SPAA’09] 7. Low depth paradigm & algorithms [SPAA’10] 8. Parallel Ideal Cache model [under submission]

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 49 Semantic Space Profiling Allocation point breakdown (2 cores, breadth-first) breadth-first work stealing Heap Use DAGs showing peak use for 2 schedulers Matrix Multiply [ICFP’08]

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 50 Publications (2) Flash-savvy database systems: 1. Semi-random writes [VLDB’08] 2. Flash-based transactional logging [Sigmod’09] 3. PR-Join for online aggregation [Sigmod’10] 4. I/O scheduling for transactional I/Os [under submission] 5. Concurrent warehousing queries & updates [under submission]

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 51 Many Open Problems Hierarchy-savvy ideal: Simplified view + thread scheduler that will rule the world New tools & architectural features that will help Extend beyond MP platform to cluster/cloud Richer class of algorithms: Data mining, etc. Hierarchy-savvy scheduling for power savings PCM-savvy systems: How will Phase Change Memory change the world?

© Phillip B. Gibbons SDM’10 keynote Hi-Spade 52 Hi-Spade: Conclusions Hierarchies are important but challenging Hi-Spade vision: Hierarchy-savvy algorithms & systems Smart thread schedulers enable simple, hierarchy-savvy abstractions Flash-savvy (database) systems maximize benefits of Flash devices Ongoing work w/ many open problems

Trumping the Multicore Memory Hierarchy with Hi-Spade Phillip B. Gibbons Intel Labs Pittsburgh April 30, 2010 Keynote talk at 10 th SIAM International.

Similar presentations

Presentation on theme: "Trumping the Multicore Memory Hierarchy with Hi-Spade Phillip B. Gibbons Intel Labs Pittsburgh April 30, 2010 Keynote talk at 10 th SIAM International."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Trumping the Multicore Memory Hierarchy with Hi-Spade Phillip B. Gibbons Intel Labs Pittsburgh April 30, 2010 Keynote talk at 10 th SIAM International.

Similar presentations

Presentation on theme: "Trumping the Multicore Memory Hierarchy with Hi-Spade Phillip B. Gibbons Intel Labs Pittsburgh April 30, 2010 Keynote talk at 10 th SIAM International."— Presentation transcript:

Similar presentations

About project

Feedback