Download presentation
Presentation is loading. Please wait.
1
Heterogeneous Multi-Core Processors Jeremy Sugerman GCafe May 3, 2007
2
Context Exploring the CPU and GPU future relationship –Joint work, thinking with Kayvon –Much kibbitzing from Pat, Mike, Tim, Daniel Vision and opinion, not experiments and results –More of a talk than a paper –The value is more conceptual than algorithmic –Wider gcafe audience appeal than our near term elbows-deep plans to dive into GPU guts
3
Outline Introduction CPU “Special Feature” Background Compute-Maximizing Processors Synthesis, with Extensions Questions for the Audience…
4
Introduction Multi-core is status quo for forthcoming CPUs Variety of emerging (for “general purpose”) architectures try to offer discontinuous performance boost over traditional CPUs –GPU, Cell SPEs, Niagara, Larrabee, … CPU vendors have a history of co-opting special purpose units for targeted performance wins: –FPU, SSE/Altivec, VT/SVM CPUs should co-opt entire “compute” cores!
5
Introduction Industry is already exploring hybrid models –Cell: 1 PowerPC and 8 SPEs –AMD Fusion: Slideware CPU + GPU –Intel Larrabee: Weirder, NDA encumbered The programming model for communicating deserves to be architecturally defined. Tighter integration than the current “host + accelerator” model eases porting and efficiency. Work queues / buffers allow intregrated coordination with decoupled execution.
6
Outline Introduction CPU “Special Feature” Background Compute-Maximizing Processors Synthesis, with Extensions Questions for the Audience…
7
CPU “Special Features” CPUs are built for general purpose flexibility… … but have always stolen fixed function units in the name of performance. –Old CPUs had schedulers, malloc burned in! –CISC instructions really were faster –Hardware managed TLBs and caches –Arguably, all virtual memory support
8
CPU “Special Features” More relevantly, dedicated hardware has been adopted for domain-specific workloads. … when the domain was sufficiently large / lucrative / influential … and the increase in performance over software implementation / emulation was BIG … and the cost in “design budget” (transistors, power, area, etc.) was acceptable. Examples: FPUs, SIMD and Non-Temporal accesses, CPU virtualization
9
Outline Introduction CPU “Special Feature” Background Compute-Maximizing Processors Synthesis, with Extensions Questions for the Audience…
10
Compute-Maximizing Processors “Important” common apps are FLOP hungry –Video processing, Rendering –Physics / Game “Physics” –Even OS compositing managers! HPC apps are FLOP hungry too –Computational Bio, Finance, Simulations, … All can soak vastly more compute than current CPUs can deliver. All can utilize thread or data parallelism. Increased interest in custom / non-”general” processors
11
Compute-Maximizing Processors Or “throughput oriented” Packed with ALUs / FPUs Application specified parallelism replaces the focus on single-thread ILP Available in many flavours: –SIMD –Highly threaded cores –High numbers of tiny cores –Stream processors Real life examples generally mix and match
12
Compute-Maximizing Processors Offer an order of magnitude potential performance boost… if the workload sustains high processor utilization Mapping / porting algorithms is a labour intensive and complex effort. This is intrinsic. Within any design budget, a BIG performance win comes at a cost… If it didn’t, the CPU designers would steal it.
13
Compute-Maximizing Programming Generally offered as off-board “accelerators” –Data “tossed over the wall” and back –Only portions of computations achieve a speedup if offloaded –Accelerators mono-task one kernel at a time Applications are sliced into successive statically defined phases separated by resorting, repacking, or converting entire datasets. Limited to a single dataset-wide feed forward pipeline. Effectively back to batch processing
14
Outline Introduction CPU “Special Feature” Background Compute-Maximizing Processors Synthesis, with Extensions Questions for the Audience…
15
Synthesis Add at least one compute-max core to CPUs –Workloads that use it get BIG performance –Programmers are struggling to get any performance from having more normal cores –Being “on-chip” architected and ubiquitous is huge for application use of compute-max Compute core exposed as programmable independent multithreaded execution engine –A lot like adding (only!) fragment shaders –Largely agnostic on hardware “flavour”
16
Extensions Unified address space –Coherency is nice, but still valuable without it Multiple kernels “bound” (loaded) at a time –All part of the same application, for now “Work” delivered to compute cores through work queues –Dequeuing batches / schedules for coherence, not necessarily FIFO –Compute and CPU cores can insert on remote queues
17
Extensions CLAIM: Queues break the “batch processing” straitjacket and still expose enough coherent parallelism to sustain compute-max utilization. First part is easy: Obvious per-data element state machine Dynamic insertion of new “work” Instead of being idle as the live thread count in a “pass” drops, a core can pull in “work” from other “passes” (queues).
18
Extensions CLAIM: Queues break the “batch processing” straitjacket and still expose enough coherent parallelism to sustain compute-max utilization. Second part is more controversial: “Lots” of data quantized into a “few” states should have plentiful, easy coherence. If the workload as a whole has coherence Pigeon hole argument, basically Also mitigates SIMD performance constraints Coherence can be built / specified dynamically
19
Outline Introduction CPU “Special Feature” Background Compute-Maximizing Processors Synthesis, with Extensions Questions for the Audience…
20
Audience Participation Do you believe my argument conceptually? –For the heterogeneous / hybrid CPU in general? –For queues and multiple kernels? What persuades you 3 x86 + compute is preferable to quad x86? –What app / class of apps and how much of a win? 10x? 5x? How skeptical are you that queues can match the performance of multi-pass / batching? What would you find a compelling flexibility / expressiveness justification for adding queues? –Performance wins regaining coherence in existing branching/looping shaders? –New algorithms if shaders and CPU threads can dynamically insert additional “work”?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.