Download presentation
Presentation is loading. Please wait.
1
Hybrid PC architecture Jeremy Sugerman Kayvon Fatahalian
2
Trends Multi-core CPUs Generalized GPUs –Brook, CTM, CUDA Tighter CPU-GPU coupling –PS3 –Xbox 360 –AMD “Fusion” (faster bus, but GPU still treated as batch coprocessor)
3
CPU-GPU coupling Important apps (game engines) exhibit workloads suitable for both CPU and GPU-style cores GPU Friendly Geometry processing Shading Physics (fluids/particles) CPU Friendly IO AI/planning Collisions Adaptive algorithms
4
CPU-GPU coupling Current: coarse granularity interaction –Control: CPU launches batch of work, waits for results before sending more commands (multi-pass) –Necessitates algorithmic changes GPU is slave coprocessor –Limited mechanisms to create new work –CPU must deliver LARGE batches –CPU sends GPU commands via “driver” model
5
Fundamentally different cores “CPU” cores –Small number (tens) of HW threads –Software (OS) thread scheduling –Memory system prioritizes minimizing latency “GPU” cores –Many HW threads (>1000), hardware scheduled –Minimize per-thread state (state kept on-chip) shared PC, wide SIMD execution, small register file No thread stack –Memory system prioritizes throughput (not clear: sync, SW-managed memory, isolation, resource constraints)
6
GPU as a giant scheduler IA VS GS RS PS 1-to-1 1-to-N (bounded) 1-to-(0 or X) (X static) 1-to-N (unbounded) data buffer cmd buffer OM = on-chip queues Off-chip buffers (data) output stream
7
GPU as a giant scheduler IA RS (read-modify-write) OM VS/GS/PS command queue vertex queue primitive queue fragment queue Thread scoreboard Hardware scheduler On-chip queues Processing cores Off-chip buffers (data)
8
GPU as a giant scheduler Rasterizer (+ input cmd processor) is a domain specific HW work scheduler –Millions of work items/frame –On chip queues of work –Thousands of HW threads active at once –CPU threads (via API commands), GS programs, fixed function logic generate work –Pipeline describes dependencies What is the work here? –Vertices –Geometric primitives –Fragments –In the future: Rays? Well defined resource requirements for each category.
9
The project Investigate making “GPU” cores first-class execution engines in multi-core system Add: Fine granularity interaction between cores Processing work on any core can create new work (for any other core) Hypothesis: scheduling work (actions) is key problem –Keeping state on-chip Drive architecture simulation with interactive graphics pipeline augmented with raytracing
10
Our architecture Multi-core processor = some “CPU” + some “GPU” style cores Unified system address space “Good” interconnect between cores Actions (work) on any core can create new work Potentially… –Software-managed configurable L2 –Synchronization/signaling primitives across actions
11
Need new scheduler GPU HW scheduler leverages highly domain-specific information –Knows dependencies –Knows resources used by threads Need to move to more general-purpose HW/SW scheduler, yet still do okay Questions –What scheduling algorithms? –What information is needed to make decisions?
12
Programming model = queues Model system as a collection of work queues –Create work = enqueue –SW driven dispatch of “CPU” core work –HW driven dispatch of “GPU” core work –Application code does not dequeue
13
Benefits of queues Describe classes of work –Associate queues with environments GPU (no gather) GPU + gather GPU + create work (bounded) CPU CPU + SW managed L2 Opportunity to coalesce/reorder work –Fine-created creation, bulk execution Describe dependencies
14
Decisions Granularity of work –Enqueue elements or batches? “Coherence” of work (batching state changes) –Associate kernels/resources with queues (part of env)? Constraints on enqueue –Fail gracefully in case of explosion Scheduling policy –Minimize state (size of queues) –How to understand dependencies
15
First steps Coarse architecture simulation –Hello world = run CPU + GPU threads, GPU threads create other threads Identify GPU ISA additions Establish what information scheduler needs –What are the “environments” Eventually drive simulation with hybrid renderer
16
Evaluation Compare against architectural alternatives 1.Multi-pass rendering (very coarse-grain) with domain-specific scheduler –Paper: “GPU” microarchitecture comparison with our design –Scheduling resources –On chip state / performance tradeoff –On chip bandwidth 2.Many-core homogenous CPU
17
Summary Hypothesis: Elevating “GPU” cores to first-class execution engines is better way to build hybrid system –Apps with dynamic/irregular components –Performance –Ease of programming Allow all cores to generate new work by adding to system queues Scheduling work in these queues is key issue (goal: keep queues on chip)
18
Three fronts GPU micro-architecture –GPU work creating GPU work –Generalization of DirectX 10 GS CPU-GPU integration –GPU cores as first-class execution environments (dump the driver model) –Unified view of work throughout machine –Any core creates work for other cores GPU resource management –Ability to correctly manage/virtualize GPU resources –Window manager
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.