Download presentation
Presentation is loading. Please wait.
Published byFrank Wilkerson Modified over 8 years ago
1
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang
2
Overview Today’s Trends Chip-multiprocessors (CMPs) Need Finer-grain parallelism for large-scale CMPs Existing fine ‐ grain schedulers Software ‐ only implement : Flexible, but slow and not scalable Hardware ‐ only implement : Fast, but inflexible and costly Approach of this paper Hardware ‐ aided user ‐ level schedulers H/W : Asynchronous direct messages (ADM) S/W : Scalable message-passing schedulers ADM schedulers are Fast like H/W, Flexible like S/W scheduler
3
Contents Introduction Asynchronous Direct Messages (ADM) ADM schedulers Evaluation
4
Fine ‐ grain parallelism divide tasks in partition and distribute them Advantages serve more parallelism avoid load imbalance Disadvantages large scheduling overhead Poor locality
5
Work ‐ stealing scheduler 1 task queue / thread Threads dequeue and enqueue tasks from task queues If a thread runs out of work try to steal tasks from another thread’s queue
6
Components of Work-stealing Queues Policies effective load rebalancing method Communication
7
Comparison 2 scheme Cost of queues and policies is cheap Cost of communication through shared memory is increasingly expensive! Carbon [ISCA ’07] One hardware LIFO task queue / core Special instructions to enqueue/dequeue tasks Components Global Task Unit : Centralized queues for fast stealing Local Task Units : One small task buffer per core to hide GTU latency S/W scheduler H/W scheduler Useless if the H/W policy did not match! good bad
8
Approach to fine-grain scheduling S/W onlyH/W onlyH/W aided OpenMP, TBB, Clik, X10, … Carbon, GPUs, … Asynchronous Direct Messages (ADM) S/W queues & policies S/W communication H/W queues & policies H/W communication S/W queues & policies H/W communication High ‐ overhead Flexible No extra H/W Low ‐ overhead Inflexible Special ‐ purpose H/W Low ‐ overhead Flexible General ‐ purpose HW
9
Contents Introduction Asynchronous Direct Messages (ADM) ADM schedulers Evaluation
10
Asynchronous Direct Messages Asynchronous Direct Messages (ADM) Messaging between threads tailored to scheduling and control Requirements Short messages for Low ‐ overhead Send from/receive to registers independent from coherence Simultanenous communication and computation Asynchronous messages with user-level interrupts General ‐ purpose H/W Generic interface allows reuse
11
ADM Microarchitecture One ADM unit / core Receive buffer holds messages until dequeued by thread Send buffer holds sent messages pending acknowledgement Thread ID Translation Buffer translates TID → core ID on sends Small structures (16 ‐ 32 entries) → scalable
12
ADM Instruction Set Architecture Send & receive are atomic (single instruction) Send completes when message is copied to send buffer Receive blocks if buffer is empty Peek doesn't block, enables polling Generate an user ‐ level interrupt when a message is received No stack switching, handler code partially saves context (used registers) → fast Interrupts can be disabled to preserve atomicity w.r.t. message reception InstructionDescription adm_send r1, r2 Sends a message of (r1) words (0 ‐ 6) to thread with ID (r2) adm_peek r1, r2Returns source and message length at head of rx buffer adm_rx r1, r2Dequeues message at head of rx buffer adm_ei / adm_diEnable / disable receive interrupts
13
Contents Introduction Asynchronous Direct Messages (ADM) ADM schedulers Evaluation
14
ADM Schedulers Message ‐ passing schedulers Replace other parallel runtime’s scheduler Application programmer’s responsibility Threads can perform 2 roles : Worker : Execute parallel phase, enqueue & dequeue tasks Manager : Control task stealing & parallel phase termination Centralized scheduler: 1 manager controls all workers
15
Centralized Scheduler: Updates Manager keeps approximate task counters for each worker Workers only notify manager at exponential thresholds
16
Centralized Scheduler: Steals Manager requests a steal from most tasks for load balancing
17
Hierarchical Scheduler Centralized scheduler : Does all communication through messages Enables directed stealing, task prefetching Does not scale over 16 threads Hierarchical scheduler Workers and managers form a tree
18
Hierarchical Scheduler: Steals Steals can affect multiple levels Scales to a lot of threads
19
Contents Introduction Asynchronous Direct Messages (ADM) ADM schedulers Evaluation
20
Evaluation : Conditions Simulated machine: Tiled CMP 32, 64, 128 in ‐ order dual ‐ thread SPARC cores (64, 256 thread) 3 ‐ level cache hierarchy, directory coherence Benchmarks: Loop ‐ parallel : canneal, cg, gtfold Task ‐ parallel : maxflow, mergesort, ced, hashjoin
21
Evaluation : Results Carbon and ADM : Small overheads that scale ADM matches Carbon → No need for H/W scheduler Must not Implement all scheduling policies in H/W
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.