Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang

Overview Today’s Trends Chip-multiprocessors (CMPs)  Need Finer-grain parallelism for large-scale CMPs Existing fine ‐ grain schedulers Software ‐ only implement : Flexible, but slow and not scalable Hardware ‐ only implement : Fast, but inflexible and costly Approach of this paper Hardware ‐ aided user ‐ level schedulers  H/W : Asynchronous direct messages (ADM)  S/W : Scalable message-passing schedulers  ADM schedulers are Fast like H/W, Flexible like S/W scheduler

Contents Introduction Asynchronous Direct Messages (ADM) ADM schedulers Evaluation

Fine ‐ grain parallelism divide tasks in partition and distribute them Advantages serve more parallelism avoid load imbalance Disadvantages large scheduling overhead Poor locality

Work ‐ stealing scheduler 1 task queue / thread Threads dequeue and enqueue tasks from task queues If a thread runs out of work  try to steal tasks from another thread’s queue

Components of Work-stealing Queues Policies effective load rebalancing method Communication

Comparison 2 scheme Cost of queues and policies is cheap Cost of communication through shared memory is increasingly expensive! Carbon [ISCA ’07] One hardware LIFO task queue / core Special instructions to enqueue/dequeue tasks Components  Global Task Unit : Centralized queues for fast stealing  Local Task Units : One small task buffer per core to hide GTU latency S/W scheduler H/W scheduler Useless if the H/W policy did not match! good bad

Approach to fine-grain scheduling S/W onlyH/W onlyH/W aided OpenMP, TBB, Clik, X10, … Carbon, GPUs, … Asynchronous Direct Messages (ADM) S/W queues & policies S/W communication H/W queues & policies H/W communication S/W queues & policies H/W communication High ‐ overhead Flexible No extra H/W Low ‐ overhead Inflexible Special ‐ purpose H/W Low ‐ overhead Flexible General ‐ purpose HW

Asynchronous Direct Messages Asynchronous Direct Messages (ADM) Messaging between threads tailored to scheduling and control Requirements  Short messages for Low ‐ overhead  Send from/receive to registers independent from coherence  Simultanenous communication and computation  Asynchronous messages with user-level interrupts  General ‐ purpose H/W  Generic interface allows reuse

ADM Microarchitecture One ADM unit / core Receive buffer holds messages until dequeued by thread Send buffer holds sent messages pending acknowledgement Thread ID Translation Buffer translates TID → core ID on sends Small structures (16 ‐ 32 entries) → scalable

ADM Instruction Set Architecture Send & receive are atomic (single instruction) Send completes when message is copied to send buffer Receive blocks if buffer is empty Peek doesn't block, enables polling Generate an user ‐ level interrupt when a message is received No stack switching, handler code partially saves context (used registers) → fast Interrupts can be disabled to preserve atomicity w.r.t. message reception InstructionDescription adm_send r1, r2 Sends a message of (r1) words (0 ‐ 6) to thread with ID (r2) adm_peek r1, r2Returns source and message length at head of rx buffer adm_rx r1, r2Dequeues message at head of rx buffer adm_ei / adm_diEnable / disable receive interrupts

ADM Schedulers Message ‐ passing schedulers Replace other parallel runtime’s scheduler Application programmer’s responsibility Threads can perform 2 roles : Worker : Execute parallel phase, enqueue & dequeue tasks Manager : Control task stealing & parallel phase termination Centralized scheduler: 1 manager controls all workers

Centralized Scheduler: Updates Manager keeps approximate task counters for each worker Workers only notify manager at exponential thresholds

Centralized Scheduler: Steals Manager requests a steal from most tasks for load balancing

Hierarchical Scheduler Centralized scheduler : Does all communication through messages Enables directed stealing, task prefetching Does not scale over 16 threads  Hierarchical scheduler Workers and managers form a tree

Hierarchical Scheduler: Steals Steals can affect multiple levels  Scales to a lot of threads

Evaluation : Conditions Simulated machine: Tiled CMP 32, 64, 128 in ‐ order dual ‐ thread SPARC cores (64, 256 thread) 3 ‐ level cache hierarchy, directory coherence Benchmarks: Loop ‐ parallel : canneal, cg, gtfold Task ‐ parallel : maxflow, mergesort, ced, hashjoin

Evaluation : Results Carbon and ADM : Small overheads that scale ADM matches Carbon → No need for H/W scheduler Must not Implement all scheduling policies in H/W

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

Similar presentations

Presentation on theme: "Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

Similar presentations

Presentation on theme: "Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang."— Presentation transcript:

Similar presentations

About project

Feedback