Download presentation
Presentation is loading. Please wait.
Published byJason Hutchinson Modified over 9 years ago
1
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher J. Hughes, Corporate Technology Group, Intel Corporation Anthony Nguyen, Corporate Technology Group, Intel Corporation By Duygu AKMAN
2
2 Keywords Multi-core Architectures RMS applications Large-Grain Parallelism / Fine-Grain Parallelism Architectural Support
3
3 Multi-core Architectures MCA Higher performance than uniprocessor systems Reduce communication latency and increase bandwidth between cores. Applications need thread level parallelism to benefit.
4
4 Thread Level Parallelism One common approach is partitioning a program into parallel tasks, and letting a software to schedule the tasks to different threads. Useful, only if tasks are large enough so that software overhead is negligible. (e.g. Scientific applications) RMS (Recognition, Mining and Synthesis) applications mostly have small tasks.
5
5 Reasons for Fine-Grain Parallelism Also, MCAs can even be found at homes, which also tells us that fine-grained parallelism is necessary. Some applications need to get good performance on different platforms with varying number of cores -> fine-grain tasks In multiprogramming, the number of cores assigned to an application can change during execution. Need to maximize available parallelism -> fine-grain parallelism
6
6 Example (8 core MCA)
7
7 Two cases: Application partitioned into 8 equal-sized tasks Application partitioned into 32 equal-sized tasks In a parallel section, when a core finishes its tasks, it waits for other cores -> waste of resources
8
8 Example (8 core MCA) With 4 and 8 cores assigned to the application, all cores are fully utilized. With 6 cores in first case, more waste resources(same performance with 4 cores) than the second case. Second case is finer-grained.
9
9 Problem is even worse with larger number of cores With only 64 tasks, no performance improvement between 32 cores and 63 cores! Need more tasks -> fine-grain parallelism
10
10 Contribution Propose a hardware technique to accelerate dynamic task scheduling on MCAs. Hardware queues that cache tasks and implement scheduling policies Task prefetchers on each core to hide the latency of accessing the queues.
11
11 Workloads Parallelized and analyzed RMS applications from areas including Simulation for computer games Financial analytics Image processing Computer vision, etc. Some modules of these applications have large grained parallelism -> insensitive to task queuing overhead But significant number of modules have to be parallelized at a fine granularity to achieve better performance
12
12 Architectural Support For Fine- Grained Parallelism Overhead when queuing of tasks are handled by software If tasks are small, this overhead can be a significant fraction of total execution time. Contribution is adding hardware to MCAs for accelerating task queues. Provides very fast access to the storage for tasks Performs fast task scheduling
13
13 Proposed Hardware An MCA chip where the cores are connected to a cache hierarchy by an on- die network. Two separate hardware components Local Task Unit (LTU) per core Single Global Task Unit (GTU)
14
14 Proposed Hardware
15
15 Global Task Unit GTU holds the logic of the implementation of the scheduling algorithm GTU holds enqueued tasks in hardware queues. There is a hardware queue for each core Since the queues are physically close to each other, scheduling is faster. GTU is physically centralized and connection between the GTU and the cores is done via the same on-die interconnect as the caches.
16
16 Global Task Unit Disadvantage of GTU is that as the number of cores increase, average communication latency between a core and GTU also increases. This latency is hidden with the use of prefetchers (LTUs).
17
17 Local Task Unit Each core has a small piece of hardware to communicate with the GTU. If cores wait to contact the GTU until the thread running on them finishes its current task, thread will have to stall for the GTU access latency. LTU also has a task prefetcher and a small buffer to hide the latency of accessing the GTU.
18
18 Local Task Unit On a dequeue, if there is a task in LTU’s buffer, task is returned to the thread, and a prefetch for the next available task is sent to the GTU. On an enqueue, task is placed in LTU’s buffer. On an enqueue, task is placed in LTU’s buffer. Since proposed hardware uses a LIFO ordering of tasks for a given thread, if the buffer is already full, the oldest task in the buffer is sent to the GTU.
19
19 Experimental Evaluation Benchmarks are from the RMS application domain RMS = Recognition, Mining and Synthesis Wide range of different areas All benchmarks are parallelized
20
20 These benchmarks are straightforward to parallelize, each parallel loop simply specifies a range of indices and the granularity of tasks
21
21 Task-level parallelism is more general than loop-level parallelism where each parallel section starts with a set of initial tasks and any task may enqueue other tasks.
22
22 Benchmarks In some of these benchmarks, task size is small, so task queue overhead must be small to effectively exploit the available parallelism. In some, parallelism is limited. In some, task sizes are highly variable, therefore a very efficient task management is needed for good load balancing.
23
23 Results Results show the performance benefit of the proposed hardware for loop-level and task-level benchmarks, when running with 64 cores. The hardware proposal is compared with the best optimized software implementations an idealized implementation (Ideal) in which tasks bypass the LTUs and are sent directly to/from GTU with zero interconnect latency and GTU processes these tasks instantly without any latency.
24
24 Results
25
25 Results
26
26 Results The graphs represent the speedup over the one-thread execution using the Ideal implementation. For each benchmark, multiple bars. Each bar corresponds to a different data set shown in Benchmark Tables
27
27 Results For the loop-level benchmarks, the proposed hardware executes 88% faster on average than the optimized software implementation and only 3% slower than Ideal. For the task-level benchmarks, on average the proposed hardware is 98% faster compared to the best software version and is within 2.7% of Ideal.
28
28 Conclusion In order to benefit from the growing compute resources of MCAs, applications must expose their thread-level parallelism to hardware. Previous work has proposed software implementations of dynamic task schedulers. But applications with large tasks, such as RMSs, achieve poor parallel speedups using software dynamic task scheduling. This is because the overhead of the scheduler are large for these applications.
29
29 Conclusion To enable good parallel scaling even for applications with very small tasks,a hardware scheme is proposed. It consists of relatively simple hardware and is tolerant to growing on-die latencies; therefore, it is a good solution for scalable MCAs. When the proposed hardware, the optimized software task schedulers and an idealized hardware task scheduler is compared, we see that, for the RMS benchmarks, hardware gives large performance benefits over the software schedulers, and it comes very close to the idealized hardware scheduler.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.