Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC
Agenda Introduction Basic concepts Sample results and analysis
Who is EEMBC? Industry standards consortium focused on benchmarks for the embedded market. Formed in 1997, and includes most embedded silicon and tools vendors. Provides standards for automotive, networking, office automation, consumer devices, telecom, java, multicore and more.
Coremark – Multicore Scalability 4 Information provided by Cavium for CN58XX
5 Multicore Scalability: IP Forwarding Information provided by Cavium
MultiBench A suite of benchmarks from EEMBC, targeted at multicore in general. A suite of benchmarks from EEMBC, targeted at multicore in general. Help decide how best to use a system. Help decide how best to use a system. Help select the best processor and/or system for the job. Help select the best processor and/or system for the job.
? = If cores were cars Why MultiBench?
Workloads and Work Items Multiple algorithms Multiple algorithms Multiple datasets Multiple datasets Decomposition Decomposition Workload Work Item A 1 Work Item A 0 Work Item B 0 Concurrency within an item
Work Items and Workers A collection of threads working on the same item are referred to as workers
Workload Characteristics Important to understand inherent characteristics of a workload. Important to understand inherent characteristics of a workload. Determine which workloads are most relevant for you. Determine which workloads are most relevant for you. Valuable information along with the algorithm description to analyze performance results. Valuable information along with the algorithm description to analyze performance results.
Classification with 8 characteristics Correlation based feature subset selection + Genetic analysis. 8 data points for 80% accuracy in performance prediction.
Tying it together Take a couple of workloads and analyze results on a few platforms, using characteristics to draw conclusions Take a couple of workloads and analyze results on a few platforms, using characteristics to draw conclusions rotate-4Ms1 (One image at a time) rotate-4Ms1w1 (Multiple images in parallel) Same kernel with different run rules. 90deg image rotation 90deg image rotation
The platforms 3 core processor, 2 HW threads / core 3 core processor, 2 HW threads / core Soft core, tested on FPGA 8 core processor, 4 HW threads / core 8 core processor, 4 HW threads / core Many-core processor (> 8) Many-core processor (> 8) GCC on all platforms. GCC on all platforms. Same OS type (Linux) on all platforms. Same OS type (Linux) on all platforms. Same ISA. Same ISA. Load balance left to the OS to decide. Load balance left to the OS to decide.
3-Core Image Rotation Speedup Here we are using parallelism to speed up processing of one image.
Analysis for 3 Core? Overall performance benefit for full configuration is 2.7x vs 2.1x. However, with 3 workers active, a system with L2 is almost twice as efficient as the one without. Not bad for a memory intensive workload. Overall performance benefit for full configuration is 2.7x vs 2.1x. However, with 3 workers active, a system with L2 is almost twice as efficient as the one without. Not bad for a memory intensive workload. Use L2? 2 or 3 cores? Depends on the headroom you need for other applications… Use L2? 2 or 3 cores? Depends on the headroom you need for other applications…
Performance Results - Workers “many core” device Best performance at 5 cores active Best performance at 5 cores active Likely due to sync and/or cache coherency effects
Best performance at 3 cores active Best performance at 3 cores active Likely due to contention for memory. Performance Results - Streams “many core” device
Analysis – Many Core Device? Assuming a part of our target application shares similar characteristics with this kernel, we can speed up processing of a single stream by allocating ~4 cores per stream, and can efficiently process 2-3 streams at a time. Assuming a part of our target application shares similar characteristics with this kernel, we can speed up processing of a single stream by allocating ~4 cores per stream, and can efficiently process 2-3 streams at a time.
Platform Bottlenecks? Cache coherence and synchronization issues above 4 workers exposed for this type of workload (memory intensive). Cache coherence and synchronization issues above 4 workers exposed for this type of workload (memory intensive). Memory contention exposed for multiple streams with that type of access Memory contention exposed for multiple streams with that type of access 30% memory instructions * 3 streams saturate the memory, and above that memory contention kills performance. Splurge for the many-core version? What will you run on the other cores? Splurge for the many-core version? What will you run on the other cores?
8 core with 4 hardware threads / core Hardware threads enable 4x speedup. Hardware threads enable 4x speedup.
8 core with 4 hardware threads / core Multiple streams scale even more (5.5x) Multiple streams scale even more (5.5x) Take care not to oversubscribe Take care not to oversubscribe
IP Reassembly IP-reassembly workload over 4M, one platform actually drops in performance! IP-reassembly workload over 4M, one platform actually drops in performance! Is it architecture or software that makes scaling difficult? Is it architecture or software that makes scaling difficult? 3 Core Different ISA Many Core
Summary Use your multiple cores wisely! Use your multiple cores wisely! Understanding the capabilities of your platform is a key to your ability to utilize them, as much as understanding your code. Understanding the capabilities of your platform is a key to your ability to utilize them, as much as understanding your code. Join EEMBC to use state of the art benchmarks or help define the next generation. Join EEMBC to use state of the art benchmarks or help define the next generation. More at
Questions?
Let us look at MD5 (A different workload in the suite) Control – extremely low (mostly int ops) Control – extremely low (mostly int ops) Memory access pattern – sequential Memory access pattern – sequential Memory ops – 20% Memory ops – 20% Typical for a computationally intensive workload. Typical for a computationally intensive workload. Same platforms as before Same platforms as before
Speedup – 3 Core >3x for multiple streams (250% increase in performance)! >3x for multiple streams (250% increase in performance)! 60% speedup for a single stream. 60% speedup for a single stream.
More then 3x on 3 cores? Virtual CPU (thread) able to squeeze more performance for very little additional silicon. Virtual CPU (thread) able to squeeze more performance for very little additional silicon. Only one of the 30 benchmarks in the suite did not gain performance from utilizing HW thread technology. Only one of the 30 benchmarks in the suite did not gain performance from utilizing HW thread technology.
Performance Results “many core” Synchronization overhead comes into effect! Synchronization overhead comes into effect! Memory contention affirmed Memory contention affirmed
8 core with 4 threads / core Higher compute load makes hardware threads shine with 9x speedup on an 8 core system. Higher compute load makes hardware threads shine with 9x speedup on an 8 core system. Even single stream performance scales up to 5x. Even single stream performance scales up to 5x.
Backup - Architect
Suite Analyzed A standard subset of MultiBench. A standard subset of MultiBench. All workloads limited to 4M working set size per context activated. All workloads limited to 4M working set size per context activated. 1 Context – 4M needed. 4 Contexts – 16M will be needed. Standardized run rules and marks capturing performance and scalability of a platform. Standardized run rules and marks capturing performance and scalability of a platform.
What information? ILP ILP Dynamic and static instruction distribution Dynamic and static instruction distribution Memory profile (static and dynamic) Memory profile (static and dynamic) Cache effects Cache effects Predictability Predictability Synchronization events Synchronization events …. more available and analyzed as the industry adds new tools …. more available and analyzed as the industry adds new tools
Why MultiBench? Multicore is everywhere Multicore is everywhere Current metrics misleading (rate, DMIPS, etc) Judging performance potential is much more complex (as if benchmarking was not complex enough). Hence our focus on benchmarking embedded multicore solutions. Need workloads close to real life
Important Workload Characteristics Memory Memory 35% of the instructions are memory moderate memory activity any memory bottlenecks will be Multicore related. Control Control extremely predictable any performance bottlenecks are not related to pipeline bubbles Strides Strides read access is sequential or nearly so, while write access has a stride of ~4K. Combined with the fact of high cache reuse and the nature of the algorithm cache coherency traffic. Sync Sync Once per ~4K of data. Other? Other? For this workload, the other characteristics do not provide additional insights.