Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Slides:



Advertisements
Similar presentations
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Computer Abstractions and Technology
SLA-Oriented Resource Provisioning for Cloud Computing
Designing and Optimizing Software for Intel® Architecture Multi-core Processors Peter van der Veen QNX Software Systems.
Introduction CSCI 444/544 Operating Systems Fall 2008.
Scalability-Based Manycore Partitioning Hiroshi Sasaki Kyushu University Koji Inoue Kyushu University Teruo Tanimoto The University of Tokyo Hiroshi Nakamura.
Performance Analysis of Multiprocessor Architectures
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
Dr. Alexandra Fedorova August 2007 Introduction to Systems Research at SFU.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
6/14/2015 How to measure Multi- Instruction, Multi-Core Processor Performance using Simulation Deepak Shankar Darryl Koivisto Mirabilis Design Inc.
Capriccio: Scalable Threads for Internet Services Rob von Behren, Jeremy Condit, Feng Zhou, Geroge Necula and Eric Brewer University of California at Berkeley.
Adaptive Server Farms for the Data Center Contact: Ron Sheen Fujitsu Siemens Computers, Inc Sever Blade Summit, Getting the.
Leveling the Field for Multicore Open Systems Architectures Markus Levy President, EEMBC President, Multicore Association.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Lecture 2: Technology Trends and Performance Evaluation Performance definition, benchmark, summarizing performance, Amdahl’s law, and CPI.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Computer System Architectures Computer System Software
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Waleed Alkohlani 1, Jeanine Cook 2, Nafiul Siddique 1 1 New Mexico Sate University 2 Sandia National Laboratories Insight into Application Performance.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.
SYNAR Systems Networking and Architecture Group Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Daniel Shelepov and Alexandra.
Multicore In Real-Time Systems – Temporal Isolation Challenges Due To Shared Resources Ondřej Kotaba, Jan Nowotsch, Michael Paulitsch, Stefan.
Multi-core architectures. Single-core computer Single-core CPU chip.
Para-Snort : A Multi-thread Snort on Multi-Core IA Platform Tsinghua University PDCS 2009 November 3, 2009 Xinming Chen, Yiyao Wu, Lianghong Xu, Yibo Xue.
Multi-Core Architectures
Feb. 19, 2008 Multicore Processor Technology and Managing Contention for Shared Resource Cong Zhao Yixing Li.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.
Heterogeneous Multikernel OS Yauhen Klimiankou BSUIR
Introduction to Parallel Rendering Jian Huang, CS 594, Spring 2002.
Server to Server Communication Redis as an enabler Orion Free
CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.
Processor Architecture
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Operating Systems: Internals and Design Principles
Full and Para Virtualization
Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Sunpyo Hong, Hyesoon Kim
Concurrency and Performance Based on slides by Henri Casanova.
Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Lecture 2: Performance Evaluation
4- Performance Analysis of Parallel Programs
Current Generation Hypervisor Type 1 Type 2.
The Multikernel: A New OS Architecture for Scalable Multicore Systems
Task Scheduling for Multicore CPUs and NUMA Systems
Parallel Algorithm Design
EE 193: Parallel Computing
Presented by: Isaac Martin
Chapter 4: Threads.
Example of usage in Micron Italy (MIT)
CLUSTER COMPUTING.
Multithreaded Programming
Mattan Erez The University of Texas at Austin
Chip&Core Architecture
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC

Agenda  Introduction  Basic concepts  Sample results and analysis

Who is EEMBC? Industry standards consortium focused on benchmarks for the embedded market. Formed in 1997, and includes most embedded silicon and tools vendors. Provides standards for automotive, networking, office automation, consumer devices, telecom, java, multicore and more.

Coremark – Multicore Scalability 4 Information provided by Cavium for CN58XX

5 Multicore Scalability: IP Forwarding Information provided by Cavium

MultiBench A suite of benchmarks from EEMBC, targeted at multicore in general. A suite of benchmarks from EEMBC, targeted at multicore in general. Help decide how best to use a system. Help decide how best to use a system. Help select the best processor and/or system for the job. Help select the best processor and/or system for the job.

? = If cores were cars Why MultiBench?

Workloads and Work Items Multiple algorithms Multiple algorithms Multiple datasets Multiple datasets Decomposition Decomposition Workload Work Item A 1 Work Item A 0 Work Item B 0 Concurrency within an item

Work Items and Workers A collection of threads working on the same item are referred to as workers

Workload Characteristics Important to understand inherent characteristics of a workload. Important to understand inherent characteristics of a workload. Determine which workloads are most relevant for you. Determine which workloads are most relevant for you. Valuable information along with the algorithm description to analyze performance results. Valuable information along with the algorithm description to analyze performance results.

Classification with 8 characteristics Correlation based feature subset selection + Genetic analysis. 8 data points for 80% accuracy in performance prediction.

Tying it together Take a couple of workloads and analyze results on a few platforms, using characteristics to draw conclusions Take a couple of workloads and analyze results on a few platforms, using characteristics to draw conclusions  rotate-4Ms1 (One image at a time)  rotate-4Ms1w1 (Multiple images in parallel)  Same kernel with different run rules. 90deg image rotation 90deg image rotation

The platforms 3 core processor, 2 HW threads / core 3 core processor, 2 HW threads / core  Soft core, tested on FPGA 8 core processor, 4 HW threads / core 8 core processor, 4 HW threads / core Many-core processor (> 8) Many-core processor (> 8) GCC on all platforms. GCC on all platforms. Same OS type (Linux) on all platforms. Same OS type (Linux) on all platforms. Same ISA. Same ISA. Load balance left to the OS to decide. Load balance left to the OS to decide.

3-Core Image Rotation Speedup Here we are using parallelism to speed up processing of one image.

Analysis for 3 Core? Overall performance benefit for full configuration is 2.7x vs 2.1x. However, with 3 workers active, a system with L2 is almost twice as efficient as the one without. Not bad for a memory intensive workload. Overall performance benefit for full configuration is 2.7x vs 2.1x. However, with 3 workers active, a system with L2 is almost twice as efficient as the one without. Not bad for a memory intensive workload. Use L2? 2 or 3 cores? Depends on the headroom you need for other applications… Use L2? 2 or 3 cores? Depends on the headroom you need for other applications…

Performance Results - Workers “many core” device Best performance at 5 cores active Best performance at 5 cores active  Likely due to sync and/or cache coherency effects

Best performance at 3 cores active Best performance at 3 cores active  Likely due to contention for memory. Performance Results - Streams “many core” device

Analysis – Many Core Device? Assuming a part of our target application shares similar characteristics with this kernel, we can speed up processing of a single stream by allocating ~4 cores per stream, and can efficiently process 2-3 streams at a time. Assuming a part of our target application shares similar characteristics with this kernel, we can speed up processing of a single stream by allocating ~4 cores per stream, and can efficiently process 2-3 streams at a time.

Platform Bottlenecks? Cache coherence and synchronization issues above 4 workers exposed for this type of workload (memory intensive). Cache coherence and synchronization issues above 4 workers exposed for this type of workload (memory intensive). Memory contention exposed for multiple streams with that type of access Memory contention exposed for multiple streams with that type of access  30% memory instructions * 3 streams saturate the memory, and above that memory contention kills performance. Splurge for the many-core version? What will you run on the other cores? Splurge for the many-core version? What will you run on the other cores?

8 core with 4 hardware threads / core Hardware threads enable 4x speedup. Hardware threads enable 4x speedup.

8 core with 4 hardware threads / core Multiple streams scale even more (5.5x) Multiple streams scale even more (5.5x) Take care not to oversubscribe Take care not to oversubscribe

IP Reassembly IP-reassembly workload over 4M, one platform actually drops in performance! IP-reassembly workload over 4M, one platform actually drops in performance! Is it architecture or software that makes scaling difficult? Is it architecture or software that makes scaling difficult? 3 Core Different ISA Many Core

Summary Use your multiple cores wisely! Use your multiple cores wisely! Understanding the capabilities of your platform is a key to your ability to utilize them, as much as understanding your code. Understanding the capabilities of your platform is a key to your ability to utilize them, as much as understanding your code. Join EEMBC to use state of the art benchmarks or help define the next generation. Join EEMBC to use state of the art benchmarks or help define the next generation. More at

Questions?

Let us look at MD5 (A different workload in the suite) Control – extremely low (mostly int ops) Control – extremely low (mostly int ops) Memory access pattern – sequential Memory access pattern – sequential Memory ops – 20% Memory ops – 20% Typical for a computationally intensive workload. Typical for a computationally intensive workload. Same platforms as before Same platforms as before

Speedup – 3 Core >3x for multiple streams (250% increase in performance)! >3x for multiple streams (250% increase in performance)! 60% speedup for a single stream. 60% speedup for a single stream.

More then 3x on 3 cores? Virtual CPU (thread) able to squeeze more performance for very little additional silicon. Virtual CPU (thread) able to squeeze more performance for very little additional silicon. Only one of the 30 benchmarks in the suite did not gain performance from utilizing HW thread technology. Only one of the 30 benchmarks in the suite did not gain performance from utilizing HW thread technology.

Performance Results “many core” Synchronization overhead comes into effect! Synchronization overhead comes into effect! Memory contention affirmed Memory contention affirmed

8 core with 4 threads / core Higher compute load makes hardware threads shine with 9x speedup on an 8 core system. Higher compute load makes hardware threads shine with 9x speedup on an 8 core system. Even single stream performance scales up to 5x. Even single stream performance scales up to 5x.

Backup - Architect

Suite Analyzed A standard subset of MultiBench. A standard subset of MultiBench. All workloads limited to 4M working set size per context activated. All workloads limited to 4M working set size per context activated.  1 Context – 4M needed.  4 Contexts – 16M will be needed. Standardized run rules and marks capturing performance and scalability of a platform. Standardized run rules and marks capturing performance and scalability of a platform.

What information? ILP ILP Dynamic and static instruction distribution Dynamic and static instruction distribution Memory profile (static and dynamic) Memory profile (static and dynamic) Cache effects Cache effects Predictability Predictability Synchronization events Synchronization events …. more available and analyzed as the industry adds new tools …. more available and analyzed as the industry adds new tools

Why MultiBench? Multicore is everywhere Multicore is everywhere  Current metrics misleading (rate, DMIPS, etc)  Judging performance potential is much more complex (as if benchmarking was not complex enough).  Hence our focus on benchmarking embedded multicore solutions.  Need workloads close to real life

Important Workload Characteristics Memory Memory  35% of the instructions are memory  moderate memory activity  any memory bottlenecks will be Multicore related. Control Control  extremely predictable  any performance bottlenecks are not related to pipeline bubbles Strides Strides  read access is sequential or nearly so, while write access has a stride of ~4K. Combined with the fact of high cache reuse and the nature of the algorithm  cache coherency traffic. Sync Sync  Once per ~4K of data. Other? Other?  For this workload, the other characteristics do not provide additional insights.