Simone Campanoni simonec@eecs.northwestern.edu A research CAT Simone Campanoni simonec@eecs.northwestern.edu.

Slides:

Advertisements

Similar presentations

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Advertisements

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios.

High Performing Cache Hierarchies for Server Workloads

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.

Compiler Optimization of scalar and memory resident values between speculative threads. Antonia Zhai et. al.

Colorado Computer Architecture Research Group Architectural Support for Enhanced SMT Job Scheduling Alex Settle Joshua Kihm Andy Janiszewski Daniel A.

Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.

University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Multi-Core Architectures

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

QCAdesigner – CUDA HPPS project

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei Lin Dean M. Tullsen Speaker: Houman.

Full and Para Virtualization

1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Sunpyo Hong, Hyesoon Kim

E-MOS: Efficient Energy Management Policies in Operating Systems

Tuning Threaded Code with Intel® Parallel Amplifier.

Architecture for Resource Allocation Services Supporting Interactive Remote Desktop Sessions in Utility Grids Vanish Talwar, HP Labs Bikash Agarwalla,

BITS Pilani, Pilani Campus Today’s Agenda Role of Performance.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Gwangsun Kim, Jiyun Jeong, John Kim

Lecture 2: Performance Evaluation

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Raghuraman Balasubramanian Karthikeyan Sankaralingam

Seth Pugsley, Jeffrey Jestes,

Assessing and Understanding Performance

Zhichun Zhu Zhao Zhang ECE Department ECE Department

Chilimbi, et al. (2014) Microsoft Research

Conception of parallel algorithms

ISPASS th April Santa Rosa, California

Decoupled Access-Execute Pioneering Compilation for Energy Efficiency

FPGA: Real needs and limits

Hierarchical Architecture

Antonia Zhai, Christopher B. Colohan,

Vector Processing => Multimedia

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Short Circuiting Memory Traffic in Handheld Platforms

Presented by: Isaac Martin

Some challenges in heterogeneous multi-core systems

Improved schedulability on the ρVEX polymorphic VLIW processor

Department of Computer Science University of California, Santa Barbara

Tony Nowatzki Vinay Gangadhar Karthikeyan Sankaralingam

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Erlang Multicore support

Gary M. Zoppetti Gagan Agrawal

Jinquan Dai, Long Li, Bo Huang Intel China Software Center

The Vector-Thread Architecture

COMS 361 Computer Organization

Department of Computer Science University of California, Santa Barbara

Benchmarks Programs specifically chosen to measure performance

Sculptor: Flexible Approximation with

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

Simone Campanoni simonec@eecs.northwestern.edu A research CAT Simone Campanoni simonec@eecs.northwestern.edu

Sequential programs are not accelerating like they used to Performance gap Core frequency scaling Performance (log scale) Multicore era Sequential program running on a platform 1992 2004

Multicores are underutilized Single application: Not enough explicit parallelism Developing parallel code is hard Sequentially-designed code is still ubiquitous Multiple applications: Only a few CPU-intensive applications running concurrently in client devices

Parallelizing compiler: Exploit unused cores to accelerate sequential programs

Non-numerical programs need to be parallelized

Parallelize loops to parallelize a program Outermost loops 99% of time is spent in loops Time

DOACROSS parallelism Iteration 0 Iteration 1 Iteration 2 work() work()

DOACROSS parallelism Sequential segment Parallel segment c=f(c) d=f(d) work() c=f(c) d=f(d) work() c=f(c) d=f(d) work() Parallel segment

HELIX: DOACROSS for multicore [Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012] c=f(c) d=f(d) work() c=f(c) d=f(d) work() c=f(c) d=f(d) work()

HELIX: DOACROSS for multicore [Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012] c=f(c) d=f(d) work() c=f(c) d=f(d) work() c=f(c) d=f(d) work()

HELIX: DOACROSS for multicore [Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012] Wait 0 Seq. Segment 0 Signal 0 c=f(c) d=f(d) work(x) c=f(c) d=f(d) work() Signal 1 Wait 1 c=f(c) d=f(d) work() Seq. Segment 1

HELIX: DOACROSS for multicore [Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012]

HELIX: DOACROSS for multicore [Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012]

Parallelize loops to parallelize a program Outermost loops Innermost loops 99% of time is spent in loops Time

Parallelize loops to parallelize a program Innermost loops Outermost loops Coverage HELIX Ease of analysis Communication

HELIX: DOACROSS for multicore 16 [Campanoni et al, CGO 2012, Campanoni et al, DAC 2012, Campanoni et al, IEEE Micro 2012] Innermost loops Coverage Communication Easy of analysis HELIX Outermost loops HELIX-RC HELIX-UP 4 HELIX Small Loop Parallelism Speedup ICC, Microsoft Visual Studio,DOACROSS 1 SPEC INT baseline 4-core Intel Nehalem

Outline Small Loop Parallelism and HELIX HELIX-RC: Architecture/Compiler Co-Design HELIX-UP: Unleash Parallelization [CGO 2012 DAC 2012, IEEE Micro 2012] [ISCA 2014] Communication HELIX Small loops [CGO 2015]

SLP challenge: short loop iterations SPEC CPU Int benchmarks Clock cycles Duration of loop iteration (cycles)

SLP challenge: short loop iterations 90 SPEC CPU Int benchmarks Duration of loop iteration (cycles) Clock cycles

SLP challenge: short loop iterations Adjacent core communication latency Clock cycles Duration of loop iteration (cycles)

A compiler-architecture co-design to efficiently execute short iterations Identify latency-critical code in each small loop Code that generates shared data Expose information to the architecture Seq. Segment 0 Seq. Segment 1 Wait 0 Wait 1 Signal 0 Signal 1 Architecture: Ring Cache Reduce the communication latency on the critical path

Light-weight enhancement of today’s multicore architecture Store X, 1 Store Y, 1 Iter. 0 … Load Y … Iter. 1 Store X, 1 Core 0 Core 1 Load X Ring node Ring node DL1 DL1 Last level cache DL1 DL1 Store Y, 1 Iter. 3 Store Y, 1 Iter. 2 75 – 260 cycles! Ring node Ring node Core 3 Core 2

Light-weight enhancement of today’s multicore architecture Store X, 1 Wait 0 Store Y, 1 Signal 0 Iter. 0 Core 0 Core 1 … Wait 0 Load Y … Iter. 1 Ring node Ring node Ring node Ring node

98% hit rate

The importance of HELIX-RC Numerical programs Non-numerical programs

The importance of HELIX-RC Numerical programs Non-numerical programs

Outline Small Loop Parallelism and HELIX HELIX-RC: Architecture/Compiler Co-Design HELIX-UP: Unleash Parallelization [CGO 2012 DAC 2012, IEEE Micro 2012] [ISCA 2014] Communication HELIX Small loops [CGO 2015]

HELIX and its limitations Iteration 0 Thread 0 Thread 1 Thread 2 Thread 3 80% Data Iteration 1 Data Iteration 2 Data 50% Performance: Lower than you would like Inconsistent across architectures Sensitive to dependence analysis accuracy What can we do to improve it? 78% accuracy 1.19 79% accuracy 1.61 Nehalem 2.77 Bulldozer 2.31 Haswell 1.68 4 Cores 28

Opportunity: relax program semantics Some workloads tolerate output distortion Output distortion is workload-dependent

Relaxing transformations remove performance bottlenecks Sequential bottleneck Thread 1 Thread 2 Thread 3 Inst 1 Inst 2 Inst 3 Inst 4 Inst 1 Inst 2 Inst 1 Inst 2 Sequential segment Dep Inst 3 Inst 4 Inst 3 Inst 4 Speedup

Relaxing transformations remove performance bottlenecks Sequential bottleneck Communication bottleneck Data locality bottleneck

Relaxing transformations remove performance bottlenecks No relaxing transformations Relaxing transformation 1 Relaxing transformation 2 … Relaxing transformation k Max output distortion Max performance No output distortion Baseline performance

Design space of HELIX-UP Apply relaxing transformation 5 to code region 2 Performance Energy saved Output distortion Code region 1 Apply relaxing transformation 3 to code region 1 Code region 2 User provides output distortion limits System finds the best configuration Run parallelized code with that configuration

Pruning the design space Empirical observation: Transforming a code region affects only the loop it belongs to 50 loops, 2 code regions per loop 2 transformations per code region Complete space = 2100 Pruned space = 50 * (22) = 200 How well does HELIX-UP perform?

HELIX: no relaxing transformations HELIX-UP unblocks extra parallelism with small output distortions Nehalem 6 cores 2 threads per core

HELIX-UP unblocks extra parallelism with small output distortions Nehalem 6 cores 2 threads per core

Performance/distortion tradeoff 256.bzip2 HELIX %

Run time code tuning Static HELIX-UP decides how to transform the code based on profile data averaged over inputs The runtime reacts to transient bottlenecks by adjusting code accordingly

Adapting code at run time unlocks more parallelism 256.bzip2 HELIX %

HELIX-UP improves more than just performance Robustness to DDG inaccuracies Consistent performance across platforms

Relaxed transformations to be robust to DDG inaccuracies 256.bzip2 Increasing DDG inaccuracies leads to lower performance No impact on HELIX-UP HELIX HELIX-UP

Relaxed transformations for consistent performance Increasing communication latency

So, what’s next?

Next work Real system evaluation of HELIX-RC Multi programs scenario Speculation for SLP

Small Loop Parallelism and HELIX Parallelism hides in small loops HELIX-RC: Architecture/Compiler Co-Design Irregular programs require low latency HELIX-UP: Unleash Parallelization Tolerating distortions boosts parallelization

Thank you!

Small Loop Parallelism and HELIX Parallelism hides in small loops HELIX-RC: Architecture/Compiler Co-Design Irregular programs require low latency HELIX-UP: Unleash Parallelization Tolerating distortions boosts parallelization

Future work: application-specific Programming language Future work Compiler My contribution Architecture Parallelism Communication HW/compiler interface Circuit

Irregular data consumption Distance of consumers Number of consumers Proactively broadcast shared data

Breakdown of overhead left

How to adapt code? Which code to adapt? When to adapt code?

Setting knobs statically is good enough for regular workload 183.equake HELIX %

HELIX-UP runtime is “good enough” 256.bzip2

Conventional transformations are still important

HELIX-UP and Related Work with no output distortion

HELIX-UP and Related Work with output distortion

Relaxed transformations remove performance bottlenecks Sequential bottleneck A code region executed sequentially No output distortion Baseline performance Max output distortion Max performance A knob for each sequential code region

Small Loop Parallelism: a compiler-level granularity Frequency of communication Granularity Coarse grain Programming language TLP Compiler SLP Architecture ILP Circuit Frequent

The HELIX flow We’ve made many enhancements to existing code analysis, e.g., DDG, induction variables, etc.

Sims show single-cycle latency is possible for adjacent-core communication

Simulator overview and accuracy Workload (parallel x86) Un-core Pin VM Core RC Host / OS

Brief description of SPEC2K benchmarks used in our experiments Non-numerical Numerical gzip & bzip2 compression vpr & twolf place & route CAD parser Syntactic parser of English XML? mcf Combinatorial optimization; network flow, scheduling art Neural network simulation; image recognition; machine learning equake Seismic wave propagation mesa Emulates graphics on CPU ammp Computational chemistry (e.g., atoms moving)

Characteristics of parallelized benchmarks HELIX+RC