Cache-Aware Partitioning of Multi-Dimensional Iteration Spaces

Slides:



Advertisements
Similar presentations
Multi-core Computing Lecture 3 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.
Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Using the Iteration Space Visualizer in Loop Parallelization Yijun YU
Distributed Systems CS
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Matching Memory Access Patterns and Data Placement for NUMA Systems Zoltán Majó Thomas R. Gross Computer Science Department ETH Zurich, Switzerland.
A Coherent Grid Traversal Algorithm for Volume Rendering Ioannis Makris Supervisors: Philipp Slusallek*, Céline Loscos *Computer Graphics Lab, Universität.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.
1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.
Arun Kejariwal Paolo D’Alberto Alexandru Nicolau Paolo D’Alberto Alexandru Nicolau Constantine D. Polychronopoulos A Geometric Approach for Partitioning.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.
K-Ary Search on Modern Processors Fakultät Informatik, Institut Systemarchitektur, Professur Datenbanken Benjamin Schlegel, Rainer Gemulla, Wolfgang Lehner.
Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.
Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
GmImgProc Alexandra Olteanu SCPD Alexandru Ştefănescu SCPD.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Threads by Dr. Amin Danial Asham. References Operating System Concepts ABRAHAM SILBERSCHATZ, PETER BAER GALVIN, and GREG GAGNE.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
Test Architecture Design and Optimization for Three- Dimensional SoCs Li Jiang, Lin Huang and Qiang Xu CUhk Reliable Computing Laboratry Department of.
Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department.
Embedded System Lab. 오명훈 Addressing Shared Resource Contention in Multicore Processors via Scheduling.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments Junghee Lee, Chrysostomos Nicopoulos, Yongjae Lee, Hyung Gyu Lee.
University of Texas at Arlington Scheduling and Load Balancing on the NASA Information Power Grid Sajal K. Das, Shailendra Kumar, Manish Arora Department.
Performance Aware Secure Code Partitioning Sri Hari Krishna Narayanan, Mahmut Kandemir, Richard Brooks Presenter : Sri Hari Krishna Narayanan.
Lucas De Marchi sponsors: co-authors: Liria Matsumoto Sato
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Reducing OLTP Instruction Misses with Thread Migration
Ioannis E. Venetis Department of Computer Engineering and Informatics
Achieving Multiprogramming Scalability of Parallel Programs on Intel SMP Platforms: Nanothreading in the Linux Kernel Christos D. Antonopoulos Panagiotis.
Xiaodong Wang, Shuang Chen, Jeff Setter,
Multiscalar Processors
Task Scheduling for Multicore CPUs and NUMA Systems
Exploiting Parallelism
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
What is Parallel and Distributed computing?
CSCI1600: Embedded and Real Time Software
Department of Computer Science University of California, Santa Barbara
Milad Hashemi, Onur Mutlu, Yale N. Patt
Computer Architecture Lecture 4 17th May, 2006
CARP: Compression-Aware Replacement Policies
Jianbo Dong, Lei Zhang, Yinhe Han, Ying Wang, and Xiaowei Li
Gary M. Zoppetti Gagan Agrawal
Multithreaded Programming
Jinquan Dai, Long Li, Bo Huang Intel China Software Center
Intel Core I7 Pipeline Wei-Tse Sun.
A Comparison-FREE SORTING ALGORITHM ON CPUs
Multi-Core Programming Assignment
Arun Kejariwal‡,¥, Xinmin Tian‡
Mapping DSP algorithms to a general purpose out-of-order processor
Department of Computer Science University of California, Santa Barbara
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
CSCI1600: Embedded and Real Time Software
Presentation transcript:

Cache-Aware Partitioning of Multi-Dimensional Iteration Spaces Arun Kejariwal (Yahoo!Inc. Santa Clara, CA) Alexandru Nicolau, Utpal Banerjee, Alexander V. Veidenbaum (UC Irvine, CA) Constantine D. Polychronopoulos (UI Urbana, IL) Presenter: Olga Golovanevsky

Outline Motivation Motivating examples The Techniques Results Conclusion

Motivation Multi-cores becoming ubiquitous Examples – Intel’s Sandybridge, IBM’s Cell and POWER Sun’s UltraSPARC T* family Number of cores is expected to increase Large-scale hardware parallelism available Software challenges Thread-level application parallelization How to map threads on different cores Load balancing Data affinity

Application Parallelization Loops account for most of application run-time Loop classification DOALL: No loop-carried dependence Amenable to auto-parallelization Execute iterations in parallel on different threads Non-DOALL Thread synchronization support needed for parallelization

Parallel Execution of DOALL Loops Auto-parallelized Directive-driven parallelization Example – OpenMP pragmas Issue with parallel execution Load balancing How to partition the iteration space for best performance? Naïve way: Partition the iteration space uniformly amongst the different threads Doesn’t yield the best performance!

Iteration Space Partitioning Why is it non-trivial? Non-rectangular geometry of iteration space

Iteration Space Partitioning (contd.) Why is it non-trivial? Use of indirect referencing Non-uniform cache miss profile Variation in L1 cache misses 462.libquantum 435.gromacs do k=nj0,nj1 jnr = jjnr(k)+1 j3 = 3*jnr-2 … faction(j3) = faction(j3)-tx11 faction(j3+1) = faction(j3+1)-ty11 faction(j3+2) = faction(j3+2)-tz11 end do

Iteration Space Partitioning (contd.) Why is it non-trivial? Non-perfect multi-way loops Outermost loop may have multiple loops at the same nesting level Conditional execution of inner loops do k=2, nk -1 … do j=1, ny  first loop do i=1, nx read A[k,j,i] end do do j=2, ny-1  second loop do i=2, nx-1 write A[k,j,i] T1 k=1 T2 k=4

Iteration Space Partitioning (contd.) Why is it non-trivial? Presence of conditionals in the loop body Non-uniform workload distribution Variation in Inst Retired 403.gcc 434.zeusmp do 90 k=ks,ke do 80 j=js,je do 70 i=is,ie … if ((rin .lt. rsq) .and. (rout .le. rsq)) then endif if ((rin .lt. rsq) .and. (rout .gt. rsq)) then 70 continue 80 continue 90 continue

How to partition? Guiding factors Partition the outermost loop Minimizes scheduling overhead Geometry-aware Model the iteration space as a convex polytope Loop indices are affine functions of outer loop indices Cache-Aware Account for non-uniform cache miss profile across the iteration space Account for non-uniform workload distribution across the iteration space

Algorithm High-level steps Obtain the cache miss profile Obtain the workload distribution Compute the total volume of iteration space Weighted by cache misses and instructions retired Given n threads Compute n-1 breakpoints along the axis corresponding to the outermost loop wherein Each breakpoint delimits a set Each set has equal weighted volume Map each set on to a different thread

Experimental Setup Use in-built hardware performance counters MEM_LOAD_RETIRED.L1D_MISS Obtain cache miss profile INST_RETIRED.ANY Obtain instructions retired profile

Kernel Set

Results (contd.) Compute two metrics: Speedup = (tco – tca) * 100 tca tco = cache-oblivious tca = cache-aware Deviation Difference between proposed technique and worst case

Thank You!

Results (contd.) Performance variation with different partitioning planes 3 threads

Results Performance variation with different partitioning planes A kernel from 178.galgel Nested non-perfect multiway DOALL loop 2 threads