POSKI: A Library to Parallelize OSKI Ankit Jain Berkeley Benchmarking and OPtimization (BeBOP) Project bebop.cs.berkeley.edu EECS Department, University.

Slides:

Advertisements

Similar presentations

The view from space Last weekend in Los Angeles, a few miles from my apartment…

Advertisements

Part IV: Memory Management

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager

1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.

Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.

Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Sparse Matrix.

Memory Management 2010.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”

1 I/O Management in Representative Operating Systems.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Computer System Architectures Computer System Software

Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (paper to appear at SC07) Sam Williams

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.

Minimizing Communication in Numerical Linear Algebra Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

Cosc 2150: Computer Organization Chapter 6, Part 2 Virtual Memory.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications Xiao Yu and Bo Hong School of Electrical and Computer Engineering Georgia.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.

Numerical Libraries Project Microsoft Incubation Group Mary Beth Hribar Microsoft Corporation CSCAPES Workshop June 10, 2008 Copyright Microsoft Corporation,

Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.

Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

Energy-Aware Resource Adaptation in Tessellation OS 3. Space-time Partitioning and Two-level Scheduling David Chou, Gage Eads Par Lab, CS Division, UC.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Background Computer System Architectures Computer System Software.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

Embedded Real-Time Systems Processing interrupts Lecturer Department University.

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

University of California, Berkeley

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

Stefan Kaestle, Reto Achermann, Timothy Roscoe, Tim Harris ATC’15

Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang

Department of Computer Science University of California,Santa Barbara

for more information ... Performance Tuning

Chapter 4: Threads.

Department of Computer Science University of California, Santa Barbara

Chapter 4: Threads.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Chapter 4: Threads & Concurrency

Department of Computer Science University of California, Santa Barbara

Rohan Yadav and Charles Yuan (rohany) (chenhuiy)

Presentation transcript:

pOSKI: A Library to Parallelize OSKI Ankit Jain Berkeley Benchmarking and OPtimization (BeBOP) Project bebop.cs.berkeley.edu EECS Department, University of California, Berkeley April 28, 2008

Outline pOSKI Goals OSKI Overview –(Slides adopted from Rich Vuduc’s SIAM CSE 2005 Talk) pOSKI Design Parallel Benchmark MPI-SpMV

pOSKI Goals Provide a simple serial interface to exploit the parallelism in sparse kernels (focus on SpMV for now) Target Multicore Architectures Hide the complex process of parallel tuning while exposing its cost Use heuristics, where possible, to limit search space Design it to be extensible so it can be used in conjunction with other parallel libraries (e.g. ParMETIS) Take Sam’s Work and present it in a distributable, easy-to-use format.

Outline pOSKI Goals OSKI Overview –(Slides adopted from Rich Vuduc’s SIAM CSE 2005 Talk) pOSKI Design Parallel Benchmark MPI-SpMV

OSKI: Optimized Sparse Kernel Interface Sparse kernels tuned for user’s matrix & machine –Hides complexity of run-time tuning –Low-level BLAS-style functionality Sparse matrix-vector multiply (SpMV), triangular solve (TrSV), … –Includes fast locality-aware kernels: A T A*x, … –Target: cache-based superscalar uniprocessors Faster than standard implementations –Up to 4x faster SpMV, 1.8x TrSV, 4x A T A*x Written in C (can call from Fortran) Note: All Speedups listed are from Sequential Platforms in 2005

How OSKI Tunes (Overview) Benchmark data 1. Build for Target Arch. 2. Benchmark Heuristic models 1. Evaluate Models Generated code variants 2. Select Data Struct. & Code Library Install-Time (offline) Application Run-Time To user: Matrix handle for kernel calls Workload from program monitoring Extensibility: Advanced users may write & dynamically add “Code variants” and “Heuristic models” to system. History Matrix

Cost of Tuning Non-trivial run-time tuning cost: up to ~40 mat-vecs –Dominated by conversion time Design point: user calls “tune” routine explicitly –Exposes cost –Tuning time limited using estimated workload Provided by user or inferred by library User may save tuning results –To apply on future runs with similar matrix –Stored in “human-readable” format

Optimizations Available in OSKI Optimizations for SpMV (bold  heuristics) –Register blocking (RB): up to 4x over CSR –Variable block splitting: 2.1x over CSR, 1.8x over RB –Diagonals: 2x over CSR –Reordering to create dense structure + splitting: 2x over CSR –Symmetry: 2.8x over CSR, 2.6x over RB –Cache blocking: 3x over CSR –Multiple vectors (SpMM): 7x over CSR –And combinations… Sparse triangular solve –Hybrid sparse/dense data structure: 1.8x over CSR Higher-level kernels –AA T *x, A T A*x: 4x over CSR, 1.8x over RB –A  *x: 2x over CSR, 1.5x over RB Note: All Speedups listed are from Sequential Platforms in 2005

Outline pOSKI Goals OSKI Overview –(Slides adopted from Rich Vuduc’s SIAM CSE 2005 Talk) pOSKI Design Parallel Benchmark MPI-SpMV

Library Install-Time (offline) Application Run-Time (online) Matrix P-OSKI OSKI Benchmark data Build for Target Arch. Benchmark Generated code variants Parallel Benchmark data Build for Target Arch. Parallel Benchmark Heuristic models Evaluate Models Select Data Struct. & Code OSKI_Matrix_Handle For kernel Calls History Parallel Heuristic models Evaluate Parallel Model Submatrix Load Balance Evaluate Parallel Model Submatrix Accumulate Handles To User: pOSKI Matrix Handle For kernel Calls How pOSKI Tunes (Overview)

Where the Optimizations Occur OptimizationOSKIP-OSKI Load Balancing/ NUMA Register Blocking Cache Blocking TLB Blocking(future) (currently)

Current Implementation The Serial Interface –Represents S  P composition of ParLab Proposal. The parallelism is hidden under the covers –Each serial-looking function call triggers a set of parallel events –Manages its own thread pool Supports up to the number of threads supported by underlying hardware –Manages thread and data affinity

Additional Future Interface The Parallel Interface –Represents P  P composition of ParLab Proposal –Meant for expert programmers –Can be used to share threads with other parallel libraries –No guarantees of thread of data affinity management –Example Use: y  A T Ax codes Alternate between SpMV and preconditioning step. Share threads between P-OSKI (for SpMV) and some parallel preconditioning library –Example Use: UPC Code Explicitly Parallel Execution Model User partitions matrix based on some information P-OSKI would not be able to infer

Thread and Data Affinity (1/3) Cache Coherent Non Uniform Memory Access (ccNUMA) times on Modern MultiSocket, MultiCore architectures Modern OS’ ‘first touch’ policy in allocating memory Thread Migration between Locality Domains is expensive –In ccNUMA, a locality domain is a set of processor cores together with locally connected memory which can be accessed without resorting to a network of any kind. For now, we have to deal with these OS policies ourselves. The ParLab OS Group is trying to solve these problems in order to hide such issues from the programmer.

Thread and Data Affinity (2/3) The Problem with malloc() and free() –malloc() first looks for free pages on heap and then requests OS to allocate new pages. –If available free pages reside on a different locality domain, malloc() still allocates them –Autotuning codes are malloc() and free() intensive so this is a huge problem

Thread and Data Affinity (3/3) The solution: Managing our own memory –One large chunk (heap) allocated at the beginning of tuning per locality domain –Size of this heap controlled by user input through environment variable [P_OSKI_HEAP_IN_GB=2] –Rare case: allocated space is not big enough Stop all threads Free all allocated memory Grow the amount of space significantly across all threads and locality domains Print a strong warning to the user

Outline pOSKI Goals OSKI Overview –(Slides adopted from Rich Vuduc’s SIAM CSE 2005 Talk) pOSKI Design Parallel Benchmark MPI-SpMV

Justification OSKI’s Benchmarking –Single Threaded –All the memory bandwidth is given to this one thread pOSKI’s Benchmarking –Benchmark’s 1, 2, 4, …, threads (based on hardware limit) in parallel –Each thread uses up memory bandwidth which resembles run-time more accurately –When each instance of OSKI choose appropriate data structures and algorithms, it uses the data from this parallel benchmark

Results (1/2) Takeaways: 1. Parallel Benchmark performs at worst 2% worse than Regular but can perform as much as 13% better. 2. Incorporating a NUMA_MALLOC interface within OSKI is of utmost importance because without that performance is unpredictable.  STATUS: In Progress 3. Superscalar speedups of > 4X, why?

Results (2/2) Justifies Need for Search Need Heuristics to reduce this since the multicore search space is expanding exponentially

Outline pOSKI Goals OSKI Overview –(Slides adopted from Rich Vuduc’s SIAM CSE 2005 Talk) pOSKI Design Parallel Benchmark MPI-SpMV

Goals Target: MultiNode, MultiCore architectures Design: Build an MPI-layer on top of pOSKI –MPI is a starting point Tuning Parameters: –Balance of Pthreads and MPI tasks Rajesh has found for collectives, the balance is not always clear Identifying if there are potential performance gains by assigning some of the threads (or cores) to only handle sending/receiving of messages Status: –Just started, should have initial version in next few weeks Future Work: –Explore UPC for communication –Distributed Load Balancing, Workload Generation

Questions? pOSKI Goals OSKI Overview pOSKI Design Parallel Benchmark MPI-SpMV

Extra Slides Motivation for Tuning

Motivation: The Difficulty of Tuning n = nnz = 1.5 M kernel: SpMV Source: NASA structural analysis problem 8x8 dense substructure

Speedups on Itanium 2: The Need for Search Reference Best: 4x2 Mflop/s

Extra Slides Some Current Multicore Machines

Rad Lab Opteron

Niagara 2 (Victoria Falls)

Nersc Power5 [Bassi]

Cell Processor

Extra Slides SpBLAS and OSKI Interfaces

SpBLAS Interface Create a matrix handle Assert matrix properties Insert matrix entries Signal the end of matrix creation Call operations on the handle Destroy the handle  Tune here

OSKI Interface The basic OSKI interface has a subset of the matrix creation interface of the Sparse BLAS, exposes the tuning step explicitly, and supports a few extra kernels (e.g., A^(T)*A*x). The OSKI interface was designed with the intent of implementing the Sparse BLAS using OSKI under-the- hood.

Extra Slides Other Ideas for pOSKI

Challenges of a Parallel Automatic Tuner Search space increases exponentially with number of parameters Parallelization across Architectural Parameters –Across Multiple Threads –Across Multiple Cores –Across Multiple Sockets Parallelizing the data of a given problem –Across Rows, Across Columns, or Checkerboard –Based on User Input in v1 –Future Versions can integrate ParMETIS or other graph partitioners

A Memory Footprint Minimization Heuristic The Problem: Search Space is too Large  Auto-tuning takes too long The rate of increase in aggregate memory bandwidth over time is not as fast as the rate of increase in processing power per machine. Our Two Step Tuning Process: –Calculate the top 20% memory efficient configurations on Thread 0 –Each Thread finds its optimal block size for its sub-matrix from the list in Step 1