Fast Number Crunching Fast Time to Market with Scala

Slides:



Advertisements
Similar presentations
Lecture 6: Multicore Systems
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Introduction CS 524 – High-Performance Computing.
Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Android for Java Developers Denver Java Users Group Jan 11, Mike
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
GPU Architecture and Programming
1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CSC 322 Operating Systems Concepts Lecture - 7: by Ahmed Mumtaz Mustehsan Special Thanks To: Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.
BITS Pilani, Pilani Campus Today’s Agenda Role of Performance.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Introduction to Python memory management. Find the slides at: manuelschipper.com/slides.pdf.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Threads prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University 1July 2016Processes.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
Foundations of Programming: Java
General Purpose computing on Graphics Processing Units
Java Programming: From the Ground Up
Generalized and Hybrid Fast-ICA Implementation using GPU
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Working with Java.
Web: Parallel Computing Rabie A. Ramadan , PhD Web:
Parallel Plasma Equilibrium Reconstruction Using GPU
CS427 Multicore Architecture and Parallel Computing
Embedded Systems Design
System Design.
CPSC 315 – Programming Studio Spring 2012
ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.
Enterprise Application Architecture
Lecture 5: GPU Compute Architecture
Torch 02/27/2018 Hyeri Kim Good afternoon, everyone. I’m Hyeri. Today, I’m gonna talk about Torch.
Pipelining and Vector Processing
Big-Oh and Execution Time: A Review
Introduction to Computer Systems
MASS CUDA Performance Analysis and Improvement
Lecture 5: GPU Compute Architecture for the last time
Dycore Rewrite Tobias Gysi.
ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.
CS 179 Project Intro.
Introduction to Systems Analysis and Design
CSCE Fall 2013 Prof. Jennifer L. Welch.
Portability CPSC 315 – Programming Studio
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Chapter 1: Computer Systems
Algorithm Discovery and Design
LTPDA Graphic User Interface summary and status
1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.
Threads Chapter 4.
HIGH LEVEL SYNTHESIS.
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
Operating Systems (CS 340 D)
CSCE Fall 2012 Prof. Jennifer L. Welch.
Java Programming Introduction
COMS 361 Computer Organization
Tonga Institute of Higher Education IT 141: Information Systems
Simulation And Modeling
Tonga Institute of Higher Education IT 141: Information Systems
CPU Structure CPU must:
Lecture 5: Synchronization and ILP
Portable Performance for Many-Core Particle Advection
6- General Purpose GPU Programming
What Are Performance Counters?
Presentation transcript:

Fast Number Crunching Fast Time to Market with Scala 2018/6/1 Fast Number Crunching Fast Time to Market with Scala By Richard Gomes

about me Richard Gomes Brazilian living in the UK since 2006 2018/6/1 about me Richard Gomes Brazilian living in the UK since 2006 Passion for Finance Special interest for High Performance Computing ( HPC ) I like photography, go karting and table tennis rgomes@jquantlib.org T: frgomes

Objectives High Performance Computing (HPC) with Scala 2018/6/1 Objectives High Performance Computing (HPC) with Scala Putting in context Thinking parallel How it works in C ( one slide! ) How it works in Scala Pros and Cons

Putting in Context What it is about 2018/6/1 Putting in Context What it is about parallelism and parallel arquitectures hundreds, thousands of processing elements ( PEs ) general purpose GPUs how do use GPUs with Scala What it is NOT about multithreading multiple core CPUs

2018/6/1 Putting in context Applicatibility of High Performance Computing ( HPC ) Geology : gas and oil prospection Meteorology : weather simulation Physics : fluid dynamics, high energy physics, ... Biology : protein structure, genoma sequencing Media : computer graphics Finance : price forecasting

Putting in Context Scala gaining momentum Language maturity 2018/6/1 Putting in Context Scala gaining momentum Language maturity Tooling maturity Performance improvements Parallel collections Recent tooling support for HPC 260+ positions in itjobswatch.co.uk in the last 12 months

Thinking Parallel Standard deviation 2018/6/1 Thinking Parallel Standard deviation float sum = 0; for (int i=0; i<n; i++) sum += cells[i]; float mean = sum / n; float sum = 0; for (int i=0; i<n; i++) sum += Math.sqr(cells[i] – mean); float stddev = Math.sqrt(sum / n);

Thinking Parallel Identify sequential code → big logical blocks 2018/6/1 Thinking Parallel Identify sequential code → big logical blocks Identify loops → candidates for execution in parallel Turn sequential code into parallel code Implement using parallel primitives Benchmarks process Design → Develop → Test → Tune

Thinking Parallel Identify sequential code Calculation of mean 2018/6/1 Thinking Parallel Identify sequential code Calculation of mean Calculation of stddev Identify loops One loop when mean is calculated One loop when stddev is calculate Turn sequential code into parallel code How loops could be performed in parallel?

2018/6/1 Thinking Parallel Let's suppose we have psum(), a parallel version of summation It was // calculate mean int n = cells.length; float mean = psum(cells) / n; // calculate stddev float sum = 0; for (int i=0; i<n; i++) sum += Math.sqr(cells[i] – mean); float stddev = Math.sqrt(sum / n); It now looks like // calculate mean int n = cells.length; float mean = psum(cells) / cells.length; // calculate stddev for (int i=0; i<n; i++) cells[i] += Math.sqr(cells[i] – mean); float sum = psum(cells); float stddev = Math.sqrt(sum / n);

Thinking Parallel in Scala 2018/6/1 Thinking Parallel in Scala // parallel sum def psum(cells: Array[Float]) : Float = cells.sum; def mean(cells: Array[Float]) : Float = { return psum(cells) / cells.length; } def f (cell: Float, mean: Float) : Float = { val x = cell – mean; return x * x; } def stddev(cells: Array[Float]) : Float = { return Math.sqrt( psum( cells.zip( f ) ) / n ); }

How it works in C/C++ ? Function f must be 2018/6/1 How it works in C/C++ ? Function f must be implemented as a kernel function copiled by a special purpose compiler uploaded into the GPU Data must be moved from the CPU into the GPU moved from the GPU into the CPU Code must be aware of GPU specs More info http://nvidia.com/cuda http://amd.com/stream http://khronos.org/opencl

How it works in Scala ? Introducing ScalaCL is a compiler plugin 2018/6/1 How it works in Scala ? Introducing ScalaCL is a compiler plugin provides byte code optimizations generates and compiles the kernel code for you handles kernel code uploading handles data transfers between the CPU and GPUs is a GPU-aware library introduces CLArray introduces CLCollection hierarchy http://code.google.com/p/scalacl

How it works in Scala ? ScalaCL benefits 100% Scala code 2018/6/1 How it works in Scala ? ScalaCL benefits 100% Scala code Hides GPU tooling details Hides implementation details Implements sequential and parallel collection interfaces Works well in Eclipse and IntelliJ http://code.google.com/p/scalacl

2018/6/1 How it works in Scala ? package org.squantlib.math.statistics import scala.math._ import scalacl._ class Stats { private implicit val context = Context.best // run on GPU def $mean(v : CLArray[Float]) : Float = v.sum / v.length def $variance(v : CLArray[Float], m : Float) : Float = { v.par.map(x => { (x - m) * (x - m) } ).sum / v.length } def $stddev(v : CLArray[Float], m : Float) : Float = { sqrt( $variance(v, m) ).asInstanceOf[Float] }

2018/6/1 How it works in Scala ? // interface with regular Array type def mean(v : Array[Float]) : Float = { $mean(v.cl) } def variance(v : Array[Float], m : Float) : Float = { $variance(v.cl, m) } def stddev(v : Array[Float], m : Float) : Float = { $stddev(v.cl, m) } } http://code.google.com/p/scalacl

How it works in Scala? Benchmarks Depend on CPU and GPU capabilities 2018/6/1 How it works in Scala? Benchmarks Depend on CPU and GPU capabilities Depend on the algorithm Depend on implementation techniques My benchmarks Easily: 10 faster With refinemends: something aroung 100 – 300 times faster Maximum: ~500 times faster http://code.google.com/p/scalacl

How it works in Scala Process Design → Develop → Test → Tune 2018/6/1 How it works in Scala Process Design → Develop → Test → Tune Strees testings : high volumes, 100+ reppetitions Build benchmarks Back to the design step Try parallel Collections Try sequential Collections Try alternative approaches and algorithms http://code.google.com/p/scalacl

Pros and cons of ScalaCL 2018/6/1 Pros and cons of ScalaCL Pros 100% Scala : no low level C or low level tooling Scala specific bytecode optimizations Excellent performance improvements Multiple approaches … in a fraction of time of C/C++ Cons Still incipient: may contain bugs Missing features Small community http://code.google.com/p/scalacl

2018/6/1 Thanks