EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB.py OS/HW f() h() Specializer.c PLL Interp Productivity app.so cc/ld $ $ SEJITS logical.

Slides:

Advertisements

Similar presentations

Lecture Computer Science I - Martin Hardwick Strings #include using namespace std; int main () { string word; cout

Advertisements

AP Computer Science Anthony Keen. Computer 101 What happens when you turn a computer on? –BIOS tries to start a system loader –A system loader tries to.

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

Dependency Test in Loops By Amala Gandhi. Data Dependence Three types of data dependence: 1. Flow (True) dependence : read-after-write int a, b, c; a.

EXAMPLES (Arrays). Example Many engineering and scientific applications represent data as a 2-dimensional grid of values; say brightness of pixels in.

Factorial Preparatory Exercise #include using namespace std; double fact(double); int fact(int); int main(void) { const int n=20; ofstream my_file("results.txt");

ELEC 206 Computer Applications for Electrical Engineers Dr. Ron Hayne

EE3190 Optical Sensing and Imaging, Fall 2004 © Timothy J. Schulz EE3190 Optical Sensing and Imaging Computing PSFs with a digital computer.

School of Engineering & Technology Computer Architecture Pipeline.

1 Convert this problem into our standard form: Minimize 3 x x 2 – 6 x 3 subject to 1 x x 2 – 3 x 3 ≥ x x 2 – 6 x 3 ≤ 5 7 x 1.

Riyadh Philanthropic Society For Science Prince Sultan College For Woman Dept. of Computer & Information Sciences CS 102 Computer Programming II (Lab:

C++ Basics March 10th. A C++ program //if necessary include headers //#include void main() { //variable declaration //read values input from user //computation.

Tinaliah, S. Kom.. * * * * * * * * * * * * * * * * * #include using namespace std; void main () { for (int i = 1; i

Triana Elizabeth, S.Kom. #include using namespace std; void main () { for (int i = 1; i

Selective, Embedded Just-in- Time Specialization (SEJITS) As a platform for implementing communication-avoiding algorithms accessible from Python.

Lecture 8 Analysis of Recursive Algorithms King Fahd University of Petroleum & Minerals College of Computer Science & Engineering Information & Computer.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Tuning Stencils Kaushik Datta.

General Computer Science for Engineers CISC 106 Lecture 33 Dr. John Cavazos Computer and Information Sciences 05/11/2009.

General Computer Science for Engineers CISC 106 Lecture 09 James Atlas Computer and Information Sciences 9/25/2009.

Презентація за розділом “Гумористичні твори”

Центр атестації педагогічних працівників 2014

Галактики і квазари.

Характеристика ІНДІЇ.

Процюк Н.В. вчитель початкових класів Боярської ЗОШ І – ІІІ ст №4

C++ Control Flow Mark Hennessy Dept. Computer Science NUI Maynooth C++ Workshop 18 th -22 nd Sept 2006.

C#: Statements and Methods Based on slides by Joe Hummel.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.

FEN 2012 UCN Technology: Computer Science1 C# - Introduction Language Fundamentals in Brief.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

CS179: GPU Programming Lecture 11: Lab 5 Recitation.

Functions Pass by Reference Alina Solovyova-Vincent Department of Computer Science & Engineering University of Nevada, Reno Fall 2005.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

C++ Tutorial Hany Samuel and Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Copyright © 2006 by Douglas.

# Author: David Foltz # # Notice: Orginal Code Idea for program taken from Python v3.2.2 Documentation def Fibonacci(): """Print a Fibonacci series with.

My Special Number by. My Special Number My special number is This number is special to me because.

Духовні символи Голосіївського району

IB Computer Science – Logic

Computing Science 1P Large Group Tutorial 13 Simon Gay Department of Computing Science University of Glasgow 2006/07.

Computer Engineering 2 nd Semester Dr. Rabie A. Ramadan 3.

General Computer Science for Engineers CISC 106 Lecture 12 James Atlas Computer and Information Sciences 08/03/2009.

COM S 228 Algorithms and Analysis Instructor: Ying Cai Department of Computer Science Iowa State University Office: Atanasoff 201.

Int fact (int n) { If (n == 0) return 1; else return n * fact (n – 1); } 5 void main () { Int Sum; : Sum = fact (5); : } Factorial Program Using Recursion.

1 st Semester Module 7 Arrays อภิรักษ์ จันทร์สร้าง Aphirak Jansang Computer Engineering Department.

Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.

Think What will be the output?

Optimum Interior Loop Teacher Student Student Student Student Student

1 مقايسه دانشگاه صنعتی شريف و دانشگاه برکلی در رشته‌های مهندسی و علوم کامپيوتر نيما احمدی پور اناری دانشکده مهندسی کامپيوتر دانشگاه صنعتی شريف جهت ارائه.

Проф. д-р Васил Цанов, Институт за икономически изследвания при БАН

ЗУТ ПРОЕКТ на Закон за изменение и допълнение на ЗУТ

О Б Щ И Н А С И Л И С Т Р А П р о е к т Б ю д ж е т г.

Електронни услуги на НАП

Боряна Георгиева – директор на

РАЙОНЕН СЪД - БУРГАС РАБОТНА СРЕЩА СЪС СЪДЕБНИТЕ ЗАСЕДАТЕЛИ ПРИ РАЙОНЕН СЪД – БУРГАС 21 ОКТОМВРИ 2016 г.

Сътрудничество между полицията и другите специалисти в България

Съобщение Ръководството на НУ “Христо Ботев“ – гр. Елин Пелин

НАЦИОНАЛНА АГЕНЦИЯ ЗА ПРИХОДИТЕ

ДОБРОВОЛЕН РЕЗЕРВ НА ВЪОРЪЖЕНИТЕ СИЛИ НА РЕПУБЛИКА БЪЛГАРИЯ

Съвременни софтуерни решения

ПО ПЧЕЛАРСТВО ЗА ТРИГОДИШНИЯ

от проучване на общественото мнение,

Васил Големански Ноември, 2006

Програма за развитие на селските райони

ОПЕРАТИВНА ПРОГРАМА “АДМИНИСТРАТИВЕН КАПАЦИТЕТ”

БАЛИСТИКА НА ТЯЛО ПРИ СВОБОДНО ПАДАНЕ В ЗЕМНАТА АТМОСФЕРА

МЕДИЦИНСКИ УНИВЕРСИТЕТ – ПЛЕВЕН

Стратегия за развитие на клъстера 2015

Моето наследствено призвание

Правна кантора “Джингов, Гугински, Кючуков & Величков”

Безопасност на движението

2D Arrays Just another data structure Typically use a nested loop

Presentation transcript:

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB.py OS/HW f() h() Specializer.c PLL Interp Productivity app.so cc/ld $ $ SEJITS logical flow Selective Embedded JIT ASP.py Specialization Human-expert- authored templates & strategy engine

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB.py OS/HW f() h() Specializer.c PLL Interp Productivity app.so cc/ld $ $ SEJITS logical flow ASP.py Asp DB

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Stencil example: input  Sum the distance-1 neighbors in a 2D stencil 3 import stencil_grid import numpy class ExampleKernel(object): def kernel(self, in_grid, out_grid): for x in out_grid.interior_points(): for y in in_grid.neighbors(x, 1): out_grid[x] = out_grid[x] + in_grid[y] in_grid = StencilGrid([5,5]) in_grid.data = numpy.ones([5,5]) out_grid = StencilGrid([5,5]) ExampleKernel().kernel(in_grid, out_grid) from stencil_kernel import *

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Neighbor iterator expanded Kernel function lowered from Python to C++ Specialized to dimensions of input data Stencil kernel example: unoptimized sequential C++ #include #include "arrayobject.h" namespace private_namespace_000 { void kernel(PyObject *in_grid, PyObject *out_grid) { for (int i = 1; i < 4; i++) { for (int j = 1; j < 4; j++) { #define _out_grid_array_macro(_d0,_d1) ((_d1) + 5 * (_d0)) #define _in_grid_array_macro(_d0,_d1) ((_d1) + 5 * (_d0)) double* _my_out_grid = (double *) PyArray_DATA(out_grid); double* _my_in_grid = (double *) PyArray_DATA(in_grid); int idxout = _out_grid_array_macro(i,j); int idxin = _in_grid_array_macro(i+1,j); _my_out_grid[idxout] = (_my_out_grid[idxout] + _my_in_grid[idxin]); idxin = _in_grid_array_macro(i-1,j); _my_out_grid[idxout] = (_my_out_grid[idxout] + _my_in_grid[idxin]); idxin = _in_grid_array_macro(i,j+1); _my_out_grid[idxout] = (_my_out_grid[idxout] + _my_in_grid[idxin]); idxin = _in_grid_array_macro(i,j-1); _my_out_grid[idxout] = (_my_out_grid[idxout] + _my_in_grid[idxin]); }

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB C++ with loop unrolling by 4 void kernel_unroll_4(PyObject *in_grid, PyObject *out_grid) { for (int _EwzZcfet = 1; (_EwzZcfet <= 256); _EwzZcfet = (_EwzZcfet + 1)) cilk_for (int _HmEYIyVk = 1; (_HmEYIyVk <= 256); _HmEYIyVk += 1) for (int _QXalOpkr = 1; (_QXalOpkr <= 256); _QXalOpkr = (_QXalOpkr + 4)) { #define _out_grid_array_macro(_d0,_d1,_d2) (((_d2) * (_d1)) * (_d0)) #define _in_grid_array_macro(_d0,_d1,_d2) (((_d2) * (_d1)) * (_d0)) double* _my_out_grid = (double *) PyArray_DATA(out_grid); double* _my_in_grid = (double *) PyArray_DATA(in_grid);; int x; x = _in_grid_array_macro(_EwzZcfet, _HmEYIyVk, _QXalOpkr); { int y; y = _in_grid_array_macro((_EwzZcfet + 1), (_HmEYIyVk + 0), (_QXalOpkr + 0)); _my_out_grid[x] = (_my_out_grid[x] + ((1.0 / 6.0) * _my_in_grid[y])); y = _in_grid_array_macro((_EwzZcfet + -1), (_HmEYIyVk + 0), (_QXalOpkr + 0)); _my_out_grid[x] = (_my_out_grid[x] + ((1.0 / 6.0) * _my_in_grid[y])); y = _in_grid_array_macro((_EwzZcfet + 0), (_HmEYIyVk + 1), (_QXalOpkr + 0)); _my_out_grid[x] = (_my_out_grid[x] + ((1.0 / 6.0) * _my_in_grid[y])); y = _in_grid_array_macro((_EwzZcfet + 0), (_HmEYIyVk + -1), (_QXalOpkr + 0)); _my_out_grid[x] = (_my_out_grid[x] + ((1.0 / 6.0) * _my_in_grid[y])); y = _in_grid_array_macro((_EwzZcfet + 0), (_HmEYIyVk + 0), (_QXalOpkr + 1)); _my_out_grid[x] = (_my_out_grid[x] + ((1.0 / 6.0) * _my_in_grid[y])); y = _in_grid_array_macro((_EwzZcfet + 0), (_HmEYIyVk + 0), (_QXalOpkr + -1)); _my_out_grid[x] = (_my_out_grid[x] + ((1.0 / 6.0) * _my_in_grid[y])); } x = _in_grid_array_macro(_EwzZcfet, _HmEYIyVk, (_QXalOpkr+1)); { int y; y = _in_grid_array_macro((_EwzZcfet + 1), (_HmEYIyVk + 0), ((_QXalOpkr+1) + 0)); _my_out_grid[x] = (_my_out_grid[x] + ((1.0 / 6.0) * _my_in_grid[y])); y = _in_grid_array_macro((_EwzZcfet + -1), (_HmEYIyVk + 0), ((_QXalOpkr+1) + 0)); _my_out_grid[x] = (_my_out_grid[x] + ((1.0 / 6.0) * _my_in_grid[y])); y = _in_grid_array_macro((_EwzZcfet + 0), (_HmEYIyVk + 1), ((_QXalOpkr+1) + 0)); _my_out_grid[x] = (_my_out_grid[x] + ((1.0 / 6.0) * _my_in_grid[y])); y = _in_grid_array_macro((_EwzZcfet + 0), (_HmEYIyVk + -1), ((_QXalOpkr+1) + 0)); _my_out_grid[x] = (_my_out_grid[x] + ((1.0 / 6.0) * _my_in_grid[y])); y = _in_grid_array_macro((_EwzZcfet + 0), (_HmEYIyVk + 0), ((_QXalOpkr+1) + 1)); _my_out_grid[x] = (_my_out_grid[x] + ((1.0 / 6.0) * _my_in_grid[y])); y = _in_grid_array_macro((_EwzZcfet + 0), (_HmEYIyVk + 0), ((_QXalOpkr+1) + -1)); _my_out_grid[x] = (_my_out_grid[x] + ((1.0 / 6.0) * _my_in_grid[y])); } x = _in_grid_array_macro(_EwzZcfet, _HmEYIyVk, (_QXalOpkr+2)); { int y; y = _in_grid_array_macro((_EwzZcfet + 1), (_HmEYIyVk + 0), ((_QXalOpkr+2) + 0)); _my_out_grid[x] = (_my_out_grid[x] + ((1.0 / 6.0) * _my_in_grid[y])); y = _in_grid_array_macro((_EwzZcfet + -1), (_HmEYIyVk + 0), ((_QXalOpkr+2) + 0)); _my_out_grid[x] = (_my_out_grid[x] + ((1.0 / 6.0) * _my_in_grid[y])); y = _in_grid_array_macro((_EwzZcfet + 0), (_HmEYIyVk + 1), ((_QXalOpkr+2) + 0)); _my_out_grid[x] = (_my_out_grid[x] + ((1.0 / 6.0) * _my_in_grid[y])); y = _in_grid_array_macro((_EwzZcfet + 0), (_HmEYIyVk + -1), ((_QXalOpkr+2) + 0)); _my_out_grid[x] = (_my_out_grid[x] + ((1.0 / 6.0) * _my_in_grid[y])); y = _in_grid_array_macro((_EwzZcfet + 0), (_HmEYIyVk + 0), ((_QXalOpkr+2) + 1)); _my_out_grid[x] = (_my_out_grid[x] + ((1.0 / 6.0) * _my_in_grid[y])); y = _in_grid_array_macro((_EwzZcfet + 0), (_HmEYIyVk + 0), ((_QXalOpkr+2) + -1)); _my_out_grid[x] = (_my_out_grid[x] + ((1.0 / 6.0) * _my_in_grid[y])); } x = _in_grid_array_macro(_EwzZcfet, _HmEYIyVk, (_QXalOpkr+3)); { int y; y = _in_grid_array_macro((_EwzZcfet + 1), (_HmEYIyVk + 0), ((_QXalOpkr+3) + 0)); _my_out_grid[x] = (_my_out_grid[x] + ((1.0 / 6.0) * _my_in_grid[y])); y = _in_grid_array_macro((_EwzZcfet + -1), (_HmEYIyVk + 0), ((_QXalOpkr+3) + 0)); _my_out_grid[x] = (_my_out_grid[x] + ((1.0 / 6.0) * _my_in_grid[y])); y = _in_grid_array_macro((_EwzZcfet + 0), (_HmEYIyVk + 1), ((_QXalOpkr+3) + 0)); _my_out_grid[x] = (_my_out_grid[x] + ((1.0 / 6.0) * _my_in_grid[y])); y = _in_grid_array_macro((_EwzZcfet + 0), (_HmEYIyVk + -1), ((_QXalOpkr+3) + 0)); _my_out_grid[x] = (_my_out_grid[x] + ((1.0 / 6.0) * _my_in_grid[y])); y = _in_grid_array_macro((_EwzZcfet + 0), (_HmEYIyVk + 0), ((_QXalOpkr+3) + 1)); _my_out_grid[x] = (_my_out_grid[x] + ((1.0 / 6.0) * _my_in_grid[y])); y = _in_grid_array_macro((_EwzZcfet + 0), (_HmEYIyVk + 0), ((_QXalOpkr+3) + -1)); _my_out_grid[x] = (_my_out_grid[x] + ((1.0 / 6.0) * _my_in_grid[y])); } 5

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Fully optimized, parallel C++ int c2; for (c2=chunkOffset_2;c2<=255;c2+=256) { int c1; for (c1=chunkOffset_1;c1<=255;c1+=8) { int c0; for (c0=chunkOffset_0;c0<=255;c0+=256) { int b2; for (b2=c2 + threadOffset_2;b2<=c ;b2+=256) { int b1; for (b1=c1 + threadOffset_1;b1<=c1 + 7;b1+=8) { int b0; for (b0=c0 + threadOffset_0;b0<=c ;b0+=256) { int k; for (k=b2 + 1;k<=b ;k+=2) { int j; for (j=b1 + 1;j<=b1 + 8;j+=2) { int i; for (i=b0 + 1;i<=b ;i+=8) { Anext[_Anext_Index(i - 1,j - 1,k - 1)] = fac * A0[_A0_Index(i - 1,j - 1,k - 1)] + fac2 * (A0[_A0_Index(i,j - 1,k - 1)] + A0[_A0_Index(i - 2,j - 1,k - 1)] + A0[_A0_Index(i - 1,j,k - 1)] + A0[_A0_Index(i - 1,j - 2,k - 1)] + A0[_A0_Index(i - 1,j - 1,k)] + A0[_A0_Index(i - 1,j - 1,k - 2)]); Anext[_Anext_Index(i,j - 1,k - 1)] = fac * A0[_A0_Index(i,j - 1,k - 1)] + fac2 * (A0[_A0_Index(i + 1,j - 1,k - 1)] + A0[_A0_Index(i - 1,j - 1,k - 1)] + A0[_A0_Index(i,j,k - 1)] + A0[_A0_Index(i,j - 2,k - 1)] + A0[_A0_Index(i,j - 1,k)] + A0[_A0_Index(i,j - 1,k - 2)]); Anext[_Anext_Index(i + 1,j - 1,k - 1)] = fac * A0[_A0_Index(i + 1,j - 1,k - 1)] + fac2 * (A0[_A0_Index(i + 2,j - 1,k - 1)] + A0[_A0_Index(i,j - 1,k - 1)] + A0[_A0_Index(i + 1,j,k - 1)] + A0[_A0_Index(i + 1,j - 2,k - 1)] + A0[_A0_Index(i + 1,j - 1,k)] + A0[_A0_Index(i + 1,j - 1,k - 2)]); Anext[_Anext_Index(i + 2,j - 1,k - 1)] = fac * A0[_A0_Index(i + 2,j - 1,k - 1)] + fac2 * (A0[_A0_Index(i + 3,j - 1,k - 1)] + A0[_A0_Index(i + 1,j - 1,k - 1)] + A0[_A0_Index(i + 2,j,k - 1)] + A0[_A0_Index(i + 2,j - 2,k - 1)] + A0[_A0_Index(i + 2,j - 1,k)] + A0[_A0_Index(i + 2,j - 1,k - 2)]); ~70 similar lines } 6

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Phase 1: Python AST => Domain AST def kernel(self, in_grid, out_grid): for x in out_grid.interior_points(): for y in in_grid.neighbors(x, 1): out_grid[x] = out_grid[x] + in_grid[y] def visit_for(self, node): if (node.isFunctionCall("interior_points")): grid = self.visit(node.iter.func.value).id body = map(self.visit, node.body) target = self.visit(node.target) newnode = StencilKernel.StencilInterior(grid, body, target) return newnode … Visit each AST node of Python call to specializer (available via Python’s introspection) Convert calls to interior_points and neighbors to domain- specific, target-independent AST nodes for later processing Note use of Python in manipulating/traversing AST

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB AST phase 1 8 Call interior_point s For x iter Call neighbors(x,1) For y iter := out_grid[x]+ in_grid[y] StencilInterior out_grid, x StencilNeighbor in_grid, y, 1 := out_grid[x]+ in_grid[y] Generic=>Domain-Specific for x in out_grid.interior_points for y in in_grid.neighbors out_grid[x] = out_grid[x] + in_grid[y]

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB out_grid[x] = out_grid[x] + in_grid[y] AST phase 2: Optimize DAST 9 For ( i over GridSize) := + Initialize variables … …… := +… …… +… …… +… …… Unrolled Neighbor Loop For ( j over GridSize) StencilInterior out_grid, x StencilNeighbor in_grid, y, 1 := out_grid[x] + in_grid[y] LoopUnroller().unroll(node, factor)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB AST phase 3 & 4: bind platform, generate/compile/cache code  Transform to platform-specific AST (PAST)  optimizations can be reused across "similar" platforms (eg, SIMDize generically, let compiler map onto ISA) 10 For ( i over GridSize) := Initialize variables := Unrolled Neighbor Loop For ( j over GridSize) CilkFor(...) := Initialize variables := block scope {... } CilkFor(...)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Who writes what Application Specializer ASP core CodePy Kernel Python AST Target AST ASP Module Utilities Compiled module Kernel call Input data Kernel call Cache hit App author (PLL) Specializer author (ELL) SEJITS team 3 rd party library

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Results summary CA matrix powersStencilGMM Performance3x–9x faster than optimized C++ 53% of achievable peak mem BW Variant selection outperforms hand- coded CUDA on large problems PortabilityOpenMP; CILK + soon OpenMP, CILK+CUDA, CILK+ specializer LOC (logic + templates) app LOC reductionN/A~20x~15x SEJITS benefitDelivery vehicle for first autotuned CA- Akx Inlining functionsStrategy selection hides hetero- geneous HW App integrationLinear solversCAT scan analysisSpeech recognition 12