V. Milutinovic, G. Rakocevic, S. Stojanovic, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

Slides:

Advertisements

Similar presentations

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Advertisements

1/21 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, Tijana Djukic,

CS CS 5150 Software Engineering Lecture 19 Performance.

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.

V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

Veljko Milutinović VLSI for SuperComputing: From Applications and Algorithms to Masks and Chips 1/40.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Selecting and Implementing An Embedded Database System Presented by Jeff Webb March 2005 Article written by Michael Olson IEEE Software, 2000.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

PhD course - Milan, March /09/ Some additional words about cloud computing Lionel Brunie National Institute of Applied Science (INSA) LIRIS.

Computer System Architectures Computer System Software

1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.

The School of Electrical Engineering University of Belgrade.

The Computer Systems By : Prabir Nandi Computer Instructor KV Lumding.

Veljko Milutinović Saša Stojanović University of Belgrade 1.

Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1.

A Perspective on the Limits of Computation Oskar Mencer May 2012.

Softcore Vector Processor Team ASP Brandon Harris Arpith Jacob.

Your First Azure Application Michael Stiefel Reliable Software, Inc.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

1/18 Lattice Boltzmann for Blood Flow: A Software Engineering Approach for a DataFlow SuperComputer Nenad Korolija, Tijana Djukic,

1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies.

1 CMPE 511 HIGH PERFORMANCE COMPUTING CLUSTERS Dilek Demirel İşçi.

“Politehnica” University of Timisoara Course No. 2: Static and Dynamic Configurable Systems (paper by Sanchez, Sipper, Haenni, Beuchat, Stauffer, Uribe)

M U N -March 10, Phil Bording1 Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording March 10, 2005.

V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies,

Acceleration of the SAT Problem

Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

Coevolutionary Automated Software Correction Josh Wilkerson PhD Candidate in Computer Science Missouri S&T.

+ Clusters Alternative to SMP as an approach to providing high performance and high availability Particularly attractive for server applications Defined.

Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

Page 1/8 Introduction to Maxeler Computing Veljko Milutinovic,

Miloš Kotlar 2012/115 Single Layer Perceptron Linear Classifier.

Philipp Gysel ECE Department University of California, Davis

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Unit 2 Technology Systems

Polynomial Interpolation and Extrapolation

Time and Depth Imaging Algorithms in a Hardware Accelerator Paradigm

Achieving the Ultimate Efficiency for Seismic Analysis

Parallel Plasma Equilibrium Reconstruction Using GPU

Introduction to Maxeler Computing Veljko Milutinovic,

Morgan Kaufmann Publishers

Architecture & Organization 1

Genomic Data Clustering on FPGAs for Compression

FPGAs in AWS and First Use Cases, Kees Vissers

Department of Computer Science University of California, Santa Barbara

Architecture & Organization 1

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

Introduction to Heterogeneous Parallel Computing

Presentation transcript:

V. Milutinovic, G. Rakocevic, S. Stojanovic, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies, London and Palo Alto Michael Flynn Stanford University, Palo Alto 1/72

For Big Data algorithms and for the same hardware price as before, achieving: a) speed-up, b) monthly electricity bills, reduced 20 times c) size, 20 times smaller The major issues of engineering are: design cost and design complexity. Remember, economy has its own rules: production count and market demand! 2/72

1. BigData 2. WORM 3. Tolerance to latency 4. Over 95% of run time in loops 5. Reusability of data (e.g., x+x 2 +x 3 +x 4 +…) 6. Skills Use a tractor, not a Ferrari, to drive over a plowed field 3/72

Absolutely all results achieved with: a) All hardware produced in Europe, specifically UK b) All software generated by programmers of EU and WB 4/72

ControlFlow (MultiFlow and ManyFlow): Top500 ranks using Linpack (Japanese K, IBM Sequoya, Cray Titan, …) DataFlow: Coarse Grain (HEP) vs. Fine Grain (Maxeler) 5/72

Compiling below the machine code level brings speedups; also a smaller power, size, and cost. The price to pay: The machine is more difficult to program. Consequently: Ideal for WORM applications :) Examples using Maxeler: GeoPhysics (20-40), Banking ( , with JP Morgan 20%), M&C (New York City), Datamining (Google), … 6/72

7

8/72

9/72 Why Java? Minimal Kolmogorov Complexity, etc…

10

11

Assumptions: 1. Software includes enough parallelism to keep all cores busy 2. The only limiting factor is the number of cores. tGPU = N * NOPS * CGPU*TclkGPU / NcoresGPU tCPU = N * NOPS * CCPU*TclkCPU /NcoresCPU tDF = NOPS * CDF * TclkDF + (N – 1) * TclkDF / NDF 12/72

DualCore? Which way are the horses going? 13/72

Is it possible to use 2000 chicken instead of two horses? ? == 14/72 What is better, real and anecdotic?

2 x 1000 chickens (CUDA and rCUDA) 15/72

How about ants? 16/72 Data

Marmalade Big Data Input Results 17/72

Factor: 20 to 200 MultiCore/ManyCoreDataflow Machine Level Code Gate Transfer Level 18/72

Factor: 20 MultiCore/ManyCoreDataflow 19/72

Factor: 20 Data Processing Process Control Data Processing Process Control MultiCore/ManyCoreDataFlow 20/72

MultiCore: Explain what to do, to the driver Caches, instruction buffers, and predictors needed ManyCore: Explain what to do, to many sub-drivers Reduced caches and instruction buffers needed DataFlow: Make a field of processing gates: 1C+2nJava+3Java No caches, etc. (300 students/year: BGD, BCN, LjU, ICL,…) 21/72

MultiCore: Business as usual ManyCore: More difficult DataFlow: Much more difficult Debugging both, application and configuration code 22/72

MultiCore/ManyCore: Several minutes DataFlow: Several hours for the real hardware Fortunately, only several minutes for the simulator, and several seconds for reload (90% due to DRAM inertia) The simulator supports both the large JPMorgan machine as well as the smallest “University Support” machine Good news: 23/72

24/72

MultiCore: Horse stable ManyCore: Chicken house DataFlow: Ant hole 25/72

MultiCore: Haystack ManyCore: Cornbits DataFlow: Crumbs 26/72

27/72 Small Data: Toy Benchmarks (e.g., Linpack)

28/72 Medium Data (benchmarks favorising NVidia, compared to Intel,…)

29/72 Big Data

Maxeler Hardware CPUs plus DFEs Intel Xeon CPU cores and up to 4 DFEs with 192GB of RAM DFEs shared over Infiniband Up to 8 DFEs with 384GB of RAM and dynamic allocation of DFEs to CPU servers Low latency connectivity Intel Xeon CPUs and 1-2 DFEs with up to six 10Gbit Ethernet connections MaxWorkstation Desktop development system MaxCloud On-demand scalable accelerated compute resource, hosted in London 30/72

1.Coarse grained, stateful: Business – CPU requires DFE for minutes or hours 2.Fine grained, transactional with shared database: DM – CPU utilizes DFE for ms to s – Many short computations, accessing common database data 3.Fine grained, stateless transactional: Science (Phy,...) – CPU requires DFE for ms to s – Many short computations 31/72 Major Classes of Algorithms, from the Computational Perspective

Long runtime, but: Memory requirements change dramatically based on modelled frequency Number of DFEs allocated to a CPU process can be easily varied to increase available memory Streaming compression Boundary data exchanged over chassis MaxRing 32/72 Coarse Grained: Modeling

DFE DRAM contains the database to be searched CPUs issue transactions find(x, db) Complex search function – Text search against documents – Shortest distance to coordinate (multi-dimensional) – Smith Waterman sequence alignment for genomes Any CPU runs on any DFE that has been loaded with the database – MaxelerOS may add or remove DFEs from the processing group to balance system demands – New DFEs must be loaded with the search DB before use 33/72 Fine Grained, Shared Data: Monitoring

Analyse > 1,000,000 scenarios Many CPU processes run on many DFEs ≈50x MPC-X vs. multi-core x86 node Each transaction executes on any DFE in the assigned group atomically 34/72 Fine Grained, Stateless: The BSOP Control CPU DFE Loop over instruments Random number generator and sampling of underliers Price instruments using Black Scholes Tail analysis on CPU CPU DFE Loop over instruments Random number generator and sampling of underliers Price instruments using Black Scholes Tail analysis on CPU CPU DFE Loop over instruments Random number generator and sampling of underliers Price instruments using Black Scholes Tail analysis on CPU CPU DFE Loop over instruments Random number generator and sampling of underliers Price instruments using Black Scholes Tail analysis on CPU DFE Loop over instruments CPU Market and instruments data Random number generator and sampling of underliers Price instruments using Black Scholes Instrument values Tail analysis on CPU

35/72 Selected Examples: Business, Mathematics, GeoPhysics, etc.

36

An MIS Example: Credit Derivatives

Climber Tether Orbital station HW

39

Seismic Imaging Running on MaxNode servers - 8 parallel compute pipelines per chip - 150MHz => low power consumption! - 30x faster than microprocessors An Implementation of the Acoustic Wave Equation on FPGAs T. Nemeth †, J. Stefani †, W. Liu †, R. Dimond ‡, O. Pell ‡, R.Ergas § † Chevron, ‡ Maxeler, § Formerly Chevron, SEG /72

Performance of one MAX2 card vs. 1 CPU core Land case (8 params), speedup of 230x Marine case (6 params), speedup of 190x The CRS Results CPU Coherency MAX2 Coherency 41/72

42

44 44/72

DM for Monitoring and Control in Seismic processing Velocity independent / data driven method to obtain a stack of traces, based on 8 parameters Search for every sample of each output trace Trace Stacking: Speed-up 217 P. Marchetti et al, 2010  parameters  ( emergence angle & azimuth   Normal Wave front parameters  K N,11 ; K N,12 ; K N22   NIP Wave front parameters  ( K Nip,11 ; K Nip,12 ; K Nip22 ) 45/72

46

This is about algorithmic changes, to maximize the algorithm to architecture match: data choreography, process modifications, pipeline utilization, and decision precision. The winning paradigm of Big Data ExaScale? 47/72 Conclusion: Nota Bene

Revisiting the Top 500 SuperComputers benchmarks Our paper in Communications of the ACM Revisiting all major Big Data DM algorithms Massive static parallelism at low clock frequencies Concurrency and communication Concurrency between millions of tiny cores difficult, “jitter” between cores will harm performance at synchronization points Reliability and fault tolerance x fewer nodes, failures much less often Memory bandwidth and FLOP/byte ratio Optimize data choreography, data movement, and the algorithmic computation New architecture of n-Programming paradigms 48/72

FP7: 49/72 The SAB goal: Out of box thinking!

FP7: 50/72 The SAB goal: Seed for new proposals! The vision of Alkis Konstantellos

51/72 DAFNE: Leader MISANU

52/72 DAFNE = South (MaxCode) + North (BigData) MISANU, IMP, KG, NS, BSC, UPV, U of Siena, U of Roma, IJS, FRI, IRB, QPLAN, Bogazici, U of Istanbul, U of Bucharest, U of Arad, U of Tuzla, Technion, Maxeler Israel, IPSI 52/72 UK Sweden Norway Denmark Germany France Austria Swiss Poland Hungary

53/72 The DAFNE Map

54/72 The DATAMAN Siena + BSC + Imperial College + Maxeler + Belgrade 46/72

55/72 The TriPeak: Essence MontBlanc = A ManyCore (NVidia) + a MultiCore (ARM) Maxeler = A FineGrain DataFlow (FPGA) How about a happy marriage? MontBlanc (ompSS) and Maxeler (an accelerator) In each happy marriage, it is known who does what :) The Big Data DM algorithms: What part goes to MontBlanc and what to Maxeler? 55/72

56/72 TriPeak: Core of the Symbiotic Success An intelligent DM algorithmic scheduler, partially implemented for compile time, and partially for run time. At compile time: Checking what part of code fits where (MontBlanc or Maxeler): LoC 1M vs 2K vs 20K At run time: Rechecking the compile time decision, based on the current data values. 56/72

57/72 57

58/72 Maxeler: Research (Google: good method) Structure of a Typical Research Paper: Scenario #1 [Comparison of Platforms for One Algorithm] Curve A: MultiCore of approximately the same PurchasePrice Curve B: ManyCore of approximately the same PurchasePrice Curve C: Maxeler after a direct algorithm migration Curve D: Maxeler after algorithmic improvements Curve E: Maxeler after data choreography Curve F: Maxeler after precision modifications Structure of a Typical Research Paper: Scenario #2 [Ranking of Algorithms for One Application] CurveSet A: Comparison of Algorithms on a MultiCore CurveSet B: Comparison of Algorithms on a ManyCore CurveSet C: Comparison on Maxeler, after a direct algorithm migration CurveSet D: Comparison on Maxeler, after algorithmic improvements CurveSet E: Comparison on Maxeler, after data choreography CurveSet F: Comparison on Maxeler, after precision modifications 58/72

59/72 Maxeler Research in Serbia: Special Issue of IPSI Transactions Journal KG: Blood Flow, Tijana Djukic and Prof. Filipovic NS: Combinatorial Math, Prof. Senk and Ivan Stanojevic MISANU: The SAT Math, Zivojin Sustran and Prof. Ognjanovic ETF: Meteorology, Radomir Radojicic and Marko Stankovic ETF: Physics (Gross Pitaevskii 3D real), Sasa Stojanovic ETF: Physics (Gross Pitaevskii 3D imaginary), Lena Parezanovic 59/72

60/72 Maxeler Research WorldWide: Special Issue of Advances in SCI Stanford, Texas, Imperial, Maxeler, ETF, MF, MISANU, IMP, KG, NS, BSC, UPV, U of Siena, U of Roma, IJS, FRI, … 60/72

61/72 © H. Maurer 61

62/72 Maxeler: Teaching (Google: prof vm) TEACHING, VLSI, PowerPoints, Maxeler: Maxeler Veljko Explanations, August 2012 Maxeler Veljko Anegdotic, Maxeler Oskar Talk, August 2012 Maxeler Forbes Article Flyer by JP Morgan Flyer by Maxeler HPC Tutorial Slides by Sasha and Veljko: Practice (Current Update) Paper, unconditionally accepted for Advances in Computers by Elsevier Paper, unconditionally accepted for Communications of the ACM Tutorial Slides by Oskar: Theory (7 parts) Slides by Jacob, New York Slides by Jacob, Alabama Slides by Sasha: Practice (Current Update) Maxeler in Meteorology Maxeler in Mathematics Examples generated in Belgrade and Worldwide THE COURSE ALSO INCLUDES DARPA METHODOLOGY FOR MICROPROCESSOR DESIGN, with an example 62/72

63/72 Maxeler PreConference Tutorials (2013) Google: IEEE HiPeak, Berlin, Germany, January 2013 ACM iSAC, Coimbra, Portugal, March 2013 IEEE MECO, Budva, Montenegro, June 2013 ACM ISCA, Tel Aviv, Israel, June /72

64/72 Maxeler InHouse Tutorials (2013) 64/72

65/72 © H. Maurer 65

66/72 Maxeler University Program Members

67/72 How to Become a Family Member? Options to consider: a. MAX-UP free of charge b. Purchasing a university-level machine (min about $10K) c. Purchasing a JPM-level machine (slowly approaching $100M), or at least a Schlumberger-level machine (slowly moving above $10M) 67/72

68/72 Good to Know! Maxeler employs close to 100 people, GBR and USA: a. Maxeler cash burn per year = about $10M b. If a university-level machine is sold at the 100% profit margin, the company life of Maxeler is extended for about 2 hours. c. If a university-level machine is sold at the 1% profit margin, the company life of Maxeler is extended for 1 minute. Our past or ongoing FP7 projects requiring Maxeler speeds: a. ProSense b. ARTreat c. HiPEAC 68/72

69/72 The Educational Mission The reality: a. University-level machines are sold at the ZERO profit margin! b. Only the Xilinx costs, handling, and shipping. c. support for student doing thesis is practically unlimited! Important note: a. Total number of accredited universities in the whole world? b. As per WeboMetrics, about c. Consequently, all universities of the world together bring only: minutes of extra life, or about two weeks of extra life. Conclusion: This is a chance for those who jump in first :) 69/72

70/72 Our Work Impacting Maxeler Milutinovic, V., Knezevic, P., Radunovic, B., Casselman, S., Schewel, J., Obelix Searches Internet Using Customer Data, IEEE COMPUTER, July 2000 (impact factor 2.205/2010). Milutinovic, V., Cvetkovic, D., Mirkovic, J., Genetic Search Based on Multiple Mutation Approaches, IEEE COMPUTER, November 2000 (impact factor 2.205/2010). Milutinovic, V., Ngom, A., Stojmenovic, I., STRIP --- A Strip Based Neural Network Growth Algorithm for Learning Multiple-Valued Functions, IEEE TRANSACTIONS ON NEURAL NETWORKS, March 2001, Vol.12, No.2, pp Jovanov, E., Milutinovic, V., Hurson, A., Acceleration of Nonnumeric Operations Using Hardware Support for the Ordered Table Hashing Algorithms, IEEE TRANSACTIONS ON COMPUTERS, September 2002, Vol.51, No.9, pp (impact factor 1.822/2010). 70/72

71/72 Maxeler Impacting Our Work Tafa, Z., Rakocevic, G., Mihailovic, Dj., Milutinovic, V., Effects of Interdisciplinary Education On Technology-driven Application Design IEEE Transactions on Education, August 2011, pp (impact factor 1.328/2010). Tomazic, S., Pavlovic, V., Milovanovic, J., Sodnik, J., Kos, A., Stancin, S., Milutinovic, V., Fast File Existence Checking in Archiving Systems ACM Transactions on Storage (TOS) TOS Homepage archive, Volume 7 Issue 1, June 2011, ACM New York, NY, USA. Jovanovic, Z., Milutinovic, V., FPGA Accelerator for Floating-Point Matrix Multiplication, IEE Computers & Digital Techniques, 2012, 6, (4), pp Flynn, M., Mencer, O., Milutinovic, V., Rakocevic, G., Stenstrom, P., Trobec, R., and Valero, M., Moving from Petaflops (on Simple Benchmarks) to Petadata per Unit of Time and Power (On Sophisticated Benchmarks) Communications of the ACM, May 2013 (impact factor 1.919/2010). 71/72

72/72 © H. Maurer 72 Q&A