High Performance Computing Introduction to classes of computing SISD MISD SIMD MIMD Conclusion.

Slides:

Advertisements

Similar presentations

Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks.

Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

PIPELINE AND VECTOR PROCESSING

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

The University of Adelaide, School of Computer Science

Parallel computer architecture classification

Implementation of the Convolution Operation on General Purpose Processors Ernest Jamro AGH Technical University Kraków, Poland.

Computer Architecture and Data Manipulation Chapter 3.

Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.

Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.

PSU CS 106 Computing Fundamentals II Introduction HM 1/3/2009.

Fall 2008Introduction to Parallel Processing1 Introduction to Parallel Processing.

Implementing a FIR-filter algorithm using MMX instructions by Lars Persson.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Chapter 9 Alternative Architectures. 2 Chapter 9 Objectives Learn the properties that often distinguish RISC from CISC architectures. Understand how multiprocessor.

Advanced Computer Architectures

KUAS.EE Parallel Computing at a Glance. KUAS.EE History Parallel Computing.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

Edgar Gabriel Short Course: Advanced programming with MPI Edgar Gabriel Spring 2007.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Fall 2012 Chapter 2: x86 Processor Architecture. Irvine, Kip R. Assembly Language for x86 Processors 6/e, Chapter Overview General Concepts IA-32.

Institute for Software Science – University of ViennaP.Brezany Parallel and Distributed Systems Peter Brezany Institute for Software Science University.

Flynn’s Architecture. SISD (single instruction and single data stream) SIMD (single instruction and multiple data streams) MISD (Multiple instructions.

Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.

Chapter 2 Data Manipulation. © 2005 Pearson Addison-Wesley. All rights reserved 2-2 Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine.

Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.

Introduction to MMX, XMM, SSE and SSE2 Technology

Parallel Computing.

Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.

Outline Why this subject? What is High Performance Computing?

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.

Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.

Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

Lecture 3: Computer Architectures

Chapter 2: Data Manipulation

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

MMX-accelerated Matrix Multiplication

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Copyright © Curt Hill SIMD Single Instruction Multiple Data.

Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.

Classification of parallel computers Limitations of parallel processing.

Processor Level Parallelism 1

SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.

Chapter Overview General Concepts IA-32 Processor Architecture

These slides are based on the book:

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Single Instruction Multiple Data

Flynn’s Taxonomy Many attempts have been made to come up with a way to categorize computer architectures. Flynn’s Taxonomy has been the most enduring of.

CHAPTER SEVEN PARALLEL PROCESSING © Prepared By: Razif Razali.

buses, crossing switch, multistage network.

Parallel Processing - introduction

CS 147 – Parallel Processing

Morgan Kaufmann Publishers

Pipelining and Vector Processing

Symmetric Multiprocessing (SMP)

STUDY AND IMPLEMENTATION

Chapter 2: Data Manipulation

buses, crossing switch, multistage network.

Part 2: Parallel Models (I)

Chapter 2: Data Manipulation

Chapter 4 Multiprocessors

COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING

Intel MMX™ Technology Accelerating 3D Geometry Transformation

Advanced Topic: Alternative Architectures Chapter 9 Objectives

Computer Organization

Chapter 2: Data Manipulation

Presentation transcript:

High Performance Computing Introduction to classes of computing SISD MISD SIMD MIMD Conclusion

Classes of computing Computation Consists of : Sequential Instructions (operation) Sequential dataset We can then abstractly classify into following classes of computing system base on their characteristic instructions and dataset: SISD Single Instruction, Single data SIMD Single Instruction, Multiple data MISDMultiple Instructions, Single data MIMDMultiple Instructions, Multiple data

High Performance Computing Introduction to classes of computing SISD MISD SIMD MIMD Conclusion

SISD Single Instruction Single Data One stream of instruction One stream of data Scalar pipeline To utilize CPU in most of the time Super scalar pipeline Increase the throughput Expecting to increase CPI > 1 Improvement from increase the “operation frequency”

SISD

Example A = A + 1 Assemble code asm(“mov %eax,%1 add $1,%eax :(=m) A”);

SISD Bottleneck Level of Parallelism is low Data dependency Control dependency Limitation improvements Pipeline Super scalar Super-pipeline scalar

High Performance Computing Introduction to classes of computing SISD MISD SIMD MIMD Conclusion

MISD Multiple Instructions Single Data Multiple streams of instruction Single stream of data Multiple functionally unit operate on single data Possible list of instructions or a complex instruction per operand (CISC) Receive less attention compare to the other

MISD

Stream #1 Load R0,%1 Add $1,R0 Store R1,%1 Stream #2 Load R0,%1 MUL %1,R0 Store R1,%1

MISD ADD_MUL_SUB $1,$4,$7,%1 SISD Load R0,%1 ADD $1,R0 MUL $4,R0 STORE %1,R0

MISD bottleneck Low level of parallelism High synchronizations High bandwidth required CISC bottleneck High complexity

High Performance Computing Introduction to classes of computing SISD MISD SIMD MIMD Conclusion

SIMD Single Instruction, Multiple Data Single Instruction stream Multiple data streams Each instruction operate on multiple data in parallel Fine grained Level of Parallelism

SIMD

A wide variety of applications can be solved by parallel algorithms with SIMD only problems that can be divided into sub problems, all of those can be solved simultaneously by the same set of instructions This algorithms are typical easy to implement

SIMD Example of Ordinarily desktop and business applications Word processor, database, OS and many more Multimedia applications 2D and 3D image processing, Game and etc Scientific applications CAD, Simulations

Example of CPU with SIMD ext Intel P4 & AMD Althon, x86 CPU 8 x 128 bits SIMD registers G5 Vector CPU with SIMD extension 32 x 128 bits registers Playstation II 2 vector units with SIMD extension

SIMD operations

SIMD SIMD instructions supports Load and store Integer Floating point Logical and Arithmetic instructions Additional instruction (optional) Cache instructions to support different locality for different type of application characteristic

Intel MMX with 8x64 bits registers

Intel SSE with 8x128 bits registers

AMD K8 16x128 bits registers

G5 32x 128 bits registers

SIMD Example of SIMD operation SIMD code –Adding 2 sets of 4 32-bits integers –V1 = {1,2,3,4} –V2 = {5,5,5,5} VecLoad v0,%0 (ptr vector 1) VecLoad v1,%1 (ptr vector 2) VecAdd V1,V0 Or PMovdq mm0,%0 (ptr vector 1) PMovdq mm1,%1 (ptr vector 2) Paddwd mm1,mm0 Result V2 = {6,7,8,9}; Total instruction 2 load and 1 add Total of 3 instructions SISD code –Adding 2 sets of 4 32-bits integers –V1 = {1,2,3,4} –V2 = {5,5,5,5} Push ecx (load counter register) Mov %eax,%0 (ptr vector Mov %ebx,%1 (ptr vector.LOOP Add %ebx,%eax (v2[i] = v1[i] + v2[i]) Add $4,%eax (v1++) Add $4,%ebx (v2++) Add $1,%eci (counter++) Branch counter < 4 Goto LOOP Result {6,7,8,9) Total instruction 3 Load + 4x (3 add) = 15 instructions

SIMD Matrix multiplication C code with Non-MMX int16 vect[Y_SIZE]; int16 matr[Y_SIZE][X_SIZE]; int16 result[X_SIZE]; int32 accum; for (i=0; i<X_SIZE; i++) { accum=0; for (j=0; j<Y_SIZE; j++) accum += vect[j]*matr[j][i]; result[i]=accum; }

SIMD Matrix multiplication C Code with MMX for (i=0; i<X_SIZE; i+=4) { accum = {0,0,0,0}; for (j=0; j<Y_SIZE;j+=2) accum += MULT4x2(&vect[j],&matr[j][i]); result[i..i+3] = accum; }

MULT4x2() movd mm7, [esi] ; Load two elements from input vector punpckldq mm7, mm7 ; Duplicate input vector: v0:v1:v0:v1 movq mm0, [edx+0] ; Load first line of matrix (4 elements) movq mm6, [edx+2*ecx] ; Load second line of matrix (4 elements) movq mm1, mm0 ; Transpose matrix to column presentation punpcklwd mm0, mm6 ; mm0 keeps columns 0 and 1 punpckhwd mm1, mm6 ; mm1 keeps columns 2 and 3 pmaddwd mm0, mm7 ; multiply and add the 1st and 2nd column pmaddwd mm1, mm7 ; multiply and add the 3rd and 4th column paddd mm2, mm0 ; accumulate 32 bit results for col. 0/1 paddd mm3, mm1 ; accumulate 32 bit results for col. 2/3

SIMD Matrix multiplication MMX with unrolled loop for (i=0; i<X_SIZE; i+=16) { accum={0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; for (j=0; j<Y_SIZE; j+=2) { accum[0..3] += MULT4x2(&vect[j], &matr[j][i]); accum[4..7] += MULT4x2(&vect[j], &matr[j][i+4]); accum[8..11] += MULT4x2(&vect[j], &matr[j][i+8]); accum[12..15] += MULT4x2(&vect[j], &matr[j][i+12]); } result[i..i+15] = accum; }

SIMD Matrix multiplication Source: Intel developer’s Matrix Multiply Application Note

SIMD MMX performance Source: Article: Does the Pentium MMX Live up to the Expectations?

High Performance Computing Introduction to classes of computing SISD MISD SIMD MIMD Conclusion

MIMD Multiple Instruction Multiple Data Multiple streams of instructions Multiple streams of data Middle grained Parallelism level Used to solve problem in parallel are those problems that lack the regular structure required by the SIMD model. Implements in cluster or SMP systems Each execution unit operate asynchronously on their own set of instructions and data, those could be a sub-problems of a single problem.

MIMD Requires Synchronization Inter-process communications Parallel algorithms Those algorithms are difficult to design, analyze and implement

MIMD

MPP Super-computer High performance of single processor Multi-processor MP Cluster Network Mixture of everything Cluster of High performance MP nodes

Example of MPP Machines Earth Simulator (2002) Cray C90 Cray X-MP

G flop Multiprocessor with 2 or 4 Cray1-like processors Shard memory

Cray C G flop per processor 8 or more processors

The Earth Simulator Operational in late 2002 Result of 5-year design and implementation effort Equivalent power to top 15 US Machines

The Earth Simulator in details 640 nodes 8 vector processors per node, 5120 total 8 G flops per processor, 40 T flops total 16 GB memory per node, 10 TB total 2800 km of cables 320 cabinets (2 nodes each) Cost: $ 350 million

Earth Simulator

High Performance Computing Introduction to classes of computing SISD MISD SIMD MIMD Conclusion

Massive Parallel Processing Age Vector & SIMD 256 bits or even with 512 MIMD Parallel programming Distribute programming Quantum computing!!! S/W slower than H/W development

Appendix Very High-Speed Computing System Michael J. Flynn, member, IEEE Into the Fray With SIMD Understanding SIMD Matrix Multiply Application Note Parallel Computing Systems Dror Feitelson, Hebrew University Does the Pentium MMX Live up to the Expectations?

End of Talk ^_^ Thank you High Performance Computing