STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Distributed Systems CS
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
Types of Parallel Computers
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Parallel Programming with Java YILDIRAY YILMAZ Maltepe Üniversitesi.
Introduction to Parallel Processing 3.1 Basic concepts 3.2 Types and levels of parallelism 3.3 Classification of parallel architecture 3.4 Basic parallel.
Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Computer Architecture Parallel Processing
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Computer System Architectures Computer System Software
High Performance Computation --- A Practical Introduction Chunlin Tian NAOC Beijing 2011.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
 What is an operating system? What is an operating system?  Where does the OS fit in? Where does the OS fit in?  Services provided by an OS Services.
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Hybrid MPI and OpenMP Parallel Programming
Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.
1 CMPE 511 HIGH PERFORMANCE COMPUTING CLUSTERS Dilek Demirel İşçi.
CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.
Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Floating Point Numbers & Parallel Computing. Outline Fixed-point Numbers Floating Point Numbers Superscalar Processors Multithreading Homogeneous Multiprocessing.
Outline Why this subject? What is High Performance Computing?
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Parallel Computing Presented by Justin Reschke
LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?
Background Computer System Architectures Computer System Software.
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
CS203 – Advanced Computer Architecture
For Massively Parallel Computation The Chaotic State of the Art
Unit OS9: Real-Time and Embedded Systems
What is Parallel and Distributed computing?
Chapter 15, Exploring the Digital Domain
Summary Background Introduction in algorithms and applications
CS703 - Advanced Operating Systems
Introduction to CILK Some slides are from:
Distributed Systems CS
Hybrid Programming with OpenMP and MPI
Chapter 4: Threads & Concurrency
Introduction, background, jargon
Multicore and GPU Programming
Introduction to CILK Some slides are from:
Multicore and GPU Programming
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle –Data used (single, multiple) per clock cycle Single Instruction Single Data: Serial computing Single Instruction Multiple Data: Multiple processors, GPU Multiple Instruction Single Data: Shared memory MIMD: Cluster computing, Multi-core CPU, Multi-threaded, Message-passing (IBM SP-x on hypercube, Intel single chip Xenon Phi: future-of-supercomputing )

Grid Computing & Cloud Not necessarily parallel Primary focus is the utilization of CPU-cycles across Just networked CPU’s, but middle-layer software makes node utilizations transparent A major focus: avoid data transfer – run codes where data are Another focus: load balancing Message passing parallelization is possible: MPI, PVM, etc. Community specific Grids: CERN, Bio-grid, Cardio-vascular grid, etc. Cloud: Data archiving focus, but really commercial versions of Grid, CPU utilization is under-sold but coming up: expect service-oriented software business model to pick up

RAM Memory Utilization Two types feasible: Shared memory: Fast, possibly on-chip, no message passing time, no dependency on a ‘pipe’ and its possible failure But, consistency needs to be explicitly controlled, that may cause-deadlock, that needs deadlock checking-breaking mechanism adding overhead Distributed local memory: communication overhead ‘pipe’ failure possibility is a practical problem good model where threads are independent of each other most general model for parallelization easy to code, & well-established library (MPI) scaling up is easy – on-chip to over-the-globe

Threading Types Two types feasible: Static threading: OS controls, typically for single-core CPU’s (why would one do it? - OS), but multi-core CPU’s use it if compiler guarantees safe execution Dynamic threading: Program controls explicitly, threads are created/destroyed as needed, parallel computing model

Multi-threaded Fibonacci Recursive Fib (n) 1If n<=1 then return n; else 2. x = Fib(n-1); 3. y = Fib(n-2); 4. return (x+y). Complexity: O(G n ), where G is Golden ration ~1.6

Fibonacci Recursive Fib (n) 1If n<=1 then return n; else 2. x = Spawn Fib(n-1); 3. y = Fib(n-2); 4.Sync; 5. return (x+y). Parallelization of threads is optional: scheduler decides (programmer, script translator, compiler, os)

GPU-type parallelization’s ideal time ~critical path length The more balanced the tree is the shorter the critical path Spawn, or Data collection node is counted as time unit 1  This is message passing Note, GPU/SIMD uses different model: Each thread does same work (kernel), & Data goes to shared memory

Terminologies/Concepts For P available processor: T inf, T P, T 1 : no-limit to serial-processor Ideal parallelization: T P = T 1 / P Real situation: T P >= T 1 / P T inf is theoretical minimum feasible, so, T P >= T inf Speedup factor = T 1 / P T 1 / T P <= P Linear speedup: T 1 / T P = O(P) [e.g. 3P +c] Perfect linear speedup: T 1 / T P = P My preferred factor would be T P / T 1 (inverse speedup: slowdown factor?) –linear O(P); quadratic O(P 2 ), …, exponential O(k P, k>1)

Terminologies/Concepts For P available processor: T inf, T P, T 1 : no limit to serial processor Parallelism factor: T 1 / T inf –serial-time by ideal-parallelized-time –note, this is about your algorithm, unoptimized over the actual configuration available to you T 1 / T inf < P implies NOT linear speedup T 1 / T inf << P implies processors are underutilized We want to be close to P: T 1 / T inf  P, as in limit Slackness factor: (T 1 / T inf ) / P, or (T 1 / T inf P) We want slackness  1, minimum feasible –i.e, we want no slack