ECE669 L7: Resource Balancing February 19, 2004 ECE 669 Parallel Computer Architecture Lecture 7 Resource Balancing.

Slides:

Advertisements

Similar presentations

CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.

Advertisements

ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.

1 Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.

PradeepKumar S K Asst. Professor Dept. of ECE, KIT, TIPTUR. PradeepKumar S K, Asst.

Microprocessors General Features To be Examined For Each Chip Jan 24 th, 2002.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.

ECE669 L9: Workload Evaluation February 26, 2004 ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation.

Parallel System Performance CS 524 – High-Performance Computing.

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.

ECE669 L15: Mid-term Review March 25, 2004 ECE 669 Parallel Computer Architecture Lecture 15 Mid-term Review.

ECE669 L12: Interconnection Network Performance March 9, 2004 ECE 669 Parallel Computer Architecture Lecture 12 Interconnection Network Performance.

PZ13A Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ13A - Processor design Programming Language Design.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.

ECE669 L11: Static Routing Architectures March 4, 2004 ECE 669 Parallel Computer Architecture Lecture 11 Static Routing Architectures.

1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.

Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

ECE669 L17: Memory Systems April 1, 2004 ECE 669 Parallel Computer Architecture Lecture 17 Memory Systems.

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

ECE669 L19: Processor Design April 8, 2004 ECE 669 Parallel Computer Architecture Lecture 19 Processor Design.

(6.1) Central Processing Unit Architecture  Architecture overview  Machine organization – von Neumann  Speeding up CPU operations – multiple registers.

Prince Sultan College For Woman

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

PMIT-6102 Advanced Database Systems

CS 420 Design of Algorithms Analytical Models of Parallel Algorithms.

Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.

Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Performance Evaluation of Parallel Processing. Why Performance?

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.

Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.

1 Instruction Sets and Beyond Computers, Complexity, and Controversy Brian Blum, Darren Drewry Ben Hocking, Gus Scheidt.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Operating Systems Lecture No. 2. Basic Elements  At a top level, a computer consists of a processor, memory and I/ O Components.  These components are.

RAM, PRAM, and LogP models

LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

1 Pipelining Part I CS What is Pipelining? Like an Automobile Assembly Line for Instructions –Each step does a little job of processing the instruction.

Data Management for Decision Support Session-4 Prof. Bharat Bhasker.

Pipelining and Parallelism Mark Staveley

DISTRIBUTED COMPUTING

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

Concurrency and Performance Based on slides by Henri Casanova.

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University

1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.

PipeliningPipelining Computer Architecture (Fall 2006)

1 Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.

Computer Architecture

Latency Tolerance: what to do when it just won’t go away

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.

Database System Architectures

Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.

Presentation transcript:

ECE669 L7: Resource Balancing February 19, 2004 ECE 669 Parallel Computer Architecture Lecture 7 Resource Balancing

ECE669 L7: Resource Balancing February 19, 2004 Outline °Last time: qualitative discussion of balance °Need for analytic approach °Tradeoffs between computation, communication and memory °Generalize from programming model for now °Evaluate for grid computations Jacobi Ocean

ECE669 L7: Resource Balancing February 19, 2004 Designing Parallel Computers: An Integrated Approach THE SYSTEM Machine: hardware substrate Functional Memory Based Data-parallel LANGUAGES:...Applications... Obj oriented Comm MemoryProcess. Hardware primitives, eg: Add operation Read Send Synchronize HW barrier

ECE669 L7: Resource Balancing February 19, 2004 THE SYSTEM Machine: hardware substrate Comm Memory Process. Hardware primitives, eg: Add operation Read Send Synchronize HW barrier Compiler system + runtime system Front ends Compiler INTERMEDIATE FORM (Eg: Sends...) Designing Parallel Computers: An Integrated Approach Functional Memory Based Data-parallel LANGUAGES:...Applications... Obj oriented

ECE669 L7: Resource Balancing February 19, 2004 Hardware/Software Interaction °Hardware architecture Principles in the design of processors, memory, communication Choosing primitive operations to be supported in HW. -Function of available technology. °Computer + Runtime technology Hardware-software tradeoffs -- where to draw the line Compiler-runtime optimizations for parallel processing Mechanisms Technology Languages Compiler If HW-SW boundary Machine

ECE669 L7: Resource Balancing February 19, 2004 Lesson from RISCs High-level abstraction High-level Op Simple hardware supported mechanisms Add-Reg-Reg Reg-Reg Move Load-store Branch Jump Jump & link Compare branch Unaligned loads Reg-Reg shifts IF The ubiquitous POLYF

ECE669 L7: Resource Balancing February 19, 2004 HW can change -- but must run existing binaries Mechanisms are important vs. Language-level compatibility vs. IF compatibility Pick best mechanisms given current technology -- fast! Have compiler backend tailored to current implementation Mechansims are not inviolate! Key:“Compile & Run” Compute time must be small -- so have some IF Binary Compatibility

ECE669 L7: Resource Balancing February 19, 2004 °To some extent... but it is becoming more scientific Choosing mechanisms: Metrics -Performance -Simplicity -Scalability - match physical constraints -Universality -Cost-effectiveness (balance) -Disciplined use of mechanism -Because the Dept. of Defense said so! For inclusion, a mechanism must be: -Justified by frequency of use -Easy to implement -Implementable using off-the-shelf parts Choosing hardware mechanisms

ECE669 L7: Resource Balancing February 19, 2004 Hardware Design Issues Storage for state information Operations on the state Communication of information °Questions: -How much storage --- comm bandwidth --- processing -What operations to support (on the state) -What mechanisms for communication -What mechanisms for synchronization °Let’s look at: ³Apply metrics to make major decisions -- eg. memory --- comm BW --- processing ³Quick look at previous designs MemCommProcMemCommProc

ECE669 L7: Resource Balancing February 19, 2004 Example use of metrics Performance -Support for floating point –Frequency of use -Caches speed mem ops. Simplicity -Multiple simple mechanisms -- SW synthesizes complex forms, -Eg. barrier from primitive F/E bits and send ops Universality -Must not preclude certain Ops. (preclude -- same as -- heavy performance hit) Eg. without past msg send, remote process invocation is very expensive Discipline -Avoid proliferation of mechanism. When multiple options are available, try to stick to one, Eg. software prefetch, write overlapping through weak ordering, rapid context switching, all allow latency tolerance. Cost effectiveness, scalability...

ECE669 L7: Resource Balancing February 19, 2004 Cost effectiveness and the notion of balance °Balanced design -> every machine resource is utilized to fullest during computation °Otherwise -> apportion $’s on underutilized resource to more heavily used resource °Two fundamental issues: Balance: Choosing size, speed, etc. of each resource so that no ideal time results due to mismatch! Overlapping: implement so that resource can overlap its operation completely with the operation of other resources

ECE669 L7: Resource Balancing February 19, 2004 Consider the basic pipeline Overlapping - X, Y, Z must be able to operate concurrently -- that is, when X is performing OP X on c, Y must be able to perform OP y on b, and Z must be able to perform OP z on a. Balance -To avoid wastage or idle time in X, Y, or Z, design each so that: Resource XY Z +* 2 OP x y z... c b a TIME(OP x ) = TIME(OP y ) = TIME(OP z )

ECE669 L7: Resource Balancing February 19, 2004 Overlap in multiprocessors °Processors are able to process other data while communication network is busy processing requests issued previously. Comm M P Add. LD

ECE669 L7: Resource Balancing February 19, 2004 Balance in multiprocessors °A machine is balanced if each resource is used fully For given problem size For given algorithm °Let’s work out an example... N-body  Body Compute force from N  1 others

ECE669 L7: Resource Balancing February 19, 2004 Consider a single node Application More parameters Number of processing OPs Number of memory words Number of comm. words All parameters are a function of P: Number of nodes N: Problem size Ops/sec Mem size Words/sec p: m: c: Function of available budget RpRp RMRM RCRC

ECE669 L7: Resource Balancing February 19, 2004 Questions °1. Given N, P find p, m, c for balance °2. Suppose °For what N do we have balance? °3. Suppose p = 2 x 10 x 10 6, how do we rebalance by changing N °4. Given fixed budget D and size N, find optimal Given: Per node cost of p, m, c, P p = 10 x 10 6 P = 10 m = 0.1 x 10 6 c = 0.1 x 10 6 Memory: K m Proc: K p Comm: K c

ECE669 L7: Resource Balancing February 19, 2004 Issue of “size efficiency” - SE °A machine that requires a larger problem size to achieve balance has a lower SE grain size than a machine that achieves balance with a smaller problem size. Machine AMachine B Balanced for N=1,000Balanced for N=10,000 p = 10 x 10 6 c = 0.1 x 10 6 m = 100 P = 10 N - body naive p = 10 x 10 6 c = 0.01 x 10 6 m = 1000 P = 10 N - body naive

ECE669 L7: Resource Balancing February 19, 2004 Intuition: °For typical problems °So, machines that provide higher ratio of comm-- compute power tend to have higher SE. °What about memory? Machines with small comm -- compute ratios tend to provide more mem. per node. °We now know why! °However, the RIGHT SE is: Problem dependent Relative cost dependent as well. Goes as N increases (P decreases) Think of any counter examples? Comm. requirements per node Proc. requirements per node

ECE669 L7: Resource Balancing February 19, 2004 Scalability °What does it mean to say a system is scalable. °TRY: A scalable architecture enjoys speedup proportional to P, the number of nodes: °If problem size is fixed at N T(P) will not decrease beyond some P [assuming some unit of computation For example add, beyond which we do not attempt to parallelize algorithms].  = T1  TP   P for scalable arch.

ECE669 L7: Resource Balancing February 19, 2004 Scalability °N is problem size °Asymptotic speedup for machine R °Intuitively, : Fraction of parallelism inherent in a given algorithm that can be exploited by any machine of that architecture, as a function of problem size N. °Intuitively, S I (N) : Maximum speedup achievable on any machine of the given architecture. (Make problems very large) Computed using as many nodes as necessary to yield best running time Minimum  N   S I N  SRSR N   Asymptotic speedup on machine Asymptotic speedup on EREW PRAM S R N    Serial running time   Parallel running time   

ECE669 L7: Resource Balancing February 19, 2004 Intuition °Maximum speedup achievable on any sized machine of the given architecture °Fraction of parallelism inherent in a given algorithm that can be exploited by any machine of that architecture as a function of problem size N.   S R (N)

ECE669 L7: Resource Balancing February 19, 2004 Scalability °Example: The (by now) famous Jacobi °2D Mesh °i.e. Mesh is 1-scalable for the Jacobi relaxation step?    S I (N) S R (N)  

ECE669 L7: Resource Balancing February 19, D Array of nodes for Jacobi N P ops { … Model: 1 op, 1cycle 1comm/hop, 1cycle

ECE669 L7: Resource Balancing February 19, 2004 Scalability ° °Ideal speedup on any number of procs. °Find best P °So, °So, 1-D array is scalable for Jacobi 24 T par  N P  P  T  P  0 P  N T par   N 1 3       T seg  N S R N   N 2 3  N N 1 3  N   N 2 3 N  S R N  S I N   N  N 1 3

ECE669 L7: Resource Balancing February 19, 2004 Solutions to Balance Questions °1. for balance: or: Yields p/c ratio: N 2  N c or p c  N P T  R P p  N 2 Pp T proc  T comm, memory full R P p  R C c, R M  m R P  N 2 P, R M  N P, R C  N

ECE669 L7: Resource Balancing February 19, 2004 Detailed Example p  10  6 c  0.1  6 m  0.1  6 P  p c  N P or 10   6  N orN  1000 for balance alsoR M  m N P  m  100  m Memory size of m = 100 yields a balanced machine.

ECE669 L7: Resource Balancing February 19, 2004 Twice as fast processor p c  N P Ifp  2p, N  2N m  2m Double problem size.

ECE669 L7: Resource Balancing February 19, 2004 Find Optimal Machine a.Optimize Subject to: b. Constraint (cost) c. At opt, balance constraints satisfied, makes solution easier to find but not strictly needed. °(Opt process should discover c. if not supplied.) T  N 2 Pp D  K p  K m  K c   P D  pK ps  mK ms  K cs P   c  N 2 Pp p c N P, m =

ECE669 L7: Resource Balancing February 19, 2004 Eliminate unknowns or substitute in T minimized when P=1! D  pK ps  mK ms  cK cs   P D  pK ps  N P K ms  p P N K cs P p  D  NK ms PK ps  P N T  N 2 P P T  N 2 K Ps  P N       D  NK ms    K cs K

ECE669 L7: Resource Balancing February 19, 2004 Summary °Balance between communication, memory, and computation °Different measures of scalability °Problem often has specific optimal machine configuration °Balance can be shown analytically for variety of machines °Meshes, cubes, and linear arrays appropriate for various problems