1 Lecture 21: Core Design, Parallel Algorithms Today: ARM Cortex A-15, power, sort and matrix algorithms.

Slides:

Advertisements

Similar presentations

Chapter 5: Tree Constructions

Advertisements

Lecture 19: Parallel Algorithms

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Chapter 4 Retiming.

Lecture 3: Parallel Algorithm Design

Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.

CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.

CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.

1 Parallel Algorithms II Topics: matrix and graph algorithms.

Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.

1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)

CS 7810 Lecture 12 Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors D. Brooks et al. IEEE Micro, Nov/Dec.

1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)

1 Lecture 25: Parallel Algorithms II Topics: matrix, graph, and sort algorithms Tuesday presentations:  Each group: 10 minutes  Describe the problem,

Lecture 21: Parallel Algorithms

1 Lecture 26: Storage Systems Topics: Storage Systems (Chapter 6), other innovations Final exam stats:  Highest: 95  Mean: 70, Median: 73  Toughest.

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

Interconnection Network PRAM Model is too simple Physically, PEs communicate through the network (either buses or switching networks) Cost depends on network.

CS 7810 Lecture 14 Reducing Power with Dynamic Critical Path Information J.S. Seng, E.S. Tune, D.M. Tullsen Proceedings of MICRO-34 December 2001.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:

CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

1 Parallel Algorithms III Topics: graph and sort algorithms.

1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )

Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.

1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.

DAST 2005 Week 4 – Some Helpful Material Randomized Quick Sort & Lower bound & General remarks…

Lecture 7: Power.

CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.

Computer Algorithms Lecture 11 Sorting in Linear Time Ch. 8

Modern VLSI Design 4e: Chapter 4 Copyright  2008 Wayne Wolf Topics n Interconnect design. n Crosstalk. n Power optimization.

ENGG 6090 Topic Review1 How to reduce the power dissipation? Switching Activity Switched Capacitance Voltage Scaling.

CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Network Aware Resource Allocation in Distributed Clouds.

CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

1 CS/EE 6810: Computer Architecture Class format:  Most lectures on YouTube *BEFORE* class  Use class time for discussions, clarifications, problem-solving,

Modern VLSI Design 3e: Chapter 4 Copyright  1998, 2002 Prentice Hall PTR Topics n Interconnect design. n Crosstalk. n Power optimization.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.

ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing.

Computer Architecture Lecture 32 Fasih ur Rehman.

Chapter One Introduction to Pipelined Processors

Basic Linear Algebra Subroutines (BLAS) – 3 levels of operations Memory hierarchy efficiently exploited by higher level BLAS BLASMemor y Refs. FlopsFlops/

Super computers Parallel Processing

Data Structures and Algorithms in Parallel Computing Lecture 10.

COMP541 Memories II: DRAMs

Computer Organization Yasser F. O. Mohammad 1. 2 Lecture 1: Introduction Today’s topics:  Why computer organization is important  Logistics  Modern.

1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Sorting Lower Bounds n Beating Them. Recap Divide and Conquer –Know how to break a problem into smaller problems, such that –Given a solution to the smaller.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University

Lower Bounds & Sorting in Linear Time

COMP541 Memories II: DRAMs

Computer Organization and Architecture + Networks

Lecture 3: Parallel Algorithm Design

Lecture 18: Core Design, Parallel Algos

Lecture 23: Interconnection Networks

Lecture: SMT, Cache Hierarchies

Lecture 16: Parallel Algorithms I

Lecture 22: Parallel Algorithms

Lecture: SMT, Cache Hierarchies

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub

Lecture: SMT, Cache Hierarchies

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Lower Bounds & Sorting in Linear Time

Presentation transcript:

1 Lecture 21: Core Design, Parallel Algorithms Today: ARM Cortex A-15, power, sort and matrix algorithms

Power/Energy Basics Energy = Power x time Power = Dynamic power + Leakage power Dynamic Power =  C V 2 f  switching activity factor C capacitances being charged V voltage swing f processor frequency

3 Guidelines Dynamic frequency scaling (DFS) can impact power, but has little impact on energy Optimizing a single structure for power/energy is good for overall energy only if execution time is not increased A good metric for comparison: ED (because DVFS is an alternative way to play with the E-D trade-off) Clock gating is commonly used to reduce dynamic energy, DFS is very cheap (few cycles), DVFS and power gating are more expensive (micro-seconds or tens of cycles, fewer margins, higher error rates) 2

Criticality Metrics Criticality has many applications: performance and power; usually, more useful for power optimizations QOLD – instructions that are the oldest in the issueq are considered critical  can be extended to oldest-N  does not need a predictor  young instrs are possibly on mispredicted paths  young instruction latencies can be tolerated  older instrs are possibly holding up the window  older instructions have more dependents in the pipeline than younger instrs

Other Criticality Metrics QOLDDEP: Producing instructions for oldest in q ALOLD: Oldest instr in ROB FREED-N: Instr completion frees up at least N dependent instrs Wake-Up: Instr completion triggers a chain of wake-up operations Instruction types: cache misses, branch mpreds, and instructions that feed them

6 Parallel Algorithms – Processor Model High communication latencies  pursue coarse-grain parallelism (the focus of the course so far) Next, focus on fine-grain parallelism VLSI improvements  enough transistors to accommodate numerous processing units on a chip and (relatively) low communication latencies Consider a special-purpose processor with thousands of processing units, each with small-bit ALUs and limited register storage

7 Sorting on a Linear Array Each processor has bidirectional links to its neighbors All processors share a single clock (asynchronous designs will require minor modifications) At each clock, processors receive inputs from neighbors, perform computations, generate output for neighbors, and update local storage input output

8 Control at Each Processor Each processor stores the minimum number it has seen Initial value in storage and on network is “  ”, which is bigger than any input and also means “no signal” On receiving number Y from left neighbor, the processor keeps the smaller of Y and current storage Z, and passes the larger to the right neighbor

9 Sorting Example

10 Result Output The output process begins when a processor receives a non- , followed by a “  ” Each processor forwards its storage to its left neighbor and subsequent data it receives from right neighbors How many steps does it take to sort N numbers? What is the speedup and efficiency?

11 Output Example

12 Bit Model The bit model affords a more precise measure of complexity – we will now assume that each processor can only operate on a bit at a time To compare N k-bit words, you may now need an N x k 2-d array of bit processors

13 Comparison Strategies Strategy 1: Bits travel horizontally, keep/swap signals travel vertically – after at most 2k steps, each processor knows which number must be moved to the right – 2kN steps in the worst case Strategy 2: Use a tree to communicate information on which number is greater – after 2logk steps, each processor knows which number must be moved to the right – 2Nlogk steps Can we do better?

14 Strategy 2: Column of Trees

15 Pipelined Comparison Input numbers:

16 Complexity How long does it take to sort N k-bit numbers? (2N – 1) + (k – 1) + N (for output) (With a 2d array of processors) Can we do even better? How do we prove optimality?

17 Lower Bounds Input/Output bandwidth: Nk bits are being input/output with k pins – requires  (N) time Diameter: the comparison at processor (1,1) influences the value of the bit stored at processor (N,k) – for example, N-1 numbers are and the last number is either 00…0 or 10…0 – it takes at least N+k-2 steps for information to travel across the diameter Bisection width: if processors in one half require the results computed by the other half, the bisection bandwidth imposes a minimum completion time

18 Counter Example N 1-bit numbers that need to be sorted with a binary tree Since bisection bandwidth is 2 and each number may be in the wrong half, will any algorithm take at least N/2 steps?

19 Counting Algorithm It takes O(logN) time for each intermediate node to add the contents in the subtree and forward the result to the parent, one bit at a time After the root has computed the number of 1’s, this number is communicated to the leaves – the leaves accordingly set their output to 0 or 1 Each half only needs to know the number of 1’s in the other half (logN-1 bits) – therefore, the algorithm takes  (logN) time Careful when estimating lower bounds!

20 Matrix Algorithms Consider matrix-vector multiplication: y i =  j a ij x j The sequential algorithm takes 2N 2 – N operations With an N-cell linear array, can we implement matrix-vector multiplication in O(N) time?

21 Matrix Vector Multiplication Number of steps = ?

22 Matrix Vector Multiplication Number of steps = 2N – 1

23 Matrix-Matrix Multiplication Number of time steps = ?

24 Matrix-Matrix Multiplication Number of time steps = 3N – 2

25 Complexity The algorithm implementations on the linear arrays have speedups that are linear in the number of processors – an efficiency of O(1) It is possible to improve these algorithms by a constant factor, for example, by inputting values directly to each processor in the first step and providing wraparound edges (N time steps)

26 Solving Systems of Equations Given an N x N lower triangular matrix A and an N-vector b, solve for x, where Ax = b (assume solution exists) a 11 x 1 = b 1 a 21 x 1 + a 22 x 2 = b 2, and so on…

27 Equation Solver

28 Equation Solver Example When an x, b, and a meet at a cell, ax is subtracted from b When b and a meet at cell 1, b is divided by a to become x

29 Complexity Time steps = 2N – 1 Speedup = O(N), efficiency = O(1) Note that half the processors are idle every time step – can improve efficiency by solving two interleaved equation systems simultaneously

30 Title Bullet