Spatial Computation Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 SCS Mihai.

Slides:



Advertisements
Similar presentations
1
Advertisements

Feichter_DPG-SYKL03_Bild-01. Feichter_DPG-SYKL03_Bild-02.
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
© 2008 Pearson Addison Wesley. All rights reserved Chapter Seven Costs.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
Processes and Operating Systems
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 1 Embedded Computing.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
UNITED NATIONS Shipment Details Report – January 2006.
RXQ Customer Enrollment Using a Registration Agent (RA) Process Flow Diagram (Move-In) Customer Supplier Customer authorizes Enrollment ( )
Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon.
Optimizing Memory Accesses for Spatial Computation Mihai Budiu, Seth Goldstein CGO 2003.
Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University.
Mihai Budiu Microsoft Research – Silicon Valley Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University Spatial Computation.
On The Energy Efficiency of Computation Mihai Budiu CMU CS CALCM Seminar Feb 17, 2004 Note: this version fixes some errors in the ASH performance graphs.
On the Critical Path of (Parallel) Computations Mihai Budiu March 30, 2005.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 10 second questions
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
Advance Nano Device Lab. Fundamentals of Modern VLSI Devices 2 nd Edition Yuan Taur and Tak H.Ning 0 Ch9. Memory Devices.
PP Test Review Sections 6-1 to 6-6
EU market situation for eggs and poultry Management Committee 20 October 2011.
EU Market Situation for Eggs and Poultry Management Committee 21 June 2012.
Bellwork Do the following problem on a ½ sheet of paper and turn in.
CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.
2 |SharePoint Saturday New York City
IP Multicast Information management 2 Groep T Leuven – Information department 2/14 Agenda •Why IP Multicast ? •Multicast fundamentals •Intradomain.
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
BEEF & VEAL MARKET SITUATION "Single CMO" Management Committee 18 April 2013.
VOORBLAD.
1 public class Newton { public static double sqrt(double c) { double epsilon = 1E-15; if (c < 0) return Double.NaN; double t = c; while (Math.abs(t - c/t)
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
© 2012 National Heart Foundation of Australia. Slide 2.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
25 seconds left…...
Analyzing Genes and Genomes
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
Datorteknik TopologicalSort bild 1 To verify the structure Easy to hook together combinationals and flip-flops Harder to make it do what you want.
Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.
Immunobiology: The Immune System in Health & Disease Sixth Edition
1 Chapter 13 Nuclear Magnetic Resonance Spectroscopy.
Energy Generation in Mitochondria and Chlorplasts
Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.
Spatial Computation Computing without General-Purpose Processors Mihai Budiu Carnegie Mellon University July 8, 2004.
Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.
Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003.
Computing Without Processors Thesis Proposal Mihai Budiu July 30, 2001 This presentation uses TeXPoint by George Necula Thesis Committee: Seth Goldstein,
Presentation transcript:

Spatial Computation Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 SCS Mihai Budiu CMU CS

2 Spatial Computation Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 SCS A model of general-purpose computation based on Application-Specific Hardware.

3 Thesis Statement Application-Specific Hardware (ASH): can be synthesized by adapting software compilation for predicated architectures, provides high-performance for programs with high ILP, with very low power consumption, is a more scalable and efficient computation substrate than monolithic processors.

4 Outline Introduction Compiling for ASH Media processing on ASH ASH vs. superscalar processors Conclusions

5 CPU Problems Complexity Power Global Signals Limited ILP

6 Design Complexity from Michael Flynns FCRC 2003 talk

7 Communication vs. Computation 5ps20ps gate wire Power consumption on wires is also dominant

8 Our Approach: ASH Application-Specific Hardware

Programs Resource Binding Time CPUASH

10 Hardware Interface CPUASH ISA software hardware software hardware gates virtual ISA

11 Application-Specific Hardware C program Compiler Dataflow IR Reconfigurable/custom hw

12 Contributions Compilation Computer architecture Reconfigurable computing Embedded systems Asynchronous circuits High-level synthesis Dataflow machines Nanotechnology theory systems

13 Outline Introduction CASH: Compiling for ASH Media processing on ASH ASH vs. superscalar processors Conclusions

14 Computation = Dataflow Operations ) functional units Variables ) wires No interpretation x = a & 7;... y = x >> 2; Programs & a 7 >> 2 x Circuits

15 Basic Operation + data valid ack latch

16 + Asynchronous Computation data valid ack latch

17 Distributed Control Logic +- ack rdy FSM asynchronous control short, local wires

18 Forward Branches if (x > 0) y = -x; else y = b*x; * xb0 y ! -> Conditionals ) Speculation critical path

19 Control Flow ) Data Flow data predicate Merge (label) Gateway data Split (branch) p !

20 i +1 < * + sum 0 Loops int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; ! ret

21 no speculation sequencing of side-effects Predication and Side-Effects Load addr data pred token to memory

22 Thesis Statement Application-Specific Hardware: can be synthesized by adapting software compilation for predicated architectures, provides high-performance for programs with high ILP, with very low power consumption, is a more scalable and efficient computation substrate than monolithic processors.

23 Outline Introduction CASH: Compiling for ASH –An optimization on the SIDE Media processing on ASH ASH vs. superscalar processors Conclusions skip to

24 Availability Dataflow Analysis y y = a*b;... if (x) { = a*b; }

25 Dataflow Analysis Is Conservative if (x) {... y = a*b; } = a*b; y?y?

26 Static Instantiation, Dynamic Evaluation flag = false; if (x) {... y = a*b; flag = true; } = flag ? y : a*b;

27 SIDE Register Promotion Impact Loads Stores % reduction

28 Outline Introduction CASH: Compiling for ASH Media processing on ASH ASH vs. superscalar processors Conclusions

29 Performance Evaluation ASH LSQ limited BW L1 8K L2 1/4M Mem CPU: 4-way OOO Assumption: all operations have the same latency.

30 Media Kernels, vs 4-way OOO

31 Media Kernels, IPC

32 Speed-up IPC Correlation

33 Low-Level Evaluation C CASH core Verilog back-end Synopsys, Cadence P/R Results shown so far. All results in thesis. Results in the next two slides. ASIC 180nm std. cell library, 2V ~1999 technology

34 Area Reference: P4 in 180nm has 217mm 2

35 Power vs 4-way OOO superscalar, 600 Mhz, with clock gating (Wattch), ~ 6W

36 Thesis Statement Application-Specific Hardware: can be synthesized by adapting software compilation for predicated architectures, provides high-performance for programs with high ILP, with very low power consumption, is a more scalable and efficient computation substrate than monolithic processors.

37 Outline Introduction CASH: Compiling for ASH Media processing on ASH –dataflow pipelining ASH vs. superscalar processors Conclusions skip to

38 Pipelining i + <= * + sum pipelined multiplier (8 stages) int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; cycle=1

39 Pipelining i + <= * + sum cycle=2

40 Pipelining i + <= * + sum cycle=3

41 Pipelining i + <= * + sum cycle=4

42 Pipelining i + <= i=1 i=0 + sum cycle=5 pipeline balancing

43 Outline Introduction CASH: Compiling for ASH Media processing on ASH ASH vs. superscalar processors Conclusions

44 This Is Obvious! ASH runs at full dataflow speed, so CPU cannot do any better (if compilers equally good).

45 SpecInt95, ASH vs 4-way OOO

46 Predicted not taken Effectively a noop for CPU! Predicted taken. Branch Prediction for (i=0; i < N; i++) {... if (exception) break; } i + < 1 & ! exception result available before inputs ASH crit path CPU crit path

47 SpecInt95, perfect prediction

48 ASH Problems Both branch and join not free Static dataflow (no re-issue of same instr)(no re-issue of same instr) Memory is far Fully static – No branch prediction – No dynamic unrolling – No register renaming Calls/returns not lenient...

49 Thesis Statement Application-Specific Hardware: can be synthesized by adapting software compilation for predicated architectures, provides high-performance for programs with high ILP, with very low power consumption, is a more scalable and efficient computation substrate than monolithic processors.

50 Outline Introduction +CASH: Compiling for ASH +Media processing on ASH +ASH vs. superscalar processors =Conclusions

51 low power simple verification? specialized to app. unlimited ILP simple hardware no fixed window economies of scale highly optimized branch prediction control speculation full-dataflow global signals/decision Strengths

52 Conclusions Compiling around the ISA is a fruitful research approach. Distributed computation structures require more synchronization overhead. Spatial Computation efficiently implements high-ILP computation with very low power.

53 Backup Slides Control logic Pipeline balancing Lenient execution Dynamic Critical Path Memory PRE Critical path analysis CPU + ASH

54 Control Logic C C Reg rdy in ack in rdy out ack out data in data out backback to talk

55 Last-Arrival Events + data valid ack Event enabling the generation of a result May be an ack Critical path=collection of last-arrival edges

56 Dynamic Critical Path 3. Some edges may repeat 2. Trace back along last-arrival edges 1. Start from last node backback to analysis

57 Critical Paths if (x > 0) y = -x; else y = b*x; * xb0 y ! ->

58 Lenient Operations if (x > 0) y = -x; else y = b*x; * xb0 y ! -> Solve the problem of unbalanced paths backback to talk

59 Pipelining i + <= * i=1 i=0 + sum cycle=6

60 Pipelining i + <= * + sum is loop sums loop Long latency pipe predicate cycle=7

61 Predicate ack edge is on the critical path. Pipelining i + <= * + sum critical path is loop sums loop

62 Pipelinine balancing i + <= * + sum is loop sums loop decoupling FIFO cycle=7

63 Pipelinine balancing i + <= * + sum is loop sums loop critical path decoupling FIFO backback to presentation

64 Register Promotion …=*p (p2) *p=… (p1) …=*p *p=… (p1) (p2 Æ : p1) Load is executed only if store is not

65 Register Promotion (2) …=*p (p2) *p=… (p1) …=*p (false) *p=… (p1) When p2 ) p1 the load becomes dead......i.e., when store dominates load in CFG back

66 ¼ PRE...=*p (p1)...=*p (p2)...=*p (p1 Ç p2) This corresponds in the CFG to lifting the load to a basic block dominating the original loads

67 Store-store (1) *p=... (p2) *p=… (p1) *p=... (p2) *p=… (p1 Æ : p2) When p1 ) p2 the first store becomes dead......i.e., when second store post-dominates first in CFG

68 Store-store (2) *p=... (p2) *p=… (p1) *p=... (p2) *p=… (p1 Æ : p2) Token edge eliminated, but......transitive closure of tokens preserved back

69 A Code Fragment for(i = 0; i < 64; i++) { for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; Y[i] = X[j].q; } SpecINT95:124.m88ksim:init_processor, stylized

70 Dynamic Critical Path for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; load predicate loop predicate sizeof(X[j]) definition

71 MIPS gcc Code LOOP: L1: beq $v0,$a1,EXIT ; X[j].r == i L2: addiu $v1,$v1,20 ; &X[j+1].r L3: lw $v0,0($v1) ; X[j+1].r L4: addiu $a0,$a0,1 ; j++ L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF EXIT: L1 ! L2 ! L3 ! L5 ! L1 4-instructions loop-carried dependence for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break;

72 If Branch Prediction Correct L1 ! L2 ! L3 ! L5 ! L1 Superscalar is issue-limited! 2 cycles/iteration sustained for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; LOOP: L1: beq $v0,$a1,EXIT ; X[j].r == i L2: addiu $v1,$v1,20 ; &X[j+1].r L3: lw $v0,0($v1) ; X[j+1].r L4: addiu $a0,$a0,1 ; j++ L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF EXIT:

73 Critical Path with Prediction Loads are not speculative for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break;

74 Prediction + Load Speculation ~4 cycles! Load not pipelined (self-anti-dependence) ack edge for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break;

75 OOO Pipe Snapshot IFDAEXWBCT L5 L1 L2 L1 L2 L3 L4 L1 L3 L5 L3 L2 L1 L3 register renaming LOOP: L1: beq $v0,$a1,EXIT ; X[j].r == i L2: addiu $v1,$v1,20 ; &X[j+1].r L3: lw $v0,0($v1) ; X[j+1].r L4: addiu $a0,$a0,1 ; j++ L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF EXIT:

76 Unrolling? for(i = 0; i < 64; i++) { for (j = 0; X[j].r != 0xF; j+=2) { if (X[j].r == i) break; if (X[j+1].r == 0xF) break; if (X[j+1].r == i) break; } Y[i] = X[j].q; } when 1 iteration backback to talk

77 Ideal Architecture High-ILP computation Low ILP computation + OS + VM CPUASH Memory back