Presentation is loading. Please wait.

Presentation is loading. Please wait.

University of Michigan Advanced Computer Architecture Laboratory StageWeb: Interweaving Pipeline Stages into a Wearout and Variation Tolerant CMP Fabric.

Similar presentations


Presentation on theme: "University of Michigan Advanced Computer Architecture Laboratory StageWeb: Interweaving Pipeline Stages into a Wearout and Variation Tolerant CMP Fabric."— Presentation transcript:

1 University of Michigan Advanced Computer Architecture Laboratory StageWeb: Interweaving Pipeline Stages into a Wearout and Variation Tolerant CMP Fabric Shantanu Gupta Amin Ansari Shuguang Feng Scott Mahlke University of Michigan - Ann Arbor June 29, 2010 1

2 University of Michigan Advanced Computer Architecture Laboratory Reliability Threats Transient Faults due to Cosmic Rays & Alpha Particles (Increase exponentially with number of devices on chip) Silicon Defects (Manufacturing defects and device wear-out) Negative Bias Threshold Inversion Oxide Breakdown Electromigration CCC CCC CCC Frequency Process Variation (random and systematic variations Intra-die ILD thickness Speed binning on a die 2

3 University of Michigan Advanced Computer Architecture Laboratory Fault Tolerance Aspects 3 Detect and Diagnose Reconfigure Recover Has anything gone wrong? Figure out the cause Isolate the broken components Resume execution from a safe point

4 University of Michigan Advanced Computer Architecture Laboratory Reconfiguring a Multi-core At the coarsest level, cores can be disabled. Rumors that industry already uses this…. ► IBM Cell w/ 7 SPEs, AMD Tri-Core Can’t scale to higher failure rates! CCC CCC CCC CCC CCC CCC CCC CCC CCC CCC CCC CCC 4 Year 1Year 3Year 5 Year 7

5 University of Michigan Advanced Computer Architecture Laboratory Reconfiguration Granularity 5 Lower complexity FETCH DEC EXEC WB MEM CORE levelSTAGE levelMODULE level ElastIC, DT’ 06 Reunion, MICRO’06 Configurable Isolation, ISCA’07 Online Diagnosis of Hard Faults, MICRO’ 05 Ultra Low-Cost Defect Protection, ASPLOS’ 06 Better resource utilization For 100% area overhead (redundancy) -- Poor MTTF gains + Easy to implement + Good MTTF gains + Circuit / Architectural boundary + Full coverage + Best MTTF gains -- Complex implementation 100% MTTF ↑ 170% MTTF ↑ 200% MTTF ↑

6 University of Michigan Advanced Computer Architecture Laboratory CMP Fabric 6 Core 2 Core 0 Core 1 Core 3 Stage1 StageN Stage2 Stage3 Stage1 StageN Stage2 Stage3 Stage1 StageN Stage2 Stage3 Stage1 StageN Stage2 Stage3 Stage1 Latch Stage2 Latch Stage3 StageN

7 University of Michigan Advanced Computer Architecture Laboratory The StageNet (SN) Fabric 7 Stage1StageNStage2Stage3 Stage1StageNStage2Stage3 Stage1StageNStage2Stage3 Stage1StageNStage2Stage3 Configuration Manager StageNet Slice (SNS) Wearout Sensors Delay Temperature Current Crossbar Switch Inputs Outputs

8 University of Michigan Advanced Computer Architecture Laboratory A 4-Slice SN chip Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Configuration Manager 8

9 University of Michigan Advanced Computer Architecture Laboratory > 5X slowdown Performance Comparison: Pipline vs. SN Slice 1 2 3 8 9 10 6 7 4 5 BR register dependency Commit Time 123678910 5 stage pipeline 123678910 SN Slice 3. Transmission delays 2. Data forwarding 1. Control stall register wb Issue Fetch Decode Ex/Mem WB LATCH Gen PC Branch Predictor Register File branch resolution bypass Decode Ex/Mem Fetch Gen PC Branch Predictor Issue Register File double buffer double buffer double buffer double buffer double buffer double buffer double buffer 9

10 University of Michigan Advanced Computer Architecture Laboratory 10 2. Data Forwarding Bypass $ Stores previous results Fully associative structure Emulates data forwarding Stream ID Control flow handling Eliminates flush signals 3. Transmission Delays 1. Control Handling >> ST LD + / >> & << ST + LD Macro-Ops Send instruction bundles Amortizes transfer delay Increases system utilization 0 1 Decode Ex/Mem Fetch Gen PC Branch Predictor Issue Register File double buffer double buffer double buffer double buffer double buffer double buffer double buffer SID Macro-op Generator Bypass $ SN Slice Microarchitecture [MICRO’08]

11 University of Michigan Advanced Computer Architecture Laboratory SN Slice Performance [MICRO’08] 0 1 2 3 4 5 6 3des g721decode g721encode idct rawcaudio rawdaudio rijndael mcf eqn grep wc Mean Normalized Runtime SNS + StreamID SNS + StreamID + Bypass$ SNS + Stream ID + Bypass$ + MOPs 10% slowdown 11

12 University of Michigan Advanced Computer Architecture Laboratory SN System - scaling to 100+ cores? 12 F F D D E/M I I F F D D I I F F D D I I F F D D I I F F D D I I F F D D I I F F D D I I F F D D I I F F D D I I F F D D I I F F D D I I F F D D I I 1. Crossbars don’t scale well due to wiring / layout complexity - Area - Delay - Power 2. Interconnection prone to failures - Single point of failure - Links have no redundancy

13 University of Michigan Advanced Computer Architecture Laboratory StageWeb: Scaling to 100+ cores In a large many-core system, small groups of cores can form SN What’s the right size for a SN island? 13 Traditional many-core SN Island SN StageWeb many-core

14 University of Michigan Advanced Computer Architecture Laboratory StageWeb: Scaling to 100+ cores In a large many-core system, small groups of cores can form SN What’s the right size for a SN island? Unfortunately, a single crossbar can’t scale to 8-10 pipelines! 14 Good scalingPoor scaling

15 University of Michigan Advanced Computer Architecture Laboratory Front-end Back-end Front-end Back-end Interconnection Alternatives Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Island 1 Island 2 Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Island 3 Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Island 4 1. Connectivity a)Single b)Single + Front-Back c)Overlap d)Overlap + Front-Back 15

16 University of Michigan Advanced Computer Architecture Laboratory Interconnection Alternatives Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Island 1 Island 2 Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Island 3 Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Island 4 1. Connectivity a)Single b)Single + Front-Back c)Overlap d)Overlap + Front-Back 2. Reliability Inputs Outputs a) crossbar Inputs Outputs b) crossbar with spares Inputs Outputs c) fault-tolerant crossbar 16

17 University of Michigan Advanced Computer Architecture Laboratory Interconnection Configuration Faults in stages, crossbar ports, links, force a reconfiguration…. Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Island 1 Island 2 17

18 University of Michigan Advanced Computer Architecture Laboratory Interconnection Configuration Single crossbar configuration ► Local to every island Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Fetch Issue Ex/Mem Decode Island 1 Island 2 Ex/Mem Decode Fetch Issue 18

19 University of Michigan Advanced Computer Architecture Laboratory Interconnection Configuration Overlap crossbar configuration ► Sweep islands, forming pipelines opportunistically Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Island 1 Island 2 Island 3 19

20 University of Michigan Advanced Computer Architecture Laboratory StageWeb Benefits 1.Scalability ► Scaling SN to benefit 100+ core systems 2.Interconnection Reliability ► Handling faults in crossbars and links 3.Process Variation ► Slower components can be isolated in a multi-core chip 20

21 University of Michigan Advanced Computer Architecture Laboratory Mitigating Process Variation 21 Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Severe process variation and lifetime wearout can result in a disparity of health for various resources StageNet can effectively isolate strong/weak resources Ex/Mem Issue Fetch Decode Fast Medium Slow Fast Frequency

22 University of Michigan Advanced Computer Architecture Laboratory Evaluation Open RISC 1200 cores (4-stage in-order) 12 configurations compared, 64-cores each Experiments ► Lifetime evaluations - throughput and total work ► Process variation - speed binning on a die Single Single + Front/Back Overlapping Overlapping + Front/Back W/O spares W/ spares Fault-tolerant InterconnectionsCrossbar types 22

23 University of Michigan Advanced Computer Architecture Laboratory Lifetime Reliability Evaluations Monte Carlo simulation with 300+ lifetime experiments Where, each lifetime experiment involves - ► Assigning a time-to-failure to all stages ► Killing components at their failure times ► Reconfiguring system to isolate broken components ► Repeating this until no logical pipeline can be formed Cumulative work and throughput are recorded ► Number of cores: 64 ► Technology node: 90 nm 23

24 University of Michigan Advanced Computer Architecture Laboratory Cumulative Work 24 ~70% more work!

25 University of Michigan Advanced Computer Architecture Laboratory Cumulative Work (area neutral) 25 52 cores Best StageWeb Configuration Overlapping interconnection network 52 cores 6 adjacent slices connected by each crossbar Fault-tolerant crossbars

26 University of Michigan Advanced Computer Architecture Laboratory Throughput over time 26

27 University of Michigan Advanced Computer Architecture Laboratory Mitigating Process Variation 27 F req 27 45 For a given frequency target, StageWeb can operate: 1.More cores, OR 2.Same # of cores at lower voltage

28 University of Michigan Advanced Computer Architecture Laboratory Conclusions Architectural innovations will be crucial in tackling technological uncertainties StageWeb is a potential solution ► Allows fine-grained isolation of failures ► Most reliability gains from grouping 8-10 pipelines ► Scalable to 100+ cores StageWeb can also mitigate process variation by grouping together faster and slower parts 28

29 University of Michigan Advanced Computer Architecture Laboratory Thank You http://cccp.eecs.umich.edu 29

30 University of Michigan Advanced Computer Architecture Laboratory Back up slides 30

31 University of Michigan Advanced Computer Architecture Laboratory Impact of Defects on CMP Yield 31

32 University of Michigan Advanced Computer Architecture Laboratory Overlapping Network 32

33 University of Michigan Advanced Computer Architecture Laboratory Simple + 2 nd Level Crossbars 33

34 University of Michigan Advanced Computer Architecture Laboratory Overlapping + 2 nd Level Crossbar 34

35 University of Michigan Advanced Computer Architecture Laboratory Back-endFront-end Interconnection Alternatives 35 Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Island 1 Island 2 1. Connectivity a)Simple b)Simple + Front-Back c)Overlap d)Overlap + Front-Back 2. Reliability Inputs Outputs a) crossbar Inputs Outputs b) crossbar with spares Inputs Outputs c) fault-tolerant crossbar

36 University of Michigan Advanced Computer Architecture Laboratory SN System Level Issues 36 F F D D E/M I I F F D D I I F F D D I I F F D D I I F F D D I I F F D D I I F F D D I I F F D D I I F F D D I I F F D D I I F F D D I I 1. Crossbars don’t scale well due to wiring / layout complexity - Area - Delay - Power 2. Interconnection prone to failures - Single point of failure - Links have no redundancy


Download ppt "University of Michigan Advanced Computer Architecture Laboratory StageWeb: Interweaving Pipeline Stages into a Wearout and Variation Tolerant CMP Fabric."

Similar presentations


Ads by Google