Download presentation
Presentation is loading. Please wait.
Published byJudith Dixey Modified over 9 years ago
1
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan, Scott Mahlke, and Michael Schlansker* Advanced Computer Architecture Laboratory University of Michigan *HP Laboratories
2
University of Michigan Electrical Engineering and Computer Science 2 Motivation VLIW Scaling Problem ► Centralized resource ► Highly ported structures ► Wire delays FU Register File Instruction Fetch/Decode FU … Register File Instruction Fetch/Decode FU
3
University of Michigan Electrical Engineering and Computer Science 3 Multicluster VLIW Distribute register files Cluster function units Distribute data caches Clusters communicate through interconnection network Used in TI C6x, Lx/ST200, Analog Tigersharc FU Register File Interconnection network Instruction Fetch/Decode Cluster 0 Cluster 1
4
University of Michigan Electrical Engineering and Computer Science 4 Control Path Scaling Problem Larger I-cache Latency ► Long wires for control signals distribution Code compression ► Hardware cost, power ► Grow quadratically with the number of FUs GFED CBAX PC B A I-cache IR align/shift network NOP
5
University of Michigan Electrical Engineering and Computer Science 5 Straight Forward Approach Distribute I-fetch in spirit similar to distribution of data path ► Local communication of controls ► Reduce latency, hardware cost, power Used in Multiflow Trace 14/300 processors I-cache PC IR Interconnection network PC FU Register File Interconnection network I-cache IR FU Register File
6
University of Michigan Electrical Engineering and Computer Science 6 DVLIW Approach Simple distribution has problems ► Doesn’t support code compression ► PC still a centralized resource I-cache FU Register File PC0 IR Interconnection network I-cache FU Register File PC IR Interconnection network align/shift PC1 align/shift
7
University of Michigan Electrical Engineering and Computer Science 7 DVLIW Execution Model Clusters execute in lock-step ► When one cluster stalls, all clusters stall Clusters collectively execute one thread ► Each cluster runs an instruction stream ► Compiler orchestrates the execution of streams ► Compiler manages communication ► Light weight synchronization
8
University of Michigan Electrical Engineering and Computer Science 8 DVLIW Benefits Completely decentralized architecture ► Distributed data path ► Distributed control path Supports arbitrary code compression Exploiting ILP on multi-core style system ► Good for embedded applications ► Low cost ► Compiler support
9
University of Michigan Electrical Engineering and Computer Science 9 DVLIW Architecture VLIW Cluster 0 VLIW Cluster 1 VLIW Cluster 3 VLIW Cluster 2 Banked L2 br_target PC Next PC B NOP A BA L1 D-Cache L1 I-Cache IR Register Files … align/shift IC MFU FU … To Banked L2 Banked L2 To cluster 2 To cluster 1
10
University of Michigan Electrical Engineering and Computer Science 10 Code Organization Code for each cluster is consecutive in memory Operations in the same MultiOp stored in different memory locations Each cluster computes its own next PC A1 A2 A3 A4 A5 B1 B2 B3 B4 … … A1 A2 A3 B1 B2 … … A4 A5 B3 B4 Conventional VLIW DVLIW PC PC0 PC1
11
University of Michigan Electrical Engineering and Computer Science 11 Branch Mechanism Maintain correct execution order ► All clusters transfer control at the same cycle ► All clusters branch to the same logical multiop Unbundled branch in HPL-PD Branch PBRbtr1, TARGET CMPPpr0, (x>100)? BRbtr1, pr0 Each cluster specifies its own target Broadcast to all clusters Replicated in each cluster
12
University of Michigan Electrical Engineering and Computer Science 12 Branch Handling Example … pbr btr1, BB2 cmpp pr0, (x>100)? … br btr1, pr0 … pbr btr1, BB2 cmpp pr0, (x>100)? bcast pr0 br btr1, pr0 … pbr btr1, BB2’ …. br btr1, pr0 Conventional VLIW DVLIW Cluster 0 Cluster 1
13
University of Michigan Electrical Engineering and Computer Science 13 Sleep Mode Idle blocks after distribution Put cluster into sleep mode ► Compiler managed ► Save energy ► Reduce code size Mode change happens at block boundary BR Cluster 0 Cluster 1 BR SLEEP WAKE BR
14
University of Michigan Electrical Engineering and Computer Science 14 Experimental Setup Trimaran toolset Processor configuration ► 4 clusters, 2 INT, 1 FP, 1 MEM, 1 BR per cluster ► 16K L1 I-cache total ► Perfect data cache assumed Power Model ► Verilog for instruction align/shift logic ► Wire model ► Cacti cache model 21 benchmarks from MediaBench and SPECINT2000
15
University of Michigan Electrical Engineering and Computer Science 15 Change in Global Communication Bits MediaBench SPECINT
16
University of Michigan Electrical Engineering and Computer Science 16 Normalized Energy Consumption on Control Path Control path energy = (align/shift logic energy) + (wire energy) + (I-cache energy) 40% saving 67% saving 80% saving 21% saving
17
University of Michigan Electrical Engineering and Computer Science 17 Normalized Code Size Baseline: Conventional VLIW with compressed encoding Traditional method (single PC): 7x increase DVLIW: 40% increase
18
University of Michigan Electrical Engineering and Computer Science 18 Result Summary DVLIW benefits ► Order of magnitude reduction in global communication ► 40% savings in control path energy ► 5x code size reduction vs. simple distribution Small overhead for ILP execution on CMP ► 3% increase in execution cycles ► 4% increase in I-cache stalls
19
University of Michigan Electrical Engineering and Computer Science 19 Conclusions DVLIW removes last centralized resource in a multicluster VLIW ► Fully distributed control path ► Scalable architecture More energy efficient Stylized CMP architecture ► Exploit ILP ► Multiple instruction streams ► Compiler orchestrated
20
University of Michigan Electrical Engineering and Computer Science 20 Thank You For more information ► http://cccp.eecs.umich.edu
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.