Yuchun Ma Joint Work with Jason Cong, Yongxiang Liu, Glenn Reinman, and Yan Zhang International Center for Design on Nanotechnologies Workshop.

Slides:

Advertisements

Similar presentations

THERMAL-AWARE BUS-DRIVEN FLOORPLANNING PO-HSUN WU & TSUNG-YI HO Department of Computer Science and Information Engineering, National Cheng Kung University.

Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

ECE 667 Synthesis and Verification of Digital Circuits

A Novel 3D Layer-Multiplexed On-Chip Network

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.

National Tsing Hua University Po-Yang Hsu,Hsien-Te Chen,

Krit Athikulwongse, Dae Hyun Kim, Moongon Jung, and Sung Kyu Lim

FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.

From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Multiobjective VLSI Cell Placement Using Distributed Simulated Evolution Algorithm Sadiq M. Sait, Mustafa I. Ali, Ali Zaidi.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

3D CMP and 3D IC Physical Design Flow Jason Cong and Guojie Luo University of California, Los Angeles {cong, cong,

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:

CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

Continuous Retiming EECS 290A Sequential Logic Synthesis and Verification.

1 Exploring Design Space for 3D Clustered Architectures Manu Awasthi, Rajeev Balasubramonian University of Utah.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

Jieyi Long and Seda Ogrenci Memik Dept. of EECS, Northwestern Univ. Jieyi Long and Seda Ogrenci Memik Dept. of EECS, Northwestern Univ. Automated Design.

Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.

Thermal-Aware SoC Test Scheduling with Test Set Partitioning and Interleaving Zhiyuan He 1, Zebo Peng 1, Petru Eles 1 Paul Rosinger 2, Bashir M. Al-Hashimi.

Delay and Power Optimization with TSV-aware 3D Floorplanning M. A. Ahmed and M. Chrzanowska-Jeske Portland State University, Oregon, USA ISQED 2014.

Metal Layer Planning for Silicon Interposers with Consideration of Routability and Manufacturing Cost W. Liu, T. Chien and T. Wang Department of CS, NTHU,

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

1 A Novel Metric for Interconnect Architecture Performance Parthasarathi Dasgupta, Andrew B. Kahng, Swamy V. Muddu Dept. of CSE and ECE University of California,

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.

Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.

A Methodology for Interconnect Dimension Determination By: Jeff Cobb Rajesh Garg Sunil P Khatri Department of Electrical and Computer Engineering, Texas.

TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.

CSE 242A Integrated Circuit Layout Automation Lecture: Partitioning Winter 2009 Chung-Kuan Cheng.

Power Reduction for FPGA using Multiple Vdd/Vth

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.

Thermal-aware Steiner Routing for 3D Stacked ICs M. Pathak and S.K. Lim Georgia Institute of Technology ICCAD 07.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

RF network in SoC1 SoC Test Architecture with RF/Wireless Connectivity 1. D. Zhao, S. Upadhyaya, M. Margala, “A new SoC test architecture with RF/wireless.

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

1 Interconnect/Via. 2 Delay of Devices and Interconnect.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

By P.-H. Lin, H. Zhang, M.D.F. Wong, and Y.-W. Chang Presented by Lin Liu, Michigan Tech Based on “Thermal-Driven Analog Placement Considering Device Matching”

Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.

Floorplanning Optimization with Trajectory Piecewise-Linear Model for Pipelined Interconnects C. Long, L. J. Simonson, W. Liao and L. He EDA Lab, EE Dept.

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

-1- UC San Diego / VLSI CAD Laboratory Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems Andrew B. Kahng and Siddhartha.

Interconnect Characteristics of 2.5-D System Integration Scheme Yangdong (Steven) Deng & Wojciech P. Maly

Retiming EECS 290A Sequential Logic Synthesis and Verification.

PipeliningPipelining Computer Architecture (Fall 2006)

Partial Reconfigurable Designs

Architecture and Synthesis for Multi-Cycle Communication

Ioannis E. Venetis Department of Computer Engineering and Informatics

3Boston University ECE Dept.;

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Serial versus Pipelined Execution

An Automated Design Flow for 3D Microarchitecture Evaluation

Efficient Interconnects for Clustered Microarchitectures

Fast Min-Register Retiming Through Binary Max-Flow

Presentation transcript:

Yuchun Ma Joint Work with Jason Cong, Yongxiang Liu, Glenn Reinman, and Yan Zhang International Center for Design on Nanotechnologies Workshop

2 Outline u Micro-architecture Design u 3-D IC Technology u 3D Architecture Exploration with 2D blocks u 3D Architecture Design with cubic folded blocks  3D cubic packing algorithm  3D architecture exploration with folded blocks u Pipelining Optimization with Throughput-Aware Floorplanning u Summary and Future Work

3 Outline u Micro-architecture Design u 3-D IC Technology u 3D Architecture Exploration with 2D blocks u 3D Architecture Design with cubic folded blocks  3D cubic packing algorithm  3D architecture exploration with folded blocks u Pipelining Optimization with Throughput-Aware Floorplanning u Summary and Future Work

4 Superscalar Processors u Superscalar processing is the ability of a microprocessor to initiate multiple instructions into multiple pipelines so that the computations of many instructions can be done in parallel if they are not dependent on each other.

5 Alpha 21264

6 Performance of a microprocessor u Performance is measured as the time taken to complete a given task  Operating systems  Compiler optimizations  Workload used for studying the performance  Microprocessor organization  Typically, the processor performance is measured in MIPS or BIPS

7 Outline u Micro-architecture Design u 3-D IC Technology u 3D Architecture Exploration with 2D blocks u 3D Architecture Design with cubic folded blocks  3D cubic packing algorithm  3D architecture exploration with folded blocks u Pipelining Optimization with Throughput-Aware Floorplanning u Summary and Future Work

8 Motivations of 3-D ICs u Alternative ways for device integration as we approach the limit of CMOS scaling u Interconnect length/delay reduction  System performance Improvement [Black04]  Power Reduction [Black04] u Integration of heterogeneous technologies u No existing flow to evaluate 3D implementations of architectures systematically   Performance   Thermal [Black04]

9 Technology background u Wafer bonding 3D IC technologies  With flipping the top layer;  Without flipping the top layer; (a) With flipping the top layer (b) Without flipping the top layer A 3D IC example with two device layers

10 R lateral Thermal Resistive Network [Wilkerson04] u u Circuit stack partitioned into tiles u u Tiles connected through thermal resistances   Lateral resistances: fixed   Vertical resistances  1/#via u u Heat sources modeled as current sources   Current value = power u u Heat sinks modeled as ground nodes u u Thermal vias:   After floorplanning, we can further reduce the temperature by thermal via insertion. (a) Tiles stack array (b) Single tile stack P1P1 R2R2 R3R3 R4R4 P4P4 P3P3 P2P2 R1R  R5R5 P5P5 5

11 Outline u Micro-architecture Design u 3-D IC Technology u 3D Architecture Exploration with 2D blocks u 3D Architecture Design with cubic folded blocks  3D cubic packing algorithm  3D architecture exploration with folded blocks u Pipelining Optimization with Throughput-Aware Floorplanning u Summary and Future Work

12 MEVA-3D u An Automated Design Flow for 3D Architecture Evaluation (MEVA-3D)  Evaluate 3D implementations of micro-architectures systematically and study them from both performance and thermal perspectives. u MEVA-3D Flow  Automated 2D/3D floorplanning; Reduce the latency along critical loops in the mico- architecture by considering interconnect pipelining at a given target frequency. Reduce the latency along critical loops in the mico- architecture by considering interconnect pipelining at a given target frequency.  Thermal Evaluation Resistive network model considering white-space and thermal via insertion. Resistive network model considering white-space and thermal via insertion.  3D router

13 3D Architecture Evaluation with Physical Planning u Optimize  BIPS (not IPC or Freq) Consider interconnect pipelining based on early floorplanning for critical paths Consider interconnect pipelining based on early floorplanning for critical paths Use IPC sensitivity model [Jagannathan05] Use IPC sensitivity model [Jagannathan05]  Area/wirelength  Temperature

14 Design Example u An out-of-order superscalar processor micro-architecture with 4 banks of L2 cache in 70nm technology u Critical paths

15 Baseline Processor Parameters

16 2D vs 3D Layout 2D EV6-like core 3D EV6-like core (2 layers) BIPS= 2.75 BIPS= 2.94 Wakeup loop : The extra cycle is eliminated. Branch misprediction resolution loop and the L2 cache access latency : Some of the extra cycles are eliminated Assume two device layers

17 Simulation Results  The 3D architecture outperforms 2D design about 11.7% when the frequency is 4GHz.

18 Performance for the micro-architecture with 2D and 3D layout at different target frequencies  3D integration can help improve the performance by 11% by eliminating most of the wire latencies in 2D.

19 Maximum On-Chip Temperature HS denotes a heat sink, and the 3D integration allows to insert thermal vias to reduce the temperature.  3D integration shows a temperature increase of over 4.78  on average. After thermal via insertion, we can reduce the maximum on-chip temperature by an average of about 62%.

20 Outline u Micro-architecture Design u 3-D IC Technology u 3D Architecture Exploration with 2D blocks u 3D Architecture Design with cubic folded blocks  3D cubic packing algorithm  3D architecture exploration with folded blocks u Pipelining Optimization with Throughput-Aware Floorplanning u Summary and Future Work

21 3D Design w/ Component Folding and Stacking u Explore 3D design of architectural structures that are  Timing/Throughput Critical  Expensive in Terms of Power Consumption and/or Thermal Output u Possible candidates for 3D component folding  Instruction Scheduling Window Issue Queue can be partitioned into multiple levels via matchlines or taglines. Issue Queue can be partitioned into multiple levels via matchlines or taglines.  On-Chip Caches Regular structure lends itself to a wide range of partitionings Regular structure lends itself to a wide range of partitionings  Register File Thermally critical resource – also has a regular structure Thermally critical resource – also has a regular structure

22 3D Architectural Block Design and Modeling u First explore how to design blocks in 3D  Wordline folding Fold block horizontally Fold block horizontally  Port Partitioning Extend ports to different layers Extend ports to different layers u Tools  CACTI Caches and cache-like structures Caches and cache-like structures Register files Register files  HSpice Issue Queue Issue Queue u Then explore design space for a microprocessor with these blocks

23 3D Issue Queue (a) 2D issue queue with 4 taglines ； (b)block folding ； (c) port partitioning u Block folding  Fold the entries and place them on different layers  Effectively shortens the tag lines u Port partitioning  Place tag lines and ports on multiple layer, thus reducing both the height and width of the ISQ.  The reduction in tag and matchline wires can help reduce both power and delay.

24 Benefits from IQ folding u Maximum delay reduction of 50%, maximum area reduction of 90% and a maximum reduction in power consumption of 40% nL- n number of layers, FB – Folding banks, TP – Tag/Ports Partitioning

25 Improvements for blocks u u Port folding performs better than wordline folding for area.(72% vs 51%); u u Wordline folding is more effective in reducing the block delay (13% vs 5%); u u Port folding also performs better in reducing power (13% vs 5%)

26 3D packing with folded blocks u u The exploration of the use of vertical integration on microprocessor design requires consideration for both physical design and architecture.   True 3D packing   Architectural Alternative Selection The number of layers in folded blocks The partition way: block folding or port partitioning

27 3D Corner Block List Representation u (S, L, T) composes a 3D CBL.  S: a record of block name  L: corner cubic block orientation(X-, Y- or Z- oriented)  T: The sequence of {T n,T n-1, …,T 2 } recording the number of attached tri-branches covered by corner cubic block S={ } L = ( Y,Z,Y,X) T=( 10,110,10,1110) 5

28 Packings with folded blocks

29

30 Performance u On average, multi-layer(3D) block configurations have 11% lower temperature as well as 14% improvement in BIPS.

31 Temperatures u Temperatures can be below 100 degree with thermal vias inserted.

32 Temperature profile 1 layer 2 layers with no via inserted

33 Temperature profile(2 layers with thermal vias)

34 Outline u Micro-architecture Design u 3-D IC Technology u 3D Architecture Exploration with 2D blocks u 3D Architecture Design with cubic folded blocks  3D cubic packing algorithm  3D architecture exploration with folded blocks u Pipelining Optimization with Throughput-Aware Floorplanning u Summary and Future Work

35 Micro-architecture Pipelining Optimization u Previous works assume that the blocks are separately designed subject to a clock frequency, and the wire pipelining is then carried out on the global wires of the circuits.  Sub-optimal due to the possible utilized slacks in block pipeline designs u We propose a novel optimization methodology of architecture pipelining with physical design, so that block pipelining and interconnect pipelining can be considered simultaneously. A B A B  0.7  pipeline with pre-designed blocks path-based pipeline

36 Simultaneous Block and Interconnect Pipelining u u We define path-based pipelinging as Simultaneous Block and Interconnect Pipelining (SBIP) Problem   Represent the micro-architecture design by a path graph G(V,E).   The delay between any two flip-flops along the same path is less than clock period .   The performance of the architecture can be evaluated by the weighted sum of number of FFs on e i (n ei ) along the paths.   Therefore the objective is to find a feasible solution with the optimal performance. AB D C A E A’ E E’ B B’ C C’ D D’

37 MILP Formulation u We define a term a(P,v) that represents the arrival time at node (v) along path P, which is the longest delay from a flip-flop to the node v along path P. u With the given clock period  and the set of paths P, we can then formulate the problem as the following MILP Obj. Min s.t. 0  a(P i,v)    v  V and P i passes v (1) n ei  0  ei  E (2) a(P i,v)  a(P i,u) + d ei –  * n ei  ei  E and ei is a connection from node u to node v along path Pi. (3)

38 Graph-based heuristic algorithm u Traverse the graph to decide the optimal insertion of flip-flops such that the weighted sum of cycle numbers of paths is minimized  Dynamic scanning for combinational circuits  Slacks along paths are used to compute the optimal positions for FFs.  Near-optimal method for sequential circuits break the cycle into a path from s to t break the cycle into a path from s to t u Throughput aware floorplanning with pipelining  The path-based pipelining design guides the block design to optimize the performance for the whole design.

39 Experimental Results u We compare the results with the wire-pipelining results (WP), and the solutions obtained from the MILP solver (MILP), the ideal upper bound used in [6][8](UB) and our graph-based heuristic approach (GH). u Impact of frequencies  The path-based pipelining will give about a 27% performance improvement over wire pipelining

40 Integrated with floorplanning optimization Frequency GHz UB+post_MILPGH Area (mm 2 ) Wire (mm) BIPS Area (mm 2 ) Wire (mm) BIPS Comparison u MILP approach as a post process at the end of the floorplanning u integrate our approach with the thoughput-driven floorplannning.

41 Summary u 3D Architecture Exploration  Coupled with 3D physical planning  Consider both 3D component stacking and folding u MEVA-3D can systematically evaluate the 3D architecture both from the performance side and from the thermal side. u We propose the optimization methodology of architecture pipelining with physical design which simultaneously optimize the pipeline design and physical packing in terms of system throughput. The performance of the system can be improved a lot over the wire-pipelining.

42 Ongoing Work u 3D Multi-core architecture design and implementation u Deep pipeline design in microarchitecture with interconnect considered u The slacks in 3D design may be used to enlarge the sizes of blocks and get better performance.

Thank You! Thank You!