Floorplanning Optimization with Trajectory Piecewise-Linear Model for Pipelined Interconnects C. Long, L. J. Simonson, W. Liao and L. He EDA Lab, EE Dept.

Slides:

Advertisements

Similar presentations

1 Wire-driven Microarchitectural Design Space Exploration School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, GA 30332,

Advertisements

Final Project : Pipelined Microprocessor Joseph Kim.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Chapter 3 Pipelining. 3.1 Pipeline Model n Terminology –task –subtask –stage –staging register n Total processing time for each task. –T pl =, where t.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Improving Placement under the Constant Delay Model Kolja Sulimma 1, Ingmar Neumann 1, Lukas Van Ginneken 2, Wolfgang Kunz 1 1 EE and IT Department University.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

CISC Machine Learning for Solving Systems Problems Presented by: John Tully Dept of Computer & Information Sciences University of Delaware Using.

Embedded Software Optimization for MP3 Decoder Implemented on RISC Core Yingbiao Yao, Qingdong Yao, Peng Liu, Zhibin Xiao Zhejiang University Information.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

1 Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlation Branches from a Larger Global History CSE 340 Project Presentation.

Simulated-Annealing-Based Solution By Gonzalo Zea s Shih-Fu Liu s

September 28 th 2004University of Utah1 A preliminary look Karthik Ramani Power and Temperature-Aware Microarchitecture.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Merging Synthesis With Layout For Soc Design -- Research Status Jinian Bian and Hongxi Xue Dept. Of Computer Science and Technology, Tsinghua University,

Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

Threshold Voltage Assignment to Supply Voltage Islands in Core- based System-on-a-Chip Designs Project Proposal: Gall Gotfried Steven Beigelmacher 02/09/05.

Neural Methods for Dynamic Branch Prediction Daniel A. Jiménez Department of Computer Science Rutgers University.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Temperature Aware Microprocessor Floorplanning Considering Application Dependent Power Load *Chunta Chu, Xinyi Zhang, Lei He, and Tom Tong Jing Electrical.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

8/16/2015\course\cpeg323-08F\Topics1b.ppt1 A Review of Processor Design Flow.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.

Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Srihari Makineni & Ravi Iyer Communications Technology Lab

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Computer Science Department In-N-Out: Reproducing Out-of-Order Superscalar Processor Behavior from Reduced In-Order Traces Kiyeon Lee and Sangyeun Cho.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

Methodologies for Performance Simulation of Super-scalar OOO processors Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

1 Power estimation in the algorithmic and register-transfer level September 25, 2006 Chong-Min Kyung.

CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.

Region-based Hierarchical Operation Partitioning for Multicluster Processors Michael Chu, Kevin Fan, Scott Mahlke University of Michigan Presented by Cristian.

Routability-driven Floorplanning With Buffer Planning Chiu Wing Sham Evangeline F. Y. Young Department of Computer Science & Engineering The Chinese University.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Sunpyo Hong, Hyesoon Kim

Superscalar Architecture Design Framework for DSP Operations Rehan Ahmed.

UNIT III -PIPELINE.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

CSE 340 Computer Architecture Summer 2016 Understanding Performance.

A Review of Processor Design Flow

An Automated Design Flow for 3D Microarchitecture Evaluation

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

Floorplanning Optimization with Trajectory Piecewise-Linear Model for Pipelined Interconnects C. Long, L. J. Simonson, W. Liao and L. He EDA Lab, EE Dept. UCLA DAC 2004

Outline Motivation Background Trajectory piecewise-linear CPI model CPI-aware floorplanning Experiment results Conclusion and discussions

Motivation Traditional design flow – Architecture optimization: minimize CPI – Floorplanning optimization: maximize clock frequency – Architectural optimization is separated from the physical optimization under the assumption that layout does NOT change CPI. ISA Configuration Performance evaluation Architecture optimization Floorplanning optimization

Traditional Flow A few years ago: – Clock rates were much lower More time for signal to reach its destination Inductance was less of a factor in delay – Interconnects delay was smaller Less resistance Lower aspect ratio meant less capacitance – Inter-module communication takes less than one cycle Interconnect length used to determine clock period (just clock it faster until it doesn’t work) Floorplanning had no impact on the cycle-by-cycle operation (CPI) of the processor

A New Interconnect Centric Reality Now: – Clock rates have increased by an order of magnitude My P2 from 1998 is 400MHz, The Prescott P4 will be 4.0GHz by the fourth quarter of ’04 and has 31 pipeline stages for integer operations, some of which are due to interconnect pipelining exclusively – Interconnects have longer delay with higher aspect ratio – Die size is the same – A signal can take up to ten clock cycles to travel from opposite corner to opposite corner of a chip in 90nm technology – Likely, the inter-module communication may take over one cycle Clock period is now a constraint, not an objective – Interconnect is pipelined when it cannot meet the constraint A pipelined interconnect delays the cycle a signal arrives – Changes the cycle-by-cycle behavior (CPI) of the system – Determined by floorplanning

How to solve this problem? Evaluate performance during floorplanning optimization – Efficiency of the evaluation is the key – Cycle-accurate simulation is too slow for this purpose ISA, Configuration Performance evaluation Architecture optimization Floorplanning optimization

Contributions of our work We have pointed out that the interconnect latency has a significant impact on architecture performance and it is critical to consider it during floorplanning We have developed an efficient table-based cycle-per- instruction (CPI) model – Called trajectory piece-wise linear (TPWL) model with error less than 3.0% We have Integrated TPWL CPI model with floorplan optimization – To reduce CPI by up to 28.57% with a small area overhead of 5.72%

Background Architecture and partitioning – A SuperScalar implementation of the MIPS instruction set – Similar to Alpha – Twelve blocks BlockArea(mm 2 )BlockArea(mm 2 ) IALU11.00IALU21.00 IALU31.00IMULT1.00 F_ADD1.94F_MULT2.07 RUU3.04Decode1.44 Branch2.27L275.6 IL18.99DL110.03

Bus Latency Vectors Interface between physical level and architecture level Twelve buses Bus latency vectors (B) – E.g., B = {3, 4, 7, …} – Characterize a floorplan as a vector containing the latency of each interconnect Bus idTerminalBus idTerminal 1IALU1, RUU7IL1, L2 2IALU2, RUU8DL1, L2 3IALU3, RUU9Branch, IL1 4IMULT, RUU10Decode, Branch 5FPADD, RUU11LSQ, DL1 6FPMUL, RUU12Decode, RUU

Miss Events and Performance Loss Types of miss events – Data Cache Miss – Instruction Cache Miss – TLB Miss – Branch Miss Prediction Other sources of performance loss – Data dependencies – Resource Contention

Measuring Performance No hardware to measure Need a model of the hardware – Simulate the execution of the machine – Two types of simulation Trace driven simulation – Shade to generate instruction and address trace, dinero to model cache, etc. – Fast, 10s of instructions on host machine per instruction on target machine – Inaccurate good for I-Cache performance loss measurement bad for D-Cache performance loss measurement poor for branch miss prediction performance loss very bad for data dependency performance loss Execution driven simulation – State of target hardware is maintained and updated in memory as each instruction is processed – Slow, ~1000s of instructions on host machine per instruction on target machine – Cycle-accurate, true to cycle by cycle behavior of hardware

Cycle Accurate Simulation Given B, compute CPI – Modify the architecture according to B Change the configuration file Insert buffers between modules – Measure CPI for a subset of the SPEC2000 benchmark suite Floating point benchmarks: equake and mesa Integer benchmarks: gzip, vortex and mcf – Take the arithmetic mean of these benchmarks as the CPI for B

CPI Models A CPI model estimate CPI under interested parameters such as interconnect latency, architecture configuration, etc. CPI models in the literature – Static simulation [Nussbaum’01] Based on a single detailed simulation Generate a synthetic instruction trace Take advantage of cache and branch prediction statistics – Statistical sampling of cycle accurate simulation Sampling instead of truncating: selectively measuring in detail only an appropriate benchmark subset Configuring a systematic sampling simulation run to achieve a desired confidence in estimates – More efficient than cycle-accurate simulation but slow, none of them consider interconnect latency

Traditional floorplanning Optimize floorplan via simulated annealing (SA) algorithm – Objective function: – Moves Change the position or shape of blocks – Cooling scheme Initial temperature Constant cooling rate

Floorplanning considering CPI Based on simulated annealing – Objective function: Extend from traditional floorplanning framework Key is to estimate CPI efficiently – Moves and cooling schedule remain the same

Trajectory of SA The path that SA follows during optimization is a trajectory in the solution space – We only need to accurately estimate CPI in the area where the trajectory travels The trajectory of SA with objective of area, wire length and CPI is close to that of area and wire length only Area and wire length Area, wire length and CPI Bus1 Bus2

Trajectory Piecewise-linear CPI Model Build a piecewise-linear model for a small solution region around the trajectories of SA – Three phases: sampling, collecting and simulating – An example for 2-dimension bus vector Latency (bus1) Latency (bus2) simulation

TPWL: Sampling Sample a complete simulated annealing process with objective of area and total wire length to obtain a set of bus latency vectors (points in n-dimension) Latency (bus1) Latency (bus2)

TPWL: Collecting Collect all the points obtained in the sampling phase in as few as possible “balls” (TPC problem) Latency (bus1) Latency (bus2)

TPWL: Simulating Obtain CPI by cycle accurate simulation for the center of “balls” Build a CPI table indexed by these center points Latency (bus1) Latency (bus2) simulation

CPI estimation under TPWL model Based on each entry, CPI of target B could be estimated by first order expansion For each entry, a weight is calculated based on the distance between the target B and the entry in CPI table The final estimation is the weighted sum of the estimation based on each entry d1d1 d2d2 d3d3 d4d4 d5d5 B B1B1 B2B2 B3B3 B4B4 B5B5

CPI-aware Floorplanning- Overview Integrate the TPWL CPI model with a traditional floorplanning tool Start Floorplanning Trajectory Sampling “Balls” to cover trajectory Solve the TPC problem CPI Table Cycle-accurate simulation Floorplanning considering CPI Integrate to floorplanning

Iterative TPWL model When the trajectory with objective of area and total wire length is significantly different from the trajectory with objective of area, total wire length and CPI, an iterative TPWL model is needed Area and wire length Bus1 Bus2 iteration = 1 iteration = 2 Area, wire length and CPI

Iterative TPWL Model Iteratively expand the CPI table to build a iterative TPWL (iTPWL) model – Based on the TPWL model but from the second iteration one, the objective of SA is area, total wire length and CPI – Improve the accuracy of CPI estimation and the quality of the final floorplan Start Floorplanning Trajectory Sampling “Balls” to cover trajectory Solve the TPC problem CPI Table Cycle-accurate simulation Floorplanning considering CPI Integrate to floorplanning

Summary on TPWL CPI Model Originally proposed for modeling non-linear systems [Rewienski’03] – Outperforms other techniques based on quadratic reduction TPWL model is suitable for floorplanning optimization – The trajectory of SA with objective of area, total wire length and CPI is close to that with objective of area and total wire length only – When these two trajectories are not close, iTPWL model is employed to improve the accuracy Contribution of this paper on TPWL model – Introduce the TPC problem – Expand TPWL model to iTPWL model

Experiment results Verification of CPI models – Error of TPWL model: 2.62%; Error of iTPWL model: 1.66%

Impact of models to final floorplans Comparison of the floorplans obtained by access ratio, sensitivity rate model, TPWL and iTPWL model with objective of area, total wire length and CPI – Access ratio: Use access ratio of interconnects to represent the impact to system performance – Estimate CPI based on first order expansion on the original point

Floorplanning with iTPWL Model Comparison between floorplans obtained by different objectives

Running time Simple-scalar simulation times to build up the TPWL and iTPWL model

Conclusion and discussion Propose an accurate CPI model with less than 3.0% error The CPI-aware floorplaner reduce CPI by 28.57% with a small area overhead of 5.72% Expand the TPWL model and improve the accuracy of estimation the accuracy of iTPWL model leads to floorplanning solutions with high quality and enables us to develop good heuristics, such as access ratio, to minimize CPI without explicit CPI calculation. Plan to apply this model to architecture changes