Optimization-Based Models For Pruning Processor Design-Space

Optimization-Based Models For Pruning Processor Design-Space
4/24/ :44 AM Optimization-Based Models For Pruning Processor Design-Space Nilay Vaish Committee Members Michael C. Ferris Mark D. Hill Jeffrey T. Linderoth Michael M. Swift David A. Wood (advisor) 04/05/2017 © 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Summary Designing processors becoming increasingly harder
Thesis explores optimization-based models for pruning/exploring design space. Three case studies. Exploring the design space of cache hierarchies On-chip resource placement and distribution Bandwidth-optimized Latency-optimized

Microprocessors Big role in IT revolution
Used to scale as suggested by Moore [Moo65] Moore's law still fine, but single-threaded performance tapering.

Single Threaded CPU Performance For SPECint
Performance on SPECint Source: J. Pershing.

Microprocessors Single Core Multi Core Many Core
P = CV^2f. F_max roughly linear in V. Reduce voltage, reduce frequency. Run tasks in parallel. More cores but operated at lower voltage and frequency. Exploit application-level parallelism for better performance

= 4455 possible spatial arrangements (includes symmetric ones)
Design Challenges More components  more design possibilities Architectural simulators are slow: simulates 100,000 instructions / second. More simulated cores  Less simulation throughput = 4455 possible spatial arrangements (includes symmetric ones)

Optimization-based models and algorithms sufficient for pruning
Our Thesis Optimization-based models and algorithms sufficient for pruning Provide solutions that perform better than previously proposed solutions Reduce time to explore the design space compared to brute-force, randomized or genetic algorithm-based search procedures. Allow solve problems of larger scale and complexity Provide information about optimality.

Our Case Studies Designing cache hierarchy (partly done before preliminary exam) Resource placement and distribution Bandwidth-optimized (mostly done before preliminary exam) Latency-optimized (most recent work)

Problem Statement Input: cache and physical memory designs, processor design (cores and interconnect topology), resource constraints, workload behavior Output: a hierarchy of caches well suited for the processor

Variability in Cache Hierarchies Across Designs
Processor Cores Level 1 (Instruction) Level 1 (Data) Level 2 Level 3 Memory Controllers Cavium ThunderX 48 78KB 32KB 16 MB shared - 6 DDR3/4 Cavium ThunderX2 54 64KB 40KB 32 MB shared Intel Xeon E v4 24 256 KB 60 MB 4 DDR4 channels Intel Xeon Phi 72 1 MB shared by two cores 16 GB (HBM) 2 DDR4, 6 channels Mellanox Tile-Gx72 18MB 4 DDR3 Oracle M7 32 16KB 256 KB I$, 256 KB D$, shared by four cores, 8MB, shared by four cores 4 DDR4, 16 channels

Example Design Space Infeasible Maximal Dominated
Expected Dynamic Energy Expected Access Latency

So many dimensions to optimize! What to do?
Energy Latency Power Area Levels

Possible Approach: Exhaustive Simulation
As per Cacti[MBJ09], more than 1012 cache designs possible Can be algorithmically and heuristically pruned to about several thousand designs Still ~1010 three-level hierarchies Architectural simulators are slow: simulates 100,000 instructions / second. Size, banks, associativity, nspd, wire type, ndwl, ndbl, ndcm, ndsam_lev1, ndsam_lev2

Possible Approach: Continuous Modeling
Several different mathematical models for designing cache hierarchy [JCSM96, PPG11, OLLC, SHK+] Common feature: continuous functions for latency program behavior cost / energy / power

Requires algebraic functional forms

Dynamic Energy Vs Size

Requires algebraic functional forms Designs not physically realizable

Gap Between Continuous And Discrete Solutions
Continuous Optimal Discrete Optimal Y X

Desired Features Trade-off among different metrics: access time, dynamic energy consumption, static power dissipation, chip area requirements Obtain physically realizable designs Optimize for shared caches, on-chip network.

Our Approach Consists of two parts
Discrete Modeling of the design space Dynamic Programming + Multi-dimensional divide and conquer to compute hierarchies on the Pareto optimal frontier

Discrete Model Prior work uses thumb rules to fit continuous functions to data obtained from simulation and from tools like Cacti Instead use obtained data directly. No need to model design parameters continuously that are actually discrete.

Dynamic Programming Observed two structural properties of solutions under the discrete model Optimal Substructure: every optimal hierarchy for n levels contains an optimal hierarchy for first n-1 levels. Overlapping Subproblems: different n-level hierarchies mostly differ in their nth-level. Compute the Pareto Frontier of (n-1)-level hierarchies and use that again and again while computing n-level hierarchies.

Find The Latency Optimal Cache Hierarchy
Access Latency, Global Miss Ratio Cache Design 4ns, 1/9 4ns, 1/8 2ns, 1/5 2ns, 1/4 1ns, 1/4 Level 1 40ns 1/100 20ns 1/50 Level 2 50ns Physical Memory

Choose Level-1 Maximal Designs
Eliminate dominated designs 4ns, 1/9 4ns, 1/8 2ns, 1/5 2ns, 1/4 1ns, 1/4 Level 1 40ns 1/100 20ns 1/50 Level 2 50ns Physical Memory

Compute Level-1 × Level-2 Maximal Designs
Compute expected access times 4 + 40/9 = 76/9 ≈ 8.44 2 + 40/5 = 10 4ns, 1/9 4ns, 1/8 2ns, 1/5 2ns, 1/4 1ns, 1/4 Level 1 40ns 1/100 20ns 1/50 Level 2 50ns Physical Memory

Choose Level-1 × Level-2 Maximal Designs
1/100 20ns 1/50 20ns 1/50 Level 2 50ns Physical Memory

Choose Level-1 × Level-2 × Memory Designs
4 + 40/9 + 50/100 = 161/18 ≈ 8.95 2 + 20/5 + 50/50 = 7 1 + 20/4 + 50/50 = 7 4ns, 1/9 2ns, 1/5 1ns, 1/4 Level 1 40ns 1/100 20ns 1/50 20ns 1/50 Level 2 50ns Physical Memory

Proposed Solution: Short Description
Choose scenarios containing multiple applications. Collect data on cache behavior of applications Select individual cache designs maximal for the chosen metrics Add private levels, one at a time. Keep only feasible and maximal designs around Design shared levels, one at a time. Analyze shared cache behavior, on-chip network behavior. Select feasible and maximal designs Account for the physical memory

Evaluation of Our Method
Comparison with a continuous model for single core hierarchy Cumulative Distribution of true positives for 8- core designs

Two Kinds Of Errors Dominated Infeasible Maximal False -ve False +ve

Estimating True Positives
Dominated Infeasible Maximal False -ve False +ve Expected Dynamic Energy Loss in performance Expected Access Latency

Continuous Model 𝑠 i , t i : size and access delay for cache at level i. 𝑠𝑡𝑎𝑡𝑖 𝑐 𝑖 , 𝑑𝑦𝑛𝑎𝑚𝑖 𝑐 𝑖 , 𝑎𝑟𝑒 𝑎 𝑖 : static power, dynamic energy and area for cache at level i. 𝑝 𝑖 : probability of accessing cache at level i. a 0 , a 1 , b 0 , b 1 , c 0 , c 1 , d 0 , d 1 ,α, β : constants obtained from fitting linear functions to data obtained from Cacti and applications under consideration.

Comparing Discrete and Continuous Models
Solved the two models with same set of parameters Output from continuous model rounded to nearest feasible cache design Simulations carried out with SPEC CPU applications using gem5. Optimal hierarchy computed by simulating hierarchies that are off the Pareto-optimal frontier.

Improvement With Discrete Model Over Continuous Model (Inorder Core)

Improvement With Discrete Model Over Continuous Model (Out-of-Order Core)

Experiments With 8-core Design
Assumed the architecture shown: 8 cores connected via crossbar One / two levels of private cache, one level shared Computed and simulated hierarchies for SPEC CPU2006 benchmark suite with detailed core and memory system.

Distribution of true Pareto-optimal (Homogeneous Workload)
Ideal Point

Distribution of true Pareto-optimal (Mixed Workload)

Conclusion Need for a structured approach towards designing cache hierarchies Developed a discrete model. Observed two structural properties. Proposed a dynamic programming + multi- dimension divide and conquer algorithm Performs better than continuous models and approximations proposed in literature.

Our Case Studies Designing cache hierarchy (partly done before preliminary exam) Resource placement and distribution Bandwidth-optimized (mostly done before preliminary exam) Latency-optimized (most recent work)

Problem Statement Input: a processor design (cores and on-chip network topology), a set of memory controllers, on-chip network resource budgets, traffic pattern Output: a ‘suitable’ placement of memory controllers in the on-chip network + design of the network : processor core : network link : memory controller : memory channel : memory (DRAM)

Why Should We Care? : processor core : network link : memory controller : memory channel : memory (DRAM) : message packet Data needs to move between the processor chip and the memory But bandwidth is limited, so needs to be shared by on- chip components

Why Should We Care? Controllers managing off-chip accesses placed strategically Similarly the on-chip network needs to be designed as per the traffic patterns. Combined problem more challenging, but potentially better designs

What Has Been Done Before?
Abts et al [AEJK+] solved the controller placement problem using: experience exhaustive simulation of smaller designs Genetic Algorithm (GA) based approach : processor core and caches : memory controller diamond diagonal

What Has Been Done Before?
Mishra et al. [MVD] designed a heterogeneous mesh network: two types of links: wide and narrow two types of routers: big and small exhaustive simulation and extrapolation. : big router : small router : wide link : narrow link

Our Approach Design space is large: n tiles, m memory controller  𝑛 𝑚 ways to place 64 tiles, 16 controllers  4.89 × ways Distributing on-chip network resources along with placing controllers even harder Use mathematical optimization for solving the problem

Optimization Model Index Input Variables
(x,y): coordinates on a 2-D plane l: link Path(x,y,x’,y’): set containing all the links l that are used in the path going from (x,y) to (x’,y’) Ωx,y,x’,y’ : weight of traffic from (x,y) to (x’,y’) link budget, number of memory controllers BWl : link’s bandwidth BW: load per unit bandwidth Ix,y: binary variable denoting whether a memory controller is placed at (x,y) Load(l): traffic on l due to communication between controllers and cores. Index Input Variables

Analysis Of The Model Mixed Integer Non-linear Non-convex
Possible to linearize the formulation

diagonal (Mishra et al.)
Solution Design diagonal (Mishra et al.) com-opt : big router + memory controller : small router : narrow link : wide link : memory controller

Evaluation Methodology
simulator 64 out-of-order cores, mesh interconnect 45 multi-programmed workloads generated using SPEC CPU2006 Simulations run till each core executed at least 25,000,000 instructions Compares diagonal and com-opt designs

Evaluation Results Speedup obtained by com-opt over diagonal
Average Speedup obtained by com-opt over diagonal On average, about 10% improvement in performance

What’s Lacking Prior models focused on minimizing the maximum traffic bandwidth over a link Applications not necessarily bound by communication bandwidth Need for models that can help design processors for latency-sensitive applications.

Our Contribution Developed two optimization models with different objectives minimize maximum latency minimize average latency Models only need the one single input parameter: traffic input rate. Non-linear models did not work well, linearized versions did  algebraic queueing model may not be essential.

MinMax Optimization Model
BWl : link’s bandwidth LoadOnLink(l): traffic on l due to communication between controllers and cores. 𝜇: average of rate of transmission at a link 𝜆: rate of traffic input from any core in the processor 𝜆 𝑙 : load per unit bandwidth of link l 𝑊 𝑙 : average waiting time at link l 𝑍 𝑥,𝑦, 𝑥 ′ , 𝑦 ′ : delay for path from (x,y) to (x’,y’)

Specifics Assume each link acts as an M/D/1 server. Estimate the average waiting time for a flit at link l as: 𝑊 𝑙 = 𝜆 𝑙 2(𝜇(𝜇− 𝜆 𝑙 )) . Latency in traversing link l: 1 𝜇 + 𝑊 𝑙 Latency over path from (x,y) to (x’, y’): 𝑍 𝑥,𝑦, 𝑥 ′ ,𝑦′ ≥ 𝐼 𝑥𝑦 𝑙∈𝑃𝑎𝑡ℎ(𝑥,𝑦,𝑥′,𝑦′) ( 1 𝜇 + 𝑊 𝑙 ) Z = maximum latency over all paths. Objective: minimize Z

Linearizing Constraints
Non-linear model unable to compute optimal design or explore design space Several constraints non-linear Average waiting time at a link: W= 𝜆 2(𝜇(𝜇−𝜆))

Plots For Queueing Latency And Piecewise Linear Estimations

Linearizing Constraints
Non-linear model unable to compute optimal design or explore design space Several constraints non-linear Average waiting time at a link: W= 𝜆 2(𝜇(𝜇−𝜆)) Constraint for waiting time linearized using piecewise linear function and features from the modeling language

Linearized MinMax Model

Results With Linearized Model
Solved the model for different values of traffic intensity: 𝜌=𝜆/𝜇. Found designs that lie in between center and diamond.

Simulation Performance Of Different Designs

Conclusion Developed two new optimization-based models for placing memory controllers and designing on-chip network Designers need to set one single parameter to explore the design space Found designs with performance in-between that of center and diamond. Finding confirmed with synthetic simulations. Use of piecewise-linear model shows algebraic queueing model not necessary.

To Conclude Designing processors getting harder. Need better tools.
We believe optimization-based models suitable for pruning. Explored few of them in our thesis. Need more research on what else might be beneficial. Develop a multi-scale model and estimate error bounds.

Acknowledgements Prof Wood Committee members
Fellow students and friends Family members

Bibliography [AEJK+] Dennis Abts, Natalie D. Enright Jerger, John Kim, Dan Gibson, and Mikko H. Lipasti, Achieving Predictable Performance through Better Memory Controller Placement in Many-Core CMPs, ISCA '09. [BBB+] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood, The gem5 simulator, SIGARCH Comput. Archit. News 39 (2011), 1-7. [DGR+74] R.H. Dennard, F.H. Gaensslen, V.L. Rideout, E. Bassous, and A.R. LeBlanc, Design of ion-implanted MOSFET's with very small physical dimensions, Solid-State Circuits, IEEE Journal of 9 (1974), no. 5, [JCSM96] Bruce L. Jacob, Peter M. Chen, Seth R. Silverman, and Trevor N. Mudge, An Analytical Model for Designing Memory Hierarchies, IEEE Trans. Comput. 45 (1996), no. 10,

Bibliography [MBJ09]Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi, Cacti 6.0: A tool to model large caches, HP Laboratories (2009). [Moo65] Gordan E. Moore, Cramming more components onto integrated circuits, Electronics 38 (1965), no. 8. [MVD] Asit K. Mishra, N. Vijaykrishnan, and Chita R. Das, A Case for Heterogeneous On-Chip Interconnects for CMPs, ISCA '11. [OLLC] Taecheol Oh, Hyunjin Lee, Kiyeon Lee, and Sangyeun Cho, An Analytical Model to Study Optimal Area Breakdown between Cores and Caches in a Chip Multiprocessor, ISVLSI '09, IEEE Computer Society, pp [PPG11] Pablo Prieto, Valentin Puente, and Jose-Angel Gregorio, Multilevel Cache Modeling for Chip-Multiprocessor Systems, IEEE Comput. Archit. Lett. 10 (2011), no. 2, [SHK+] Guangyu Sun, Christopher J. Hughes, Changkyu Kim, Jishen Zhao, Cong Xu, Yuan Xie, and Yen-Kuang Chen, Moguls: A Model to Explore the Memory Hierarchy for Bandwidth Improvements, ISCA '11, pp

Classic CMOS Dennard Scaling: the Science behind Moore’s Law
Source: Future of Computing Performance: Game Over or Next Level?, National Academy Press, 2011 Scaling: Voltage: V/a Oxide: tOX/a Results: Power/ckt: 1/a2 Power Density: ~Constant National Research Council (NRC) – Computer Science and Telecommunications Board (CSTB.org), Mark D. Hill

Post-classic CMOS Dennard Scaling
Post Dennard CMOS Scaling Rule Scaling: Voltage: V/a V Oxide: tOX/a Results: Power/ckt: 1/a2 1 Power Density: ~Constant a2 National Research Council (NRC) – Computer Science and Telecommunications Board (CSTB.org), Mark D. Hill

How We Linearize The Model
Form of the non-linear constraint: Bilinear with product of a continuous and a discrete bounded variables

Benefits Of Using Optimization
Scalable Theoretical performance bounds Flexible

Dynamic Programming Technique for solving optimization problems in which decisions are made over multiple stages. In each stage, a decision is made, some reward is given, and possibly new information about the system under consideration is provided. Aim: maximize the sum total of all the rewards received. Algorithm: step forward / backward in time.

Dynamic Programming : Drawback
Not possible in many situations due to the curse of dimensionality. Ginormous number of: stages designs or policies state variables scenarios

b. How To Select Individual Cache Designs
Need a tool (like Cacti) for estimating design’s performance on chosen metrics: access latency, dynamic energy, static power, chip area Trillions of designs possible as per Cacti. Need to prune Compute maximal designs using divide and conquer algorithm [Ben80]. Time complexity: 𝑂(𝑛 log 𝑘−1 𝑛 ) for n points, k-dimensional space. About 0.07% of designs produced by Cacti are maximal.

Variation in Cache Miss Rates Across Applications

Designing Shared Levels Of Hierarchy
Expected access time computed using queueing theory Miss probabilities computed using equilibrium state of the shared caches On-chip network performance also computed using queueing theory

Why Not Scalarize Scalarize the objective by combining the metrics: Min 1 𝑝 λ𝑖 𝑓𝑖(𝑥) subject to x ∈ X Not clear what 𝜆𝑖 to chose. Yields only one maximal design Not possible to generate all maximal designs

Scalarization Dominated Infeasible Maximal Expected Dynamic Energy
Expected Access Latency

Why Not Scalarize Scalarize the objective by combining the metrics: Min 1 𝑝 λ𝑖 𝑓𝑖(𝑥) subject to x ∈ X Not clear what 𝜆𝑖 to chose. Yields only one maximal design Not possible to generate all maximal designs

Example Design Space Dominated Infeasible Maximal

Multi-Dimensional Divide and Conquer
Algorithm for computing Pareto Frontier [Ben80]. Divide into two halves. Compute Pareto Frontier for the two halves. Combine the two frontiers together for the complete frontier. Time complexity: 𝑂(𝑛 log 𝑘−1 𝑛 ) for n points, k- dimensional space.

Distribution of virtual channels
Solution Design Distribution of virtual channels

Synthetic Evaluation

Optimization-Based Models For Pruning Processor Design-Space

Similar presentations

Presentation on theme: "Optimization-Based Models For Pruning Processor Design-Space"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimization-Based Models For Pruning Processor Design-Space

Similar presentations

Presentation on theme: "Optimization-Based Models For Pruning Processor Design-Space"— Presentation transcript:

Similar presentations

About project

Feedback