Virtual Cluster Scheduling Through the Scheduling Graph Josep M. Codina Jesús Sánchez Antonio González Intel Barcelona Research Center, Intel Labs - UPC.

Slides:

Advertisements

Similar presentations

EE384y: Packet Switch Architectures

Advertisements

Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.

On Scheduling Vehicle-Roadside Data Access Yang Zhang Jing Zhao and Guohong Cao The Pennsylvania State University.

Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University.

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

Fast optimal instruction scheduling for single-issue processors with arbitrary latencies Peter van Beek, University of Waterloo Kent Wilken, University.

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.

Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.

SE-292 High Performance Computing

THERMAL-AWARE BUS-DRIVEN FLOORPLANNING PO-HSUN WU & TSUNG-YI HO Department of Computer Science and Information Engineering, National Cheng Kung University.

UPC MICRO35 Istanbul Nov Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide

Parallel List Ranking Advanced Algorithms & Data Structures Lecture Theme 17 Prof. Dr. Th. Ottmann Summer Semester 2006.

Chapter 4 Memory Management Basic memory management Swapping

Chapter 10: Virtual Memory

Copyright  2003 Dan Gajski and Lukai Cai 1 Transaction Level Modeling: An Overview Daniel Gajski Lukai Cai Center for Embedded Computer Systems University.

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

SE-292 High Performance Computing

Addition 1’s to 20.

25 seconds left…...

ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica.

SE-292 High Performance Computing

Chapter 12 Analyzing Semistructured Decision Support Systems Systems Analysis and Design Kendall and Kendall Fifth Edition.

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.

Topics Left Superscalar machines IA64 / EPIC architecture

Chapter 3 General-Purpose Processors: Software

U P C CGO’03 San Francisco March 2003 Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache Enric.

Static Bus Schedule aware Scratchpad Allocation in Multiprocessors Sudipta Chattopadhyay Abhik Roychoudhury National University of Singapore.

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.

University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors Enric Gibert 1,2, Jaume Abella 1,2, Jesús Sánchez 1, Xavier Vera 1, Antonio González.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David.

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

Generic Software Pipelining at the Assembly Level Markus Pister

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning.

A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.

1 Andreea Chis under the guidance of Frédéric Desprez and Eddy Caron Scheduling for a Climate Forecast Application ANR-05-CIGC-11.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

Pipelining and Parallelism Mark Staveley

Using Dynamic Binary Translation to Fuse Dependent Instructions Shiliang Hu & James E. Smith.

Optimal Superblock Scheduling Using Enumeration Ghassan Shobaki, CS Dept. Kent Wilken, ECE Dept. University of California, Davis

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

Sunpyo Hong, Hyesoon Kim

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors Josep M. Codina, Jesús Sánchez and Antonio González Dept. of Computer.

Michael Chu, Kevin Fan, Scott Mahlke

Presentation transcript:

Virtual Cluster Scheduling Through the Scheduling Graph Josep M. Codina Jesús Sánchez Antonio González Intel Barcelona Research Center, Intel Labs - UPC CGO07, San Jose, California - March 2007

2 Clustered Architectures Semiconductor technology is continuously improving New technologies pack more logic in a single chip Exploit more ILP More functional units, registers, etc. Faster clock cycles Current/future challenges in processor design Delay in the transmission of signals Power consumption Clustering: divide the system in semi-independent units Each unit Cluster Fast interconnects intra-cluster Slow interconnects inter-clusters Common trend in commercial VLIW processors Equators MAP1000, TI TMS320C6x, ADI TigerSharc, HP/STs Lx, …

3 Overview of the Architecture CLUSTER 1 CLUSTER 2 CLUSTER N MAIN MEMORY Register buses Clustered VLIW processor DATA CACHE INT FP MEM REGISTER FILE DATA CACHE

4 Clustered VLIW Processors Performance relies on the Compiler Code generation: Instruction Scheduling Register Allocation Cluster Assignment Hide delay due to inter-cluster communications Phase-ordering problem Decisions made for one task constraint possible decisions on the others Single-Phase approach

5 Phase-Ordering Alternatives Previous Work First Assign then schedule Accurate information of the assignment when scheduling However, schedule is constrained for the assignment Instructions scheduled and assigned at the same time Partially alleviates the ordering constraints However, no information from one task when performing the other Our Approach Perform both tasks at the same time but decisions aimed at assignment are delayed Accurate scheduling information when performing final assignment First instructions scheduled Partial assignment is built with the consequences of the scheduling decisions If a scheduling decision is not appropriate for assignment can be discarded Then, final assignment is performed

6 Talk Outline Proposed algorithm Overview Scheduling Graph Virtual Clusters Deduction Process Performance evaluation Conclusions

7 Proposal Overview Superblock Scheduling Single entry multiple exits GOAL: Minimize Average Weighted Completion Time (AWCT) Cycles between the entry and each exit weighted by the exit probability Our scheme enumerates AWCT B0B0 B1B1 B2B2 I0I0 I1I1 I2I2 I3I3 I4I4 Data Dependence Graph Inst B and I fully pipelined Latency(B) = 3 Latency(I) = 2 Issue-with: 2 I, 1 B Estart(B 0 ) = 3 Estart(B 1 ) = 6 Estart(B 2 ) = 8 MinAWCT = 0.1 * * * 8 = 7.1 Estart(B 0 ) = 3 Estart(B 1 ) = 7 Estart(B 2 ) = 8 AWCT = 0.1 * * * 8 = 7.3 Estart(B 0 ) = 3 Estart(B 1 ) = 7 Estart(B 2 ) = 9 AWCT = 0.1 * * * 9 = 8

8 Proposal Overview Superblock Scheduling Single entry multiple exits GOAL: Minimize Average Weighted Completion Time (AWCT) Cycles between the entry and each exit weighted by the exit probability Our scheme enumerates AWCT Single-phase approach scheduling and cluster assignment Delaying the cluster assignment decisions More information of the scheduling when making assignment decisions Impact of scheduling over assignment discovered and managed Main ingredients 1. Scheduling Graph Describes all possible schedules 2. Virtual Clusters Enable delaying the cluster assignment by keeping partial assignment 3. Deduction Process Discovers most of the consequences of any decisions made

9 Ingredient 1: Scheduling Graph Describes all possible schedules Contains all feasible combinations between inst pairs that may overlap IB I B I B I B Assume B < I Combinations are feasible depending on Dependences Resources For a particular AWCT, estart and lstart Undirected Graph Same nodes as DDG An edge (v, w) means execution of v and w can be overlapped Labels at every edge are the set of combinations

10 Scheduling Based on SG Choose some combinations while discard others Chosen combinations create complex instructions Schedule each complex instruction in a cycle EdgesComb 1,2-1, 0, 1 3,4,6-2, -1, 0, 1 5,7-2, -1 B0B0 B1B1 B2B2 I0I0 I1I1 I2I2 I3I3 I4I4 B0B0 B1B1 B2B2 I0I0 I1I1 I2I2 I3I3 I4I Data Dependence GraphScheduling Graph CycFU1FU2Br 0I0 1 2I1I2 3 4I3 B0B0 5 6 B1B1 7I4 8 9 B2B Instructions B and I fully pipelined Latency(B) = 3 Latency(I) = 2 Issue-with: 2 I, 1 B B0B0 I1I1 I2I2 B1B1 I3I3 I0I0 I4I4 B2B

11 Ingredient 2: Virtual Clusters Virtual Cluster Set of instructions to be mapped into the same physical cluster Multiple virtual clusters can be mapped into the same physical cluster However, not all virtual clusters can be mapped into the same phsical cluster Not enough resources to accommodate both VCs in the same physical cluster VCG: Undirected Graph Each node is a virtual cluster When an edge (VC 1, VC 2 ) exists, VC 1 and VC 2 are incompatible VC 1 and VC 2 must be mapped into different physical clusters VCG managed by the deduction process Clusters are fused Clusters become incompatible Communications are added When a pair producer-consumer belong to incompatible clusters

12 Ingredient 3: Deduction Process Every decision considered is submitted to the deduction process Discovers most of the consequences of any decisions Improves the knowledge to make appropriate decisions Anticipate invalid decisions Avoid non-valid schedules in advance Process based on rules Interaction between resources and dependences Cluster assignment A rule Takes a decision or a change on the state as a input Examines the current state Concludes mandatory changes to apply over the state Decision Deduction Process Scheduling State Scheduling State I0I0 I1I1 I2I2 VC 2 VC 1RuleConcludes A communication is required either I1 I0 or I2 I0

13 Ingredient 3: Deduction Process Every decision considered is submitted to the deduction process Discovers most of the consequences of any decisions Improves the knowledge to make appropriate decisions Anticipate invalid decisions Avoid non-valid schedules in advance Process based on rules Interaction between resources and dependences Cluster assignment A rule Takes a decision or a change on the state as a input Examines the current state Concludes mandatory changes to apply over the state Changes feed back to the process Consequences of consequences discovered Process finishes when no change to be treated Decision Deduction Process Scheduling State Scheduling State

14 Algorithm Overview Compute Scheduling Graph DDG Compute minAWCT Set AWCT = minAWCT Set Scheduling State for AWCT Find a Schedule For AWCT Valid Schedule NO YES Deduction Process Compute Virtual Clusters Graph Increase AWCT Compute SG Dependences Resources

15 Algorithm Overview Compute Virtual Clusters Graph Compute Scheduling Graph DDG Compute minAWCT Set AWCT = minAWCT Set Scheduling State for AWCT Find a Schedule For AWCT Valid Schedule NO YES Deduction Process Increase AWCT Compute VCG Each instruction has its own VC

16 Set Scheduling State AWCT constraints the cycles where instructions can be scheduled and so the SG DP used to obtain accurate initial state Algorithm Overview Deduction Process Compute minAWCT Set AWCT = minAWCT Set Scheduling State for AWCT DDG Find a Schedule For AWCT Compute Scheduling Graph Valid Schedule NO YES Compute Virtual Clusters Graph Increase AWCT Enumerate AWCT minAWCT Enhanced through DP

17 Take a decision over a Candidate Select Candidates Study each Candidate 1.Combination 2.Complex instruction 3.Pair of virtual clusters Algorithm Overview Find a Schedule For AWCT Deduction Process DDG Compute Scheduling Graph Compute minAWCT Set AWCT = minAWCT Set Scheduling State for AWCT Valid Schedule NO YES Compute Virtual Clusters Graph Increase AWCT Find a Schedule DP provides knowledge on the consequences of a candidate Simple widely used heuristics to select among the candidates based on the outcome of the DP Num of communications Compact code The success of the decision making relies on the DP

18 Algorithm Overview Find a Schedule For AWCT Deduction Process DDG Compute Scheduling Graph Compute minAWCT Set AWCT = minAWCT Set Scheduling State for AWCT Valid Schedule NO YES Compute Virtual Clusters Graph Increase AWCT A schedule is valid if: All virtual clusters have been mapped All combinations have been chosen or discarded All instructions have been scheduled in one cycle A combination has been chosen for all pairs of overlapping instructions

19 Increase AWCT The next valid AWCT value is considered Algorithm Overview Deduction Process Compute minAWCT Set AWCT = minAWCT Set Scheduling State for AWCT DDG Valid Schedule NO Find a Schedule For AWCT Compute Scheduling Graph YES Compute Virtual Clusters Graph Increase AWCT Enumerate AWCT

20 Experimental Environment CARS Single-Phase approach List-schedule giving priority to instructions in the critical path of the DG Schedules and Assigns instructions at the same time For each instruction, 1.the scheduling cycle for each cluster is computed 2.the cluster that allows for the schedule of the instruction in the earliest cycle is selected 3.instruction becomes assigned and scheduled in the selected cluster In contrast to our approach It does not study the consequences before making a decision It simply updates the estart of all successors as a consequence of a decision to the scheduling state

21 Experimental Environment Impact compiler Profiling information on the superblock exit probabilities execution frequency of each superblock Configurations Three different ones 2-clusters 1 Interconnect Bus with 1 cycle latency 4-clusters 1 Interconnect Bus with 1 cycle latency 4-clusters 1 Interconnect Bus with 2 cycle latency Each cluster able to execute 1 Int, 1 FP, 1 Mem, 1 Branch Perfect Memory Non-constrained number of registers Benchmarks 7 SpecInt95 and 7 MediaBench

22 Performance Results We perform better than CARS for all benchmarks and configurations Similar trends when comparing speedups obtained with SpecInt and MediaBench The more aggressive the architecture is the higher the benefits of our approach Specially when extra complexity on exploiting the resources (e.g. bus latency 2)

23 Conclusions Single-phase scheduling and cluster assignment Delaying the cluster assignment Key features Scheduling Graphs Virtual Clusters Deduction Process Our approach applied to superblocks performs better than CARS Avg speedup close 10% for 4 clusters 1 bus latency 2 Up to 14% for some programs Improvements come from More information of the effects of all decisions made Reducing the probabilities to made erroneous decisions Allowing for a better interaction between scheduling and assignment

Virtual Cluster Scheduling Through the Scheduling Graph Josep M. Codina Jesús Sánchez Antonio González Intel Barcelona Research Center, Intel Labs - UPC CGO07, San Jose, California - March 2007