CAPS team Compiler and Architecture for superscalar and embedded processors.

Slides:



Advertisements
Similar presentations
Lecture 6: Multicore Systems
Advertisements

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
COMP3221: Microprocessors and Embedded Systems Lecture 2: Instruction Set Architecture (ISA) Lecturer: Hui Wu Session.
2015/6/21\course\cpeg F\Topic-1.ppt1 CPEG 421/621 - Fall 2010 Topics I Fundamentals.
Chapter 17 Parallel Processing.
Multiscalar processors
The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science
Computer Organization & Assembly Language
Generic Software Pipelining at the Assembly Level Markus Pister
Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
HERY H AZWIR Computer Software. Computer Software Outline Software and Programming Languages  Software  Programming  Programming language development.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.
1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team.
BLU-ICE and the Distributed Control System Constraints for Software Development Strategies Timothy M. McPhillips Stanford Synchrotron Radiation Laboratory.
CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.
Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.
André Seznec Caps Team IRISA/INRIA HAVEGE HArdware Volatile Entropy Gathering and Expansion Unpredictable random number generation at user level André.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.
CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
© 2012 xtUML.org Bill Chown – Mentor Graphics Model Driven Engineering.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
1 CAPS Compilers Activities IRISA Campus Universitaire de Beaulieu Rennes.
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University
Lecture on Central Process Unit (CPU)
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.
Infrastructures for Assembly Level Tools F. Bodin CAPS Team IRISA-INRIA.
Chapter 1 Introduction.
ECE354 Embedded Systems Introduction C Andras Moritz.
Microarchitecture.
Computer Architecture Principles Dr. Mike Frank
Why to use the assembly and why we need this course at all?
Chapter 1 Introduction.
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Vector Processing => Multimedia
Many-core Software Development Platforms
A Review of Processor Design Flow
CSCI1600: Embedded and Real Time Software
Computer Architecture Lecture 4 17th May, 2006
Performance Optimization for Embedded Software
Virtual Machines (Introduction to Virtual Machines)
Chapter 1 Introduction.
Computer Evolution and Performance
The Vector-Thread Architecture
Introduction to Virtual Machines
Introduction to Virtual Machines
CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.
CSCI1600: Embedded and Real Time Software
Research: Past, Present and Future
Presentation transcript:

CAPS team Compiler and Architecture for superscalar and embedded processors

CAPS project 2 CAPS members  2 INRIA researchers: A. Seznec, P. Michaud  2 professors: F. Bodin, J. Lenfant  11 Ph D students: R. Amicel, R. Dolbeau, A. Monsifrot, L. Bertaux, K. Heydemann, L. Morin, G. Pokam, A. Djabelkhir, A. Fraboulet, O. Rochecouste, E.Toullec  3 engineers : S. Bihan, P. Villalon, J. Simonnet

CAPS project 3 CAPS themes  Two interacting activities  High performance microprocessor architecture  Performance oriented compilation

CAPS project 4 CAPS Grail  Performance at the best cost Progress in computer science and applications are driven by performance

CAPS project 5 CAPS path to the Grail  Defining the tradeoffs between:  what should be done through hardware  what can be done by the compiler  for maximum performance  or for minimum cost  or for minimum size, power..

CAPS project 6 Need for high-performance processors  Current applications  general purpose: scientific, multimedia, data bases …  embedded systems: cell phones, automotive, set-top boxes..  Future applications  don’t worry: users have a lot of imagination !  New software engineering techniques are CPU hungry:  reusability, generality  portability, extensibility (indirections, virtual machines)  safety (run-time verifications)  encryption/decryption

CAPS project 7 CAPS (ancient) background  « ancient » background in hardware and software management of ILP  decoupled pipeline architectures  OPAC, an hardware matrix floating-point coprocessor  software pipeline for LIW  « Supercomputing » background  interleaved memories  Fortran-S

CAPS project CAPS background in architecture  Solid knowledge in microprocessor architecture  technological watch on microprocessors  A. Seznec worked with Alpha Development Group in  Researches in cache architecture  Researches in branch prediction mechanisms

CAPS project 9 CAPS background in compilers  Software optimizations for cache memories  Numerical algorithms on dense structures  Optimizing data layout  Many prototype environments for parallel compilers:  CT++ (with CEA): image processing C++ library for a SIMD architecture,  Menhir: a parallel compiler for MatLab  IPF (with Thomson-LER): Fortran Compiler for image processing on Maspar  Sage (with Indiana): Infrastusture for source level transformation

CAPS project 10 We build on  SALTO: System for Assembly-Language Transformations and Optimizations  retargetable assembly source to source preprocessor  Erven Rohou’s Ph. D  TSF:  Scripting language for program transformation on top of ForeSys (Simulog)  Yann Mevel’s Ph. D

CAPS project 11 Salto overview  Assembly source to source preprocessor  Fine grain machine description  Independent from compilers Transformation tool SALTO interface C++ Machine Description assembly language

CAPS project 12 Compiler activities  Code optimizations for embedded applications  infrastructures rather than compilers  optimizing compiler strategies rather than new code optimizations  Global constraints  performance /code sizes/ low power (starting)  Focus on interactive tools rather than automatic  code tuning  case based reasoning  assembly code optimizations

CAPS project 13 Computer aided hand tuning  Automatic optimization has many shortcomings  rather provide the user with a testbed to hand-tune applications  Target applications  Fortran codes and embedded C applications  Our approach  case based reasoning  static code analysis and pattern matching  profiling  learning techniques  the user is the ultimate responsible

CAPS project 14 CAHT Prototype built on Foresys: Fortran interactive front-end (from Simulog) TSF: Scripting language for program transformation Sage++: Infrastusture for source level transformation

CAPS project 15 Analysis and Tuning tool for Low Level Assembly and Source code (with Thomson Multimedia)  ATLLAS objectives :  Has the compiler done a good job ?  Try to match source and optimized assembly at fine grain  Development/analysis environment:  Models for both source and assembly  Global and local analysis (WCET, …) at both levels  Interactive environment for codes visualization and manual/ automatic analysis and optimization  Built using Salto and Sage++:  Retargetable with compilers and architectures

CAPS project 16 ATLLAS - Analysis and Tuning tool for Low Level Assembly and Source code : Tuning method Good ? Half-Automatic or Manual Source Optimisations Atllas compilation profiling End Yes Half-Automatic or Manual Assembly Optimisations Source CodeAssembly Code Post-Processing Processing Support C ode matching analysis and evaluations Graphic Display of Ass. And Src. Code

CAPS project 17 Assembly Level Infrastrure for Software Enhancement (with STmicroelectonics)  ALISE  enhanced SALTO for code optimization: better integration with code generation –interface with front-end –interface for profiling data targets global optimization based on component software optimization engines  Answer to a real need from industry:  A retargetable infrastructure

CAPS project 18 ALISE  Environment for:  global assembly code optimization  providing optimization alternatives  Support for new embedded processors  ISAs with ILP support (VLIW, EPIC)  Predicated instructions  Functional unit clusters,..

CAPS project 19 ALISE Architecture Description D to M Architecture Model Intermediate representation Opt 1Opt 2Opt n P to IR Text Input IR to Ass (Emit) Optimized Program High Level API Interfaces External Infrastructure User interface G.U.I. Intermediate Code External Infrastructure

CAPS project 20 Preprocessor for media processors (MEDEA+ Mesa project)  Multimedia instructions on embedded and general- purpose processors but :  no consensus on MMD instructions among constructors: saturated arithmetic or not, different instructions, …  Multimedia instructions are not well handled by compilers: but performance is very dependent

CAPS project 21 Preprocessor for media processors: our approach  C source to source preprocessor  user oriented idioms recognition:  easy to retarget  target dedicated recognition  exploiting loop parallelism  vectorization techniques  multiprocessor systems  available soon  Collaboration with Stmicroelectonics

CAPS project 22 Iterative compilation  Embedded systems:  Compile time is not critical  Performance/code size/power are critical  One can often relate on profiling  Classical compiler: local optimizations  but constraints are GLOBAL  Proof of concept for code sizes (Rohou ’s Ph. D)  new Ph. D. beginning in september 2000

CAPS project 23 High performance instruction set simulation  Embedded processors:  // development of silicon, ISA, compiler and applications  Need for flexible instruction set simulation:  high performance  simulation of large codes  debugging  retargetable to experiment: new ISA various microarchitecture options  First results: up to 50x faster than ad-hoc simulator

CAPS project 24 ABSCISS: Assembly Based System for Compiled Instruction Set Simulation C SourceTriMedia Assembly tmcc TriMedia Binary ABSCISS tmsim tmas gcc C/C++ Source Compiled simulator Architecture Description

CAPS project 25 Enabling superscalar processor simulation  Complete O-O-O microprocessor simulation:  slower than real hardware  can not simulate realistic applications, but slices  even fast mode emulation is slow (50-100x): simulation generally limited to slices at the beginning of the application representativeness ?  Calvin2 + DICE:  combines direct execution with simulation  really fast mode: 1-2x slowdown  enables simulating slices distributed over the whole application

CAPS project 26 DICE Host ISA Emulator User analysis routines Calvin2 + DICE Original code SPARC V9 assembly code calvin2 Static Code Annotation Tool checkpoint Switching event Emulation mode Switching event

CAPS project 27 Moving tools to IA64  New 64bit ISA from Intel/HP:  Explicitly Parallel Instruction Computing  Predicated Execution  Advanced loads (i.e. speculative)  A very interesting platform for research !!  Porting SALTO and Calvin2+DICE approach to IA64  Exploring new trade-offs enabled by instruction sets:  predicting the predicates ?  advanced loads against predicting dependencies  ultimate out-of-order execution against compiler

CAPS project 28 Low power, compilation, architecture, … (just beginning :=)  Power consumption becomes a major issue:  Embedded and general purpose  Compilation (setting a collaboration with STmicroelectronics/Stanford/Milan):  Is it different from performance optimization ?  Global constraint optimization  Instruction Set Architecture support ?  Architecture:  High order bits are generally null, …  registers and memory  ALUs

CAPS project 29 Caches and branch predictors  International CAPS visibility in architecture =  skewed associative cache  + decoupled sectored cache  + multiple block ahead branch prediction  + skewed branch predictor  Continue recurrent work on these topics:  multiple block ahead + tradeoffs complexity/accuracy

CAPS project 30 Simultaneous Multithreading  Sharing functional units among several processes  Among the first groups working on this topic  S. Hily’s Ph. D.  SMT behavior well understood for independent threads  now, focus on // threads from a single application  Current research directions:  speculative multithreading ultimate performance with a single thread through predicting threads  performance/complexity tradeoffs: SMT/CMP/hybrid

CAPS project 31 « Enlarging » the instruction window (supported by Intel)  In an O-O-O processor, fireable instructions are chosen in a window of a few tens of RISC-like instructions.  Limitations are:  size of the window  number of physical registers  Prescheduling:  separate data flow scheduling from resource arbitration.  coarser units of work ?  Reducing the number of physical registers:  how to detect when a physical register is dead ?  Per group validation ? revisiting CISC/RISC war ?

CAPS project 32 Unwritten rule on superscalar processor designs  For general purpose registers: Any physical register can be the source or the result of any instruction executed on any functional unit

CAPS project 33 4-cluster WSRS architecture (supported by Intel) S0 C0 S1 C1 S2 C2 S3 C3 S2 Half the read ports, one fourth the write ports Register file: Silicon area x 1/8 Power x 1/2 Access time x 0.6 Gains on: bypass network selection logic

CAPS project 34 Multiprocessor on a chip  Not just replicating board level solutions !  A way to manage a large on-chip cache capacity:  how can a sequential application use efficiently a distributed cache ?  architectural supports for distributing a sequential application on several processors ?  how should instructions and data be distributed ?

CAPS project 35 HIPSOR HIgh Performance SOftware Random number generation  Need for unpredicable random number generation:  sequences that cannot be reproduced  State of the art:  < 100 bit/s using the operating system  75Kbit/s using hardware generator on Pentium III  Internal state of a superscalar can not be reproduced  use this state to generate unpredictable random numbers

CAPS project 36 HIPSOR (2)  1000’s of unmonitorable states modified by OS interrupts  Hardware clock counter to indirectly probe these states  Combined with in-line pseudo-random number generation  100 Mbit/s unpredictable random numbers ARC INRIA with CODES