Faculty of Electrical Engineering Czech Technical University in Prague

Slides:



Advertisements
Similar presentations
WATERLOO ELECTRICAL AND COMPUTER ENGINEERING 20s: Computer Hardware 1 WATERLOO ELECTRICAL AND COMPUTER ENGINEERING 20s Computer Hardware Department of.
Advertisements

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
CS 201 Compiler Construction
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.
Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
Parallel computer architecture classification
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Compiler Challenges for High Performance Architectures
ECE 109 / CSCI 255 What’s next.
Spring 2008, Jan. 14 ELEC / Lecture 2 1 ELEC / Computer Architecture and Design Spring 2007 Introduction Vishwani D. Agrawal.
Introduction CS 524 – High-Performance Computing.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Instruction Level Parallelism (ILP) Colin Stevens.
Computer ArchitectureFall 2007 © September 10, CS-447– Computer Architecture M,W 2:30-3:50pm Lecture 8.
CSCE 611: Conceptual Modeling Tools for CAD Course goals: –Design and verification methodologies for large-scale digital systems using industrial tools.
©UCB CS 162 Computer Architecture Lecture 1 Instructor: L.N. Bhuyan
2015/6/21\course\cpeg F\Topic-1.ppt1 CPEG 421/621 - Fall 2010 Topics I Fundamentals.
Embedded Systems in Silicon TD5102 Henk Corporaal Technical University Eindhoven DTI / NUS Singapore.
1 Computer Engineering Department Islamic University of Gaza ECOM 6301: Selected Topics in Computer Architectures (Graduate Course) Fall Prof.
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
Topic ? Course Overview. Guidelines Questions are rated by stars –One Star Question  Easy. Small definition, examples or generic formulas –Two Stars.
COM181 Computer Hardware Ian McCrumRoom 5B18,
Dept. of Computer Science Engineering Islamic Azad University of Mashhad 1 Computer System Architecture Dept. of Computer Science Engineering Islamic Azad.
1 Computer Engineering Department Islamic University of Gaza ECOM 6301: Advanced Computer Architectures (Graduate Course) Fall 2013 Prof. Mohammad A. Mikki.
CSCE 430/830 Course Project Guidelines By Dongyuan Zhan Feb. 4, 2010.
Introduction to MATLAB Session 1 Prepared By: Dina El Kholy Ahmed Dalal Statistics Course – Biomedical Department -year 3.
DOP - A CPU CORE FOR TEACHING BASICS OF COMPUTER ARCHITECTURE Miloš Bečvář, Alois Pluháček and Jiří Daněček Department of Computer Science and Engineering.
1 ECE 587 Advanced Computer Architecture I Chapter 1 Instructor and You Herbert G. Mayer, PSU Status 7/1/2015.
CS 311: Computer Organization
Fall 2015, Aug 17 ELEC / Lecture 1 1 ELEC / Computer Architecture and Design Fall 2015 Introduction Vishwani D. Agrawal.
Datapath Architecture Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki
32-bit Pipelined RISC Processor Group 1 aka “Go Us” Alice Wang Ann Ho Jason Fong CS m152b TA: Young Cho Lab section 1.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
CS5222 Advanced Computer Architecture Part 3: VLIW Architecture
MIPS Project -- Simics Yang Diyi Outline Introduction to Simics Simics Installation – Linux – Windows Guide to Labs – General idea Score Policy.
Early Adopter: Integration of Parallel Topics into the Undergraduate CS Curriculum at Calvin College Joel C. Adams Chair, Department of Computer Science.
4/25/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 22: Putting it All Together Krste Asanovic Electrical Engineering and.
Lecture 0. Course Introduction Prof. Taeweon Suh Computer Science Education Korea University COM609 Topics in Embedded Systems.
Survey of Program Compilation and Execution Bangor High School Ali Shareef 2/28/06.
Pipelining and Parallelism Mark Staveley
Performance Tuning John Black CS 425 UNR, Fall 2000.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Spring 2016, Jan 13 ELEC / Lecture 1 1 ELEC / Computer Architecture and Design Spring 2016 Introduction Vishwani D. Agrawal.
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,
1 An Execution-Driven Simulation Tool for Teaching Cache Memories in Introductory Computer Organization Courses Salvador Petit, Noel Tomás Computer Engineering.
CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.
Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
Computer Architecture Organization and Architecture
1 ECE 486/586 Computer Architecture I Chapter 1 Instructor and You Herbert G. Mayer, PSU Status 7/21/2016.
Topics to be covered Instruction Execution Characteristics
Lecture 5 Approaches to Concurrency: The Multiprocessor
ECE 486/586 Computer Architecture Introductions Instructor and You
CS203 – Advanced Computer Architecture
Array Processor.
STUDY AND IMPLEMENTATION
Physics-based simulation for visual computing applications
Computer Organization
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
Coe818 Advanced Computer Architecture
A High Performance SoC: PkunityTM
EE 4xx: Computer Architecture and Performance Programming
Course Outline for Computer Architecture
Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.
The University of Adelaide, School of Computer Science
ELEC / Computer Architecture and Design Fall 2014 Introduction
Presentation transcript:

Faculty of Electrical Engineering Czech Technical University in Prague VLIW-DLX Simulator Milos Becvar and Stanislav Kahanek Faculty of Electrical Engineering Czech Technical University in Prague

Presentation Outline Undergraduate Comp. Arch. Course Experience with WinDLX VLIW-DLX Simulation Model VLIW-DLX Simulator Features Example of Program Planned use in comp. arch. Course Future work

X36APS Course Content Introduction, computer performance – 1 lecture Undergraduate course intended for CS/CE students, follow-up to digital design and basic computer organization course. 90 minutes lecture + 90 minutes lab/seminar per week. 200-300 students per semester. Introduction, computer performance – 1 lecture ISA – 2 lectures Pipelining of RISC – 2 lectures Memory subsystem – 2 lectures Intro. to ILP - Superscalar, VLIWs – 2 lectures Data parallelism – vector computers – 1 lecture Multiprocessors, coherency on SMP – 2 lectures

X36APS Seminars and Labs Goal is to complement lectures with additional experience with presented topics: Using visualization simulators (WinDLX, HDLDLX, SMPCache) Running benchmarks and evaluating various trade-offs (SPEC benchmarks, Dinero) “Table and chalk” seminars about topics where simulators are not available (cache design, vector computers) Visualization simulators prove to be the most efficient way for student interaction with the topics.

Good Experience with WinDLX

WinDLX in X36APS Course Used to demonstrate correspondence between C source code and assembly program in DLX ISA, importance of GCC optimization (1 week in class) Used to demonstrate loop unrolling to improve speed of execution on DLX (1 week in class) Matrix multiplication program (3 weeks homework)

Matrix Multiplication Program Write a program for the DLX processor that will compute a product of two square matrices of dimension N. Optimize this program for the given processor parameters so as to achieve as low execution time as possible. Result Rating (for N=10) Competition to achieve best result limits cheating. For achieving full number of points, students have to employ unrolling of the right loop and schedule instructions to eliminate stalls. Register constrains are necessary to prohibit a brutal-force approach to a solution. (e.g. completely eliminating inner loop by unrolling it 10 times) Clock Cycles Points > 10 800 1 9800 ... 10800 2 8800... 9800 3 7800 ... 8800 4 6800 ... 7800 5 <6800 6

VLIW-DLX Goals A tool similar to WinDLX to illustrate basics of VLIWs. Show relationship between VLIWs and scalar pipelines Show relationship between software elimination of hazards by inserting NOPs into code and hardware solution by pipeline interlocks and stalls. Show that speedup achievable by extending pipeline width is limited and show sources of these limitations. Demonstrate software pipelining algorithm efficiency for VLIW and superscalar processors

Requirements for VLIW-DLX Simulator Similar ISA to DLX Similar GUI/features to WinDLX GUI Visualization of pipeline similar to WinDLX Must run in both Win/Linux environment (hence it is in Java)

VLIW-DLX Architecture

VLIW-DLX Features Currently no forwarding, all data transfers through a unified register file (Int. and FP registers). RAW and WAW hazards possible. Multiple write conflicts possible (later operation wins) Single branch allowed per VLIW instruction in pipeline slot 1 VLIW instruction following branch in the delay slot is always executed (branch is executed in the ID stage) Number and type of pipeline slots can be easily modified in simulator code. Operations are all DLX instructions except double precision FP instructions and division, new operations can be added easily.

VLIW-DLX Instructions bnez r3, loop lf f3,0(r2) sf -16(r2),f10 nop multf f2,f1,f2;; subi r3,r3,4 lf f5,4(r2) sf -12(r2),f11 multf f4,f1,f4;; DLX Instruction = VLIW-DLX Operation VLIW-DLX Instruction = Group of 5 DLX Instructions VLIW-DLX Instruction delimiter Pipeline 1 (Integer, Branch) Pipeline 2 (Integer, Load) Pipeline 3 (Load / Store) Pipeline 4 (Floating Point) Pipeline 5 (Multiplication)

VLIW-DLX Instructions Simple HW oriented representation of VLIW instructions Position in instruction corresponds to pipeline slot, operation type allowed is checked by compiler Explicit nops must be included in unused instruction slots (bundle concept or instruction compression is not used) Exchange of values between two registers is possible in a single VLIW instruction without intermediate storage register. r2 r1 add r1,r2,r0 add r2,r1,r0 nop nop nop ;; nop nop nop nop nop;; Full 5-slot NOP

VLIW-DLX Simulator Features Memory view Pipeline view Register view Source code editor

VLIW-DLX Simulator Features Shows in which stage is a given operation Same colors as WinDLX

VLIW-DLX Simulator Features Help shows which operations are now allowed in a given pipeline slot

VLIW-DLX Simulator Features Simulator shows register values read by a given operation in the ID stage. This helps to track RAW, WAW dependences.

VLIW-DLX Demonstration Program XPK (X Plus K): float x[100], k; for (i=0; i<100; i++) x[i]+=k;

XPK on Scalar DLX (main loop) lf f1, 0(r1) addf f1, f1, f0 addui r1,r1,4 subui r3,r3,1 sf -4(r1),f1 bnez r3, loop

XPK on WinDLX Same latency as VLIW-DLX, no forwarding, 2 multicycle FP adders used to simulate a single pipelined FP adder Trivial 2x unrolled 4x unrolled Software Pipelined Instruction Count 604 454 379 371 Cycles 1104 654 429 422 RAW stalls 400 150 Control stalls 99 49 24 22 Structural stalls 1 26 29 IPC 0,55 0,69 0,88 CPI 1,83 1,44 1,13 1,14

Trivial XPK on VLIW-DLX Pipeline 1 Pipeline 2 Pipeline 3 Pipeline 4 #1 loop: nop lf f1, 0(r1) #2 #3 #4 addf f1,f1,f0 #5 subui r3,r3,1 #6 addui r1,r1,4 #7 bnez r3, loop #8 (Delay slot) sf 0(r1), f1

2xUnrolled XPK on VLIW-DLX Pipeline 1 Pipeline 2 Pipeline 3 Pipeline 4 #1 loop: nop lf f1, 0(r1) #2 lf f2, 4(r1) #3 #4 addf f1,f1,f0 #5 addf f2,f2,f0 #6 subui r3,r3,2 #7 addui r1,r1,8 #8 bnez r3, loop sf 0(r1), f1 #9 (Delay slot) sf 4(r1), f2

Soft. Pipelined XPK on VLIW-DLX + Prolog and Epilog (not shown) Pipeline 1 Pipeline 2 Pipeline 3 Pipeline 4 #1 loop: nop lf f1,28(r1) sf 0(r1),f1 addf f1,f1,f0 #2 addui r1,r1,16 lf f1, 32(r1) sf 4(r1),f1 #3 bnez r3,loop lf f1,36(r1) sf 8(r1), f1 #4 (Delay sl.) subui r3,r3, 4 lf f1,24(r1) sf -4(r1),f1

XPK Loop Performance

Pipeline Efficiency of VLIW-DLX

VLIW-DLX in X36APS Course Students will be introduced to VLIW-DLX within a single seminar They will try to implement a simple loop (similar to XPK) to learn how to use the tool and software pipelining They will be assigned a slightly more complex homework (SAXPY loop, matrix mult. kernel)

Summary and Future Work VLIW-DLX is a simple tool for introduction of VLIWs to undergraduate students It can be easily integrated into course based on DLX and also MIPS processors Similar tool is planned to replace aging WinDLX simulator. It will support also vector instructions. Our goal is to introduce all these concepts to undergraduate students within a common ISA framework