Lecture 3: Lecturer: Simon Winberg Towards Prac1, Golden Measure, Temporal and Spatial Computing, Benchmarking Attribution-ShareAlike 4.0 International.

Slides:



Advertisements
Similar presentations
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
Advertisements

Lecturer: Simon Winberg Lecture 18 Amdahl’s Law & YODA Blog & Design Review.
TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.
Evis Trandafili Polytechnic University of Tirana Albania Functional Programming Languages 1.
1 Lecture 6 Performance Measurement and Improvement.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.5 Comparing and Summarizing Performance.
Computer ArchitectureFall 2007 © November 7th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
CS 201 Functions Debzani Deb.
Chapter 4 Assessing and Understanding Performance
1 CS 501 Spring 2005 CS 501: Software Engineering Lecture 22 Performance of Computer Systems.
Academic Support Irene Togher DCFE Assessment O Assignments: a piece of work you submit for correction to demonstrate you have learned.
1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.
Lecture 4: Timing & Programming Models Lecturer: Simon Winberg Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Lecture 6: Design of Parallel Programs Part I Lecturer: Simon Winberg Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Chapter 10 Review: Matrix Algebra
Chocolate Bar! luqili. Milestone 3 Speed 11% of final mark 7%: path quality and speed –Some cleverness required for full marks –Implement some A* techniques.
Lecturer: Simon Winberg Lecture 21 Reflections, key steps* and A short comprehensive assignment * Relates to Martinez, Bond and Vai Ch 4. Attribution-ShareAlike.
Lecture 14 Reconfigurable Computing Basics Lecturer: Simon Winberg.
15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
IT253: Computer Organization Lecture 3: Memory and Bit Operations Tonga Institute of Higher Education.
Copyright © 2002, Systems and Computer Engineering, Carleton University a-JavaReview.ppt * Object-Oriented Software Development Unit.
IT253: Computer Organization
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
1 CS/EE 362 Hardware Fundamentals Lecture 9 (Chapter 2: Hennessy and Patterson) Winter Quarter 1998 Chris Myers.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Lecturer: Simon Winberg Lecture 18 Amdahl’s Law (+- 25 min)
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.
Lecture 11: Parallel Design Patterns, Where to in Term 2 … Lecturer: Simon Winberg.
Lecture 4: Timing & Programming Models Lecturer: Simon Winberg Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
INTRODUCTION: WELCOME TO STAT 200 January 5 th, 2009.
INFOMGP Student names and numbers Papers’ references Title.
Data Reuse in Embedded Processors Peter Trenkle CPE631 Project Presentation.
Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.
Concurrency and Performance Based on slides by Henri Casanova.
1/50 University of Turkish Aeronautical Association Computer Engineering Department Ceng 541 Introduction to Parallel Computing Dr. Tansel Dökeroğlu
Announcements Assignment 2 Out Today Quiz today - so I need to shut up at 4:25 1.
Lecture 4: Lecturer: Simon Winberg Temporal and Spatial Computing, Shared Memory, Granularity Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Lecture 5: Lecturer: Simon Winberg Review of paper: Temporal Partitioning Algorithm for a Coarse-grained Reconfigurable Computing Architecture by Chongyong.
Lecture 1 Golden Measure and other terms (appropriate for pracs) Lecturer: Simon Winberg Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Lecture 3: Harvard Architecture, Shared Memory, Architecture Case Studies Lecturer: Simon Winberg Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Intro to Data Structures Concepts ● We've been working with classes and structures to form linked lists – a linked list is an example of something known.
Conclusions on CS3014 David Gregg Department of Computer Science
Introduction to Algorithms
Defining Performance Which airplane has the best performance?
Lecture: MATLAB Chapter 1 Introduction
EEE4084F Digital Systems NOT IN 2017 EXAM Lecture 25
The Hardware/Software Interface CSE351 Winter 2013
Other Kinds of Arrays Chapter 11
Lecture 5: GPU Compute Architecture
EEE4084F Digital Systems NOT IN 2018 EXAM Lecture 24X
Lecture 5: GPU Compute Architecture for the last time
Lecture 6: Introduction to Pthreads
Lecture 18 X: HDL & VHDL Quick Recap
Objective of This Course
1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.
COMP60621 Fundamentals of Parallel and Distributed Systems
Introduction to Algorithms
ECE 352 Digital System Fundamentals
ECE 352 Digital System Fundamentals
COMP60611 Fundamentals of Parallel and Distributed Systems
Dealing with reading assignments
EEE4084F Digital Systems Lecture 5:
Presentation transcript:

Lecture 3: Lecturer: Simon Winberg Towards Prac1, Golden Measure, Temporal and Spatial Computing, Benchmarking Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

 Prac Issues  Seminar planning  Temporal & spatial computing  Benchmarking  Power BTW: Quiz 1 NEXT Thursday! Licensing details last slide

 Each seminar run by a seminar group  Everyone to read each assigned reading  Recommend: make notes to self (or underlining/highlight important points – if you want to resell your book, don’t do this)  Write down questions or comments (classmates running the seminar would probably welcome these)

 Your seminar needs to include:  3x important take-home messages (of which students will hopefully remember at least 1)  1x point did you collectively decided was most interesting Extra Seasoning: you’re by all means welcome to do tasks or surveys, handouts, etc, that may encourage participation and/or benefit your classmates’ learning experience.

Seminar presentation timing & marking guide Structure of Seminar PresentationMark Introduction of group and topic (~1 min)5 Summary presentation (~10 min)20 Visual aids / use of images / mindmaps / etc.20 Reflections (5 – 10 min) Including group’s viewpoints / comments / critique 15 Facilitation and direction of class discussion & response to questions (10 min) 15 Quality of questions posed by the presenters10 Wrapping up / conclusion (2 min)5 Participation of all members10 TOTAL:100 Look for the Seminar Marking Guide under resources

 30 students  About 10 seminars (excl. tomorrow’s)  Groups to be determined  Use Sign-Up in Vula to specify your group members (prefer 3 students per group, max. 4)

Prac 1 Issues EEE4084F Digital Systems

 Procedure:  Develop / study algorithm  Implementation  Performance test  Initially “feel-good” conformation (graphs, etc)  Then speed, memory, latency, etc comparisons with the “golden measure”

 Golden measure:  A (usually) sequential solution that you develop as the ‘yard stick’  A solution that runs slowly, isn’t optimized, but you know it gives an excellent result  E.g., a solution written in OCTAVE or MatLab, verify it is correct using graphs, inspecting values, checking by hand with calculator, etc.

 Sequential / Serial (serial.c)  A non-parallized code solution  Generally, you can call your code solutions parallel.c (or para1.c, para2.c if you have multiple versions)  You can also include some test data (if it isn’t too big, <1Mb), e.g. gold.csv or serial.csv, and paral1.csv

 Part A  Example program that helps get you started quicker  e.g., PartA of Prac1 gives a sample program using Pthreads and loading an image  Part B  The main part, where you provide a parallelized solution

 Reports should be short!  Pref. around one or two pages long (could add appendices, e.g., additional screenshots)  Discussing your observations and results.  Prac num, Names & student num on 1 st page  Does not need to be fancy (e.g. point-form OK for prac reports)  Where applicable (e.g. for Prac1), you can include an image or two of the solution to illustrate/clarify the discussion

 Very important:  Show the error stats and timing results you got.  Use standard deviation when applicable  You may need to be inventive in some cases (e.g., stddev between two images)  I want to see the real time it took, and  The speedup factor for the different methods and the types of tests applied u = average X speedup = T p1 / T p2 T p1 = Original non-parallel program T p2 = Optimized or parallel program

Temporal and Spatial Computation Temporal ComputationSpatial Computation The traditional paradigm Typical of Programmers Things done over time steps Suited to hardware Possibly more intuitive? Things related in a space A = input(“A= ? ”); B = input(“B =? ”); C = input(“B multiplier ?”); X = A + B * C Y = A – B * C A? B? C? + * X ! Y ! - Which do you think is easier to make sense of? Can provide a clearer indication of relative dependencies.

 Being able to comprehend and extract the parallelism, or properties of concurrency, from a process or algorithm is essential to accelerating computation  The Reconfigurable Computing (RC) Advantage:  The computer platform able to adapt according to the concurrency inherent in a particular application in order to accelerate computation for the specific application

 “Don’t loose sight of the forest for the trees…”  Generally, the main objective is to make the system faster, use less power, use less resources…  Most code doesn’t need to be parallel.  Important questions are…

 Should you bother to design a parallel algorithm?  Is your parallel solution better than a simpler approach, especially if that approach is easier to read and share?  Major telling factor is: Real-time performance measure Or “wall clock time”

 Generally most accurate to use built in timer, which is somehow directly related to real time (e.g., if the timer measures 1s, then 1s elapsed in the real world)  Technique: unsigned long long start; // store start time unsigned long long end; // store end time start = read_the_timer(); // e.g. time() DO PROCESSNG end = read_the_timer(); // e.g. time().. Output the time measurement (end-start), or save it to an array if printing will interfere with the times. Note: to avoid overflow, used unsigned vars. See file: Cycle.c

Power concerns (a GST perspective)

Computation Design Trends Intel performance graph For the past decades the means to increase computer performance has been focusing to a large extent on producing faster software processors. This included packing more transistors into smaller spaces. Moore’s law has been holding pretty well… when measured in terms of transistors (e.g., doubling number of transistors) But this trend has drawbacks, and seems to be slowing…

Calculation per seconds per 1k$ Over time trend

Illustration of demand for computers (Intel perspective) slide 22 - demand for computers.jpg (unknown license) Source: alphabytesoup.files.wordpress.com/2012/07/computer-timeline.gif

Computation Design Trends – Power concerns Processors are getting too power hungry! There’s too many transistors that need power. Also, the size of transistors can't come down by much – it might not be possible to have transistors smaller than a few atoms! And how would you connect them up? Now tending to multi-core processors.. Sure it can double the transistors every 2- 3 years (and the power). But what of performance? A dual core Intel system with GPU, LCD monitor draws about 220 watts Projections obviously we’ve seen the reality isn’t as bad Slide 22 - Power over time.jpg

Image source:

 Matrix operations are commonly used to demonstrate and teach parallel coding  The scalar product (or dot product) and Matrix multiply are the ‘usual suspects’ Vector scalar product Matrix multiplication Both of these operations can be successfully Implemented as deeply parallel solutions. C i,j = A i,k B k,j k

 Attempt a pseudo code solution for parallelizing both the:  Scalar vector product algorithm and the  Matrix multiplication algorithm  Assume you would want to implement your solution in C (i.e. your pseudo code should follow C-type operations)  Next consider how you would do it in hardware on a FPGA (draw schematic) If time is too limited, just try the scalar product. If you have more time, and are real keen, the by all means experiment with writing and testing real code to see that your suggested solution is valid.

Suggested function prototypes void matrix_multiply (float** A, float** B, float** C, int n) { // A,B = input matrices of size n x n floats // C = output matrix of size n x n floats } Matrix multiply: Scalar product: float scalarprod (float* a, float* b, int n) { // a,b = input vectors of length n // Function returns the scalar product }

Scalarprod.c t0 = CPU_ticks(); // get initial tick value // Do processing... // first initialize the vectors for (i=0; i<VECTOR_LEN; i++) { a[i] = random_f(); b[i] = random_f(); } sum = 0; for (i=0; i<VECTOR_LEN; i++) { sum = sum + (a[i] * b[i]); } // get the time elapsed t1 = CPU_ticks(); // get final tick value Golden measure / sequence solution

 Thursday lecture  Timing  Programming Models

Image sources: Gold bar: Wikipedia (open commons) IBM Blade (CC by 2.0) ref: Takeaway, Clock, Factory and smoke – public domain CC0 ( Forrest of trees - NonCommercial-ShareAlike 2.0 Generic (CC BY-NC-SA 2.0) Moore’s Law graph, processor families per supercomputer over years – all these creative commons, commons.wikimedia.orgcommons.wikimedia.org Disclaimers and copyright/licensing details I have tried to follow the correct practices concerning copyright and licensing of material, particularly image sources that have been used in this presentation. I have put much effort into trying to make this material open access so that it can be of benefit to others in their teaching and learning practice. Any mistakes or omissions with regards to these issues I will correct when notified. To the best of my understanding the material in these slides can be shared according to the Creative Commons “Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)” license, and that is why I selected that license to apply to this presentation (it’s not because I particulate want my slides referenced but more to acknowledge the sources and generosity of others who have provided free material such as the images I have used).