1 חישוב מקבילי לכל תלמיד/ה פרופ' רן גינוסר הנדסת חשמל ומדעי המחשב הטכניון

Slides:



Advertisements
Similar presentations
CSCI 4717/5717 Computer Architecture
Advertisements

Optimal PRAM algorithms: Efficiency of concurrent writing “Computer science is no more about computers than astronomy is about telescopes.” Edsger Dijkstra.
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
PRAM (Parallel Random Access Machine)
© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.
ECE669 L9: Workload Evaluation February 26, 2004 ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation.
Map-Reduce Input: a collection of scientific articles on different topics, each marked with a field of science –Mathematics, Computer Science, Biology,
Advanced Topics in Algorithms and Data Structures Classification of the PRAM model In the PRAM model, processors communicate by reading from and writing.
PRAM Models Advanced Algorithms & Data Structures Lecture Theme 13 Prof. Dr. Th. Ottmann Summer Semester 2006.
Simulating a CRCW algorithm with an EREW algorithm Efficient Parallel Algorithms COMP308.
Slide 1 Parallel Computation Models Lecture 3 Lecture 4.
Map-Reduce Input: a collection of scientific articles on different topics, each marked with a field of science –Mathematics, Computer Science, Biology,
1 Many-cores: Supercomputer-on-chip How many? And how? (how not to?) Ran Ginosar Technion Mar 2010.
The many-core architecture 1. The System One clock Scheduler (ideal) distributes tasks to the Cores according to a task map Cores 256 simple RISC Cores,
Optimizing RAM-latency Dominated Applications
RAM and Parallel RAM (PRAM). Why models? What is a machine model? – A abstraction describes the operation of a machine. – Allowing to associate a value.
Chapter 18 Multicore Computers
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
Multi-Core Architectures
Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
-1.1- Chapter 2 Abstract Machine Models Lectured by: Nguyễn Đức Thái Prepared by: Thoại Nam.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
RAM, PRAM, and LogP models
Bulk Synchronous Processing (BSP) Model Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie.
Parallel Computing Department Of Computer Engineering Ferdowsi University Hossain Deldari.
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
University of Washington What is parallel processing? Spring 2014 Wrap-up When can we execute things in parallel? Parallelism: Use extra resources to solve.
M U N - February 17, Phil Bording1 Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording February.
Parallel Processing & Distributed Systems Thoai Nam Chapter 2.
1 CSCI 2510 Computer Organization Memory System II Cache In Action.
1 Programming for Engineers in Python Autumn Lecture 9: Sorting, Searching and Time Complexity Analysis.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.
Parallel Computing Presented by Justin Reschke
3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-1.
Vector computers.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
These slides are based on the book:
Basic Paging (1) logical address space of a process can be made noncontiguous; process is allocated physical memory whenever the latter is available. Divide.
Nios II Processor: Memory Organization and Access
18-447: Computer Architecture Lecture 30B: Multiprocessors
Higher Level Parallelism
EECE571R -- Harnessing Massively Parallel Processors ece
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
A Memory Aliased Instruction Set Architecture
Simultaneous Multithreading
Multi-core processors
The University of Adelaide, School of Computer Science
CS 147 – Parallel Processing
Parallel Algorithm Design
Lecture 22 review PRAM: A model developed for parallel machines
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
PRAM Algorithms.
Laxmi Narayan Bhuyan SIMD Architectures Laxmi Narayan Bhuyan
The University of Adelaide, School of Computer Science
L15: CUDA, cont. Memory Hierarchy and Examples
Unit –VIII PRAM Algorithms.
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Parallel Algorithms A Simple Model for Parallel Processing
Lecture 2 The Art of Concurrency
The University of Adelaide, School of Computer Science
Mattan Erez The University of Texas at Austin
Module 6: Introduction to Parallel Computing
The University of Adelaide, School of Computer Science
ADSP 21065L.
Presentation transcript:

1 חישוב מקבילי לכל תלמיד/ה פרופ' רן גינוסר הנדסת חשמל ומדעי המחשב הטכניון

2 תוכן מדוע חישוב מקבילי ? תיאוריה ארכיטקטורה אלגוריתמים מה לעשות

3 מדוע חישוב מקבילי ? כולם עושים זאת –search “Parallel computing high school” חישוב מהיר יותר חישוב בהספק נמוך יותר חישוב באנרגיה נמוכה יותר

4 תיאוריה (1) מודל PRAM –Parallel Random Access Machine –מיועד לאלגוריתם אחד בלבד בו-זמנית –קריאה וכתיבה "במקביל" Concurrently Simultaneously –ביחידת זמן: כל מעבד מבצע חישוב, או כל מעבד ניגש לזיכרון P0P0 P1P1 P2P2 PnPn Shared memory …

5 דוגמה: סיכום אברי מערך + A1 A A3 A4 + A5 A6 + + A7 A8 T 1 =7, O(n) P=1 T P =3, O( log n) P=4, O(n) SpeedUp=T 1 /T P =7/3=2.33, O(n/ log n) סה"כ עבודה אפשרית : P  T P =12, O(n log n) סה"כ עבודה שבוצעה: W=7, O(n) יעילות: E P = W / PT P =7/12, O(1/ log n) + A1 A2 + A3 + A4 + A5 + A6 + A7 + A8 סדרתי מקבילי אבל: שכחנו להתחשב במחיר הגישה לזיכרון...

6 Exchange Swap elements from A[ ] with those of B[ ] Input: A[ ], B[ ], n = length of arrays SERIAL int main( ) { int i; for(i=0; i<n; i++) { int e= A[i]; A[i] = B[i]; B[i] = e; } PARALLEL Duplicable task() // n copies { int x ; x= A[$]; A[$] = B[$]; B[$] = x; } כמה מעבדים? מה זמן החישוב? ההאצה? היעילות? FINE GRANULARITY

7 Optimal Parallel Exchange int y[n], x[n]; Duplicable Task first-A () { x[$] = A[$]; } Duplicable Task second-A ( first-A,first-B ) { A[$] = y[$]; } Duplicable Task first-B () { y[$] = B[$]; } Duplicable Task second-B ( first-A,first-B ) { B[$] = x[$]; } על כמה מעבדים? מה יקרה אם יהיו פחות מעבדים? יותר מעבדים? מה זמן חישוב? ההאצה? היעילות? מה המשמעות של FINE GRANULARITY ??

8 תיאוריה (2) מה פירוש גישה במקביל לזיכרון? –EREW: קריאה בלעדית, כתיבה בלעדית אל / מאותו משתנה בדיוק! –CREW: קריאה בו-זמנית, כתיבה בלעדית –ERCW –CRCW: קריאה וכתיבה בו-זמנית האלגוריתם חייב להבטיח חישוב נכון –לא החומרה מגינה מפני המתכנת הטיפש האם מודל PRAM מעשי?

9 סינכרון סינכרון PRAM –כל המעבדים מבצעים אותה פעולה באותו זמן חוסר סינכרון –כל מעבד מתקדם בקצב שלו, עד נקודת סינכרון דוגמה בהמשך סינכרון בקטעים –BSP = Bulk Synchronous Pattern –כל מעבד בקצב שלו, אבל נקודות הסינכרון משותפות (= מחסומים) –בכל קטע ירוק אין שיתוף זיכרונות –בנקודות המחסום מחליפים מידע BSP

10 מודלים אחרים מחשב רשת P+M

11 כיצד לארגן את המעבדים והזכרונות על גבי שבב ?

12 P0 P1 P2 P3 P4 P5 P6 P7 מודל נהלל: המשותף קרוב לכולם

13 ארכיטקטורה

14 Start-up company in Israel Result of Technion research (since 1980s) P LURALITY

15 Architecture: Part I “anti-local” addressing by interleaving MANY banks / ports negligible conflicts fine granularity NO PRIVATE MEMORY tightly coupled memory equi-distant (1 cycle each way) fast combinational NOC P P P PPP P P external memory shared memory P-to-M resolving NoC P LURALITY

16 P P P PPP P P external memory shared memory P-to-M resolving NoC low latency parallel scheduling enables fine granularity scheduler P-to-S scheduling NoC “anti-local” addressing by interleaving MANY banks / ports negligible conflicts fine granularity NO PRIVATE MEMORY tightly coupled memory equi-distant (1 cycle each way) fast combinational NOC Architecture: Part II P LURALITY

17 Actual layout (40nm) 1MByte Data Cache 64 KB Instruction Cache Sync/Sched 32 processors

18 programming model Compile into –task-dependency-graph = ‘task map’ –task codes Task maps loaded into scheduler Tasks loaded into memory regular duplicable task xxx( dependencies ) join/fork { … $ …. ….. } Task template: P P P PPP P P external memory shared memory P-to-M resolving NoC scheduler P-to-S scheduling NoC P LURALITY

19 Fine Grain Parallelization SERIAL for ( i=0; i<10000; i++ ) { a[i] = b[i]*c[i]; } PARALLEL duplicable task XX(…) // copies {a[$] = b[$]*c[$];} All tasks, or any subset, can be executed in parallel

20 Task map example (2D FFT) Duplicable task Conditional task Join / fork task

21 Another task map (linear solver)

22 Linear Solver: Simulation snap-shots

23 Architectural Benefits Shared, uniform (equi-distant) memory –no worry which core does what –no advantage to any core because it already holds the data Many-bank memory + fast P-to-M NoC –low latency –no bottleneck accessing shared memory Fast scheduling of tasks to free cores (many at once) –enables fine grain data parallelism –impossible in other architectures due to: task scheduling overhead data locality Any core can do any task equally well on short notice –scales automatically Programming model: –intuitive to programmers –easy for automatic parallelizing compiler P LURALITY

24 Analysis

25 Two (analytic) approaches to many cores 1) How many cores can fit into a fixed size VLSI chip? 2) One core is fixed size. How many cores can be integrated? The following analysis employs approach (1)

26 Analysis

27 Analysis K 4K power   N perf   N freq  1/  N Perf / power  const

28 מה לעשות? ללמד חישוב מקבילי –מעט ארכיטקטורה –הרבה אלגוריתמים להשתמש במערכת Plurality לפיתוח תכנה, סימולציה ומדידת ביצועים –מבוססת GCC, Eclipse –לשפות C, ++C –זמינה חינם לכל מורה ותלמיד/ה, הורדה מהאתר –כוללת דוגמאות והוראות –מאפשרת התנסות

29 סיכום מחשב מקבילי (מעבדים קטנים) יעיל יותר ממחשב יחיד חזק –בתנאי שאנו יודעים לתכנת אותו ביעילות –וזה אפשרי רק במחשב פשוט ועם אלגוריתם חכם תכנות מקבילי –קל ללימוד למורה –מתאים ללימוד בבי"ס תיכון –חיוני ללמוד אותו בבי"ס תיכון

30