COSC513 Operating System Research Paper Fundamental Properties of Programming for Parallelism Student: Feng Chen (134192)

Slides:

Advertisements

Similar presentations

Lecture 4: CPU Performance

Advertisements

Dependence Precedence. Precedence & Dependence Can we execute a 1000 line program with 1000 processors in one step? What are the issues to deal with in.

CS 201 Compiler Construction

Computer Organization and Architecture

CSCI 4717/5717 Computer Architecture

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos What’s the.

Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

1 (Review of Prerequisite Material). Processes are an abstraction of the operation of computers. So, to understand operating systems, one must have a.

Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.

Compiler techniques for exposing ILP

Superscalar processors Review. Dependence graph S1S2 Nodes: instructions Edges: ordered relations among the instructions Any ordering-based transformation.

Programmability Issues

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Bernstein’s Conditions. Techniques to Exploit Parallelism in Sequential Programming Hierarchy of levels of parallelism: Procedure or Methods Statements.

Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.

Dependency Analysis We want to “parallelize” program to make it run faster. For that, we need dependence analysis to ensure correctness.

Chapter 10 Introduction to Arrays

Dependence Analysis Kathy Yelick Bebop group meeting, 8/3/01.

File Management Chapter 12. File Management File management system is considered part of the operating system Input to applications is by means of a file.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

CMPUT680 - Fall 2006 Topic A: Data Dependence in Loops José Nelson Amaral

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

1 Chapter 18 Recursion Dale/Weems/Headington. 2 Chapter 18 Topics l Meaning of Recursion l Base Case and General Case in Recursive Function Definitions.

EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

An Object-Oriented Approach to Programming Logic and Design Chapter 7 Arrays.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Computer Architecture

Part II: Addressing Modes

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

The foreach LooptMyn1 The foreach Loop The foreach loop gives an easy way to iterate over arrays. foreach works only on arrays, and will issue an error.

1. Know the different types of flow block 2. Understand how problems can be broken down into smaller problems.

Testing phases. Test data Inputs which have been devised to test the system Test cases Inputs to test the system and the predicted outputs from these.

Chapter 4: Decision Making with Control Structures and Statements JavaScript - Introductory.

Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

File Management Chapter 12. File Management File management system is considered part of the operating system Input to applications is by means of a file.

1 File Management Chapter File Management n File management system consists of system utility programs that run as privileged applications n Concerned.

RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.

1 Advanced Behavioral Model Part 1: Processes and Threads Part 2: Time and Space Chapter22~23 Speaker: 陳奕全 Real-time and Embedded System Lab 10 Oct.

Execution of an instruction

Min Chen School of Computer Science and Engineering Seoul National University Data Structure: Chapter 2.

Ch. 10 For Statement Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2012.

Python Arrays. An array is a variable that stores a collection of things, like a list. For example a list of peoples names. We can access the different.

8-1 Compilers Compiler A program that translates a high-level language program into machine code High-level languages provide a richer set of instructions.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

How Are Computers Programmed? CPS120: Introduction to Computer Science Lecture 5.

Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.

ICC Module 3 Lesson 2 – Memory Hierarchies 1 / 25 © 2015 Ph. Janson Information, Computing & Communication Memory Hierarchies – Clip 8 – Example School.

Computer Science 210 Computer Organization Machine Language Instructions: Control.

Concepts and Challenges

Circuit Diagrams 13.1 An electric circuit can be represented using a diagram. Each part of the circuit is represented with a symbol. By reading a circuit.

Think What will be the output?

CONTROL FLOW TESTING.

Computer Science 210 Computer Organization

Fundamentals of Programming

Chapter 4: Repetition Structures: Looping

Andy Wang Operating Systems COP 4610 / CGF 5765

Andy Wang Operating Systems COP 4610 / CGF 5765

Introduction to Optimization

Sarah Diesburg Operating Systems COP 4610

Presentation transcript:

COSC513 Operating System Research Paper Fundamental Properties of Programming for Parallelism Student: Feng Chen (134192)

Conditions of Parallelism Needs in three key areas: Computation models Inter-processor communication System integration Tradeoffs exist among time, space, performance, cost factors

Data and resource dependences Flow dependence: if an execution path exists from S1 to S2 and at least one output of S1 feeds in as input to S2 Antidependence: if S2 follows S1 and the output of S2 overlaps the input to S1 Output dependence: S1 and S2 produce the same output variable I/O dependence: the same file is referenced by more than one I/O statements Unknown dependence: index itself indexed (indirect addressing), no loop variable in the index, nonlinear loop index, etc.

Example of data dependence S1: Load R1, A/move mem(A) to R1 S2: Add R2, R1/R2 = (R1) + (R2) S3: Move R1, R3/move (R3) to R1 S4: Store B, R1/move (R1) to mem(B) S2 is flow-dependent on S1 S3 is antidependent on S2 S3 is output-dependent on S1 S2 and S4 are totally independent S4 is flow-dependent on S1 and S3

Example of I/O dependence S1: Read(4), A(i)/read array A from tape unit 4 S2: Rewind(4)/rewind tape unit 4 S3: Write(4), B(i)/write array B into tape unit 4 S4: Rewind(4)/rewind tape unit 4 S1 and S3 are I/O dependent on each other This relation should not be violated during execution; otherwise, errors occur.

Control dependence The situation where the order of execution of statements cannot be determined before run time Different paths taken after a conditional branch may change data dependences May exist between operations performed in successive iterations of a loop Control dependence often prohibits parallelism from being exploited

Example of control dependence Successive iterations of this loop are control- independent: For (I=0; I<N; I++) { A(I) = C(I); if (A(I) < 0) A(I) = 1; }

Example of control dependence The following loop has control- dependent iterations: For (I=1; I<N; I++) { if (A(I-1) == 0) A(I) = 0 }

Resource dependence Concerned with the conflicts in using shared resources, such as integer units, floating-point units, registers, and memory areas ALU dependence: ALU is the conflicting resource Storage dependence: each task must work on independent storage locations or use protected access to share writable memory area Detection of parallelism requires a check of the various dependence relations

Bernstein’s conditions for parallelism Define: I i as the input set of a process P i O i as the output set of a process P i P 1 and P 2 can execute in parallel (denoted as P 1 || P 2 ) under the condition: I 1 ∩ O 2 = 0 I 2 ∩ O 1 = 0 O 1 ∩ O 2 = 0 Note that I 1 ∩ I 2 <> 0 does not prevent parallelism

Bernstein’s conditions for parallelism Input set: also called read set or domain of a process Output set: also called write set or range of a process A set of processes can execute in parallel if Bernstein’s conditions are satisfied on a pairwise basis; that is, P 1 ||P 2 || … ||P K if and only if P i ||P j for all i<>j

Bernstein’s conditions for parallelism The parallelism relation is commutative: Pi || Pj implies that Pj || Pi The relation is not transitive: Pi || Pj and Pj || Pk do not necessarily mean Pi || Pk Associativity: Pi || Pj || Pk implies that (Pi || Pj) || Pk = Pi || (Pj || Pk)

Bernstein’s conditions for parallelism For n processes, there are 3n(n-1)/2 conditions; violation of any of them prohibits parallelism collectively or partially Statements or processes which depend on run-time conditions are not transformed to parallelism. (IF or conditional branches) The analysis of dependences can be conducted at code, subroutine, process, task, and program levels; higher-level dependence can be inferred from that of subordinate levels

Example of parallelism using Bernstein’s conditions P1: C = D * E P2: M = G + C P3: A = B + G P4: C = L + M P5: F = G / E Assume no pipeline is used, five steps are needed in sequential execution

Example of parallelism using Bernstein’s conditions * / E E D C G B A C L M C G F Time P1 P2 P3 P4 P5 * ++/ + DE C G L BGE M CAF P1 P2P3P5 P4

Example of parallelism using Bernstein’s conditions There are 10 pairs of statements to check against Bernstein’s conditions Only P2 || P3 || P5 is possible because P2 || P3, P3 || P5 and P2 || P5 are all possible If two adders are available simultaneously, the parallel execution requires only three steps

Implementation of parallelism We need special hardware and software support to implement parallelism There is a distinguish between hardware and software parallelism Parallelism cannot be achieved free

Hardware parallelism Often a function of cost and performance tradeoffs If a processor issues k instructions per machine cycle, it is called a k-issue processor Conventional processor takes one or more machine cycles to issue a single instruction: one-issue processor A multiprocessor system built with n k-issue processors should be able to handle maximum nk threads of instructions simultaneously

Software parallelism Defined by the control and data dependence of programs A function of algorithm, programming styles, and compiler optimization Two most cited types of parallel programming: Control parallelism: in the form of pipelining and multiple functional units Data parallelism: similar operations performed over many data elements by multiple processors; practiced in SIMD and MIMD systems

Hardware vs. Software parallelism Software parallelism Totally eight instructions: 4 loads (L), 2 multiplication (X), 1 addition (+) and 1 subtraction (-) Theoretically, the computation will be accomplished in 3 cycles (steps) LLLL XX +- AB Step 1 Step 2 Step 3

Hardware vs. Software parallelism Hardware parallelism (Example 1) By a 2-issue processor which can execute one memory access and one arithmetic operation simultaneously The computation needs 7 cycles (steps) Mismatch between HW and SW parallelism L L L L X\XX\X X + - A B Step 1 Step 2 Step 3 Step 4 Step 5 Step 7 Step 6

Hardware vs. Software parallelism Hardware parallelism (example 2) Using a dual-processor system, each processor is single-issue 6 cycles are needed to execute the 12 instructions, where 2 store operations and 2 load operations are inserted for inter- processor communication through the shared memory L L X S L + L L X S L - BA Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 S statements: added instructions for inter- processor communication