10/20/20151 GC16/3011 Functional Programming Lecture 22 The Four-Stroke Reduction Engine.

Slides:

Advertisements

Similar presentations

A Block-structured Heap Simplifies Parallel GC Simon Marlow (Microsoft Research) Roshan James (U. Indiana) Tim Harris (Microsoft Research) Simon Peyton.

Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

1 Concurrency Control Chapter Conflict Serializable Schedules  Two actions are in conflict if  they operate on the same DB item,  they belong.

Lecture 7: Deadlock Necessary Conditions for Deadlock Deadlock Prevention - Havender's Linear Ordering Deadlock Avoidance Deadlock Detection &

6. Deadlocks 6.1 Deadlocks with Reusable and Consumable Resources

CSC 213 – Large Scale Programming. Today’s Goals  Consider what new does & how Java works  What are traditional means of managing memory?  Why did.

Parallel and Distributed Simulation Time Warp: Other Mechanisms.

Introduction to Trees. Tree example Consider this program structure diagram as itself a data structure. main readinprintprocess sortlookup.

CSE506: Operating Systems Block Cache. CSE506: Operating Systems Address Space Abstraction Given a file, which physical pages store its data? Each file.

Efficient Representation of Data Structures on Associative Processors Jalpesh K. Chitalia (Advisor Dr. Robert A. Walker) Computer Science Department Kent.

Spark plug Inlet valve Exhaust valve CylinderPiston The four-stroke engine.

CS 536 Spring Automatic Memory Management Lecture 24.

1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.

Recap from last time We were trying to do Common Subexpression Elimination Compute expressions that are available at each program point.

CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.3 Program Flow Mechanisms.

Concurrent Processes Lecture 5. Introduction Modern operating systems can handle more than one process at a time System scheduler manages processes and.

Efficient Associative SIMD Processing for Non-Tabular Data Jalpesh K. Chitalia and Robert A. Walker Computer Science Department Kent State University.

3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.

3.5 Interprocess Communication

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.

Compiler construction in4020 – lecture 13 Koen Langendoen Delft University of Technology The Netherlands.

Ceng Operating Systems Chapter 2.1 : Processes Process concept Process scheduling Interprocess communication Deadlocks Threads.

Chapter 11 Operating Systems

CS 603 Threads, Processes, and Agents March 18, 2002.

Transforming Infix to Postfix

CSE 326: Data Structures B-Trees Ben Lerner Summer 2007.

CSSE Operating Systems

1 Process Description and Control Chapter 3 = Why process? = What is a process? = How to represent processes? = How to control processes?

Chapter 6 Real-Time Embedded Multithreading The Thread – The Essential Component.

Functional Programming Professor Yihjia Tsai Tamkang University.

Operating Systems (CSCI2413) Lecture 3 Processes phones off (please)

Distributed process management: Distributed deadlock

OPERATING SYSTEMS DESIGN AND IMPLEMENTATION Third Edition ANDREW S. TANENBAUM ALBERT S. WOODHULL Yan hao (Wilson) Wu University of the Western.

Nirmalya Roy School of Electrical Engineering and Computer Science Washington State University Cpt S 122 – Data Structures Data Structures Trees.

Review Binary Tree Binary Tree Representation Array Representation Link List Representation Operations on Binary Trees Traversing Binary Trees Pre-Order.

CMPSC 16 Problem Solving with Computers I Spring 2014 Instructor: Lucas Bang Lecture 15: Linked data structures.

More with Methods (parameters, reference vs. value, array processing) Corresponds with Chapters 5 and 6.

10/12/20151 GC16/3C11 Functional Programming Lecture 3 The Lambda Calculus A (slightly) deeper look.

S.Ducasse Stéphane Ducasse 1 Processes and Concurrency in VW.

Operating Systems CMPSC 473 Processes (4) September Lecture 10 Instructor: Bhuvan Urgaonkar.

Chapter 11 Heap. Overview ● The heap is a special type of binary tree. ● It may be used either as a priority queue or as a tool for sorting.

1 Nasser Alsaedi. The ultimate goal for any computer system design are reliable execution of task and on time delivery of service. To increase system.

CE Operating Systems Lecture 11 Windows – Object manager and process management.

1 Chapter 2.1 : Processes Process concept Process concept Process scheduling Process scheduling Interprocess communication Interprocess communication Threads.

Games Development 2 Concurrent Programming CO3301 Week 9.

10/26/20151 GC16/3011 Functional Programming Lecture 21 Parallel Graph Reduction.

Compiling Functional Programs Mooly Sagiv Chapter 7

CY2003 Computer Systems Lecture 04 Interprocess Communication.

12/2/20151 GC16/3011 Functional Programming Lecture 2 The Lambda Calculus: A Simple Introduction.

NETW 3005 Threads and Data Sharing. Reading For this lecture, you should have read Chapter 4 (Sections 1-4). NETW3005 (Operating Systems) Lecture 03 -

Linked List by Chapter 5 Linked List by

1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.

CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.

Programming Abstractions Cynthia Lee CS106X. Topics:  Priority Queue › Linked List implementation › Heap data structure implementation  TODAY’S TOPICS.

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

CSE 5317/4305 L12: Higher-Order Functions1 Functional Languages and Higher-Order Functions Leonidas Fegaras.

2/4/20161 GC16/3011 Functional Programming Lecture 20 Garbage Collection Techniques.

CMSC 202, Version 5/02 1 Trees. CMSC 202, Version 5/02 2 Tree Basics 1.A tree is a set of nodes. 2.A tree may be empty (i.e., contain no nodes). 3.If.

Haskell on a Shared-Memory Multiprocessor Tim Harris Simon Marlow Simon Peyton Jones.

Chapter 7 Trees_ Part2 TREES. Depth and Height 2  Let v be a node of a tree T. The depth of v is the number of ancestors of v, excluding v itself. 

© Kenneth C. Louden, Chapter 7 - Control I: Expressions and Statements Programming Languages: Principles and Practice, 2nd Ed. Kenneth C. Louden.

Fiber Based Job Systems Seth England. Preemptive Scheduling Competition for resources Use of synchronization primitives to prevent race conditions in.

CS321 Functional Programming 2 © JAS Implementation using Combinators This approach builds upon the λ-calculus model. The aim is to translate.

Priority Queues and Heaps Tom Przybylinski. Maps ● We have (key,value) pairs, called entries ● We want to store and find/remove arbitrary entries (random.

Design issues for Object-Oriented Languages

Partially Ordered Data ,Heap,Binary Heap

Automatic Memory Management

More examples How many processes does this piece of code create?

Chapter 3: Process Management

Lecture 26: Array Disjoint Sets

Presentation transcript:

10/20/20151 GC16/3011 Functional Programming Lecture 22 The Four-Stroke Reduction Engine

10/20/20152 Contents  Motivation  Model for Parallel Graph Reduction  Parallelism and Tasks  FSRE representation, synchronisation and scheduling  Two-stroke reduction  Four-stroke reduction  Summary

10/20/20153 Motivation  Previously: abstract/theoretical  This lecture: a real graph reducer  Details of the Four-Stroke Reduction Engine

10/20/20154 Model for PGR Shared Task Pool Agent

10/20/20155  Each task:  has access to any part of graph  performs reductions in normal order  reduces a subgraph to (weak head) normal form  Overwrites root node of redex (with indirection to result) as indivisible operation  Then simply “dies”  may anticipate need for value of a subgraph  Places task for that subgraph in task pool (sparking)  is executed by an agent (physical processor)

10/20/20156 Parallelism and tasks  Sparking could be conservative or speculative  Speculative sparking needs careful management  FSRE uses conservative sparking  For (e1 e2), e1 may not yet be evaluated  So could evaluate e1 in parallel with e2  Extends to many arguments evaluated in parallel  But only those we know will be needed  Parallelism annotations advise when and what to spark

10/20/20157  Want to detect parallelism in three cases:  f x y = x + y  f will always evaluate x and y  Could annotate the function f, or the application nodes ((f x) y)  ((if e1 f g) e2)  Don’t know which function used until runtime  So annotate the functions  f x y = y 3 x  f is not strict in x if y doesn’t use x  But for application (f e +) the expression e WILL be used  So annotate the application nodes

10/20/20158 FSRE representation  A node (or cell) has a tag, a left field and a right field  Tags denote application, lambda, constant, parallelism annotations and “paint” (see later) etc.  A “task” is two pointers (B and F)  Graph traversal is achieved using pointer reversal (no stack required)  Current state of a suspended task is held in graph  Reversed pointers made inaccessible to other tasks (because nodes are “painted” – see later)

10/20/20159 FSRE synchronisation  Two tasks attempt to evaluate common subgraph?  Mutual exclusion not required, but desirable to prevent @ g * *

10/20/ FSRE synchronisation (2)  As task traverses graph, it “paints” all nodes it is working on (special versions of tags)  After working on a section of graph, it “unpaints” the nodes  If a task attempts to access a node that has been “painted” by another task, it blocks until the node is unpainted  Tasks are blocked and later resumed with no explicit communication between agents or tasks

10/20/ FSRE synchronisation (3)  A task (parent) sparks a subtask (child) to evaluate a subgraph  Later, the parent accesses the subgraph to get its value. The subgraph might be in one of three states:  Already evaluated: parent uses value  Being evaluated: subgraph is “painted” and parent blocks until it is “unpainted”  Not yet started to be evaluated: parent evaluates the subgraph (“paints” the nodes) and child will later block or die

10/20/ FSRE synchronisation (4)  A task is blocked when it accesses a “painted” node:  It is then placed on a queue of blocked tasks  This queue is attached to the node that caused the block  Using reversed pointer so no extra memory overhead!  When the node is “unpainted”, all tasks in the task queue for this node are sent to the task pool  Block on unwind, resume on rewind

F B’ F’

F B’F’Q Q

10/20/ FSRE scheduling  Too many sparked tasks: task pool fills up  Ignore new sparked tasks!  Discard already-sparked tasks!  (parents always check on their children and do the work themselves if child doesn’t)  NB can’t ignore/discard RESUMED tasks (parent?)  Always schedule resumed tasks first  Use LIFO/FIFO switching for parallelism control (less/more) in system

10/20/ Two-stroke reduction  “Inlet”  Unwind down the spine to find the leftmost outermost function  Use pointer-reversal and “paint” nodes  If find parallelism annotations in application nodes, spark tasks to evaluate those arguments  Might block on way down, so don’t remember arguments  If leftmost outermost function is a lambda (or a primitive with no strict args), use 2-stroke reduction  if primitive operator, use 4-stroke reduction  “Exhaust”  Get parallelism info and number of args  Rewind (& unpaint) up the spine to find the root of the redex Overwrite root with IND to result of reduction  Then go to “Inlet” again!

10/20/ Four-stroke reduction  “Inlet” – same as before  “Compression”  Get parallelism info and number of strict args  Rewind (& unpaint) up the spine to the topmost strict argument, sparking strict args on the way up  “Power”  Unwind (& paint) the spine again, checking the evaluation of all strict args one at a time  “Exhaust” – same as before

10/20/ Summary  Motivation  Model for Parallel Graph Reduction  Parallelism and Tasks  FSRE representation, synchronisation and scheduling  Two-stroke reduction  Four-stroke reduction  Summary