Extending GRAMPS Shaders Jeremy Sugerman June 2, 2009 FLASHG.

Slides:

Advertisements

Similar presentations

Technology Behind AMD’s “Leo Demo” Jay McKee MTS Engineer, AMD

Advertisements

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)

MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Bio Michel Hanna M.S. in E.E., Cairo University, Egypt B.S. in E.E., Cairo University at Fayoum, Egypt Currently is a Ph.D. Student in Computer Engineering.

Algorithms + L. Grewe.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Chapter 51 Scripting With JSP Elements JavaServer Pages By Xue Bai.

File Management Chapter 12. File Management File management system is considered part of the operating system Input to applications is by means of a file.

Neochiron An Architecture Assembly Language. ADL Constructs Architecture description language – provides primitives for composing an architecture Components,

GRAMPS Overview and Design Decisions Jeremy Sugerman February 26, 2009 GCafe.

Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.

Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.

Pipelined Computations Divide a problem into a series of tasks A processor completes a task sequentially and pipes the results to the next processor Pipelining.

TEMPLATE DESIGN © Sum() is now a Shader stage: An N:1 shader and a graph cycle reduce in place, in parallel. 'Barrier'

GRAMPS: A Programming Model For Graphics Pipelines Jeremy Sugerman, Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan.

GRAMPS: A Programming Model for Graphics Pipelines and Heterogeneous Parallelism Jeremy Sugerman March 5, 2009 EEC277.

Many-Core Programming with GRAMPS Jeremy Sugerman Kayvon Fatahalian Solomon Boulos Kurt Akeley Pat Hanrahan.

MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.

Pixel Shader Vertex Shader The Real-time Graphics Pipeline Input Assembler Rasterizer Output Merger.

GRAMPS Beyond Rendering Jeremy Sugerman 11 December 2009 PPL Retreat.

Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

Backtracking Reading Material: Chapter 13, Sections 1, 2, 4, and 5.

Many-Core Programming with GRAMPS & “Real Time REYES” Jeremy Sugerman, Kayvon Fatahalian Stanford University June 12, 2008.

Many-Core Programming with GRAMPS Jeremy Sugerman Stanford University September 12, 2008.

Some Things Jeremy Sugerman 22 February Jeremy Sugerman, FLASHG 22 February 2005 Topics Quick GPU Topics Conditional Execution GPU Ray Tracing.

Doing More With GRAMPS Jeremy Sugerman 10 December 2009 GCafe.

File Management Chapter 12.

Further Developing GRAMPS Jeremy Sugerman FLASHG January 27, 2009.

BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.

L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

SS ZG653Second Semester, Topic Architectural Patterns Pipe and Filter.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

External Sorting Problem: Sorting data sets too large to fit into main memory. –Assume data are stored on disk drive. To sort, portions of the data must.

File Management Chapter 12. File Management File management system is considered part of the operating system Input to applications is by means of a file.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.

Efficient Evaluation of XQuery over Streaming Data Xiaogang Li Gagan Agrawal The Ohio State University.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve ， Devendra Dahiphale ， Amit Chhajer 報告 : 饒展榕.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

CS 193G Lecture 7: Parallel Patterns II. Overview Segmented Scan Sort Mapreduce Kernel Fusion.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

CS533 - Concepts of Operating Systems 1 Threads, Events, and Reactive Objects - Alan West.

GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.

MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/

J++ Machine Jeremy Sugerman Kayvon Fatahalian. Background  Multicore CPUs  Generalized GPUs (Brook, CTM, CUDA)  Tightly coupled traditional CPU (more.

Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.

Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Gwangsun Kim, Jiyun Jeong, John Kim

Efficient Evaluation of XQuery over Streaming Data

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Pipeline parallelism and Multi–GPU Programming

CS 179 Lecture 14.

Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy

Dynamic Packet-filtering in High-speed Networks Using NetFPGAs

Distributed System Gang Wu Spring，2018.

Parallel Computation Patterns (Reduction)

Patrick Cozzi University of Pennsylvania CIS Spring 2011

Presentation transcript:

Extending GRAMPS Shaders Jeremy Sugerman June 2, 2009 FLASHG

2 Agenda  GRAMPS Reminder (quick!)  Reductions  Reductions and more with GRAMPS Shaders

3 GRAMPS Reminder Ray Tracer Ray Queue Ray Hit Queue Fragment Queue CameraIntersect Shade FB Blend = Thread Stage = Shader Stage = Queue = Stage Output = Push Output Intermediate Tuples Map Output Produce Combine (Optional) Reduce Sort Initial Tuples Intermediate Tuples Final Tuples Map-Reduce

4 GRAMPS Shaders  Facilitate data parallelism  Benefits: auto-instancing, queue management, implicit parallelism, mapping to ‘shader cores’  Constraints: 1 input queue, 1 input element and 1 output element per queue (plus push).  Effectively limits kernels to “map”-like usage.

5 Reductions  Central to Map-Reduce (duh), many parallel apps  Strict form: sequential, requires arbitrary buffering – E.g., compute median, depth order transparency  Associativity, commutativity enable parallel incremental reductions – In practice, many of the reductions actually used (all Brook / GPGPU, most Map-Reduce)

6 Logarithmic Parallel Reduction

7 Simple GRAMPS Reduction Barrier / Sum Generate: 0.. MAX Sum(Input) (Reduce) Validate / Consume Data  Strict reduction  All stages are threads, no shaders

8 Strict Reduction Program sumThreadMain(GrEnv *env) { sum = 0; /* Block for entire input */ GrReserve(inputQ, -1); for (i = 0 to numPackets) { sum += input[i]; } GrCommit(inputQ, numPackets); /* Write sum to buffer or outputQ */ }

9 Incremental/Partial Reduction sumThreadMain(GrEnv *env) { sum = 0; /* Consume one packet at a time */ while (GrReserve(inputQ, 1) != NOMORE) { sum += input[i]; GrCommit(inputQ, 1); } /* Write sum to buffer or outputQ */ } Note: Still single threaded!

10 Shaders for Partial Reduction?  Appeal: – Stream, GPU languages offer support – Take advantage of shader cores – Remove programmer boiler plate – Automatic parallelism and instancing  Obstacles: – Location for partial / incremental result – Multiple input elements (spanning packets) – Detecting termination – Proliferation of stage / program types.

11 Shader Enhancements  Stage / kernel takes N inputs per invocation  Must handle 1)  Invocation reduces all input to a single output – Stored as an output key?  GRAMPS can (will) merge input across packets  No guarantees on shared packet headers!  Not a completely new type of shader  General filtering, not just GPGPU reduce

12 GRAMPS Shader Reduction Barrier Generate: 0.. MAX Sum(Input) (Reduce) Validate / Consume Data  Combination of N:1 shader and graph cycle (in- place).  Input “Queue” to validate only gets NOMORE

13 Scheduling Reduction Shaders  Highly correlated with graph cycles. – Given reduction, preempt upstream under footprint.  Free space in input gates possible parallelism – 1/Nth free is the most that can be used. – One free entry is the minimum required for forward progress.  Logarithmic versus linear reduction is entirely a scheduler / GRAMPS decision.

14 Other Thoughts  (As mentioned) Enables filtering. What else?  How interesting are graphs without loops?  Are there other alternatives? Would a separate “reduce” / “combine” stage be better?  Questions?