TEMPLATE DESIGN © 2008 www.PosterPresentations.com Sum() is now a Shader stage: An N:1 shader and a graph cycle reduce in place, in parallel. 'Barrier'

Slides:

Advertisements

Similar presentations

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)

Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.

Bio Michel Hanna M.S. in E.E., Cairo University, Egypt B.S. in E.E., Cairo University at Fayoum, Egypt Currently is a Ph.D. Student in Computer Engineering.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.

Extending GRAMPS Shaders Jeremy Sugerman June 2, 2009 FLASHG.

GRAMPS Overview and Design Decisions Jeremy Sugerman February 26, 2009 GCafe.

Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.

Pipelined Computations Divide a problem into a series of tasks A processor completes a task sequentially and pipes the results to the next processor Pipelining.

GRAMPS: A Programming Model For Graphics Pipelines Jeremy Sugerman, Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan.

GRAMPS: A Programming Model for Graphics Pipelines and Heterogeneous Parallelism Jeremy Sugerman March 5, 2009 EEC277.

Many-Core Programming with GRAMPS Jeremy Sugerman Kayvon Fatahalian Solomon Boulos Kurt Akeley Pat Hanrahan.

MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.

Pixel Shader Vertex Shader The Real-time Graphics Pipeline Input Assembler Rasterizer Output Merger.

GRAMPS Beyond Rendering Jeremy Sugerman 11 December 2009 PPL Retreat.

Many-Core Programming with GRAMPS Jeremy Sugerman Stanford PPL Retreat November 21, 2008.

Many-Core Programming with GRAMPS & “Real Time REYES” Jeremy Sugerman, Kayvon Fatahalian Stanford University June 12, 2008.

Many-Core Programming with GRAMPS Jeremy Sugerman Stanford University September 12, 2008.

Doing More With GRAMPS Jeremy Sugerman 10 December 2009 GCafe.

Further Developing GRAMPS Jeremy Sugerman FLASHG January 27, 2009.

L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve ， Devendra Dahiphale ， Amit Chhajer 報告 : 饒展榕.

1 Dryad Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly of Microsoft.

MapReduce How to painlessly process terabytes of data.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.

Process Characteristics

Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Processor Architecture

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

Author : Weirong Jiang, Yi-Hua E. Yang, and Viktor K. Prasanna Publisher : IPDPS 2010 Presenter : Jo-Ning Yu Date : 2012/04/11.

CSE Operating System Principles

Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.

MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

J++ Machine Jeremy Sugerman Kayvon Fatahalian. Background  Multicore CPUs  Generalized GPUs (Brook, CTM, CUDA)  Tightly coupled traditional CPU (more.

Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.

Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.

On the Relation Between Simulation-based and SAT-based Diagnosis CMPE 58Q Giray Kömürcü Boğaziçi University.

Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.

Gwangsun Kim, Jiyun Jeong, John Kim

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

SOFTWARE DESIGN AND ARCHITECTURE

The Graphics Rendering Pipeline

Accelerating MapReduce on a Coupled CPU-GPU Architecture

CS 179 Lecture 14.

Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

Distributed System Gang Wu Spring，2018.

GPGPU: Parallel Reduction and Scan

Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, 2012 slides5.ppt Oct 24, 2013.

Parallel Computation Patterns (Reduction)

Chapter 4: Threads.

Presentation transcript:

TEMPLATE DESIGN © Sum() is now a Shader stage: An N:1 shader and a graph cycle reduce in place, in parallel. 'Barrier' queue only gets NOMORE (when Generate is done and Sum has reduced to a single item). Reminder: Run-time for producer-consumer parallelism based on stages and queues. “Shader” stage is the primary tool for data- parallelism. It is effectively “Map”. Benefits: auto-instancing, auto queue mgmt, implicit parallelism model, 'shader cores' Constraints: 1 input queue, 1 input element per instance. Central to Map-Reduce (duh), data-parallel apps Strict form: sequentially and unlimited buffering –E.g., find median, depth order transparency Associativity, commutativity enable parallel, incremental reductions. –In practice, many of the reductions actually used (all Brook / GPGPU, most Map-Reduce) Reminder: Run-time for producer-consumer parallelism based on stages and queues. “Shader” stage is the primary tool for data- parallelism. It is effectively “Map”. Benefits: auto-instancing, auto queue mgmt, implicit parallelism model, 'shader cores' Constraints: 1 input queue, 1 input element per instance. Jeremy Sugerman PPL Retreat, 5 June 2009 Extending GRAMPS Shaders GRAMPS Intermediate Tuples Map Output Split Combine (Optional) Sort Initial Tuples Intermediate Tuples Final Tuples Ray Tracer Ray Queue Ray Hit Queue Fragment Queue CameraIntersect Shade FB Blend Map-Reduce Reductions Logarithic Parallel Reduction Reduce Thread Stage Shader Stage Queue Stage Output Push Output (Currently) Requires Thread stages: Note: Even the incremental version is still single threaded! (Until / unless manually parallelized) Reductions in GRAMPS Barrier / Sum Generate: 0.. MAX Sum(Input) (Reduce) Validate / Consume Data Strict Reduction: sumThreadMain(GrEnv *env) { sum = 0; /* Block for entire input */ GrReserve(inputQ, -1); for (i = 0 to numPackets) { sum += input[i]; } GrCommit(inputQ, numPackets); /* Write sum to buffer or output */ } Enabling GRAMPS Shaders for Partial Reductions Appeal: –Stream and GPU languages allow reduce –Important to utilize shader cores –Remove / simplify programmer boiler plate –Enable auto parallelism and instancing Obstacles –Where to store partial / running result –Handling multiple inputs (spanning packets) –Detecting / signaling termination –Avoiding a proliferation of stage types Shader Environment Enhancements: Stage / kernel takes N inputs at a time (Must handle < N available) Invocation reduces N : 1 (Stored as an output key?) GRAMPS can (will) merge input across packets  No guarantees on per packet shared headers!  Not a completely new type, not just GPGPU reduce Reduction Using Extended Shaders Sum() is now a Shader stage: An N:1 shader and a graph cycle reduce in place, in parallel. 'Barrier' queue only gets NOMORE (when Generate is done and Sum has reduced to a single item). Barrier Generate: 0.. MAX Sum(Input) (Reduce) Validate / Consume Data Scheduling Reduction Shaders Highly correlated with execution graph cycles –Scheduling heuristic for such cycles: pre-empt upstream to contain working set Free space in the loopback caps parallelism –Maximal parallelism requires 1/Nth free. –A single free space is sufficient to guarantee forward progress  Note: Logarithmic versus linear reduction has become application transparent and entirely a GRAMPS run-time / scheduler decision. Data-parallel reductions are an important case. Nice to include them by generalizing existing Shader versus adding a distinct Reduce stage. In addition to reductions, non-recursive filtering is also now possible. Remains to be seen how interesting that is. “Next” data-parallel operations after map and reduce? Are there more that run well on commodity many-core hardware? Concluding, Other Thoughts Incremental / Partial Reduction: sumThreadMain(GrEnv *env) { sum = 0; /* Consume one packet at a time */ while (GrReserve(inputQ, 1) != NOMORE){ sum += input[0]; GrCommit(inputQ, 1); } /* Write sum to buffer or output */ }