Stampede A Cluster Programming Middleware for Interactive Stream-oriented Applications Umakishore Ramachandran, Rishiyur Nikhil, James Matthew Rehg, Yavor.

Slides:



Advertisements
Similar presentations
System Integration and Performance
Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Requirements on the Execution of Kahn Process Networks Marc Geilen and Twan Basten 11 April 2003 /e.
Memory-Savvy Distributed Interactive Ray Tracing David E. DeMarle Christiaan Gribble Steven Parker.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
Memory Management 2010.
Informationsteknologi Friday, November 16, 2007Computer Architecture I - Class 121 Today’s class Operating System Machine Level.
Memory Management Chapter 5.
Space-Time Memory Kishore Ramachandran Georgia Tech Joint work with researchers from Compaq CRL (Rishiyur Nikhil, Jim Rehg, Bert Halstead, Chris Joerg,
1 Threads Chapter 4 Reading: 4.1,4.4, Process Characteristics l Unit of resource ownership - process is allocated: n a virtual address space to.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
Stampede: A Cluster Programming Middleware for Interactive Stream- Oriented Applications Mamadou Diallo Leila Jalali CS224 Advances in Database Management.
Stampede Overview Joint research between HP CRL and Georgia Tech (*) Kishore Ramachandran (*) Jim Rehg(*), Phil Hutto(*), Ken Mackenzie(*), Irfan Essa(*),
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Review of Memory Management, Virtual Memory CS448.
Threads, Thread management & Resource Management.
1-1 Embedded Network Interface (ENI) API Concepts Shared RAM vs. FIFO modes ENI API’s.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.
2007 Oct 18SYSC2001* - Dept. Systems and Computer Engineering, Carleton University Fall SYSC2001-Ch7.ppt 1 Chapter 7 Input/Output 7.1 External Devices.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Chapter 3.5 Memory and I/O Systems. 2 Memory Management Memory problems are one of the leading causes of bugs in programs (60-80%) MUCH worse in languages.
Introduction to Parallel Rendering Jian Huang, CS 594, Spring 2002.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Concurrency Patterns Emery Berger and Mark Corner University.
CSE 661 PAPER PRESENTATION
1/30/2003 BARC1 Profile-Guided I/O Partitioning Yijian Wang David Kaeli Electrical and Computer Engineering Department Northeastern University {yiwang,
CE Operating Systems Lecture 14 Memory management.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Stampede Overview Joint research between HP CRL and Georgia Tech (*) Kishore Ramachandran (*) Jim Rehg(*), Phil Hutto(*), Ken Mackenzie(*), Irfan Essa(*),
1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
File Systems cs550 Operating Systems David Monismith.
Lecture 7 Page 1 CS 111 Summer 2013 Another Option Fixed partition allocations result in internal fragmentation – Processes don’t use all of the fixed.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
Memory Management Program must be brought (from disk) into memory and placed within a process for it to be run Main memory and registers are only storage.
Operating Systems: Summary INF1060: Introduction to Operating Systems and Data Communication.
Chapter 3 System Buses.  Hardwired systems are inflexible  General purpose hardware can do different tasks, given correct control signals  Instead.
COMP091 – Operating Systems 1 Memory Management. Memory Management Terms Physical address –Actual address as seen by memory unit Logical address –Address.
Terra-Fusion Loads Tiles in real-time while panning Loads Tiles in real-time while panning Improved overall performance via: Improved overall performance.
Distributed Programming Infrastructure for Ubiquitous Presence Joint research between Compaq CRL and Georgia Tech (*) Researchers: Kishore Ramachandran.
Memory Management Chapter 5 Advanced Operating System.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Buffering Techniques Greg Stitt ECE Department University of Florida.
CMSC 611: Advanced Computer Architecture
SLC/VER1.0/OS CONCEPTS/OCT'99
Chapter 2 Memory and process management
Process Management Process Concept Why only the global variables?
Game Architecture Rabin is a good overview of everything to do with Games A lot of these slides come from the 1st edition CS 4455.
Parallel Programming By J. H. Wang May 2, 2017.
Chapter 9 – Real Memory Organization and Management
Architecture Background
Main Memory Management
Main Memory Background Swapping Contiguous Allocation Paging
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Chapter 12 Pipelining and RISC
COMP755 Advanced Operating Systems
Presentation transcript:

Stampede A Cluster Programming Middleware for Interactive Stream-oriented Applications Umakishore Ramachandran, Rishiyur Nikhil, James Matthew Rehg, Yavor Angelov, Arnab Paul, Sameer Adhikari, Kenneth Mackenzie, Nissim Harel, Kathleen Knobe IEEE Transactions on Parallel and Distributed Systems, November 2003

Introduction New application domains: interactive vision, multimedia collaboration, animation Interactive Process temporal data High computational requirements Exhibit task & data parallelism Dynamic – unpredictable at compile time Stampede: programming system to enable execution on SMPs/clusters Support for task, data parallelism Temporal data handling, buffer management High level data sharing: space-time memory

Example: Smart Kiosk Public device for providing information, entertainment Interact with multiple people Capable of initiating interaction I/O: video cameras, microphones, touch screens, infrared, speakers, …

Kiosk application characteristics Tasks have different computational requirements higher level tasks may be more expensive May not run as often – data dependent Multiple (heterogeneous) time correlated data sets Tasks have different priorities e.g., interacting with customer vs. looking for new customers Input may not be accessed in strict order e.g., skip all but most recent data May need to re-analyze earlier data Claim: streams, lists not expressive enough

Space time memory Distributed shared data structures for temporal data STM channel: random access STM queue: FIFO access STM register: cluster-wide shared variable Unique system wide names Threads attach, detach dynamically Threads communicate only via STM

STM channels

STM channel API Channels supports bounded/unbounded size Separate API for typed access, hooks for marshalling, unmarshalling Timestamp wildcards Request newest/oldest item in channel Newest value not previously read Get/put Blocking/nonblocking operation Timestamps can be out of order Copy-in, copy-out semantics Get can be called on an item 0-#conn times

STM queue Supports data parallelism Get/put behave as enqueue/dequeue Get: items retrieved exactly once Put: multiple items w/same timestamp can be added  Used for partitioning data items (regions in frame)  Runtime adds ticket for unique id

Garbage collection How to determine if an STM item is no longer needed? Consume API call indicates this for a connection Queues Items have implicit reference count of 1 GC after consume Channels Number of consumers unknown  Threads can skip items  New connections can be created dynamically Reachability via timestamps  GC if item cannot be accessed by any current or future connection  System: item not GCed until marked consumed by all connections  Application: must mark each item consumed (can mark timestamp ranges)

GC and timestamps Threads propagate input timestamps to output Threads at data source (e.g. camera) generate timestamps Virtual time: per thread, application specific (e.g. frame number) Visibility: per-thread, minimum of virtual time & item timestamps from all connections Put: item timestamp >= visibility Create thread: child virtual time >= visibility Attach: items < visibility implicitly consumed Set virtual time: any value >= visibility. Infinity or must guarantee advancement Global minimum timestamp, ts_min. Minimum of: Virtual time of all threads Timestamps of items on all queues Timestamps of unconsumed items on all input connections of all channels Items with timestamps < ts_min can be garbage collected

Code samples

People tracker for Smart Kiosk Track multiple moving targets based on color Goals: low latency, keep up with frame rate Application: color-based tracking Model 1 Model 2

Mapping to Stampede Expected bottleneck: target detection Data parallelize by color models, frame regions (horizontal stripes) Placement on cluster 1 node: all threads except inner DPS N nodes: 1 inner DPS each

Color tracking results Setup: 17 node cluster (Dell 8450s) 8 CPUs/node: 550 MHz P3 Xeon 4 GB memory/node 2 MB L2 cache/CPU Gigabit ethernet OS: Linux Stampede used CLF messaging Data: 1 30 fps, 8 models Bottleneck was histogram thread

Application: video textures Batch video processing: generate video loop from set of frames Randomly transition between computed cut points, or create loop of specified length Calculate best places to cut – pairwise frame comparison Comparisons independent – lots of parallelism Problem: data distribution – don’t send every frame everywhere

Mapping to Stampede Cluster nodes

Decentralized data distribution Fetches all images fetches a subset and reuses images “tiling with chaining”

Stripe size experiment Tune image comparison for L2 cache size Compare image regions rather than whole images Find stripe size (#rows) s.t. comparisons fit in cache Measure single node speedup as a function of stripe size, number of worker threads Setup: cluster as before Data: 316 frames, 640x480, 24 bit color (~900KB) comparisons = N(N-1)/2 = 49770

Stripe size results Memory bottleneck (seconds) Whole image comparison

Data distribution experiment Single-source vs. decentralized data distribution Measure speedup as a function of nodes, threads/node Tile size varies with number of nodes  Larger tiles: better compute/communication ratio  Smaller tiles: better load balancing Compare to algorithm-limited speedup  no communication costs  shows effect of load imbalances Setup: as before Full image comparisons

Data distribution results Single source bottleneck – as #nodes ↑, communication time > computation time 1-thread vs. 8-thread performance: communication for initial tile fetch no computation overlap