Download presentation
Presentation is loading. Please wait.
Published byAnthony Strickland Modified over 9 years ago
1
Stampede A Cluster Programming Middleware for Interactive Stream-oriented Applications Umakishore Ramachandran, Rishiyur Nikhil, James Matthew Rehg, Yavor Angelov, Arnab Paul, Sameer Adhikari, Kenneth Mackenzie, Nissim Harel, Kathleen Knobe IEEE Transactions on Parallel and Distributed Systems, November 2003
2
Introduction New application domains: interactive vision, multimedia collaboration, animation Interactive Process temporal data High computational requirements Exhibit task & data parallelism Dynamic – unpredictable at compile time Stampede: programming system to enable execution on SMPs/clusters Support for task, data parallelism Temporal data handling, buffer management High level data sharing: space-time memory
3
Example: Smart Kiosk Public device for providing information, entertainment Interact with multiple people Capable of initiating interaction I/O: video cameras, microphones, touch screens, infrared, speakers, …
4
Kiosk application characteristics Tasks have different computational requirements higher level tasks may be more expensive May not run as often – data dependent Multiple (heterogeneous) time correlated data sets Tasks have different priorities e.g., interacting with customer vs. looking for new customers Input may not be accessed in strict order e.g., skip all but most recent data May need to re-analyze earlier data Claim: streams, lists not expressive enough
5
Space time memory Distributed shared data structures for temporal data STM channel: random access STM queue: FIFO access STM register: cluster-wide shared variable Unique system wide names Threads attach, detach dynamically Threads communicate only via STM
6
STM channels
7
STM channel API Channels supports bounded/unbounded size Separate API for typed access, hooks for marshalling, unmarshalling Timestamp wildcards Request newest/oldest item in channel Newest value not previously read Get/put Blocking/nonblocking operation Timestamps can be out of order Copy-in, copy-out semantics Get can be called on an item 0-#conn times
8
STM queue Supports data parallelism Get/put behave as enqueue/dequeue Get: items retrieved exactly once Put: multiple items w/same timestamp can be added Used for partitioning data items (regions in frame) Runtime adds ticket for unique id
9
Garbage collection How to determine if an STM item is no longer needed? Consume API call indicates this for a connection Queues Items have implicit reference count of 1 GC after consume Channels Number of consumers unknown Threads can skip items New connections can be created dynamically Reachability via timestamps GC if item cannot be accessed by any current or future connection System: item not GCed until marked consumed by all connections Application: must mark each item consumed (can mark timestamp ranges)
10
GC and timestamps Threads propagate input timestamps to output Threads at data source (e.g. camera) generate timestamps Virtual time: per thread, application specific (e.g. frame number) Visibility: per-thread, minimum of virtual time & item timestamps from all connections Put: item timestamp >= visibility Create thread: child virtual time >= visibility Attach: items < visibility implicitly consumed Set virtual time: any value >= visibility. Infinity or must guarantee advancement Global minimum timestamp, ts_min. Minimum of: Virtual time of all threads Timestamps of items on all queues Timestamps of unconsumed items on all input connections of all channels Items with timestamps < ts_min can be garbage collected
11
Code samples
12
People tracker for Smart Kiosk Track multiple moving targets based on color Goals: low latency, keep up with frame rate Application: color-based tracking Model 1 Model 2
13
Mapping to Stampede Expected bottleneck: target detection Data parallelize by color models, frame regions (horizontal stripes) Placement on cluster 1 node: all threads except inner DPS N nodes: 1 inner DPS each
14
Color tracking results Setup: 17 node cluster (Dell 8450s) 8 CPUs/node: 550 MHz P3 Xeon 4 GB memory/node 2 MB L2 cache/CPU Gigabit ethernet OS: Linux Stampede used CLF messaging Data: 1 MB/frame @ 30 fps, 8 models Bottleneck was histogram thread
15
Application: video textures Batch video processing: generate video loop from set of frames Randomly transition between computed cut points, or create loop of specified length Calculate best places to cut – pairwise frame comparison Comparisons independent – lots of parallelism Problem: data distribution – don’t send every frame everywhere
16
Mapping to Stampede Cluster nodes
17
Decentralized data distribution Fetches all images fetches a subset and reuses images “tiling with chaining”
18
Stripe size experiment Tune image comparison for L2 cache size Compare image regions rather than whole images Find stripe size (#rows) s.t. comparisons fit in cache Measure single node speedup as a function of stripe size, number of worker threads Setup: cluster as before Data: 316 frames, 640x480, 24 bit color (~900KB) comparisons = N(N-1)/2 = 49770
19
Stripe size results Memory bottleneck (seconds) Whole image comparison
20
Data distribution experiment Single-source vs. decentralized data distribution Measure speedup as a function of nodes, threads/node Tile size varies with number of nodes Larger tiles: better compute/communication ratio Smaller tiles: better load balancing Compare to algorithm-limited speedup no communication costs shows effect of load imbalances Setup: as before Full image comparisons
21
Data distribution results Single source bottleneck – as #nodes ↑, communication time > computation time 1-thread vs. 8-thread performance: communication for initial tile fetch no computation overlap
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.