Massive Streaming Data Analytics: A Case Study with Clustering Coefficients David Ediger Karl Jiang Jason Riedy David A. Bader Georgia Institute of Technology Atlanta, GA USA 1
STINGER Data Structure Spatio-temporal Interaction Networks and Graphs (STING) Extensible Representation General-purpose data structure for dynamic graphs Efficient edge insertion/deletion (updates) with concurrent readers (analysis) 2
STINGER Data Structure Array of linked lists, which may have empty slots (from deleting edges) Additional stored info not in paper Efficient updates Concurrent reads (no locking) 3
Assumptions for parallelism Single streaming source for inserts/deletes Changes are scattered widely – Batches are sufficiently independent Analysis kernels have small range – Graph change only requires access to local portions and affects small portion of output 4
Assumptions (continued) 5
Case Study: Updating Clustering Coefficients Clustering coefficients measure density of closed triangles: One way of determining if a graph is a small- world graph 6
Bloom filter Consider an edge list represented as a bit array (1 bit per edge) => O(n) storage space Bloom filter is a bit array with an arbitrary, smaller number of bits A hash function maps a vertex to a specific bit Small number of bits == high collision rate To reduce false-positives, use k independent hash functions to set multiple bits 7
Bloom filter 8
Testbed Massively multi-threaded Cray XMT – 64 Threadstorm processors Each running at 500MHz Each has 128 hardware streams maintaining a thread context Context switches occur every cycle 512 GiB globally addressable shared memory – (holds 2 billion vertices and 17 billion edges) Synthetic data – 16 million vertices, ~500 million edges 9