Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The Ohio State University Columbus, Ohio
Outline Motivation Software Transactional Memory Shared Memory Parallelization Schemes Transactional Locking II (TL2) Hybrid Replicated STM scheme FREERIDE Processing Structure Experimental Results Conclusions October 24, 20152
Motivation Availability of large data for analysis –On the scale of tera and peta bytes Advent of multi-core, many-core architecture –Intel’s Polaris 80-core chip –Larrabee many-core architecture Programmability challenge –Coarse-grained Performance not sufficient –Fine-grained Better left to experts Need for transparent, scalable shared- memory parallelization technique October 24, 20153
Software Transactional Memory (STM) October 24, Maps concurrent transactions in database to concurrent thread operations Programmer Identify critical sections Tag them as transactions Launch multiple threads Transactions run as atomic and isolated operations Data races handled automatically Guarantees absence of deadlock!
Contributions October 24, 20155
FREERIDE Processing Structure (Framework for Rapid Implementation of Datamining Engines) October 24, {* Outer sequential loop*} While( ) { {*Reduction loop*} Foreach( element e) { (i, val) = compute(e) RObj(i) = Reduc(Robj(i), val) } Map-reduce Two-stage FREERIDE One-stage Intermediate structure exposed Better Performance than map- reduce [Cluster ‘09] Middleware API Process each data instance Reduce the result into Reduction object Local combination from all threads if needed Reduction Object
Shared-Memory Parallelization Techniques Context: FREERIDE ( Framework for Rapid Implementation of Datamining Engines) October 24, Replication-based (Lock-free) Full-replication (f-r) Lock-based Full-locking Cache-sensitive locking (cs-l) Full Locking Cache-Sensitive Locking LockReduction Element
Motivation for STM Integration Potential downside of existing schemes [CCGRID ‘09] Full-replication –Very high memory requirements Cache-sensitive locking –Tuned for specific cache architecture –Risk of introducing bugs, deadlocks with porting Advantages of STM Leverage on large body of STM work –Easier programmability –No deadlocks! Provide transparent integration –Programmer don’t bother about STM details What do we need? Use easy programmability of STM Achieve competitive performance October 24, 20158
Transactional Locking II (TL2) October 24, Word-based, Lock-based algorithm Faster than non-blocking STM techniques API –STMBeginTransaction() –STMWrite() –STMRead() –STMCommit() We used Rochester STM (RSTM-TL2) Downside of STM –Large number of conflicts -> large number of aborts
Optimization – Hybrid Replicated STM (rep-stm) October 24, Best of two worlds –Replication –STM Replicated STM –Group ‘n’ threads by ‘m’ groups –‘m’ copies of Reduction object –Each group of threads has private copy –n/m threads within a group share to use STM Adv. of Replicated STM –Reduce no. of reduction object copies –Reduce merge overhead –Also, reduce conflicts with STM
October 24, Experimental Goals Setup Intel Xeon E5345 processors 2-Quad cores (8 cores), each core 2.33 GHz 6 GB main memory 8 MB L2 cache Goals Compare f-r, cs-l, TL2 and rep-stm for three datamining alogrithms –K-means, Expectation Maximization (E-M) and Principal Component Analysis (PCA) Evaluate different Read-Write mixes Evaluate conflicts and aborts October 24,
Parallel Efficiency of PCA Principal Component Analysis 8.5 GB data Best result rep-stm (6.1x) Observations All techs. are competitive PCA specific Computation is high for finding co-variance matrix Amortizes the revalidation /acquire/release of locks STM overheads, 2.3% October 24,
Parallel Efficiency of EM October 24, Expectation-Maximization (EM) 6.4 GB data Best result cs-l (~ 5x) Observations STM schemes are competitive STM have better scalability Diff. between stm-TL2/rep- stm not observed with 8 cores EM specific Computation between updates is high Again, initial overhead is high
Canonical Loop – Parallel Efficiency for Read-write Mixes Canonical loop Synthetic computation Follows generalized reduction Diff workloads with R/W mix All results from 8-threads Interesting Diff. winner for each workload October 24,
Evaluation of Conflicts and Aborts Same canonical loop Compare rate of aborts for stm-TL2 and rep-stm Demonstrates the adv. of rep-stm over stm-TL2 for large no. of threads All cases, for rep-stm –Rate of growth of aborts is much slower –Reduces aborts by 40-55% October 24,
Conclusions Transparent use of STM schemes Developed Hybrid Replicated-STM to reduce –Memory requirements –Conflicts/aborts TL2 and rep-stm are competitive with highly-tuned locking scheme rep-stm significantly reduces no. of aborts with TL2 October 24,
October 24, Thank You! Questions? Contacts: Vignesh Ravi- Gagan Agrawal-
Parallel Efficiency of K-means Kmeans clustering 6 GB data, k=250 Best result f-r (6.57x) STM overheads 15.3% Revalidate R/W Acquire/Release locks Kmeans specific Computation between updates is quite low October 24,