Download presentation
Presentation is loading. Please wait.
Published byElinor Rice Modified over 9 years ago
1
Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The Ohio State University Columbus, Ohio - 43210
2
Outline Motivation Software Transactional Memory Shared Memory Parallelization Schemes Transactional Locking II (TL2) Hybrid Replicated STM scheme FREERIDE Processing Structure Experimental Results Conclusions October 24, 20152
3
Motivation Availability of large data for analysis –On the scale of tera and peta bytes Advent of multi-core, many-core architecture –Intel’s Polaris 80-core chip –Larrabee many-core architecture Programmability challenge –Coarse-grained Performance not sufficient –Fine-grained Better left to experts Need for transparent, scalable shared- memory parallelization technique October 24, 20153
4
Software Transactional Memory (STM) October 24, 20154 Maps concurrent transactions in database to concurrent thread operations Programmer Identify critical sections Tag them as transactions Launch multiple threads Transactions run as atomic and isolated operations Data races handled automatically Guarantees absence of deadlock!
5
Contributions October 24, 20155
6
FREERIDE Processing Structure (Framework for Rapid Implementation of Datamining Engines) October 24, 20156 {* Outer sequential loop*} While( ) { {*Reduction loop*} Foreach( element e) { (i, val) = compute(e) RObj(i) = Reduc(Robj(i), val) } Map-reduce Two-stage FREERIDE One-stage Intermediate structure exposed Better Performance than map- reduce [Cluster ‘09] Middleware API Process each data instance Reduce the result into Reduction object Local combination from all threads if needed Reduction Object
7
Shared-Memory Parallelization Techniques Context: FREERIDE ( Framework for Rapid Implementation of Datamining Engines) October 24, 20157 Replication-based (Lock-free) Full-replication (f-r) Lock-based Full-locking Cache-sensitive locking (cs-l) Full Locking Cache-Sensitive Locking LockReduction Element
8
Motivation for STM Integration Potential downside of existing schemes [CCGRID ‘09] Full-replication –Very high memory requirements Cache-sensitive locking –Tuned for specific cache architecture –Risk of introducing bugs, deadlocks with porting Advantages of STM Leverage on large body of STM work –Easier programmability –No deadlocks! Provide transparent integration –Programmer don’t bother about STM details What do we need? Use easy programmability of STM Achieve competitive performance October 24, 20158
9
Transactional Locking II (TL2) October 24, 20159 Word-based, Lock-based algorithm Faster than non-blocking STM techniques API –STMBeginTransaction() –STMWrite() –STMRead() –STMCommit() We used Rochester STM (RSTM-TL2) Downside of STM –Large number of conflicts -> large number of aborts
10
Optimization – Hybrid Replicated STM (rep-stm) October 24, 201510 Best of two worlds –Replication –STM Replicated STM –Group ‘n’ threads by ‘m’ groups –‘m’ copies of Reduction object –Each group of threads has private copy –n/m threads within a group share to use STM Adv. of Replicated STM –Reduce no. of reduction object copies –Reduce merge overhead –Also, reduce conflicts with STM
11
October 24, 201511 Experimental Goals Setup Intel Xeon E5345 processors 2-Quad cores (8 cores), each core 2.33 GHz 6 GB main memory 8 MB L2 cache Goals Compare f-r, cs-l, TL2 and rep-stm for three datamining alogrithms –K-means, Expectation Maximization (E-M) and Principal Component Analysis (PCA) Evaluate different Read-Write mixes Evaluate conflicts and aborts October 24, 201511
12
Parallel Efficiency of PCA Principal Component Analysis 8.5 GB data Best result rep-stm (6.1x) Observations All techs. are competitive PCA specific Computation is high for finding co-variance matrix Amortizes the revalidation /acquire/release of locks STM overheads, 2.3% October 24, 201512
13
Parallel Efficiency of EM October 24, 201513 Expectation-Maximization (EM) 6.4 GB data Best result cs-l (~ 5x) Observations STM schemes are competitive STM have better scalability Diff. between stm-TL2/rep- stm not observed with 8 cores EM specific Computation between updates is high Again, initial overhead is high
14
Canonical Loop – Parallel Efficiency for Read-write Mixes Canonical loop Synthetic computation Follows generalized reduction Diff workloads with R/W mix All results from 8-threads Interesting Diff. winner for each workload October 24, 201514
15
Evaluation of Conflicts and Aborts Same canonical loop Compare rate of aborts for stm-TL2 and rep-stm Demonstrates the adv. of rep-stm over stm-TL2 for large no. of threads All cases, for rep-stm –Rate of growth of aborts is much slower –Reduces aborts by 40-55% October 24, 201515
16
Conclusions Transparent use of STM schemes Developed Hybrid Replicated-STM to reduce –Memory requirements –Conflicts/aborts TL2 and rep-stm are competitive with highly-tuned locking scheme rep-stm significantly reduces no. of aborts with TL2 October 24, 201516
17
October 24, 201517 Thank You! Questions? Contacts: Vignesh Ravi- raviv@cse.ohio-state.eduraviv@cse.ohio-state.edu Gagan Agrawal- agrawal@cse.ohio-state.eduagrawal@cse.ohio-state.edu
18
Parallel Efficiency of K-means Kmeans clustering 6 GB data, k=250 Best result f-r (6.57x) STM overheads 15.3% Revalidate R/W Acquire/Release locks Kmeans specific Computation between updates is quite low October 24, 201518
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.