Design and Implementation of Signatures in Transactional Memory Systems Daniel Sanchez August 2007 University of Wisconsin-Madison.

Slides:



Advertisements
Similar presentations
Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Advertisements

A Search Memory Substrate for High Throughput and Low Power Packet Processing Sangyeun Cho, Michel Hanna and Rami Melhem Dept. of Computer Science University.
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
Lecture 12 Reduce Miss Penalty and Hit Time
LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
Hardware Transactional Memory for GPU Architectures Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia In.
Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
1 Hash-Based Indexes Module 4, Lecture 3. 2 Introduction As for any index, 3 alternatives for data entries k* : – Data record with key value k – –Choice.
Hash-Based Indexes The slides for this text are organized into chapters. This lecture covers Chapter 10. Chapter 1: Introduction to Database Systems Chapter.
Chapter 11 (3 rd Edition) Hash-Based Indexes Xuemin COMP9315: Database Systems Implementation.
Cuckoo Filter: Practically Better Than Bloom
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.
Cuckoo Hashing : Hardware Implementations Adam Kirsch Michael Mitzenmacher.
Hit or Miss ? !!!.  Cache RAM is high-speed memory (usually SRAM).  The Cache stores frequently requested data.  If the CPU needs data, it will check.
The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.
WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.
November 29, 2005Christopher Tuttle1 Linear Scan Register Allocation Massimiliano Poletto (MIT) and Vivek Sarkar (IBM Watson)
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.
Performance Evaluation of IPv6 Packet Classification with Caching Author: Kai-Yuan Ho, Yaw-Chung Chen Publisher: ChinaCom 2008 Presenter: Chen-Yu Chaug.
Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.
Software-Based Cache Coherence with Hardware-Assisted Selective Self Invalidations Using Bloom Filters Authors : Thomas J. Ashby, Pedro D´ıaz, Marcelo.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching proxy to improve the efficiency of the web. It reduces.
Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
Linear Scan Register Allocation POLETTO ET AL. PRESENTED BY MUHAMMAD HUZAIFA (MOST) SLIDES BORROWED FROM CHRISTOPHER TUTTLE 1.
Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos.
Building Expressive, Area-Efficient Coherence Directories Michael C. Huang Guofan Jiang Zhejiang University University of Rochester IBM 1 Lei Fang, Peng.
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
EazyHTM: Eager-Lazy Hardware Transactional Memory Saša Tomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián Cristal, Osman Unsal, Tim Harris,
2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.
Thread-Level Speculation Karan Singh CS
Notary: Hardware Techniques to Enhance Signatures Luke Yen Collaborator: Prof. Stark C. Draper Advisor: Prof. Mark D. Hill University of Wisconsin, Madison.
Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 10.
Implementing Signatures for Transactional Memory Daniel Sanchez, Luke Yen, Mark Hill, Karu Sankaralingam University of Wisconsin-Madison.
Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.
Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo.
© 2008 Multifacet ProjectUniversity of Wisconsin-Madison Pathological Interaction of Locks with Transactional Memory Haris Volos, Neelam Goyal, Michael.
The Evicted-Address Filter
HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation.
컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.
Prophet/Critic Hybrid Branch Prediction B B B
Speaker : Kyu Hyun, Choi. Problem: Interference in shared caches – Lack of isolation → no QoS – Poor cache utilization → degraded performance.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Improving Cache Performance using Victim Tag Stores
Algorithmic Improvements for Fast Concurrent Cuckoo Hashing
CSC 4250 Computer Architectures
Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.
Two Ideas of This Paper Using Permissions-only Cache to deduce the rate at which less-efficient overflow handling mechanisms are invoked. When the overflow.
Lecture 19: Transactional Memories III
EMOMA- Exact Match in One Memory Access
Cache - Optimization.
Lecture 23: Transactional Memory
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,
Presentation transcript:

Design and Implementation of Signatures in Transactional Memory Systems Daniel Sanchez August 2007 University of Wisconsin-Madison

2 Outline  Introduction and motivation  Bloom filters  Bloom signatures  Area & performance evaluation  Influence of system parameters  Novel signature schemes (brief overview)  Conclusions

Signature-based conflict detection  Signatures: Represent an arbitrarily large set of elements in bounded amount of state (bits) Approximate representation, with false positives but no false negatives  Signature-based CD: Use signatures to track read/write sets of a transaction Pros: □ Transactions can be unbounded in size □ Independence from caches, eases virtualization Cons: □ False conflicts -> Performance degradation 3

Motivation of this study  Signatures play an important role in TM performance. Poor signatures cause lots of unnecessary stalls and aborts.  Signatures can take significant amount of area Can we find area-efficient implementations? Adoption of TM much easier if the area requirements are small!  Signature design space exploration incomplete in other TM proposals 4

Summary of results  Previously proposed TM signatures are either true Bloom (1 filter, k hash functions) or parallel Bloom (k filters, 1 hash function each). Performance-wise, True Bloom = Parallel Bloom Parallel Bloom about 8x more area-efficient  New Bloom signature designs that double the performance and are more robust  Pressure on signatures greatly increases with the number of cores; directory can help  Three novel signature designs 5

Outline 6  Introduction and motivation  Bloom filters  Bloom signatures  Area & performance evaluation  Influence of system parameters  Novel signature schemes (brief overview)  Conclusions

Bloom filters 7 h1h1 h2h2 Bit field (m bits) Hash functions Address Hash values {0,…,m-1}

Bloom filters 8 Add 0x2a83ff00 h1h1 h2h2 38

Bloom filters 9 Add 0x2a8ab3f4 h1h1 h2h2 122

Bloom filters 10 Test 0x2a8a83f4 h1h1 h2h2 102 False

Bloom filters 11 h1h1 h2h2 38 True Test 0x2a83ff00

Bloom filters 12 h1h1 h2h2 28 True (false positive!) Test 0xff83ff48

Outline 13  Introduction and motivation  Bloom filters  Bloom signatures True Bloom signatures Parallel Bloom signatures  Area & performance evaluation  Influence of system parameters  Novel signature schemes (brief overview)  Conclusions Design Implementation

True Bloom signature - Design  True Bloom signature = Signature implemented with a single Bloom filter 14  Easy insertions and tests for membership  Probability of false positives:  Design dimensions Size of the bit field (m) Number of hash functions (k) Type of hash functions

Number of hash functions 15

Types of hash functions  Addresses neither independent nor uniformly distributed (key assumptions to derive P FP (n))  But can generate hash values that are almost uniformly distributed and uncorrelated with good (universal/almost universal) hash functions  Hash functions considered: 16 Bit-selection H3H3 (inexpensive, low quality)(moderate, higher quality)

True Bloom signature – Implementation  Divide bit field in words, store in small SRAM Insert: Raise wordline, drive appropriate bitline to 1, leave the rest floating Test: Raise wordline, check the value at bitline  k hash functions => k read, k write ports 17 Problem Size of SRAM cell increases quadratically with # ports!

Parallel Bloom signatures - Design  Use k Bloom filters of size m/k, with independent hash functions 18  Probability of false positives: Same as true Bloom!

Parallel Bloom signature - Implementation 19  Highly area-efficient SRAMs  Same performance as true Bloom! (in theory)

Outline 20  Introduction and motivation  Bloom filters  Bloom signatures  Area & performance evaluation Area evaluation True vs. Parallel Bloom in practice Type of hash functions Variability in hash functions  Influence of system parameters  Novel signature schemes (brief overview)  Conclusions

Area evaluation  SRAM: Area estimations using CACTI 4Kbit signature, 65nm 21 k=1k=2k=4 True Bloom0.031 mm mm mm 2 Parallel Bloom0.031 mm mm mm 2  8x area savings for four hash functions!  Hash functions:  Bit selection has no extra cost  Four hardwired H 3 require ≈25% of SRAM area

Performance evaluation  System organization: 32 in-order single-issue cores Private split 32KB, 4-way L1 caches Shared unified 8MB, 8-way L2 cache High-bandwidth crossbar Signature checks are broadcast (no directory) Base conflict resolution protocol with write-set prediction  Benchmarks: btree, raytrace, vacation barnes, delaunay, and full set of results in report 22

True vs. Parallel Bloom signatures  Bottom line: True ≈ parallel if we use good enough hash functions 23 vacation bit-selection vacation H 3 Graph format Solid lines = Parallel Bloom Dashed lines = True Bloom Different colors = Different number of hash functions Execution times are always normalized

Bit-selection vs. fixed H 3  H 3 clearly outperforms bit-selection for k≥2  Only 2Kbit signatures with 4+ H 3 functions cause no degradation over all the benchmarks 24 btree bit-selection btree H 3

The benefits of variability  Variable H 3 : Reconfigure hash functions after each commit/abort Constant aliases -> Transient aliases Adds robustness 25 btree fixed H 3 btree var. H 3

The benefits of variability  Variable H 3 : Reconfigure hash functions after each commit/abort Constant aliases -> Transient aliases Adds robustness 26 raytrace fixed H 3 raytrace var. H 3

Conclusions on Bloom signature evaluation  Parallel Bloom enables high number of hash functions “for free”  Type of hash functions used matters a lot (but was neglected in previous analysis)  Variability adds robustness  Should use: About four H 3 or other high quality hash functions Variability if the TM system allows it Size… depends on system configuration 27

Outline 28  Introduction and motivation  Bloom filters  Bloom signatures  Area & performance evaluation  Influence of system parameters Number of cores Conflict resolution protocol  Novel signature schemes (brief overview)  Conclusions

Number of cores & using a directory  Pressure increases with #cores  Directory helps, but still requires to scale the signatures with the number of cores 29 btree vacation Constant signature size (256 bits) Number of cores in the x-axis !

Effect of conflict resolution protocol  Protocol choice fairly orthogonal to signatures  False conflicts boost existing pathologies in btree/raytrace -> Hybrid policy helps even more than with perfect signatures 30 btree raytrace vacation (Parallel Bloom, fixed H 3, k=2) Constant signature type (H 3, k=2) Execution times not normalized !

Overview of novel signature schemes  Cuckoo-Bloom signatures Adapts cuckoo hashing for HW implementation Keeps a hash table for small sets, morphs into a Bloom filter dynamically as the size grows Significant complexity, performance advantage not clear  Hash-Bloom signatures Simpler hash-table based approach Morphs to a Bloom filter more gradually than Cuckoo-Bloom Outperforms Bloom signatures for both small and write sets, in theory and practice  Adaptive Bloom signatures Bloom signatures + set size predictors + scheme to select the best number of hash functions 31

Conclusions  Bloom signatures should always be implemented as parallel Bloom with ≈4 good hash functions, some variability if allowed Overall good performance, simple/inexpensive HW  Increasing #cores makes signatures more critical Hinders scalability! Using directory helps, but doesn’t solve  Hybrid conflict resolution helps with signatures  There are alternative schemes that outperform Bloom signatures 32

Thanks for your attention Any questions?

Backup – Hash function analysis 34  Hash value distributions for btree, 512-bit parallel Bloom with 2 hash functions bit-selection fixed H 3

Backup - Conflict resolution in LogTM-SE  Base: Stall requester by default, abort if it is stalling an older Tx and stalled bt an older Tx  Pathologies: DuelingUpgrades: Two Txs try to read-modify-update same block concurrently -> younger aborts StarvingWriter: Difficult for a Tx to write to a widely shared block FutileStall: Tx stalls waiting for other that later aborts  Solutions: Write-set prediction: Predict read-modify-updates, get exclusive access directly (targets DuelingUpgrades) Hybrid conflict resolution: Older writer aborts younger readers (targets StarvingWriter, FutileStall) 35

Backup – Cuckoo-Bloom signatures 36 vacationbtree

Backup – Hash-Bloom signatures 37 vacation

Backup – Adaptive Bloom signatures 38 vacationraytrace