Notary: Hardware Techniques to Enhance Signatures Luke Yen Collaborator: Prof. Stark C. Draper Advisor: Prof. Mark D. Hill University of Wisconsin, Madison.

Slides:



Advertisements
Similar presentations
Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Advertisements

Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Effects of Virtual Cache Aliasing on the Performance of the NetBSD Operating System Rafal Boni CS 535 Project Presentation.
Signatures in Transactional Memory Systems Dissertation Defense Luke Yen 1/29/2009.
4/14/2017 Discussed Earlier segmentation - the process address space is divided into logical pieces called segments. The following are the example of types.
Segmentation and Paging Considerations
An Adaptable Benchmark for MPFS Performance Testing A Master Thesis Presentation Yubing Wang Advisor: Prof. Mark Claypool.
CS 333 Introduction to Operating Systems Class 11 – Virtual Memory (1)
WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.
Memory Management (II)
CS 333 Introduction to Operating Systems Class 11 – Virtual Memory (1)
1 Virtual Memory vs. Physical Memory So far, all of a job’s virtual address space must be in physical memory However, many parts of programs are never.
(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.
Performance Evaluation of IPv6 Packet Classification with Caching Author: Kai-Yuan Ho, Yaw-Chung Chen Publisher: ChinaCom 2008 Presenter: Chen-Yu Chaug.
Software-Based Cache Coherence with Hardware-Assisted Selective Self Invalidations Using Bloom Filters Authors : Thomas J. Ashby, Pedro D´ıaz, Marcelo.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.
1 RAKSHA: A FLEXIBLE ARCHITECTURE FOR SOFTWARE SECURITY Computer Systems Laboratory Stanford University Hari Kannan, Michael Dalton, Christos Kozyrakis.
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
Paging. Memory Partitioning Troubles Fragmentation Need for compaction/swapping A process size is limited by the available physical memory Dynamic growth.
8.4 paging Paging is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method for implementation.
Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data.
Virtualization Part 2 – VMware. Virtualization 2 CS5204 – Operating Systems VMware: binary translation Hypervisor VMM Base Functionality (e.g. scheduling)
DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.
Chapter 8 – Main Memory (Pgs ). Overview  Everything to do with memory is complicated by the fact that more than 1 program can be in memory.
Silberschatz, Galvin and Gagne Operating System Concepts Chapter 9: Virtual Memory.
(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.
Chapter 4 Memory Management Virtual Memory.
1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi
CS399 New Beginnings Jonathan Walpole. Virtual Memory (1)
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
Sampling Dead Block Prediction for Last-Level Caches
Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.
Author : Christopher J. Martinez, Devang K. Pandya, and Wei-Ming Lin Publisher/Conf : IEEE/ACM TRANSACTIONS ON NETWORKING Speaker : Chen Deyu Data :
Operating Systems (CS 340 D) Princess Nora University Faculty of Computer & Information Systems Computer science Department.
Implementing Signatures for Transactional Memory Daniel Sanchez, Luke Yen, Mark Hill, Karu Sankaralingam University of Wisconsin-Madison.
Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.
Design and Implementation of Signatures in Transactional Memory Systems Daniel Sanchez August 2007 University of Wisconsin-Madison.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
Redundant Memory Mappings for Fast Access to Large Memories
The Evicted-Address Filter
HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation.
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.
Sunpyo Hong, Hyesoon Kim
Architectural Features of Transactional Memory Designs for an Operating System Chris Rossbach, Hany Ramadan, Don Porter Advanced Computer Architecture.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Agile Paging: Exceeding the Best of Nested and Shadow Paging
Kernel Code Coverage Nilofer Motiwala Computer Sciences Department
Memory COMPUTER ARCHITECTURE
Effective Data-Race Detection for the Kernel
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Energy-Efficient Address Translation
What we need to be able to count to tune programs
Reducing Memory Reference Energy with Opportunistic Virtual Caching
Page Replacement.
CARP: Compression-Aware Replacement Policies
Memory Management Overview
Virtual Memory Hardware
LogTM-SE: Decoupling Hardware Transactional Memory from Caches
Lecture 7: Flexible Address Translation
Main Memory Background
Presentation transcript:

Notary: Hardware Techniques to Enhance Signatures Luke Yen Collaborator: Prof. Stark C. Draper Advisor: Prof. Mark D. Hill University of Wisconsin, Madison MICRO-41 - November 11,

Executive Summary Tackle 2 problems with hardware signatures: Problem 1: Best signature hashing (i.e., H 3 ) has high area & power overheads Solution 1: Use entropy analysis to guide lower-cost hashing (Page-Block-XOR, PBX) that performs similar to H 3 –Ex: 160 gates for H 3 vs 20 gates for PBX Problem 2: Spurious signature conflicts caused by signature bits set by private memory addrs Solution 2: Avoid inserting private stack addrs, propose privatization interface for higher performance 10/21/2015 University of Wisconsin-Madison 2

Outline Signature background Entropy Entropy results & PBX Privatization Methodology & workloads Results Conclusions & Future Work 10/21/2015 University of Wisconsin-Madison 3

Signature background Signatures (hardware Bloom filters) used to summarize and detect conflicts with a transaction’s read- and write-sets –Inspired by Bulk system [Ceze,ISCA’06] –Implemented in LogTM-SE [Yen,HPCA’07] –Can have false positives, but never false negatives –Also proposed for non-TM purposes (e.g., SC violation detection, atomicity violation detection, race recording) Ex: Use k Bloom filters of size m/k, with independent hash functions 10/21/2015 University of Wisconsin-Madison 4

Signature hash functions Which hash function is best? [Sanchez, MICRO’07] –Bit-selection? Hash simply decodes some number of input bits –H 3 ? Each bit of a hash value is an XOR of (on avg.) half of the input address bits 10/21/2015 University of Wisconsin-Madison 5 Result: H 3 better with >=2 hash functions However, H 3 uses many multi-level XOR trees Can we improve this? LogTM-SE w/ 2kb signatures

H 3 implementation Num XOR Ex: 2kb signatures, k=2, c=10, 32-bit addr = 160 XOR gates per signature Can we reduce the total gate count? 10/21/2015 University of Wisconsin-Madison 6

Outline Signature background Entropy Entropy results & PBX Privatization Methodology & workloads Results Conclusions & Future Work 10/21/2015 University of Wisconsin-Madison 7

Entropy overview Not all address bits have equal randomness –Ex: High-level address bits unlikely to change if working set size is small Key insight: If input bits are random and those bits are used as inputs to hash functions, random hash values result –Use entropy to measure bit randomness Entropy – measure of the uncertainty of a random variable x 10/21/2015 University of Wisconsin-Madison 8

Entropy formally defined Entropy = p(x i ) = the probability of the occurrence of value x i N = number of sample values random variable x can take on Entropy = amount of information required on average to describe outcome of variable x (in bits) –Ex: What is the best possible lossless compression? 10/21/2015 University of Wisconsin-Madison 9 n-bit field has constant value All bit patterns in n-bit field equally likely Entropy value of n-bit field 0 bits n bits min max Other cases

Our measures of entropy For our workloads, we care about: Q1: What is the best achievable entropy? –Global entropy – upper bound on entropy of address Q2: How does entropy change within an address? –Local entropy – entropy of bit-field within the address 10/21/2015 University of Wisconsin-Madison 10 Addr 31 6 Global entropy Addr 31 6 Local entropy NSkip

Outline Signature background Entropy Entropy results & PBX Privatization Methodology & workloads Results Conclusions & Future Work 10/21/2015 University of Wisconsin-Madison 11

Entropy results Workloads to be described later Global entropy is at most 16 bits Bit-window for local entropy is 16 bits wide (NSkip from 0-10) –Smaller windows (<16b) may not reach global entropy value –Larger windows (>16b) hides some fine-grain info 10/21/2015 University of Wisconsin-Madison 12

Entropy results summary More entropy results in our MICRO paper In summary, for our workloads entropy monotonically decreases when moving towards high-order bits –We calculate the average entropy across the entire workload’s execution –May miss entropy changes due to program phase behavior Our Page-Block-XOR (PBX) hash takes advantage of this overall trend 10/21/2015 University of Wisconsin-Madison 13

Page-Block-XOR (PBX) Motivated by 3 findings: –(1) Lower-order bits have most entropy Follows from our entropy results –(2) XORing two bit-fields produces random hash values From prior work on XOR hashing (e.g., data placement in caches, DRAM) –(3) Bit-field overlaps can lead to higher false positives Correlation between the two bit-fields can reduce the range of hash values produced (worse for larger signatures) 10/21/2015 University of Wisconsin-Madison 14

PBX implementation For 2kb signatures with 2 hash functions: –20 XOR gates for PBX vs 160 XOR gates for H 3 ! 10/21/2015 University of Wisconsin-Madison 15 PPN and Cache-index fields not tied to system params: Use entropy to find two non-overlapping bit-fields with high randomness

Summary thus far Problem 1: H 3 has high area & power overheads Solution 1: Use entropy analysis to guide lower-cost PBX –Ex: 160 gates for H 3 vs 20 gates for PBX Problem 2: Spurious signature conflicts caused by signature bits set by private memory addrs Solution 2: To be described 10/21/2015 University of Wisconsin-Madison 16

Outline Signature background Entropy Entropy results & PBX Privatization Methodology & workloads Results Conclusions & Future Work 10/21/2015 University of Wisconsin-Madison 17

Motivation False conflicts caused by thread-private addrs –Avoid conflicts if addrs not inserted in thread’s signatures 10/21/2015 University of Wisconsin-Madison 18

Privatization solutions Two solutions proposed: –(1) Remove private stack references from sigs. Very little work for programmer/compiler Benefits depend on fraction of stack addresses versus all transactional references –(2) Language-level interface (e.g., private_malloc(), shared_malloc() ) Even higher performance boost For skilled programmer WARNING: Incorrectly marking shared objects as private can lead to program errors! 10/21/2015 University of Wisconsin-Madison 19

Page-based implementation Each page is assigned a status, private or shared –Invariant: Page is shared if any object is shared If stack is private, library marks stack pages as private If using privatization heap functions, mark heap pages accordingly 10/21/2015 University of Wisconsin-Madison 20

OS support OS allocates different physical page frames for shared and private pages –Sets a per-frame bit in translation entry if shared –Reduce number of page frames used by packing objects with same status together Signatures insert memory addresses of transactional references to shared pages –Query page sharing bit in HW TLB & current transactional status 10/21/2015 University of Wisconsin-Madison 21

Outline Signature background Entropy Entropy results & PBX Privatization Methodology & workloads Results Conclusions & Future Work 10/21/2015 University of Wisconsin-Madison 22

Methodology Full-system simulation using Simics and Wisconsin GEMS timing modules Transistor-level design for area & power of XOR gates CACTI for Bloom filter bit array area & power Simulated system –Single-chip CMP –16 single-threaded,in-order cores –32kB, 4-way private L1 I & D, write-back –8MB, 8-way shared L2 cache –MESI directory protocol –Signatures from 64b-64kb (8B-8kB) & “Perfect” 10/21/2015 University of Wisconsin-Madison 23

Workloads Micro-benchmarks –BTree – read and write ops on shared tree –Sparse Matrix – algorithm from dense column vector multiplication kernel SPLASH-2 apps –Barnes & Raytrace – exert most signature pressure Stanford STAMP apps –Vacation, Genome, Delaunay, Bayes, Labyrinth DNS server –BIND 10/21/2015 University of Wisconsin-Madison 24

Outline Signature background Entropy Entropy results & PBX Privatization Methodology & workloads Results Conclusions & Future Work 10/21/2015 University of Wisconsin-Madison 25

PBX vs H 3 area & power Area & power overheads (2kb, k=4): 10/21/2015 University of Wisconsin-Madison 26 Type of overhead Bloom filter bit array H 3 hashPBX hash H 3 sig.PBX sig.% savings for PBX sig. Area (mm 2 ) 2.70e-28.10e-34.70e-43.50e-22.70e-223 Power (mW) 1.80e21.04e e21.81e24.7

PBX vs H 3 execution time 10/21/2015 University of Wisconsin-Madison 27 PBX performs similar to H 3 Additional workload results in paper

Privatization results summary Removing private stack references from signatures did not help much –Most addr references not to stack –Most likely because running with SPARC ISA. Other ISAs (e.g., x86) likely has more benefits Privatization interface helps four workloads –Remainder either does not have private heap structures or does not have high transactional duty cycle 10/21/2015 University of Wisconsin-Madison 28

Privatization interface results 10/21/2015 University of Wisconsin-Madison 29

Outline Signature background Entropy Entropy results & PBX Privatization Methodology & workloads Results Conclusions & Future Work 10/21/2015 University of Wisconsin-Madison 30

Conclusions Tackle 2 problems with signature designs: –(1) Area and power overheads of H 3 hashing E.g., 160 XOR gates for H 3, 20 for PBX –(2) False conflicts due to signature bits set by private memory references Our solutions: –(1) Use entropy analysis to guide hashing function (PBX), a low-cost alternative that performs similarly to H 3 –(2) Prevent private stack references from entering signatures, and propose a privatization interface for heap allocations Notary can be applied to non-TM uses: –PBX hashing can directly transfer –Privatization may transfer if addr filtering applies 10/21/2015 University of Wisconsin-Madison 31

Future Work Dynamic entropy calculation: –How to adapt PBX hashing to entropy changes over time? Dynamic privatization characteristics: –How common is it for objects to change sharing status (i.e., from private to shared, and vice versa)? 10/21/2015 University of Wisconsin-Madison 32

BACKUP SLIDES 10/21/2015 University of Wisconsin-Madison 33

Privatization interface 10/21/2015 University of Wisconsin-Madison 34 Privatization functionUsage shared_malloc(size), private_malloc(size) Dynamic allocation of shared and private memory objects shared_free(ptr), private_free(ptr) Frees up memory allocated by shared or private allocators privatize_barrier(num_threads, ptr, size), publicize_barrier(num_threads, ptr, size) Program threads come to a common point to privatize or publicize an object. Must be used outside of transactions

Dynamic privatization Dynamically switch from private to shared, and vice versa If transitioning from private -> shared, safe to mark page as shared (at cost of performance) If transitioning from shared -> private, default policy is to disallow if there exists other shared objects on same page Otherwise, trap to user software and let programmer call shared_free(), followed by private_malloc() on object 10/21/2015 University of Wisconsin-Madison 35

Bit-field overlaps harmful for PBX 10/21/2015 University of Wisconsin-Madison 36

Removing stack refs doesn’t help significantly 10/21/2015 University of Wisconsin-Madison 37

Entropy of commercial workloads 10/21/2015 University of Wisconsin-Madison 38

10/21/2015 University of Wisconsin-Madison 39 Signature Operation Example Program: xbegin LD A ST B LD C LD D ST C … Hash Function(s) R W A B C D External ST E ALIAS FALSE POSITIVE: CONFLICT! External ST F NO CONFLICT

Type of Hash Functions In real programs, addresses neither independent nor uniformly distributed (key assumptions to derive P FP (n)) But can generate hash values that are almost uniformly distributed and uncorrelated with good (universal/almost universal) hash functions Hash functions considered: 10/21/2015 University of Wisconsin-Madison 40 Bit-selection (inexpensive, low quality) H 3 [Carter, CSS79] (moderate, higher quality)