Eliminating Squashes Through Learning Cross-Thread Violations in Speculative Parallelization for Multiprocessors Marcelo Cintra and Josep Torrellas University.

Slides:

Advertisements

Similar presentations

1 Evaluating the Security Threat of Instruction Corruptions in Firewalls Shuo Chen, Jun Xu, Ravishankar K. Iyer, Keith Whisnant Center of Reliable and.

Advertisements

Can We Talk? MICHAEL Conference London May 23, 2008Joyce Ray.

Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer Engineering University.

Thread-Level Speculation as a Memory Consistency Protocol for Software DSM? Marcelo Cintra University of Edinburgh

Beyond Auto-Parallelization: Compilers for Many-Core Systems Marcelo Cintra University of Edinburgh

Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh.

Handling Branches in TLS Systems with Multi-Path Execution Polychronis Xekalakis and Marcelo Cintra University of Edinburgh

Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm Nikolas Ioannou, Jeremy Singer, Salman Khan, Polychronis Xekalakis, Paraskevas.

Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Toward a Compiler Framework for Thread-Level Speculation Marcelo Cintra University of Edinburgh

Increasing the Energy Efficiency of TLS Systems Using Intermediate Checkpointing Salman Khan 1, Nikolas Ioannou 2, Polychronis Xekalakis 3 and Marcelo.

Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching Pedro Díaz and Marcelo Cintra University of Edinburgh

Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability.

X-ray Photoelectron Spectroscopy (XPS)

IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.

Alias Speculation using Atomic Regions (To appear at ASPLOS 2013) Wonsun Ahn*, Yuelu Duan, Josep Torrellas University of Illinois at Urbana Champaign.

L1 Data Cache Decomposition for Energy Efficiency Michael Huang, Joe Renau, Seung-Moon Yoo, Josep Torrellas University of Illinois at Urbana-Champaign.

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

ECE291 Computer Engineering II Lecture 11 Josh Potts University of Illinois at Urbana- Champaign.

A Framework for Dynamic Energy Efficiency and Temperature Management (DEETM) Michael Huang, Jose Renau, Seung-Moon Yoo, Josep Torrellas University of Illinois.

Optimizing single thread performance Dependence Loop transformations.

DFA Minimization Jeremy Mange CS 6800 Summer 2009.

Implementing Reflective Access Control in SQL Lars E. Olson 1, Carl A. Gunter 1, William R. Cook 2, and Marianne Winslett 1 1 University of Illinois at.

Easily extensible unix software for spectral analysis, display modification, and synthesis of musical sounds James W. Beauchamp School of Music Dept.

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign

Memory Consistency Arbob Ahmad, Henry DeYoung, Rakesh Iyer /18-740: Recent Research in Architecture October 14, 2009.

More on Thread Level Speculation Anthony Gitter Dafna Shahaf Or Sheffet.

October 2003 What Does the Future Hold for Parallel Languages A Computer Architect’s Perspective Josep Torrellas University of Illinois

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.

The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization Lawrence Rauchwerger and David A. Padua PLDI.

Chapter Hardwired vs Microprogrammed Control Multithreading

Multiscalar processors

PathExpander: Architectural Support for Increasing the Path Coverage of Dynamic Bug Detection S. Lu, P. Zhou, W. Liu, Y. Zhou, J. Torrellas University.

The Impact of Data Dependence Analysis on Compilation and Program Parallelization Original Research by Kleanthis Psarris & Konstantinos Kyriakopoulos Year.

Architectural Support for Scalable Speculative Parallelization in Shared- Memory Multiprocessors Marcelo Cintra, José F. Martínez, Josep Torrellas Department.

Illusionist: Transforming Lightweight Cores into Aggressive Cores on Demand I2PC March 28, 2013 Amin Ansari 1, Shuguang Feng 2, Shantanu Gupta 3, Josep.

Thread-Level Speculation Karan Singh CS

ReSlice: Selective Re-execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan.

Hyper Threading (HT) and  OPs (Micro-Operations) Department of Computer Science Southern Illinois University Edwardsville Summer, 2015 Dr. Hiroshi Fujinoki.

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members.

3.2 Correlation. Correlation Measures direction and strength of the linear relationship in the scatterplot. Measures direction and strength of the linear.

Gauss Students’ Views on Multicore Processors Group members: Yu Yang (presenter), Xiaofang Chen, Subodh Sharma, Sarvani Vakkalanka, Anh Vo, Michael DeLisi,

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

亚洲的位置和范围吉林省白城市洮北区教师进修学校郑春艳. Q 宠宝贝神奇之旅 —— 亚洲 Q 宠快递你在网上拍的一套物理实验器材到了。 Q 宠宝贝打电话给你：你好，我是快递员，有你的邮件，你的收货地址上面写的是学校地址，现在学校放假了，能把你家的具体位置告诉我吗？请向快递员描述自己家的详细位置！

Stela: Enabling Stream Processing Systems to Scale-in and Scale-out On- demand Le Xu ∗, Boyang Peng†, Indranil Gupta ∗ ∗ Department of Computer Science,

Rebound: Scalable Checkpointing for Coherent Shared Memory

Implementing Advanced Intelligent Memory

Parallel Computing Lecture

Mengjia Yan, Yasser Shalabi, Josep Torrellas

“RDM Service Categories” by OCLC Research, from The Realities of Research Data Management. Part One: A Tour of the Research Data Management (RDM) Service.

February 12 – 19, 2018.

CMPT 886: Computer Architecture Primer

Changing thread semantics

Symmetric Multiprocessing (SMP)

Slide Presentations for ECE 329, Introduction to Electromagnetic Fields, to supplement “Elements of Engineering Electromagnetics, Sixth Edition” by Nannapaneni.

CS4021/4521 Advanced Computer Architecture II

Mengjia Yan† , Jiho Choi† , Dimitrios Skarlatos,

مديريت موثر جلسات Running a Meeting that Works

Edward C. Jordan Memorial Offering of the First Course under the Indo-US Inter-University Collaborative Initiative in Higher Education and Research: Electromagnetics.

Mattan Erez The University of Texas at Austin

Cross-Core Prime+Probe Attacks on Non-inclusive Caches

CS 286 Computer Organization and Architecture

Slide Presentations for ECE 329, Introduction to Electromagnetic Fields, to supplement “Elements of Engineering Electromagnetics, Sixth Edition” by Nannapaneni.

Teach Parallelism Using Video Games

University of Illinois at Urbana-Champaign

MicroScope: Enabling Microarchitectural Replay Attacks

Presentation transcript:

Eliminating Squashes Through Learning Cross-Thread Violations in Speculative Parallelization for Multiprocessors Marcelo Cintra and Josep Torrellas University of Edinburgh University of Illinois at Urbana-Champaign

Intl. Symp. on High Performance Computer Architecture - February Speculative Parallelization  Assume no dependences and execute threads in parallel  Track data accesses at run-time  Detect violations  Squash offending threads and restart them for(i=0; i<100; i++) { … = A[L[i]]+… A[K[i]] = … } Iteration J+2 … = A[5]+… A[6] =... Iteration J+1 … = A[2]+… A[2] =... Iteration J … = A[4]+… A[5] =... RAW

Intl. Symp. on High Performance Computer Architecture - February Squashing in Speculative Parallelization  Speculative Parallelization: Threads may get squashed  Dependence violations are statically unpredictable  False sharing may cause further squashes

Intl. Symp. on High Performance Computer Architecture - February Squashing Useful work Possibly correct work Wasted correct work Squash overhead Producer i i+j Consumer i+j+1 i+j+2 Wr Rd Time Squashing is very costly

Intl. Symp. on High Performance Computer Architecture - February Contribution: Eliminate Squashes  Framework of hardware mechanisms to eliminate squashes  Based on learning and prediction of violations  Improvement: average of 4.3 speedup over plain squash-and-retry for 16 processors

Intl. Symp. on High Performance Computer Architecture - February Outline  Motivation and Background  Overview of Framework  Mechanisms  Implementation  Evaluation  Related Work  Conclusions

Intl. Symp. on High Performance Computer Architecture - February Types of Dependence Violations No dependence Disambiguate line addresses Plain Speculation TypeAvoiding SquashesMechanism

Intl. Symp. on High Performance Computer Architecture - February Types of Dependence Violations False Type No dependence Avoiding Squashes Disambiguate line addresses Disambiguate word addresses Mechanism Plain Speculation Delay & Disambiguate

Intl. Symp. on High Performance Computer Architecture - February Types of Dependence Violations False Same-word, predictable value Type No dependence Avoiding Squashes Disambiguate line addresses Disambiguate word addresses Compare values produced and consumed Mechanism Plain Speculation Delay & Disambiguate Value Predict

Intl. Symp. on High Performance Computer Architecture - February Types of Dependence Violations False Same-word, predictable value Type No dependence Same-word, unpredictable value Avoiding Squashes Disambiguate line addresses Disambiguate word addresses Compare values produced and consumed Stall thread, release when safe Mechanism Plain Speculation Delay & Disambiguate Value Predict Stall&Release Stall&Wait

Intl. Symp. on High Performance Computer Architecture - February Plain Speculation Delay&Disambiguate ValuePredict Stall&Release Stall&Wait Learning and Predicting Violations  Monitor violations at directory  Remember data lines causing violations  Count violations and choose mechanism potential violation violations and squashes violation and squash age Plain SpeculationDelay&DisambiguateValuePredictStall&Release Stall&Wait Plain Speculation

Intl. Symp. on High Performance Computer Architecture - February Outline  Motivation and Background  Overview of Framework  Mechanisms  Implementation  Evaluation  Related Work  Conclusions

Intl. Symp. on High Performance Computer Architecture - February Delay&Disambiguate  Assume potential violation is false  Let speculative read proceed  Remember unresolved potential violation  Perform a late or delayed per-word disambiguation when consumer thread becomes non-speculative –No per-word access information at directory –No word addresses on memory operations  Squash if same-word violation

Intl. Symp. on High Performance Computer Architecture - February Delay&Disambiguate (Successful) Useful work Producer i i+j Consumer i+j+1 i+j+2 Wr Rd Delayed disambiguation overhead Time

Intl. Symp. on High Performance Computer Architecture - February Delay&Disambiguate (Unsuccessful) Useful work Possibly correct work Wasted correct work Squash overhead Producer i i+j Consumer i+j+1 i+j+2 Wr Rd Delayed disambiguation overhead Time

Intl. Symp. on High Performance Computer Architecture - February ValuePredict  Predict value based on past observed values –Assume value is the same as last value written (last-value prediction, value reuse, or silent store) –More complex predictors are possible  Provide predicted value to consumer thread  Remember predicted value  Compare predicted value against correct value when consumer thread becomes non-speculative  Squash if mispredicted value

Intl. Symp. on High Performance Computer Architecture - February Stall&Release  Stall consumer thread when attempting to read data  Release consumer thread when a predecessor commits modified copy of the data to memory –Provided no intervening thread has a modified version  Squash released thread if a later violation is detected

Intl. Symp. on High Performance Computer Architecture - February Stall&Release (Successful) Useful work Producer i i+j Consumer i+j+1 i+j+2 Wr Rd Stall overhead Time

Intl. Symp. on High Performance Computer Architecture - February Stall&Wait  Stall consumer thread when attempting to read data  Release consumer thread only when it becomes non- speculative

Intl. Symp. on High Performance Computer Architecture - February Outline  Motivation and Background  Overview of Framework  Mechanisms  Implementation  Evaluation  Related Work  Conclusions

Intl. Symp. on High Performance Computer Architecture - February Baseline Architecture Processor + Caches Memory Directory Controller Network Speculation module with per-line access bits Per-word access bits in caches Conventional support for speculative parallelization:

Intl. Symp. on High Performance Computer Architecture - February Speculation Module  One entry per memory line speculatively touched  Load and Store bits per line per thread  Mapping of threads to processors and ordering of threads Line Tag Valid Bit Load Bits Store Bits ………… Global Memory Disambiguation Table (GMDT) (Cintra, Martínez, and Torrellas – ISCA’00)

Intl. Symp. on High Performance Computer Architecture - February Enhanced Architecture Processor + Caches Memory Directory Controller LDE VPT VPT: Violation Prediction Table Monitor and learn violations + enforce our mechanisms LDE: Late Disambiguation Engine Use local per-word info to support the Delay&Disambiguate and ValuePredict New modules: Network

Intl. Symp. on High Performance Computer Architecture - February Violation Prediction Table (VPT)  Entries for lines recently causing potential violations  Appended to every row in the GMDT Line Tag Valid Bit Load Bits Store Bits ………… State Bits Valid Bit Squash Counter SavedSquash Bit …………

Intl. Symp. on High Performance Computer Architecture - February VPT Pending Transactions Buffer  Entries for pending transactions on VPT lines Valid Bit Line Tag Thread ID Predicted Line Value ………… State Bits … Modified Mask … 2K-entry VPT: < 2Kbytes 128-entry Pend. Trans. Buffer : < 10Kbytes

Intl. Symp. on High Performance Computer Architecture - February Operation under Delay&Disambiguate Thread i+j+kThread i+jThread i D&D State VPT Load Bits Thread ID i+j+k Mask 0000 VPT Pend. Trans. Buffer Load word 12. Store word 3 3. Commit 4. Commit 5. Delayed Disambiguation request 6. No squash 1000

Intl. Symp. on High Performance Computer Architecture - February Operation under Delay&Disambiguate Thread i+j+kThread i+jThread i D&D State VPT Load Bits Thread ID i+j+k Mask 0000 VPT Pend. Trans. Buffer Load word 12. Store word 3 4. Commit5. Commit6. Delayed Disambiguation request 7. Squash 3. Store word

Intl. Symp. on High Performance Computer Architecture - February Outline  Motivation and Background  Overview of Framework  Mechanisms  Implementation  Evaluation  Related Work  Conclusions

Intl. Symp. on High Performance Computer Architecture - February Simulation Environment  Execution-driven simulation  Detailed superscalar processor model  Coherent+speculative memory back-end  Directory-based multiprocessor: 4 nodes  Node: spec CMP + 1M L2 + 2K-entry GMDT + Mem.  CMP: 4 x (processor + 32K L1)  Processor: 4-issue, dynamic

Intl. Symp. on High Performance Computer Architecture - February Applications  Applications dominated by non-analyzable loops with cross-iteration dependences  average of 60% of sequential time  TRACK (PERFECT)  EQUAKE, WUPWISE (SPECfp2000)  DSMC3D, EULER (HPF2)  Non-analyzable loops and accesses identified by the Polaris parallelizing compiler  Results shown for the non-analyzable loops only

Intl. Symp. on High Performance Computer Architecture - February Delay&Disambiguate Performance Our scheme gets very close to oracle Plain line-based protocol is slow Most squashes eliminated with Delay&Disambiguate

Intl. Symp. on High Performance Computer Architecture - February Complete Framework Performance Framework gets very close to oracle Complete framework performs as well as or better than each mechanism alone

Intl. Symp. on High Performance Computer Architecture - February Outline  Motivation and Background  Overview of Framework  Mechanisms  Implementation  Evaluation  Related Work  Conclusions

Intl. Symp. on High Performance Computer Architecture - February Related Work Other dynamic learning, prediction, and specialized handling of dependences for speculative parallelization: –Synchronization: Multiscalar; TLDS  Mostly tailored to CMP  Learning based on instruction addresses (Multiscalar) –Value Prediction: Clustered Speculative; TLDS  We use compiler to eliminate trivial dependences (like TLDS)  We use floating-point applications –Mixed word/line speculation: I-ACOMA  We never require word addresses on memory operations

Intl. Symp. on High Performance Computer Architecture - February Outline  Motivation and Background  Overview of Framework  Mechanisms  Implementation  Evaluation  Related Work  Conclusions

Intl. Symp. on High Performance Computer Architecture - February Conclusions  Framework of hardware mechanisms eliminates most squashes  Very good performance of line-based speculative machine with framework: –4.3 times faster than a line-based speculative machine without framework –1.2 times faster than expensive word-based speculative machine  Delay&Disambiguate has largest impact but combination with Stall&Wait is best  ValuePredict did not work well in our environment

Eliminating Squashes Through Learning Cross-Thread Violations in Speculative Parallelization for Multiprocessors Marcelo Cintra and Josep Torrellas University of Edinburgh University of Illinois at Urbana-Champaign

Intl. Symp. on High Performance Computer Architecture - February VPT Fields  State Bits: the mechanism currently in use for the line  Squash Counter: number of squashes to the line incurred with current mechanism  SavedSquashBit: squash saved by the current mechanism  ThrsStallR: number of squashes to trigger Stall&Release mechanism  ThrsStallW: number of squashes to trigger Stall&Wait mechanism  AgePeriod: number of commits to age the state of line

Intl. Symp. on High Performance Computer Architecture - February Pending Transactions Buffer Fields  Line Tag: address of line with unresolved violation  Thread ID: consumer thread with a unresolved violation  State Bits: mechanism used  Modified Mask: bitmap of all modifications to words of the line by predecessor threads  Predicted Line Value: actual data values provided to consumer thread

Intl. Symp. on High Performance Computer Architecture - February Operation under ValuePredict Thread i+j+kThread i+jThread i VP State VPT Load Bits Thread ID i+j+k Mask 0000 VPT Pend. Trans. Buffer Load word 12. Store word 1 (value = 25) 3. Commit 4. Commit 5. Delayed Disambiguation request 6. No squash 0000 Value 25

Intl. Symp. on High Performance Computer Architecture - February Operation under ValuePredict Thread i+j+kThread i+jThread i VP State VPT Load Bits Thread ID i+j+k Mask 0000 VPT Pend. Trans. Buffer Load word 12. Store word 1 (value = 26) 3. Commit 4. Commit 0010 Value Delayed Disambiguation request 6. Squash

Intl. Symp. on High Performance Computer Architecture - February Summary + Delay&Disambiguate offers word-level disambiguation without word addresses in most memory transactions + Simple implementation combines all different mechanisms seamlessly + Learning/Prediction policy effective when subset of speculative data tends to be accessed with dependences - Learning/Prediction policy not so effective when subset of speculative instructions tends to cause dependences - Learning/Prediction policy reacts to previous behavior but cannot extrapolate/anticipate behavior

Intl. Symp. on High Performance Computer Architecture - February Simulation Environment Processor Param.ValueMemory Param.Value Issue width 4 L1,L2,VC size 32KB,1MB,64KB Instruction window size 64 L1,L2,VC assoc. 2-way,4-way,8-way No. functional units(Int,FP,Ld/St) 3,2,2 L1,L2,VC,line size 64B,64B,64B No. renaming registers(Int,FP) 32,32 L1,L2,VC,latency 1,12,12 cycles No. pending memory ops.(Ld,St) 8,16 L1,L2,VC banks 2,3,2 Local memory latency 75 cycles 2-hop memory latency 290 cycles 3-hop memory latency 360 cycles GMDT size 2K entries GMDT assoc. 8-way GMDT/VPT lookup 20 cycles Pend. Trans. Buffer size 128 entries Pend. Trans. Buffer scan 3 cycles/entry

Intl. Symp. on High Performance Computer Architecture - February Application Characteristics DSMC3D EULER Application TRACK EQUAKE Loops nlfilt_300 move3_100 dflux_[100,200] psmoo_20 eflux_[100,200,300] smvp_1195 WUPWISE muldeo_200’ muldoe_200’ % of Seq. Time RAW Dependences Same-word and False Same-word and False False Same-word and False

Intl. Symp. on High Performance Computer Architecture - February Squash Behavior

Intl. Symp. on High Performance Computer Architecture - February Stall&Wait Performance Some speedup, but significantly limited by stall time Our scheme gets close to oracle most of the times

Intl. Symp. on High Performance Computer Architecture - February Related Work  Other hardware-based speculative parallelization: –Multiscalar (Wisconsin); Hydra (Stanford); Clustered Speculative (UPC); TLDS (CMU); MAJC (Sun); Superthreaded (Minnesota); Illinois –Mostly tailored for CMP –Mostly word-based speculation –Mostly squash-and-retry