MICRO-50 Swamit Tannu Zachary Myers Douglas Carmean Prashant Nair

Slides:



Advertisements
Similar presentations
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Advertisements

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Processor history / DX/SX SX/DX Pentium 1997 Pentium MMX
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
1 Layers of Computer Science, ISA and uArch Alexander Titov 20 September 2014.
Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Dong Hyuk Woo Nak Hee Seong Hsien-Hsin S. Lee
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Chapter 2 Data Manipulation Yonsei University 1 st Semester, 2015 Sanghyun Park.
Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.
M U N - February 17, Phil Bording1 Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording February.
Lecture 2: Computer Architecture: A Science ofTradeoffs.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.
Lx: A Technology Platform for Customizable VLIW Embedded Processing.
Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.
Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Computer Organization and Architecture Lecture 1 : Introduction
These slides are based on the book:
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
CST 303 COMPUTER SYSTEMS ARCHITECTURE (2 CREDITS)‏
CS203 – Advanced Computer Architecture
Gwangsun Kim, Jiyun Jeong, John Kim
QUANTUM COMPUTING: Quantum computing is an attempt to unite Quantum mechanics and information science together to achieve next generation computation.
ESE532: System-on-a-Chip Architecture
Reza Yazdani Albert Segura José-María Arnau Antonio González
Evaluating Register File Size
Lecture 1: CS/ECE 3810 Introduction
For Massively Parallel Computation The Chaotic State of the Art
Efficient Complex Operators for Irregular Codes
Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)
Stateless Combinational Logic and State Circuits
Read Only Memory July 22, 2018.
Ultimate Theoretical Models of Nanocomputers
Architecture & Organization 1
Overview Introduction General Register Organization Stack Organization
Wireless Sensor Network
Ke Bai and Aviral Shrivastava Presented by Bryce Holton
Methodology of a Compiler that Compresses Code using Echo Instructions
Foundations of Computer Science
The University of Texas at Austin
Introduction, Focus, Overview
Milad Hashemi, Onur Mutlu, Yale N. Patt
Architecture & Organization 1
PACE: Power-Aware Computing Engines
High Throughput LDPC Decoders Using a Multiple Split-Row Method
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
OSU Quantum Information Seminar
MICRO-2018 Gururaj Saileshwar1 Prashant Nair1 Prakash Ramrakhyani2
Computer Evolution and Performance
What is Computer Architecture?
What is Computer Architecture?
Co-designed Virtual Machines for Reliable Computer Systems
Introduction, Focus, Overview
Introduction COE 301 Computer Organization Prof. Aiman El-Maleh
Husky Energy Chair in Oil and Gas Research
What Are Performance Counters?
Funded by the Horizon 2020 Framework Programme of the European Union
Magic-State Functional Units
Presentation transcript:

Taming the Instruction Bandwidth of Quantum Computers via Hardware Managed Error Correction MICRO-50 Swamit Tannu Zachary Myers Douglas Carmean Prashant Nair Moinuddin Qureshi

Quantum Computers enable solutions to important problems Why Quantum Computers? Execution Time Classical Computer Problem Size Quantum Computer Billion Years Molecule and Material Simulations Quantum computers provide large speedup for problems in material science, machine learning, and medicine Days Quantum Computers enable solutions to important problems

Quantum Bits: Background State of a Classical Bit  1 or 0 two points on sphere State of a Quantum Bit Any point on the sphere Quantum Bit Classical Bit Quantum computer use quantum bits (qubits) to encode the information.

Quantum Instruction: Background Quantum Instructions manipulate the state of qubit By rotating state vector in the Hilbert space

Organization of Quantum Computer Control Processor Qubits Quantum Computer Control Processor -- Interface between Qubits & Programmer

Todays Quantum Computer Dilution Refrigerator 5 Qubit Chip (IBM) Qubits 5 Qubit Chip (IBM) 20mK Ref: IBM Quantum Experience

Cryogenic Control Processor Qubits 20mK 300K (27oC) Qubits 20mK Control Processor 4K (-269o C) Metal Wires  Thermal Leakage Superconducting wires  Low Leakage Cryogenic Control Processor is essential for scalable Quantum Computer (Ref: Cryogenic Control Architecture for Large-Scale Quantum Computing by Hornibrook et. al)

Scalable Organization

Outline Background Problem  Instruction Bandwidth Bloat due to Error Correction Design  HW managed error correction Challenges Evaluation & Conclusion

Quantum bits are fickle Even at 20mK, qubits can lose state 1 Classical Bit Quantum Bit Need Error Correction to protect Quantum bits

Quantum Error Correction Copying qubits is not allowed Measurement destroys the qubit state Quantum Error Correction can protect qubits No Copy Control Processor Qubits Continuous Quantum Error is essential to protect the state

Programmable Quantum Error Correction Cost Error Rate Programmable Quantum Error Correction Different QECC Designed New error correction codes  QECC needs to be programmable Software managed QECC  Compiler inserts QECC instructions in regular instruction stream Different Error Correction Designs Triangle Code Planer Codes Twist Surface Code Concatenation Code

Baseline Architecture Uncorrectable Errors !! Control Processor Qubits

Problem: Instruction Bandwidth Bottleneck Shor’s Algorithm Instruction Bandwidth (B/s) Total Bandwidth Non-QECC Bandwidth Number of Qubits Can’t use i-cache -- delay in delivery of QECC instructions results in error SW-QECC + no i-cache instruction bandwidth must scale linearly with number of qubits Can’t use i-cache 128 bit 256 bit 1024 bit QECC Bandwidth Bloat – 10,000X Instruction Bandwidth bottleneck  99.99% of instructions are QECC

QECC Instruction Bandwidth Bloat Realistic Quantum workloads require substantially large number of qubits Large number of qubits  must support large instruction bandwidth QECC instruction bloat is dominant in all large scale quantum workloads

Software-Managed QECC Programmability BW Efficiency Software-Managed QECC Hardcoded QECC Goal: To enable programmable Quantum Error Correction without bandwidth bloat

QECC can be executed independently without any global synchronization Insight QECC can be executed independently without any global synchronization QECC is simple enough to manage in hardware using programmable microcode

QuEST alleviates instruction bandwidth by issuing QECC ops locally QuEST Architecture Master Controller MCE continuously issues QECC µops from local µcode Master controller issues regular instructions. MCE decodes it to µops Control Processor µ-Code Engine Qubits QuEST alleviates instruction bandwidth by issuing QECC ops locally

MCE Microarchitecture QuEST alleviates instruction bandwidth by issuing QECC ops locally

Quantum Execution Unit Dedicated QECC microcode continuously run QECC Microcode Design Logical Microcode  supplies logical µops QECC Microcode  supplies QECC µops to all qubits Mask Bits  selects between QECC µop and Logical µops Quantum Execution Unit Dedicated QECC microcode continuously run QECC

Microcode Memory Capacity Requirements C bits Log(N) bits QECC µops NO need of opcode Qubits Microcode memory capacity requirement increase linearly with no. of qubits

Limited Memory Capacity at 4 Kelvin CMOS Frequency JJ-Memory Capacity 4 Kb/cm2 JJ-logic Data Energy 105 Device Density 30 sec JJ – Works at 4K Ref: Beyond CMOS Superconducting Digital Circuits by MIT Lincoln Labs Capacity of JJ-based memory is limited

Memory Capacity Limits Number of Qubits Capacity of state of the art JJ based Memory 2 mins 4Kb µcode memory can only support about 35 qubits

Insight- Repetition in QECC Instructions A B C D Unit Cell Unit Cell A B C D QECC µops Unit cell need clearification Why what is it ? Animation for unit cell Qubit1 Qubit12 QECC have spatially repeating instruction blocks -- Unit Cell Unit-Cell µops can generate QECC µops for all the qubits QECC instruction patterns are leveraged to minimize µcode capacity

Unit-Cell Enables Constant Storage Capacity of state of the art JJ based Memory O(1) 2 mins QECC instruction patterns are leveraged to minimize µcode capacity

Number of Qubits Serviced 90x UNIT CELL UNIT CELL FIFO Redudant RAM For 4Kb Memory Budget Microcode can service up to 2000 qubits with 4Kb of microcode

QuEST reduces global bandwidth demand by seven orders of magnitude Evaluations Instruction BW Demand Reduction in Global 107x Why we get this benefit ? Most of the instructions stem from QECC we handle QECC in microcode and that reduces the instruction bandwidth demand by several orders of magnitude. This graph shows Reduction in Global Bandwidth demand on the y axis for quantum workloads including Shors algorithm. On average Quest reduces instruction bandwidth demand by seven order of magnitude QuEST reduces global bandwidth demand by seven orders of magnitude

Software-Managed QECC Conclusion Programmability BW Efficiency Software-Managed QECC Hardcoded QECC QuEST (µcode)

Questions? Want to know more about quantum … Thank you and I will be happy to take questions

Backup slides