Compiler Management of Communication and Parallelism for Quantum Computation Authors: Jeff Heckey, Shruti Patil, Ali Javadi Abhari, Adam Holmes, Daniel.

Slides:



Advertisements
Similar presentations
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Advertisements

ECE 667 Synthesis and Verification of Digital Circuits
1 CS 201 Compiler Construction Machine Code Generation.
EDA (CS286.5b) Day 10 Scheduling (Intro Branch-and-Bound)
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Logic Synthesis – 3 Optimization Ahmed Hemani Sources: Synopsys Documentation.
Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Sequential Timing Optimization. Long path timing constraints Data must not reach destination FF too late s i + d(i,j) + T setup  s j + P s i s j d(i,j)
Automated Generation of Layout and Control for Quantum Circuits Mark Whitney, Nemanja Isailovic, Yatish Patel, John Kubiatowicz University of California,
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
A Fault-tolerant Architecture for Quantum Hamiltonian Simulation Guoming Wang Oleg Khainovski.
Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
ROBUST RESOURCE ALLOCATION OF DAGS IN A HETEROGENEOUS MULTI-CORE SYSTEM Luis Diego Briceño, Jay Smith, H. J. Siegel, Anthony A. Maciejewski, Paul Maxwell,
Quantum Programming Languages for Specification and Optimization Fred Chong UC Santa Barbara Ken Brown, Ravi Chugh, Margaret Martonosi, John Reppy and.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Static Process Scheduling
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
Dynamic Load Balancing Tree and Structured Computations.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Quantum Bits (qubit) 1 qubit probabilistically represents 2 states
These slides are based on the book:
Lower Bounds & Sorting in Linear Time
QUANTUM COMPUTING: Quantum computing is an attempt to unite Quantum mechanics and information science together to achieve next generation computation.
Chapter 2 Memory and process management
Database Management System
Conception of parallel algorithms
Dynamic Hashing (Chapter 12)
Optimizing Compilers Background
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Distance Computation “Efficient Distance Computation Between Non-Convex Objects” Sean Quinlan Stanford, 1994 Presentation by Julie Letchner.
Optimization Code Optimization ©SoftMoore Consulting.
Parallel Algorithm Design
Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang Cornell University
Greedy Algorithm for Community Detection
COSC160: Data Structures Linked Lists
Chapter 12: Query Processing
Amir Kamil and Katherine Yelick
Introduction to cosynthesis Rabi Mahapatra CSCE617
Pipelining and Vector Processing
Designing and Debugging Batch and Interactive COBOL Programs
Building Java Programs
Chapter 9: Virtual-Memory Management
Predictive Performance
Objective of This Course
Indexing and Hashing Basic Concepts Ordered Indices
CS 201 Compiler Construction
Hash-Based Indexes Chapter 11
Lower Bounds & Sorting in Linear Time
Static Code Scheduling
Quipper : A Scalable Quantum Programming Language
Chapter 10 Multiprocessor and Real-Time Scheduling
Amir Kamil and Katherine Yelick
Data Structures Introduction
Secondary Storage Management Hank Levy
Ali JavadiAbhari, Shruti Patil,
Sum this up for me Let’s write a method to calculate the sum from 1 to some n public static int sum1(int n) { int sum = 0; for (int i = 1; i
CAD DESK PRIMAVERA PRESENTATION.
Parallel Programming in C with MPI and OpenMP
COMPUTER ORGANIZATION AND ARCHITECTURE
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Magic-State Functional Units
Basic Search Methods How to solve the control problem in production-rule systems? Basic techniques to find paths through state- nets. For the moment: -
Presentation transcript:

Compiler Management of Communication and Parallelism for Quantum Computation Authors: Jeff Heckey, Shruti Patil, Ali Javadi Abhari, Adam Holmes, Daniel Kudrow, Kenneth R. Brown, Diana Franklin, Frederic T. Chong, Margaret Martonosi (ASPLOS ‘15) Slides by: George Li

Objective Proposes a muti-SIMD (single instruction, multiple datastream) scalable quantum computing architecture, using microwave control fanout, quantum teleportation and global/local memories Efficient use of Multi-SIMD requires orchestration of qubits to maximize parallelism and to reduce qubit move/teleport operations, so communication-aware scheduling techniques are the main focus

Ion Trap Technology Trapped-ion systems seem the most promising candidate for large scale QC (2015) Well characterized qubits; long decoherence; can initialize, measure, and perform universal set of gates Laser tuning for phase must be extremely precise, limiting QCs to ~6-12 lasers because they’re too big Experiments with microwaves show potential multi-qubit control scalability from 10-100 qubits for a single microwave emitter; additional fanout level could create SIMD regions for ~100-10,000 qubits Microwave control doesn’t actually allow for measurement, so assume laser-controlled measurement

Quantum Teleportation EPR (Einstein-Podolsky-Rosen) pair q2/q3 pre-distributed between communication regions before teleportation, q3 is used as destination States q1 and q2 are measured, then classically transmitted to q3 via X and Z operations q1 state destroyed in process Teleportation process requires up to 4x the communication cost of a single qubit Assume all gates with same latency Scheduling algorithm should optimize to minimize QT operations Allows for simplified global memory to make long distance communication constant

Multi-SIMD Architecture Model k SIMD compute regions with d qubits each k limited by number of microwave signals, d limited by fanout without disturbing system Uses microwaves to control states Each region has a teleportation unit and local memory, as well as a global memory for scheduling purposes QT uses classical communication channels, which reduces quantum decoherence over repeatedly moving qubits on physical fabric

Multi-SIMD Architecture (cont.) The k SIMD regions can be used as active computation as well as passive storage for qubits Active qubits must be moved between regions, idle qubits moved to global memory Using local memory could reduce overhead Centralized global memory model assumes QT latency constant Constant cost favors parallelism; memory access takes same time for serial vs parallel With simplified model, local memory assumed to take 1 compute cycle, global memory takes 4

Compiler Framework Uses ScaffCC for scheduler Translates algorithms in Scaffold into LLVM, then into QASM for loops and state-based conditional statements allows for parallelism control QASM uses qubits, cbits, and instruction set of universal gates (X, Y, Z), CNOT, H, Phase (S), and π/8 (T) gate QECC Sub-operations below logic level are not optimized Classical-To-Quantum-Gates (CTQG) integrated into flow Decomposes arithmetic operations and if-then-else into QASM Compiler uses decomposition stage to translate rotation and Toffoli gates into approximate equivalents Scheduling is targeted after this decomposition step Full unrolling using iteration counts, gate operations available w/ Scaffold; too large for benchmarks Extend LLVM intermediate of ScaffCC to also analyze modules for leaf and non-leaf hierarchies Allows for fine grained and coarse grained scheduling of leaf modules

Leaf Module Flattening Example Two dependent toffoli operations in example Toffoli operation decomposed into QASM gates Modular case (coarse grained) does not detect inter-blackbox parallelism

Flattening Results Flattening Threshold (FTh) – resource estimation on each module to count number of gates If less than threshold it is “flattened” 80% of modules were flattened

Execution and Scheduling Model Multi-SIMD(k,d) model allows 1 to k discrete, simultaneous gates to executed in one logical timestep, each which can be applied on 1 to d qubits each Gate operations assumed to take one logical timestep, constrained by longest operation time Compiler assume modules are blackboxes, so all active qubits are flushed to global memory during calls Assume ancilla bits are generated by global memory and teleported to SIMD needed QT between SIMD regions must also be scheduled Schedulers need to account for data movement costs; QT uses 4 timesteps, which results in a potential 5x longer runtime than computed critical path Schedules are a list of sequential timesteps, each timestep contains an array of k+1 SIMD regions 0th region lists the qubits that will be moved and corresponding sources and destinations Rest list the operations to be performed in the associated region

Scheduling with Ready Critical Path (RCP) Method for scheduling tasks to prioritize those which take the longest, and schedule other tasks during this time to not delay the next dependency Identifies the longest chain of successive dependent activities and associated time to identify critical paths Schedule other tasks alongside this timing constraint to minimize delay Maintains a ready list instead of a free list for tasks, and only queues tasks with dependencies met Unsorted list of operations Evaluates operation type, movement cost, and slack Slack is graph distance between uses of a qubit Can also weight each of these, but set to 1 for this case Visualization of critical path method https://en.wikipedia.org/wiki/Critical_path_method#/media/File:SimpleAONwDrag3.png

Ready Critical Path Algorithm At each timestep, priority for each operation type in each SIMD is calculated with its associated region and operation When priority for configurations have been calculated, the SIMD region and operation type with highest priority is scheduled SIMD region is removed from available regions Ready list updated Repeat until all operations in ready list are scheduled or all regions occupied Then update ready list with operations ready for next timestep until queue is empty

Longest Path First Search (LPFS) Illustration of linearization of DAG LPFS assigns l SIMD regions, where l < k for l longest paths Longest path algorithm starts by finding the critical path in the LLVM generated DAG (directed acyclic graph) http://www.mathcs.emory.edu/~cheung/Courses/171/Syllabus/11-Graph/Docs/longest-path-in-dag.pdf

LPFS Algorithm Start at the top, sets a tag at each node with the furthest distance from the top Find largest depth from bottom, then trace back and record paths Multiple paths can be found sequentially and at arbitrary positions in the execution Longest paths returned as ordered list of the operations to be performed Stored in array of size l for each statically scheduled region Operations not on any of the longest paths added to a free list Scheduled in any available SIMD in range [l+1, k] based on order of operations as they are scheduled 2 more options SIMD & Refill SIMD scheduling allows any SIMD region dedicated to longest path to execute an operation from the free list of the same type Refill lets longest paths of unequal lengths to schedule another path to schedule after the first finished; refill a reserved SIMD with another longest path Experiments run with l = 1 and SIMD and Refill enabled

Scheduling Algorithm Because benchmarks are so large, coarse-grained scheduler stiches together the optimized leaf modules scheduled by RCP and LPFS Algorithm for k SIMD regions Operations are scheduled in priority order Flexible blackbox dimensions considered for parallelizable modules to find the best combination For each operation Fi in priority ordered set Check dependencies for scheduling timestep If dependencies don’t allow for it, serialize Fi into own schedule Find l & w of operation If it fits into constraints, just add to schedule of current block If not, try different combinations of length and width to see if any will fit If so, add it to schedule of box If not, serialize operation into own box Merge current box with total box dimensions

Scheduling Communication with Local Memory To further reduce communication overheads, a rudimentary optimization using local memories attached to SIMD regions Computed after communication-aware algorithm has finished mapping computations to SIMD and communication schedules are computed Determine if some qubits can be moved to local memory instead of teleporting to global memory Based on when the qubit’s next computation is scheduled Local memory also constrained by capacity Qubit can be stored in global memory, source local memory, destination local memory, or an idle SIMD region Unless source SIMD region is idle, qubits moved to global memory After each compute cycle, qubits not scheduled for next operation are moved to global or scratchpad memory; reused ones are left in place Local moves are derived from the schedules on leaf modules and the schedule is modified to reflect moves. If any region requires a move to/from global memory, wait 4 cycles to avoid race condition

Benchmarks Used Benchmarks modularly describe quantum circuit acting on logical qubits. Range from 10^7 to 10^12 gates. Grover’s Search Amplitude amplification to search database of 2^n elements, parameterized by (n) Binary Welded Tree Uses random walk algorithm to find path between entry and exit node. Parameterized by height (n) and time (s) to find solution Ground State Estimation Quantum phase estimation algorithm to estimate ground state energy of molecule. Parameterized by size of molecule in terms of molecular weight (M) Triangle Finding Problem Finds a triangle in dense, undirected graph; parameterized by nodes (n) Boolean Formula Algorithm to compute winning strategy for a game of Hex. Parameterized by board size (x, y) Class Number Computational algebraic number theory; compute class group of a quadratic number field Parameterized by (p) number of digits after radix point for floating point numbers used Secure Hash Algorithm 1 Message decrypted using SHA-1 function as oracle in Grover’s search. Parameterized by message size (n) Shor’s Factoring Algorithm Factorization using QFT. Parameterized by (n), size of bits in number to factor

Results CTQG modules are unoptimized and highly locally serialized ex. Shor’s algorithm has long chains of non-parallelizable rotations which require many SIMD regions to parallelize; thus k sensitive

Benchmark Speedup Results Without optimizing for communication latency, the estimated critical path runtime for Shors achieves a 10x speedup, while the actual runtimes only results in a 2-3x speedup Still able to achieve estimated critical path runtime Using a communication-aware scheduler increases the runtime speedups compared to a naïve movement model GSE benchmark shows the largest improvement

Speedup Results (continued) Data level parallelism is able to capture most parallelism, even with relatively small k Even though an infinite amount of data-parallelism is assumed, reducing d to 32 results in marginal changes in effectiveness Solely logic-level approach results in optimized architecture that can even handle additional error correction qubits

Conclusion Proposed Multi-SIMD architecture which incorporates quantum computing physical constraints works with compiler Ion trap technology with microwave control and quantum teleportation used as practical bases Scalable and maximizes parallelism while minimizing communication overhead Hierarchical analysis scheme allows the scheduling algorithm to parallelize at coarse-grained and fine-grained levels Produces observable speedup when evaluated against sequential execution using naïve communication model

Open Problems Level below logic level not explored Speedup is only in comparison to naïve sequential execution Local and global memory optimizations, latency analysis Proposed target architecture based on theoretical assumptions for practicality

Questions?