Sketching high-performance implementations of bitstream programs. Armando Solar-Lezama, Rastislav Bodik UC Berkeley.

Slides:

Advertisements

Similar presentations

Chapter 1 An Overview of Computers and Programming Languages.

Advertisements

1.2 Row Reduction and Echelon Forms

Linear Equations in Linear Algebra

The Future of Correct Software George Necula. 2 Software Correctness is Important ► Where there is software, there are bugs ► It is estimated that software.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

StreamBit: Sketching high-performance implementations of bitstream programs Armando Solar-Lezama, Rastislav Bodik UC Berkeley.

Algorithms and Problem Solving-1 Algorithms and Problem Solving.

Formal Methods. Importance of high quality software ● Software has increasingly significant in our everyday activities - manages our bank accounts - pays.

Honors Compilers The Course Project Feb 28th 2002.

Algorithms and Problem Solving. Learn about problem solving skills Explore the algorithmic approach for problem solving Learn about algorithm development.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.

Charles Kime & Thomas Kaminski © 2008 Pearson Education, Inc. (Hyperlinks are active in View Show mode) Chapter 3 – Combinational Logic Design Part 1 –

Programming. Software is made by programmers Computers need all kinds of software, from operating systems to applications People learn how to tell the.

1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.

CHAPTER 1: INTORDUCTION TO C LANGUAGE

1 Programming Languages Examples: C, Java, HTML, Haskell, Prolog, SAS Also known as formal languages Completely described and rigidly governed by formal.

Principles of Programming Chapter 1: Introduction  In this chapter you will learn about:  Overview of Computer Component  Overview of Programming 

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

GENERAL CONCEPTS OF OOPS INTRODUCTION With rapidly changing world and highly competitive and versatile nature of industry, the operations are becoming.

High level & Low level language High level programming languages are more structured, are closer to spoken language and are more intuitive than low level.

Charles Kime & Thomas Kaminski © 2004 Pearson Education, Inc. Terms of Use (Hyperlinks are active in View Show mode) Terms of Use Lecture 12 – Design Procedure.

Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.

TMF1013 : Introduction To Computing Lecture 1 : Fundamental of Computer ComputerFoudamentals.

Language processors (Chapter 2) 1 Course Overview PART I: overview material 1Introduction 2Language processors (tombstone diagrams, bootstrapping) 3Architecture.

Today  Table/List operations  Parallel Arrays  Efficiency and Big ‘O’  Searching.

Chapter 3 Developing an algorithm. Objectives To introduce methods of analysing a problem and developing a solution To develop simple algorithms using.

Enabling Refinement with Synthesis Armando Solar-Lezama with work by Zhilei Xu and many others*

Lesson 3 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.

SE: CHAPTER 7 Writing The Program

INTRODUCTION TO COMPUTING CHAPTER NO. 04. Programming Languages Program Algorithms and Pseudo Code Properties and Advantages of Algorithms Flowchart (Symbols.

Charles Kime & Thomas Kaminski © 2008 Pearson Education, Inc. (Hyperlinks are active in View Show mode) Chapter 3 – Combinational Logic Design Part 1 –

Introduction to Problem Solving. Steps in Programming A Very Simplified Picture –Problem Definition & Analysis – High Level Strategy for a solution –Arriving.

CS221 Algorithm Basics. What is an algorithm? An algorithm is a list of instructions that transform input information into a desired output. Each instruction.

Developing an Algorithm. Simple Program Design, Fourth Edition Chapter 3 2 Objectives In this chapter you will be able to: Introduce methods of analyzing.

Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.

The Software Development Process

CS162 Week 5 Kyle Dewey. Overview Announcements Reactive Imperative Programming Parallelism Software transactional memory.

Data Structure and Algorithms. Algorithms: efficiency and complexity Recursion Reading Algorithms.

Data Structures and Algorithms Dr. Tehseen Zia Assistant Professor Dept. Computer Science and IT University of Sargodha Lecture 1.

Chapter 1 Computers, Compilers, & Unix. Overview u Computer hardware u Unix u Computer Languages u Compilers.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

1 Languages and Compilers (SProg og Oversættere) Bent Thomsen Department of Computer Science Aalborg University With acknowledgement to Norm Hutchinson.

CMPSC 16 Problem Solving with Computers I Spring 2014 Instructor: Tevfik Bultan Lecture 4: Introduction to C: Control Flow.

Recursion Unrolling for Divide and Conquer Programs Radu Rugina and Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology.

Concurrency and Performance Based on slides by Henri Casanova.

Lecture #1: Introduction to Algorithms and Problem Solving Dr. Hmood Al-Dossari King Saud University Department of Computer Science 6 February 2012.

PROGRAMMING FUNDAMENTALS INTRODUCTION TO PROGRAMMING. Computer Programming Concepts. Flowchart. Structured Programming Design. Implementation Documentation.

INTRODUCTION TO COMPUTER PROGRAMMING(IT-303) Basics.

1 1.2 Linear Equations in Linear Algebra Row Reduction and Echelon Forms © 2016 Pearson Education, Ltd.

1 A simple parallel algorithm Adding n numbers in parallel.

Introduction to Problem Solving Programming is a problem solving activity. When you write a program, you are actually writing an instruction for the computer.

Introduction to Computer Programming Concepts M. Uyguroğlu R. Uyguroğlu.

Computer Systems Architecture Edited by Original lecture by Ian Sunley Areas: Computer users Basic topics What is a computer?

Software Engineering Algorithms, Compilers, & Lifecycle.

 Problem Analysis  Coding  Debugging  Testing.

CHAPTER 2 GC101 Program’s algorithm 1. COMMUNICATING WITH A COMPUTER  Programming languages bridge the gap between human thought processes and computer.

Algorithms and Problem Solving

Overview Part 1 – Design Procedure Beginning Hierarchical Design

Compiler Construction

Programming by Sketching for Bit-Streaming Programs

STUDY AND IMPLEMENTATION

Programming Fundamentals (750113) Ch1. Problem Solving

Linear Equations in Linear Algebra

Algorithms and Problem Solving

The Challenge of Cross - Language Interoperability

Armando Solar-Lezama, Rastislav Bodik UC Berkeley

Linear Equations in Linear Algebra

Presentation transcript:

Sketching high-performance implementations of bitstream programs. Armando Solar-Lezama, Rastislav Bodik UC Berkeley

Bitstream programs bitstream programs: a growing domain –crypto, compression, NSA/BitTwiddle, coding in general. bitstream algorithms easy to state –e.g., “Drop every third bit in the bit stream.” but bitstream programs hard to implement –because efficient bit manipulations are hard to code Can only work with word-size arrays of bits Exponentially many ways of accomplishing the same task SLOW O(n) FAST O(log n)

Current development process A collaborative experience: 1.domain expert writes a high-level algorithm, in C/Fortran, 2.system expert tunes its performance, often drastically turning the algorithm into ugly low-level code. Now, if the original algorithm needs to be modified: –introduce changes into optimized code (error-prone), or –rewrite the algorithm and repeat tuning (time-consuming).

Our development process A (better) collaborative experience: 1.domain expert writes a clean algorithm, in a clean DSL, 2.system expert optimizes the implementation by writing a (reusable) transformation specification in a TSL. If the original algorithm is modified, then –simply reapply the transformation specification. The transformation spec: think of it as … –sequence of program edits, or –high level description of the desired implementation, or –an optimizer tailored to the algorithm

Challenges, and our solution (overview) 1.How to make the transformation reusable? –Transformation should apply after the algorithm changes. 2.How to simplify transformation development? –Should be much easier than editing the program by hand. –Should not allow you to introduce bugs Our solution: a transformation sketch –system expert sketches the transformation, –details filled in automatically. ? functionalityfull implementation +  DSL: StreamItTSL: transf. specTSL with sketching

DSL: StreamIt High-level bitstream algorithms written in StreamIt. Example: “drop every third bit”. filter dropThird { Work push 2 pop 3 { for (int i=0; i<3; ++i) { x = peek(i); if (i<2) push(x); pop(); }  3 2 consumes a 3-bit chunk of input; produces a 2-bit of output. xyzxyz xyxy x =

Naïve Compilation of BitStream Program The example filter operates on 3- and 2-bit chunks. –unsuitable for the target machine’s instructions So, how to generate code for the example filter? –transform the filter to operate on word-size chunks! –Decompose new filters into filters corresponding to machine instructions –Such low-level program can be expressed in StreamIt!

The development strategy We offer several methods for “lowering” filters into low-level form (i.e., for implementing them with target code): 1.Manual-transformation: M(P) For any filter P, the system expert could manually write a TSL spec to make P low level. 2.Auto-transformation: N(P) For any filter P, the system knows how to produce a naïve lowering TSL spec N. N is thus a simple code generator for P

Transforming into low-level form (1) unroll 4x to make input/output a multiple of W=4 bits

Transforming into low-level form (2) decompose into filters operating on W=4 bits of input rrobin 4,4, or

Transforming into low-level form (3) decompose into filters producing W=4 bits of output. rrobin 4,4, or rrobin 4,4, or duplicate cat

Implementing a basic filter (4) decompose word-size filter into available instructions duplicate or t1 = in AND 1100 t2 = in SHIFTL 1 t3 = t2 AND 0010 out = t1 OR t3 in

The development strategy We offer several methods for “lowering” filters into low-level form (i.e., for implementing them with target code): 1.Manual-transformation: M(P) For any filter P, the filter system expert could manually bring it to low level form. 2.Auto-transformation: N(P) For any filter P, the system knows how to produce a naïve lowering TSL spec N. N is thus a simple code generator for P 3.Half-way transformation. N(T(P)) System expert provides a spec T that transforms P so that N generates better code. If written well, T brings P closer to low-level code (gives it a good structure) so that N can do a perfect job generating the code.

TSL: Transformation spec language TSL specification: –describes how to transform a filter into a semantically equivalent one with a different structure. –so that you don’t need to transform filters manually. Example. TSL spec for the transformation you just saw: f = Unroll[4](dropThird); f = ColSplit[4](f); f.f_1 = RowSplit[4](f.f_1); f.f_2 = RowSplit[4](f.f_2); f.f_3 = RowSplit[4](f.f_3); TSL Spec can also be thought of as an implementation specification. –Every statement in the above specification provides the system more details about the implementation. –This simple specification can be generated automatically –better TSL specs written by system expert, with help of sketches

Half way transformations The expert guides lowering by imparting structure to the filter –If the system expert has an algorithm in mind to implement a particular filter, the expert can decompose the filter into a sequence of filters, each one implementing one step of the algorithm. So, specifying an efficient bit manipulation often boils down to specifying a decomposition of the filter’s matrix. –filter = filter1 x filter2 This can be done hierarchically, –Detail is only added where necessary.

F.F_1 Half way transformations-example System Expert provides high level decomposition System Takes care of Lowering F.F_1, F.F_2 and F.F_3 Correctness is guaranteed as long as F = [F.F_3]x[F.F_2]x[F.F_1] Fully Specifying F.F._1, F.F_2 and F.F_3 is still too difficult. We would like to be able to sketch them F F.F_2 F.F_3 F.F_1 F.F_2 F.F_3 Half way transformation to specify FAST bit shifting algorithm

The development strategy We offer several methods for “lowering” filters into low-level form (i.e., for implementing them with target code): 1.Manual-transformation: M(P) For any filter P, the filter system expert could manually bring it to low level form. 2.Auto-transformation: N(P) For any filter P, the system knows how to produce a naïve lowering TSL spec N. N is thus a simple code generator for P 3.Half-way transformation. N(T(P)) System expert provides a spec T that transforms P so that N generates better code. If written well, T brings P closer to low-level code (gives it a good structure) so that N can do a perfect job generating the code. 4.Half-specified transformation. N(s[T](P)) System expert provides a sketch of a transformation T. The compiler completes the sketch by requiring that the completed transformation (s[T]) produces a filter semantically equivalent to P.

Sketching a transformation A transformation sketch: –Specifies the number of stages in the decomposition –Gives constraints on terms of the decomposition –System derives a decomposition satisfying the constraints and semantically equivalent to original filter Example. A fragment of TSL for FAST compaction: filter = [shift(1:16 by 0 || 1)] x [shift(1:16 by 0 || 2)] x [shift(1:16 by 0 || 4)] functionality sketch ???????????????? ???????????????? ???????????????? ???????????????? full implementation

Sketching a transformation The PermutFactor function specifies each step as a list of constraints –PermutFactor[ [constrList] [constrList] …] Constraints of 3 types: –Type 1: specific shift amount shift( bitList by x) –Type 2: undetermined shift amount shift( bitList by ?) –Type 3: limited choice of shift amount shift( bitList by a || b || …) System ignores constraints on discarded bits

Sketching a transformation-example First we must unroll to get a multiple of the word size –Unroll[16](filter) We want to use the fast algorithm for compacting the bits in each word The second word can only pack half the bits at a time, since half will go to one word and half to another

Sketching a transformation-example If we first pack all bits within a word, we can better exploit the parallelism afforded by the fast algorithm. We want to sketch this transformation.

Sketching a transformation-example Sketch of the transformation –PermutFactor[[][]](filter) Sketch of the transformation –PermutFactor[[] [shift(1:16 by ?), shift(17:32 by ?), shift(33:48 by ?)] ](filter) Sketch of the transformation –PermutFactor[[shift(1:2 by 0), shift(17:18 by 0), shift(33:34 by 0)] [shift(1:16 by ?), shift(17:32 by ?), shift(33:48 by ?)] ](filter) After this is done, we can proceed hierarchically

Sketching a transformation-example We can select a specific part of the algorithm and add more detail –Specification of fast bit packing algorithm within each word PermutFactor[ [shift(1:16 by 0 || 1)], [shift(1:16 by 0 || 2)], [shift(1:16 by 0 || 4)] ]( F_i );

Complete TSL spec for FAST WSIZE=16; subsequence = Unroll[WSIZE](subsequence); subsequence = PermutFactor[ [shift(1:2 by 0), shift(17:18 by 0), shift(33:34 by 0)], [shift(1:16 by ?), shift(17:32 by ?), shift(33:48 by ?)] ] ( subsequence ); subsequence.subsequence_1=DiagSplit[WSIZE](subsequence); for(i=0; i<3; ++i) { bsequence.subsequence_1.filter(i) = PermutFactor[ [shift(1:16 by 0 || 1)], [shift(1:16 by 0 || 2)], [shift(1:16 by 0 || 4)] ]( subsequence.subsequence_1.filter(i) ); } Size: 13 lines

Reusability of the TSL specs A TSL spec is reusable under a given change to the program if the changed program can still be profitably transformed with the TSL spec. Experiment: 1.Start with the program P = “drop every third bit”, 2.Write the FAST spec for it, 3.Modify P to “drop the second bit in each three bits.” no change in the original TSL spec needed. 4.Modify P to “drop every fourth bit”. Only minor change in the TSL spec needed.

Performance Gain: bit compaction On a pentium III processor running at 1.5 GHz, the unoptimized code took seconds to process.127 Mb of data 100 times. On the same machine, optimized version took seconds, a performance gain of 39%.

Performance Gain: Permutation from DES On the same machine the unoptimized code took.102 seconds to process the same amount of data. After a simple TSL specification DESIP=PermutFactor[ [ shift(1:2:31 by -33), shift(2:2:32 by 0), shift(33:2:63 by 0), shift(34:2:64 by 33) ],[]](DESIP); Optimized version took.073 seconds, a speedup of 28%

Conclusions Our system allows the separation of the algorithm specification from performance tuning through the use of TSL System expert only needs to specify high level transformations, system can take care of the details Sketching the transformations makes them reusable and easier to write.