GUSTO: General architecture design Utility and Synthesis Tool for Optimization Qualifying Exam for Ali Irturk University of California, San Diego 1.

Slides:

Advertisements

Similar presentations

FPGA (Field Programmable Gate Array)

Advertisements

© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Financial Risk Management Framework - Cash Flow at Risk

1 Wireless Communication Low Complexity Multiuser Detection Rami Abdallah University of Illinois at Urbana Champaign 12/06/2007.

Lecture Presentation Software to accompany Investment Analysis and Portfolio Management Seventh Edition by Frank K. Reilly & Keith C. Brown Chapter.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

Architectural Optimization of Decomposition Algorithms for Wireless Communication Systems Ali Irturk †, Bridget Benson †, Nikolay Laptev ‡, Ryan Kastner.

A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.

AN INTRODUCTION TO PORTFOLIO MANAGEMENT

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Data-intensive Computing Case Study Area 2: Financial Engineering B. Ramamurthy 6/26/20151B. Ramamurthy & Abhishek Agarwal.

Chapter 6 An Introduction to Portfolio Management.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Estimation Error and Portfolio Optimization Global Asset Allocation and Stock Selection Campbell R. Harvey Duke University, Durham, NC USA National Bureau.

AN INTRODUCTION TO PORTFOLIO MANAGEMENT

Diversification and Portfolio Analysis Investments and Portfolio Management MB 72.

GPGPU platforms GP - General Purpose computation using GPU

© 2011 Xilinx, Inc. All Rights Reserved Intro to System Generator This material exempt per Department of Commerce license exception TSU.

Lecture 7: Simulations.

03/12/20101 Analysis of FPGA based Kalman Filter Architectures Arvind Sudarsanam Dissertation Defense 12 March 2010.

1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.

Version 1.2 Copyright © 2000 by Harcourt, Inc. All rights reserved. Requests for permission to make copies of any part of the work should be mailed to:

Portfolio Management-Learning Objective

Lecture Presentation Software to accompany Investment Analysis and Portfolio Management Seventh Edition by Frank K. Reilly & Keith C. Brown Chapter 7.

Some Background Assumptions Markowitz Portfolio Theory

Investment Analysis and Portfolio Management Chapter 7.

Power Reduction for FPGA using Multiple Vdd/Vth

Floating Point vs. Fixed Point for FPGA 1. Applications Digital Signal Processing -Encoders/Decoders -Compression -Encryption Control -Automotive/Aerospace.

1 Portfolio Optimization Problem for Stock Portfolio Construction Student : Lee, Dah-Sheng Professor: Lee, Hahn-Ming Date: 9 July 2004.

Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Efficient FPGA Implementation of QR

FPGA FPGA2  A heterogeneous network of workstations (NOW)  FPGAs are expensive, available on some hosts but not others  NOW provide coarse- grained.

Kevin Ross, UCSC, September Service Network Engineering Resource Allocation and Optimization Kevin Ross Information Systems & Technology Management.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Constraint Directed CAD Tool For Automatic Latency-optimal Implementation of FPGA-based Systolic Arrays Greg Nash Reconfigurable Technology: FPGAs and.

ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.

EXPLORATION AND PRODUCTION (E&P) How to Choose and Manage Exploration and Production Projects Supat Kietnithiamorn Kumpol Trivisvavet May 9, 2001 Term.

Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput.

High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop

Advanced SW/HW Optimization Techniques for Application Specific MCSoC m Yumiko Kimezawa Supervised by Prof. Ben Abderazek Graduate School of Computer.

Investment Analysis and Portfolio Management First Canadian Edition By Reilly, Brown, Hedges, Chang 6.

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

Chapter McGraw-Hill/Irwin Copyright © 2008 by The McGraw-Hill Companies, Inc. All rights reserved. Risk and Capital Budgeting 13.

Covariance Estimation For Markowitz Portfolio Optimization Ka Ki Ng Nathan Mullen Priyanka Agarwal Dzung Du Rezwanuzzaman Chowdhury 14/7/2010.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

Unit – III Session No. 26 Topic: Optimization

08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.

Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:

Chapter 7 An Introduction to Portfolio Management.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.

Introduction to Intrusion Detection Systems. All incoming packets are filtered for specific characteristics or content Databases have thousands of patterns.

An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.

Philipp Gysel ECE Department University of California, Davis

Marilyn Wolf1 With contributions from:

Fang Fang James C. Hoe Markus Püschel Smarahara Misra

Dynamo: A Runtime Codesign Environment

Centar ( Global Signal Processing Expo

Estimation Error and Portfolio Optimization

Estimation Error and Portfolio Optimization

HIGH LEVEL SYNTHESIS.

Final Project presentation

Estimation Error and Portfolio Optimization

Department of Electrical Engineering Joint work with Jiong Luo

Estimation Error and Portfolio Optimization

Presentation transcript:

GUSTO: General architecture design Utility and Synthesis Tool for Optimization Qualifying Exam for Ali Irturk University of California, San Diego 1

Thesis Objective 2  Design of a novel tool, GUSTO, for automatic generation and optimization of application specific matrix computation architectures from a given Matlab algorithm;  Showing the effectiveness of my tool by Rapid architectural production of various signal processing, computer vision and financial computation algorithms,

Motivation  Matrix Computations lie at the heart of most scientific computational tasks Wireless Communication, Financial Computation, Computer Vision.  Matrix inversion is required in Equalization algorithms to remove the effect of the channel on the signal, Mean variance framework to solve a constrained maximization problem, Optical flow computation algorithm for motion estimation. QRD, A -1 3

Motivation 4  There are a number of tools that translate Matlab algorithms to a hardware description language;  However, we believe that the majority of these tools take the wrong approach;  We take a more focused approach, specifically developing a tool that is targeting matrix computation algorithms.

Computing Platforms 5 ASICsDSPsFPGAsGPUCELL BE  Exceptional Performance  Long Time to Market  Substantial Costs  Ease of Development  Fast Time to Market  Low Performance  Ease of Development  Fast Time to Market  ASIC-like Performance

Field Programmable Gate Arrays  FPGAs are ideal platforms High processing power, Flexibility, Non recurring engineering (NRE) cost.  If used properly, these features enhance the performance and throughput significantly.  BUT! A few tools exist which can aid the designer with the many system, architectural and logic design choices. 6

GUSTO General architecture design Utility and Synthesis Tool for Optimization An easy-to-use tool for more efficient design space exploration and development. GUSTO Matrix dimensions Bit width Resource allocation Modes Algorithm Required HDL files 7 GUSTO: An Automatic Generation and Optimization Tool for Matrix Inversion Architectures, Ali Irturk, Bridget Benson, Shahnam Mirzaei and Ryan Kastner, under review, Transactions on Embedded Computing Systems.

Outline  Motivation  GUSTO: Design Tool and Methodology  Applications  Matrix Decomposition Methods  Matrix Inversion Methods  Mean Variance Framework for Optimal Asset Allocation  Future Work  Publications 8

GUSTO Design Flow Algorithm Analysis Architecture Generation Instruction Generation Resource Allocation XilinxMentor Graphics Algorithm Matrix Dimensions Type and # of Arithmetic Resources Design Library Data Representation + - */ Area, Latency and Throughput Results Simulation Results Mode 1 General Purpose Architecture (dynamic) Resource Trimming and Scheduling XilinxMentor Graphics Area, Latency and Throughput Results Simulation Results Mode 2 Application Specific Architecture (static) Error Analysis 9

Mode 1 of GUSTO generates a general purpose architecture and its datapath. Can be used to explore other algorithms. Do not lead to high-performance results. GUSTO Modes Instruction Controller Arithmetic Unit Memory Controller Arithmetic Unit Multipliers Adders Multipliers Arithmetic Units 10 Mode 2 of GUSTO creates a scheduled, static, application specific architecture. Simulates the architecture to Collect scheduling information, Define the usage of resources.

Matrix Multiplication Core Design Algorithm Analysis Architecture Generation Instruction Generation Resource Allocation XilinxMentor Graphics Algorithm Matrix Dimensions Type and # of Arithmetic Resources Design Library Data Representation Area, Latency and Throughput Results Simulation Results Mode 1 General Purpose Architecture (dynamic scheduling) Resource Trimming and Scheduling XilinxMentor Graphics Area, Latency and Throughput Results Simulation Results Mode 2 Application Specific Architecture (static scheduling) Error Analysis 11

Algorithm Analysis Architecture Generation Instruction Generation Resource Allocation XilinxMentor Graphics Algorithm Matrix Dimensions Type and # of Arithmetic Resources Design Library Data Representation Area, Latency and Throughput Results Simulation Results Mode 1 General Purpose Architecture (dynamic scheduling) Resource Trimming and Scheduling XilinxMentor Graphics Area, Latency and Throughput Results Simulation Results Mode 2 Application Specific Architecture (static scheduling) Error Analysis Matrix Multiplication Core Design 12

Matrix Multiplication Core Design Algorithm Analysis for i=1:n, for j=1:n, for k=1:n Temp = A(i,k)*B(k,j); C(i,j) = C(i,j) + Temp; end C = A * B Built-In Function 13

Algorithm Analysis Architecture Generation Instruction Generation Resource Allocation XilinxMentor Graphics Algorithm Matrix Dimensions Type and # of Arithmetic Resources Design Library Data Representation Area, Latency and Throughput Results Simulation Results Mode 1 General Purpose Architecture (dynamic scheduling) Resource Trimming and Scheduling XilinxMentor Graphics Area, Latency and Throughput Results Simulation Results Mode 2 Application Specific Architecture (static scheduling) Ali Irturk Error Analysis Matrix Multiplication Core Design 14

Matrix Multiplication Core Design Instruction Generation C(1,1) = A(1,1) * B(1,1) Temp = A(1,2) * B(2,1) C(1,1) = C(1,1) + Temp [mul, C(1,1), A(1,1), B(1,1)] [mul, temp, A(1,2), B(2,1)] [add, C(1,1), C(1,1), temp] Operation Destination Operand 1 Operand 2 15

Instruction Controller Arithmetic Unit Memory Controller Arithmetic Unit Multipliers Adders Multipliers Arithmetic Units Matrix Multiplication Core Design Instructions 16

Algorithm Analysis Architecture Generation Instruction Generation Resource Allocation XilinxMentor Graphics Algorithm Matrix Dimensions Type and # of Arithmetic Resources Design Library Data Representation Area, Latency and Throughput Results Simulation Results Mode 1 General Purpose Architecture (dynamic scheduling) Resource Trimming and Scheduling XilinxMentor Graphics Area, Latency and Throughput Results Simulation Results Mode 2 Application Specific Architecture (static scheduling) Error Analysis Matrix Multiplication Core Design 17

Instruction Controller Arithmetic Unit Memory Controller Arithmetic Unit Multipliers Adders Multipliers Arithmetic Units Matrix Multiplication Core Design Number of Arithmetic Units 18

Algorithm Analysis Architecture Generation Instruction Generation Resource Allocation XilinxMentor Graphics Algorithm Matrix Dimensions Type and # of Arithmetic Resources Design Library Data Representation Area, Latency and Throughput Results Simulation Results Mode 1 General Purpose Architecture (dynamic scheduling) Resource Trimming and Scheduling XilinxMentor Graphics Area, Latency and Throughput Results Simulation Results Mode 2 Application Specific Architecture (static scheduling) Error Analysis Matrix Multiplication Core Design Error Analysis 19

Matrix Multiplication Core Design Error Analysis GUSTOMATLAB User Defined Input Data Error Analysis Metrics: 1)Mean Error 2)Peak Error 3)Standard Deviation of Error 4)Mean Percentage Error Fixed Point Arithmetic Results (using variable bit width) Floating Point Arithmetic Results (Single/Double precision) 20

Matrix Multiplication Core Design Error Analysis 21

Algorithm Analysis Architecture Generation Instruction Generation Resource Allocation XilinxMentor Graphics Algorithm Matrix Dimensions Type and # of Arithmetic Resources Design Library Data Representation Area, Latency and Throughput Results Simulation Results Mode 1 General Purpose Architecture (dynamic scheduling) Resource Trimming and Scheduling XilinxMentor Graphics Area, Latency and Throughput Results Simulation Results Mode 2 Application Specific Architecture (static scheduling) Ali Irturk Error Analysis Matrix Multiplication Core Design Architecture Generation 22

Instruction Controller Arithmetic Unit Memory Controller Arithmetic Unit Multipliers Adders Multipliers Arithmetic Units Matrix Multiplication Core Design Architecture Generation General Purpose Architecture Dynamic Scheduling Dynamic Memory Assignments Full Connectivity 23

Algorithm Analysis Architecture Generation Instruction Generation Resource Allocation XilinxMentor Graphics Algorithm Matrix Dimensions Type and # of Arithmetic Resources Design Library Data Representation Area, Latency and Throughput Results Simulation Results Mode 1 General Purpose Architecture (dynamic scheduling) Resource Trimming and Scheduling XilinxMentor Graphics Area, Latency and Throughput Results Simulation Results Mode 2 Application Specific Architecture (static scheduling) Ali Irturk Error Analysis Matrix Multiplication Core Design Architecture Generation 24

Instruction Controller Arithmetic Unit Memory Controller Arithmetic Unit Multipliers Adders Multipliers Arithmetic Units Matrix Multiplication Core Design Architecture Generation Application Specific Architecture Static Scheduling Static Memory Assignments Required Connectivity 25

GUSTO Trimming Feature A In_A1In_A2 Out_mem2 Out_A Out_mem1 B In_B1In_B2 Out_B mem In_mem1 A Out_A Out_B Out_mem1 Out_mem2 Out_A Out_B Out_mem1 Out_mem2 Out_A In_A1 In_A2 Out_A Out_B Out_ mem1 Out_ mem2 Simulation runs 26

GUSTO Trimming Feature A In_A1In_A2 Out_mem2 Out_A Out_mem1 B In_B1In_B2 Out_B mem In_mem1 B Out_A Out_B Out_mem1 Out_mem2 Out_A Out_B Out_mem1 Out_mem2 Out_B In_B1 In_B2 Out_A Out_B Out_ mem1 Out_ mem2 Simulation runs 27

Matrix Multiplication Core Results 28 Inst. Cont. A A A A M M M M Mem. Cont. Inst. Cont. A A M M M M Mem. Cont. Inst. Cont. A A M M Mem. Cont. Design 1Design 2Design 3 Design 1 Design 2Design 3 Area (Slices) Throughput Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths, Ali Irturk, Nikolay Laptev and Ryan Kastner, under review, Design Automation Conference (DAC 2009). A

Hierarchical Datapaths  Unfortunately the organization of the architecture does not provide a complete design space to the user for exploring better design alternatives.  This simple organization also does not scale well with the complexity of the algorithms:  To overcome these issues, we incorporate hierarchical datapaths and heterogeneous architecture generation options into GUSTO. 29 Number of Instructions Optimization Performance Number of Functional Units Internal Storage and Communication

Matrix Multiplication Core Results 30 Core A_1 Design 4 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Design 5 Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths, Ali Irturk, Nikolay Laptev and Ryan Kastner, under review, Design Automation Conference (DAC 2009). 16 Inst. Cont. A A A A M M M M Mem. Cont. Core A_1

Matrix Multiplication Core Results 31 Core A_1 Design 4 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Design 5 Core A_2 Design 6 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Design 7 Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths, Ali Irturk, Nikolay Laptev and Ryan Kastner, under review, Design Automation Conference (DAC 2009) Inst. Cont. A A A A M M M M Mem. Cont. Core A_2

Matrix Multiplication Core Results 32 Core A_1 Design 4 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Design 5 Core A_2 Design 6 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Design 7 Core A_4 Design 8 Core A_4 Core A_4 Core A_4 Core A_4 Design 9 Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths, Ali Irturk, Nikolay Laptev and Ryan Kastner, under review, Design Automation Conference (DAC 2009) Inst. Cont. A A A A M M M M Mem. Cont. Core A_4

Matrix Multiplication Core Results 33 Core A_1 Design Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Design 5 Core A_2 Design 6 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Design 7 Core A_4 Design 8 Core A_4 Core A_4 Core A_4 Core A_4 Design 9 Area (Slices) Throughput Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths, Ali Irturk, Nikolay Laptev and Ryan Kastner, under review, Design Automation Conference (DAC 2009)

Matrix Multiplication Core Results Core A_2 Design 10 Core A_4 Core A_1 Design 11 Core A_2 Core A_1 Core A_1 Design 12 Core A_4 Core A_ Throughput Area (Slices) Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths, Ali Irturk, Nikolay Laptev and Ryan Kastner, under review, Design Automation Conference (DAC 2009)

Outline  Motivation  GUSTO: Design Tool and Methodology  Applications  Matrix Decomposition Methods  Matrix Inversion Methods  Mean Variance Framework for Optimal Asset Allocation  Future Work  Publications 35

Outline  Motivation  GUSTO: Design Tool and Methodology  Applications  Matrix Decomposition Methods  Matrix Inversion Methods  Mean Variance Framework for Optimal Asset Allocation  Future Work  Publications 36

M ATRIX D ECOMPOSITIONS QR, LU AND C HOLESKY Given Matrix Orthogonal Matrix Upper Triangular Matrix 37 Lower Triangular Matrix Given Matrix Upper Triangular Matrix Unique Lower Triangular Matrix (Cholesky triangle) Transpose of Lower Triangular Matrix Given Matrix

M ATRIX I NVERSION Given Matrix Inverse Matrix Identity Matrix Full Matrix Inversion is costly! 38

Results Inflection Point Analysis 39 Automatic Generation of Decomposition based Matrix Inversion Architectures, Ali Irturk, Bridget Benson and Ryan Kastner, In Proceedings of the IEEE International Conference on Field-Programmable Technology (ICFPT), December Matrix Size

Results Inflection Point Analysis Inflection Point Analysis Implementation : Serial Parallel Bitwidths 16 bits 32 bits 64 bits Matrix Sizes 2 × 2 3 × 3 …….. 8 × 8 40

Results Inflection Point Analysis: Decomposition Methods 41

Results Inflection Point Analysis: Matrix Inversion 42 An FPGA Design Space Exploration Tool for Matrix Inversion Architectures, Ali Irturk, Bridget Benson, Shahnam Mirzaei and Ryan Kastner, In Proceedings of the IEEE Symposium on Application Specific Processors (SASP), June 2008.

Results Finding the Optimal Hardware : Decomposition Methods General Purpose Architecture (Mode 1) Application Specific Architecture (Mode 2) QRLUCholesky Decrease in Area (Percentage) 94%83%86% 43 Architectural Optimization of Decomposition Algorithms for Wireless Communication Systems, Ali Irturk, Bridget Benson, Nikolay Laptev and Ryan Kastner, In Proceedings of the IEEE Wireless Communications and Networking Conference (WCNC 2009), April 2009.

Results Finding the Optimal Hardware: Decomposition Methods General Purpose Architecture (Mode 1) Application Specific Architecture (Mode 2) QRLU Cholesky Increase in Througput (Percentage) 68% 16% 14% 44

Results Finding the Optimal Hardware: Matrix Inversion (using QR) average of 59% decrease in area 3X increase in throughput 45

Results Architectural Design Alternatives: Matrix Inversion 46

Results Architectural Design Alternatives: Matrix Inversion 47

Results Comparison with Previously Published Work: Matrix Inversion Eilert et al. Our ImplA Our ImplB Our ImplC Edman et al. Karkooti et al. Our Method Analytic QR Bit width Data typefloating fixed floatingfixed Device type Virtex 4 Virtex 2Virtex 4 Slices DSP48s004816NR2212 BRAMsNR Throughput (10 6 ×s -1 ) J. Eilert, D. Wu, D. Liu, “Efficient Complex Matrix Inversion for MIMO Software Defined Radio”, IEEE International Symposium on Circuits and Systems. (2007). F. Edman, V. Öwall, “A Scalable Pipelined Complex Valued Matrix Inversion Architecture”, IEEE International Symposium on Circuits and Systems. (2005). M. Karkooti, J.R. Cavallaro, C. Dick, “FPGA Implementation of Matrix Inversion Using QRD-RLS Algorithm”, Asilomar Conference on Signals, Systems and Computers (2005). 48

Results Comparison with Previously Published Work: Matrix Inversion F. Edman, V. Öwall, “A Scalable Pipelined Complex Valued Matrix Inversion Architecture”, IEEE International Symposium on Circuits and Systems. (2005). M. Karkooti, J.R. Cavallaro, C. Dick, “FPGA Implementation of Matrix Inversion Using QRD-RLS Algorithm”, Asilomar Conference on Signals, Systems and Computers (2005). Edman et al. Karkooti et al. Our Method QR LUCholesky Bit width1220 Data typefixedfloatingfixed Device type Virtex 2Virtex 4 Slices DSP48sNR2212 BRAMsNR 111 Throughput ( 10 6 ×s -1 )

Outline  Motivation  GUSTO: Design Tool and Methodology  Applications  Matrix Decomposition Methods  Matrix Inversion Methods  Mean Variance Framework for Optimal Asset Allocation  Future Work  Publications 50

Asset Allocation  Asset allocation is the core part of portfolio management.  An investor can minimize the risk of loss and maximize the return of his portfolio by diversifying his assets.  Determining the best allocation requires solving a constrained optimization problem. 51 Markowitz’s mean variance framework

Asset Allocation  Increasing the number of assets significantly provides more efficient allocations. 52

High Performance Computing  Higher number of assets and more complex diversification require significant computation.  The addition of FPGAs to the existing high performance computers can boost the application performance and design flexibility. 53 Zhang et al. and Morris et al.Single Option Pricing Kaganov et al.Credit Derivative Pricing Thomas et al.Interest Rates and Value-at-Risk Simulations We are the first to propose hardware acceleration of the mean variance framework using FPGAs. FPGA Acceleration of Mean Variance Framework for Optimum Asset Allocation, Ali Irturk, Bridget Benson, Nikolay Laptev and Ryan Kastner, In Proceedings of the Workshop on High Performance Computational Finance at SC08 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2008.

T HE M EAN V ARIANCE F RAMEWORK 54 Computation of Required Inputs Computation of the Efficient Frontier Expected Prices E{M} Expected Covariance Cov{M} Allocations Optimal Allocation Standard Deviation (RISK) Expected Return Efficient Frontier Standard Deviation (RISK) Expected Return Highest Utility Portfolio Computation of the Optimal Allocation 5 Phases MVF

I DENTIFICATION OF B OTTLENECKS 55 # of Portfolios = 100# of Scenarios = 100,000

T HE M EAN V ARIANCE F RAMEWORK 56 Computation of Required Inputs Computation of the Efficient Frontier Expected Prices E{M} Expected Covariance Cov{M} Allocations Optimal Allocation Computation the Optimal Allocation 5 Phases Standard Deviation (RISK) Expected Return Efficient Frontier Standard Deviation (RISK) Expected Return Highest Utility Portfolio MVF

Hardware Architecture for MVF Step 2 57 Random Number Generator α 1 = [ α 11, α 12, …, α 1Ns ] Requires N s Multiplications Monte Carlo Block Ψ α = α × M Objective Value Allocation α =[ α 1, α 2,…, α Ns ] Market Vector ? Is this allocation the best? Expected Return Standard Deviation (RISK)

Hardware Architecture for MVF Step 2 58 ψ

Hardware Architecture for MVF Step 2 59

Hardware Architecture for MVF Step 2 60 Satisfaction Function Calculator Blocks Parallel N s Multipliers Parallel N m Monte Carlo Blocks Parallel N m Utility Calculation Blocks Parallel N p Satisfaction Function Calculation Blocks Parallel Satisfaction Function Calculator Blocks

Results 61 Mean Variance Framework – Step runs 10 Satisfaction Blocks (1 Monte-Carlo Block with 10 multipliers and 10 Utility Function Calculator Blocks) × 100,000 scenarios and 50 Portfolios 10 Satisfaction Blocks (1 Monte-Carlo Block with 20 multipliers and 20 Utility Function Calculator Blocks) × FPGA Acceleration of Mean Variance Framework for Optimum Asset Allocation, Ali Irturk, Bridget Benson, Nikolay Laptev and Ryan Kastner, In Proceedings of the Workshop on High Performance Computational Finance at SC08 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2008.

Outline  Motivation  GUSTO: Design Tool and Methodology  Applications  Matrix Decomposition Methods  Matrix Inversion Methods  Mean Variance Framework for Optimal Asset Allocation  Future Work  Publications 62

Thesis Outline and Future Work 1. Introduction 2.Comparison of FPGAs, GPUs and CELLs - Possible journal paper, - GPU implementation of Face Recognition for journal paper. 3. GUSTO Fundamentals 4. Super GUSTO - Journal paper for Hierarchical design and Heteregenous Core Design, - Employing different instruction scheduling algorithms and analysis of their effects on implemented architectures. 5. Small code applications of GUSTO - Matrix Decomposition Core (QR, LU, Cholesky) designs with different architectural choices - Matrix Inversion Core (Analytic, QR, LU, Cholesky) designs with different architectural choices - Design of an Adaptive Weight Calculation Cores. 6. Large code applications using GUSTO - Mean Variance Framework Step 2 implementation, - Short Preamble Processing Unit implementation, - Optical Flow Computation algorithm implementation. 7. Conclusions 8. Future Work 9. References 63

Outline  Motivation  GUSTO: Design Tool and Methodology  Applications  Matrix Decomposition Methods  Matrix Inversion Methods  Mean Variance Framework for Optimal Asset Allocation  Future Work  Publications 64

Publications [15] An Optimized Algorithm for Leakage Power Reduction of Embedded Memories on FPGAs Through Location Assignments, Shahnam Mirzaei, Yan Meng, Arash Arfaee, Ali Irturk, Timothy Sherwood, Ryan Kastner, working paper for IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. [14] Xquasher: A Tool for Efficient Computation of Multiple Linear Expressions, Arash Arfaee, Ali Irturk, Ryan Kastner, Farzan Fallah, under review, Design Automation Conference (DAC 2009), July [13] Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths, Ali Irturk, Nikolay Laptev and Ryan Kastner, under review, Design Automation Conference (DAC 2009), July [12] Energy Benefits of Reconfigurable Hardware for use in Underwater Sensor Nets, Bridget Benson, Ali Irturk, Junguk Cho, Ryan Kastner, under review, 16th Reconfigurable Architectures Workshop (RAW 2009), May [11] Architectural Optimization of Decomposition Algorithms for Wireless Communication Systems, Ali Irturk, Bridget Benson, Nikolay Laptev and Ryan Kastner, In Proceedings of the IEEE Wireless Communications and Networking Conference (WCNC 2009), April [10] FPGA Acceleration of Mean Variance Framework for Optimum Asset Allocation, Ali Irturk, Bridget Benson, Nikolay Laptev and Ryan Kastner, In Proceedings of the Workshop on High Performance Computational Finance at SC08 International Conference for High Performance Computing, Networking, Storage and Analysis, November [9] GUSTO: An Automatic Generation and Optimization Tool for Matrix Inversion Architectures, Ali Irturk, Bridget Benson, Shahnam Mirzaei and Ryan Kastner, under review (2 nd round of reviews), Transactions on Embedded Computing Systems. [8] Automatic Generation of Decomposition based Matrix Inversion Architectures, Ali Irturk, Bridget Benson and Ryan Kastner, In Proceedings of the IEEE International Conference on Field-Programmable Technology (ICFPT), December

Publications [7] Survey of Hardware Platforms for an Energy Efficient Implementation of Matching Pursuits Algorithm for Shallow Water Networks, Bridget Benson, Ali Irturk, Junguk Cho, and Ryan Kastner, In Proceedings of the The Third ACM International Workshop on UnderWater Networks (WUWNet), in conjunction with ACM MobiCom 2008, September [6] Design Space Exploration of a Cooperative MIMO Receiver for Reconfigurable Architectures, Shahnam Mirzaei, Ali Irturk, Ryan Kastner, Brad T. Weals and Richard E. Cagley, In Proceedings of the IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), July [5] An FPGA Design Space Exploration Tool for Matrix Inversion Architectures, Ali Irturk, Bridget Benson, Shahnam Mirzaei and Ryan Kastner, In Proceedings of the IEEE Symposium on Application Specific Processors (SASP), June [4] An Optimization Methodology for Matrix Computation Architectures, Ali Irturk, Bridget Benson, and Ryan Kastner, Unsubmitted Manuscript. [3] FPGA Implementation of Adaptive Weight Calculation Core Using QRD-RLS Algorithm, Ali Irturk, Shahnam Mirzaei and Ryan Kastner, Unsubmitted Manuscript. [2] An Efficient FPGA Implementation of Scalable Matrix Inversion Core using QR Decomposition, Ali Irturk, Shahnam Mirzaei and Ryan Kastner, Unsubmitted Manuscript. [1] Implementation of QR Decomposition Algorithms using FPGAs, Ali Irturk, MS Thesis, Department of Electrical and Computer Engineering, University of California, Santa Barbara, June Advisor: Ryan Kastner 66

Thank You 67

M ATRIX I NVERSION Ali Irturk -UC San DiegoSASP 2008 Use Decomposition Methods for Analytic Simplicity Computational Convenience  Decomposition Methods  QR  LU  Cholesky etc.  Analytic Method

Matrix Inversion using QR Decomposition Ali Irturk -UC San DiegoSASP 2008 Given Matrix Orthogonal Matrix Upper Triangular Matrix

Matrix Inversion using QR Decomposition Ali Irturk -UC San DiegoSASP columns of the matrix Entry at the intersection of i th row with j th column  Three different QR decomposition methods: Gram-Schmidt Orthogonormalization Givens Rotations Householder Reflections Memory Euclidean Norm

Matrix Inversion using Analytic Method Ali Irturk -UC San DiegoSASP 2008 The analytic method uses The adjoint matrix, Determinant of the given matrix. Adj(A) det A Determinant of a 2 × 2 matrix

Adjoint Matrix UC San DiegoSASP 2008 * * * * * * * * * A 33 A 44 A 34 A 43 A 32 A 44 A 34 A 42 A 32 A 43 A 33 A 42 A 22 A 23 A 24 * * C 11 Adjoint Matrix Calculation Cofactor Calculation Core

Different Implementations of Analytic Approach Ali Irturk -UC San DiegoSASP 2008 Cofactor Calculation Core Implementation AImplementation B Implementation C

Matrix Inversion using LU Decomposition Ali Irturk - UC San DiegoICFPT 2008 Given Matrix Lower Triangular Matrix Upper Triangular Matrix

Matrix Inversion using LU Decomposition Ali Irturk - UC San Diego ICFPT

Matrix Inversion using LU Decomposition Ali Irturk - UC San Diego ICFPT

Matrix Inversion using Cholesky Decomposition Ali Irturk - UC San DiegoICFPT 2008 Given Matrix Unique Lower Triangular Matrix (Cholesky triangle) Transpose of Lower Triangular Matrix

Matrix Inversion using Cholesky Decomposition Ali Irturk - UC San DiegoICFPT

Matrix Inversion using Cholesky Decomposition Ali Irturk - UC San DiegoICFPT

Matrix Inversion using Cholesky Decomposition Ali Irturk - UC San DiegoICFPT

T HE M EAN V ARIANCE F RAMEWORK Computation of Required Inputs Computation of the Efficient Frontier Expected Prices E{M} Expected Covariance Cov{M} Allocations Optimal Allocation Standard Deviation (RISK) Expected Return Efficient Frontier Standard Deviation (RISK) Expected Return Highest Utility Portfolio 81 Computation of the Optimal Allocation 5 Phases MVF

T HE M EAN V ARIANCE F RAMEWORK Computation of Required Inputs Computation of the Efficient Frontier 1 23 Expected Prices E{M} Expected Covariance Cov{M} Allocations Optimal Allocation 82 Computation the Optimal Allocation 5 Phases MVF

C OMPUTATION OF R EQUIRED I NPUTS Computation of Required Inputs Expected Prices E{M} Expected Covariance Cov{M} Publicly Available Data Prices Covariance # of Securities Reference Allocation Horizon Investor Objective Known Data The time that investment made Investment Horizon Estimation Interval 1) Detect the Invariants 2) Determine the Distribution of Invariants 3) Project the Invariants to the Investment Horizon 4) Compute the Expected Return and the Covariance Matrix 5) Compute the Expected Return and the Covariance Matrix of the Market Vector 83

C OMPUTATION OF R EQUIRED I NPUTS Computation of Required Inputs Expected Prices E{M} Expected Covariance Cov{M} Publicly Available Data Prices Covariance # of Securities Reference Allocation Horizon Investor Objective  Investor Objectives Absolute Wealth Relative Wealth Net Profits (3) Ψ α = α × M Objective Value Allocation α =[ α 1, α 2,…, α Ns ] Market Vector 84

C OMPUTATION OF R EQUIRED I NPUTS S TEP 5 Ψ α =α × M Objective Value Market Vector M is a transformation of the Market Prices at the Investment Horizon M ≡ a + BP T+ τ Standard Investor Objectives Absolute WealthRelative WealthNet Profits (a)Specific FormΨ α = W T+τ (α)Ψ α = W T+τ (α)-γ(α) W T+τ (β)Ψ α = W T+τ (α)- w T (α) (b)Generalized Form a ≡ 0, B ≡ I N Ψ α = α ’ P T+τ a ≡ 0, B ≡ K Ψ α = α ’ KP T+τ a ≡ -p T, B ≡ I N Ψ α = α ’ (P T+τ -p T ) 85 Allocation α =[ α 1, α 2,…, α Ns ]

C OMPUTATION OF R EQUIRED I NPUTS  Each step requires to make assumptions: Invariants Distribution of invariants Estimation interval….  Our assumptions: Compounded returns of stocks as market invariants, 3 years of the known data, 1 week estimation interval, 1 year as our horizon. 86 Phase 5 is a good candidate for hardware implementation.

T HE M EAN V ARIANCE F RAMEWORK Computation of Required Inputs Computation of the Efficient Frontier Expected Prices E{M} Expected Covariance Cov{M} Allocations Optimal Allocation 87 Computation of the Optimal Allocation 5 Phases MVF

MVF: S TEP 1 Computation of the Efficient Frontier Computation of Required Inputs Computation of the Efficient Frontier Expected Prices E{M} Expected Covariance Cov{M} Current Prices # of Portfolios # of Securities Budget Allocation Standard Deviation (RISK) Expected Return Efficient Frontier α(v)≡ arg max α ’ E{M}, v ≥ 0 α Є constraints α ‘ Cov{M} α=v E{ψ α } Var{ψ α } 88

MVF: S TEP 1 Computation of the Efficient Frontier α(v)≡ arg max α ’ E{M}, v ≥ 0 α Є constraints α ‘ Cov{M} α=v Standard Deviation (RISK) Expected Return Unachievable Risk-Return Space Efficient Frontier An investor does NOT want to be in this area! 89

T HE M EAN V ARIANCE F RAMEWORK Computation of Required Inputs Computation of the Efficient Frontier Expected Prices E{M} Expected Covariance Cov{M} Allocations Optimal Allocation 90 Computation the Optimal Allocation 5 Phases Standard Deviation (RISK) Expected Return Efficient Frontier Standard Deviation (RISK) Expected Return Highest Utility Portfolio MVF

MVF: S TEP 2 Computing the Optimal Allocation Determination of the Highest Utility Portfolio Optimal Allocation Current Prices # of Securities # of Portfolios # of Scenarios Satisfaction Index ? Is this allocation the best? 91 Expected Return Standard Deviation (RISK)

MVF: S TEP 2 Computing the Optimal Allocation  Satisfaction Indices Represent all the features of a given allocation with one single number, Quantify the investor’s satisfaction.  Satisfaction Indices Certainty-equivalent, Quantile, Coherent indices.  Certainty-equivalent satisfaction indices are Represented by the investor’s utility function and objective, u(ψ), We use Hyperbolic Absolute Risk Aversion (HARA) class of utility functions.  Utility Functions Exponential, Quadratic, Power, Logarithmic, Linear. 92

MVF: S TEP 2 Computing the Optimal Allocation  Hyperbolic Absolute Risk Aversion (HARA) class of utility functions are Specific forms of the Arrow-Pratt risk aversion model, Defined as where η = 0. A(ψ) = ψ γψ2+ζψ+η Utility Functions Exponential Utility (ζ>0 and γ ≡0) Quadratic Utility (ζ>0 and γ≡-1) Power Utility (ζ ≡0 and γ ≥1) Logarithmic Utility (lim γ→1 γ ) Linear Utility (lim γ→∞ γ ) u(ψ) = -e –(1/ζ) ψ u(ψ) = ψ – (1/2ζ) ψ 2 u(ψ) =ψ 1- 1/γ u(ψ) = ln(ψ)u(ψ) = ψ 93

I DENTIFICATION OF B OTTLENECKS  In terms of computational time, most important variables are: Number of Securities, Number of Portfolios, Number of Scenarios. Computation of Required Inputs Computation of the Efficient Frontier Determination of the Highest Utility Portfolio Expected Prices E{M} Expected Covariance Cov{M} Allocations Optimal Allocation Standard Deviation (RISK) Expected Return Efficient Frontier Standard Deviation (RISK) Expected Return Highest Utility Portfolio 94

I DENTIFICATION OF B OTTLENECKS “# of Securities” dominates computation time over “# of Portfolios”. 95 # of Scenarios = 100,000

I DENTIFICATION OF B OTTLENECKS “# of Portfolios” dominates computation time over “# of Scenarios”. 96 # of Securities = 100

I DENTIFICATION OF B OTTLENECKS 97 # of Portfolios = 100# of Scenarios = 100,000

I DENTIFICATION OF B OTTLENECKS 98 # of Scenarios = 100,000 # of Securities = 100

I DENTIFICATION OF B OTTLENECKS 99 # of Portfolios = 100# of Securities = 100

T HE M EAN V ARIANCE F RAMEWORK Computation of Required Inputs Computation of the Efficient Frontier Expected Prices E{M} Expected Covariance Cov{M} Allocations Optimal Allocation 100 Computation the Optimal Allocation 5 Phases Standard Deviation (RISK) Expected Return Efficient Frontier Standard Deviation (RISK) Expected Return Highest Utility Portfolio MVF

Generation of Required Inputs – Phase 5 × × /- × - pTpT pTpT β'β' ININ cntrl_a cntrl_b M Market Vector Calculator IP Core K Building Block pTpT KP T+ τ or P T+ τ Absolute Wealth Relative WealthNet Profits a ≡ 0, B ≡ I N Ψ α = α ’ P T+τ a ≡ 0, B ≡ K Ψ α = α ’ KP T+τ a ≡ -p T, B ≡ I N Ψ α = α ’ (P T+τ -p T ) Control Inputs Objectivecntrl_acntrl_b Absolute00 Relative10 Net Profits P T+ τ

Generation of Required Inputs – Phase 5 × × pTpT pTpT β’β’ 102

Generation of Required Inputs – Phase 5 × × /- pTpT pTpT β ININ 103

Generation of Required Inputs – Phase 5 × × / - × pTpT pTpT β'β' ININ P T+ τ 0 1 cntrl_a 104

Generation of Required Inputs – Phase 5 × × / - × pTpT pTpT β'β' ININ P T+ τ 0 1 cntrl_a - cntrl_b pTpT

Generation of Required Inputs – Phase 5 106

T HE M EAN V ARIANCE F RAMEWORK Computation of Required Inputs Computation of the Efficient Frontier Expected Prices E{M} Expected Covariance Cov{M} Allocations Optimal Allocation 107 Computation the Optimal Allocation 5 Phases Standard Deviation (RISK) Expected Return Efficient Frontier Standard Deviation (RISK) Expected Return Highest Utility Portfolio MVF

Hardware Architecture for MVF Step 1 α(v)≡ arg max α ’ E{M}, v ≥ 0 α Є constraints α ‘ Cov{M} α=v  A popular approach to solve constrained maximization problems is to use the Lagrangian multiplier method. 108

Hardware Architecture for MVF Step 1 Number of Securities amount of functions need to be computed for determination of the efficient allocation for a given risk. 109

Hardware Architecture for MVF Step 1 α1α1 α2α2 α NsNs 1.Core α1α1 α2α2 α NsNs 2.Core α1α1 α2α2 α NsNs N p.Core v1v1 E{M} Cov{M} 110

T HE M EAN V ARIANCE F RAMEWORK Computation of Required Inputs Computation of the Efficient Frontier Expected Prices E{M} Expected Covariance Cov{M} Allocations Optimal Allocation 111 Computation the Optimal Allocation 5 Phases Standard Deviation (RISK) Expected Return Efficient Frontier Standard Deviation (RISK) Expected Return Highest Utility Portfolio MVF

Hardware Architecture for MVF Step 2 Random Number Generator α 1 = [ α 11, α 12, …, α 1Ns ] Requires N s Multiplications Monte Carlo Block 112 Utility Functions Exponential Utility (ζ>0 and γ ≡0) Quadratic Utility (ζ>0 and γ≡-1) u(ψ) = -e –(1/ζ) ψ u(ψ) = ψ – (1/2ζ) ψ 2

Hardware Architecture for MVF Step ψ

Hardware Architecture for MVF Step 2 114

Hardware Architecture for MVF Step 2 Satisfaction Function Calculator Blocks Parallel N s Multipliers Parallel N m Monte Carlo Blocks Parallel N m Utility Calculation Blocks Parallel N p Satisfaction Function Calculation Blocks Parallel Satisfaction Function Calculator Blocks 115

Results Generation of Required Inputs – Phase runs N s number of arithmetic resources in parallel × 629× (for 50 Securities) 116

Results Mean Variance Framework – Step runs 10 Satisfaction Blocks (1 Monte-Carlo Block with 10 multipliers and 10 Utility Function Calculator Blocks) × 100,000 scenarios and 50 Portfolios 10 Satisfaction Blocks (1 Monte-Carlo Block with 20 multipliers and 20 Utility Function Calculator Blocks) × 117

Conclusion  Mean Variance Framework’s inherent parallelism make the framework an ideal candidate for an FPGA implementation;  We are bound by hardware resources rather than by the parallelism Mean Variance Framework offers;  However, there are many different architectural choices to implement Mean Variance Framework’s steps. 118