Co-processors for speeding up drug design algorithms Advait Jain Priyanka Jindal Pulkit Gambhir Under the guidance of: Prof. M Balakrishnan Prof. Kolin.

Slides:



Advertisements
Similar presentations
Parallel List Ranking Advanced Algorithms & Data Structures Lecture Theme 17 Prof. Dr. Th. Ottmann Summer Semester 2006.
Advertisements

Extreme Programming Alexander Kanavin Lappeenranta University of Technology.
Ch 3: Unified Process CSCI 4320: Software Engineering.
SEP1 - 1 Introduction to Software Engineering Processes SWENET SEP1 Module Developed with support from the National Science Foundation.
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Dan Iannuzzi Kevin Pine CS 680. Outline The Problem Recap of CS676 project Goal of this GPU Research Approach Parallelization attempts Results Difficulties.
Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.
CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.
1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture.
Spike Sorting Algorithm implemented on FPGA Elad Ilan Asaf Gal Sup: Alex Z.
CS 106 Introduction to Computer Science I 02 / 28 / 2007 Instructor: Michael Eckmann.
Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.
Tejas Bhatt and Dennis McCain Hardware Prototype Group, NRC/Dallas Matlab as a Development Environment for FPGA Design Tejas Bhatt June 16, 2005.
Advanced Topics in Algorithms and Data Structures 1 Two parallel list ranking algorithms An O (log n ) time and O ( n log n ) work list ranking algorithm.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Written by: Haim Natan Benny Pano Supervisor:
Implementation of DSP Algorithm on SoC. Characterization presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompany engineer : Emilia Burlak.
COEN 180 Main Memory Cache Architectures. Basics Speed difference between cache and memory is small. Therefore:  Cache algorithms need to be implemented.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Viterbi Decoder Project Alon weinberg, Dan Elran Supervisors: Emilia Burlak, Elisha Ulmer.
Accurate 3D Modeling of User Inputted Molecules Computer Systems Lab: Ben Parr Period 6.
EECE **** Embedded System Design
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
Computational Chemistry, WebMO, and Energy Calculations
Acceleration Based Pedometer
Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based.
By: Oleg Schtofenmaher Maxim Fudim Supervisor: Walter Isaschar Characterization presentation for project Winter 2007 ( Part A)
An Introduction to Programming and Algorithms. Course Objectives A basic understanding of engineering problem solving process. A basic understanding of.
The Queen’s Tower Imperial College London South Kensington, SW7 28th Jan 2007 | Ashley Brown Profiling floating point value ranges for reconfigurable implementation.
May 2004 Department of Electrical and Computer Engineering 1 ANEW GRAPH STRUCTURE FOR HARDWARE- SOFTWARE PARTITIONING OF HETEROGENEOUS SYSTEMS A NEW GRAPH.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
NDA Confidential. Copyright ©2005, Nallatech.1 Implementation of Floating- Point VSIPL Functions on FPGA-Based Reconfigurable Computers Using High- Level.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.
1 CSE 326: Data Structures Sorting It All Out Henry Kautz Winter Quarter 2002.
Hardware/Software Partitioning of Floating-Point Software Applications to Fixed-Point Coprocessor Circuits Lance Saldanha, Roman Lysecky Department of.
Chapter 25: Code-Tuning Strategies. Chapter 25  Code tuning is one way of improving a program’s performance, You can often find other ways to improve.
Electrical and Computer Engineering Klaus Kristo Clem Leung Adam Frieden Chris Davidson Faculty: Professor Ramgopal Mettu Project: iPlanAhead Preliminary.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.
An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.
SimBioSys Inc.© 2004http:// Conformational sampling in protein-ligand complex environment Zsolt Zsoldos SimBioSys Inc., © 2004 Contents:
CS121 Quiz 3 Fall 2012 Quiz Tips. Quiz 1 Tips Question 1 From the many calculation choices, use eval to calculate the dependent variable P(s), but use.
FFT Accelerator Project Rohit Prakash Anand Silodia Date: June 7 th, 2007.
1 CS 501 Spring 2004 CS 501: Software Engineering Lecture 2 Software Processes.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CS 484. Iterative Methods n Gaussian elimination is considered to be a direct method to solve a system. n An indirect method produces a sequence of values.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Anton, a Special-Purpose Machine for Molecular Dynamics Simulation By David E. Shaw et al Presented by Bob Koutsoyannis.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
MA/CS 471 Lecture 15, Fall 2002 Introduction to Graph Partitioning.
Virtual Memory Review Goal: give illusion of a large memory Allow many processes to share single memory Strategy Break physical memory up into blocks (pages)
Co-processors for speeding up drug design algorithms Advait Jain Priyanka Jindal Pulkit Gambhir.
Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
NFV Compute Acceleration APIs and Evaluation
Backprojection Project Update January 2002
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Dynamo: A Runtime Codesign Environment
Application-Specific Customization of Soft Processor Microarchitecture
Basic Performance Parameters in Computer Architecture:
Multi-Processing in High Performance Computer Architecture:
Multi-Processing in High Performance Computer Architecture:
Matlab as a Development Environment for FPGA Design
Co-processors for speeding up drug design algorithms
Memory System Performance Chapter 3
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Co-processors for speeding up drug design algorithms Advait Jain Priyanka Jindal Pulkit Gambhir Under the guidance of: Prof. M Balakrishnan Prof. Kolin Paul

Objective To design FPGA based hardware accelerators for speeding up the energy minimization process.

Approach to the problem  Familiarization with the code  Software profiling Identifying bottleneck procedures/loops Compiler level optimizations  H/w - S/w partitioning Where to partition API’s to export  Hardware Design  Performance Analysis

Overall Control Flow

Bottleneck Functions

Split Up code Eval_Energy_for _step(%) Diff_Energy(%) Non-bonded pairs Dihedrals Angles Bonded00.00

Bottleneck Functions Iterate over list of bonds {O(N) elements} Iterate over list of angles {O(N) elements} Iterate over list of dihedrals {O(N) elements} Iterate over list of non-bonded pairs {O(N 2 ) elements} Eval energyEval Energy for stepDiff energy

Molecule Size v/s Time (log plot) Average Slope = 2.03

Energy v/s CG Steps We are here

Non-bonded List Node structure Float A, B, C (4*3 bytes) Int a1, a2 C is a function of charge q1 and q2 of atoms. 471,282 distinct Cs (3 bytes) A, B Are a function of radius and epsilon of atoms. 192 distinct pairs of A,B (1 byte)

New Data Structure Vector of Distinct Cs Vector of Distinct (A,B) pairs New Node structure 3d coordinates of atoms Int a1, a2 Unsigned common_index 3 1

Result of new data structure Molecule Size: 2008 VanderList: 2,008,417 AB_Vander list: 136 C_Vanderlist: 21,651 Old Data Structure New Data Structure Projected Data Structure 2,008,417 * 20 ~ 40 MB 2,008,417 * * ,651 * 4 ~ 24 MB 2,008,417 * * ,651 * 4 ~ 16 MB Improvement in cache performance

Sorting to improve performance  Consecutive nodes of van-der list can point randomly anywhere in the C and (A,B) vectors  Scope for further improving Cache performance  Radix sort on the van-der list First bucket sort on the C-index Second stable bucket sort on the (A,B)-index  Sequential access of (A,B) vector

Cache Profiling (unsorted vs sorted) L1D refsL1D missesL2 refs 1,773,145,080 Rd: 1,451,802,230 Wr: 321,342,785 44,016,787 Rd: (3%) 43,429,781 Wr: (.1826 %) 587,006 44,754,341 Rd: 44,167,335 Wr: 587,006 1,842,686,500 Rd: 1,495,124,238 Wr: 347,562,262 29,287,877 Rd: (1.9%) 28,470,590 Wr:(.235%) 817,287 30,152,893 Rd: 29,335,606 Wr: 817,287 Test Case : Molecule of size 413 atoms with 25 SD and 100 CG steps

Converting to floating point  All the code written with a double point precision  Double point difficult to replicate in hardware  Need to test feasibility of conversion to single precision

Single Point Precision minEnergyCG() diffEnergy()evalEnergy_for_step() moveStep() Precision lost here Instability introduced here Resulting in NaN

Single Point Precision  Removed the instability Parabolic interpolation replaced by lnsearch() whenever points are colinear.  Time taken to evaluate the energy increased.  Increase in the number of calls to evalEnergy_for_step().

Slow Float Vs Double: Time Plot

Control Flow

Single Point Precision (Molecule Size: 2008 SD:100 CG: 150) # of Calls to: EvalEnergyforStep() Double 642 Slow Float 893 From: minEnergyCG()450 From: lnSearch() DoubleSlow Float # of Calls to: lnSearch() evalEnergyForStep() per lnSearch()

Reducing the number of Calls  minEnergyCG: Parabolic interpolation – which 3pts to choose.  Lnsearch : Iteratively calculates the step size. When to stop the iteration determined by 2 tolerances.  What we did: Pts for parabolic interpolation are further apart Increased the tolerances till the time to minimize the energy was same as double. Then profiled to check the actual energy.

Fast Float Vs Double: Time Plot

Fast Float Vs Double: Energy Plot

Our conclusions from this exercise  Located the source of instability.  However converting to float increased the time required for the code to run.  Increasing tolerances again made the code fast.  The energy in case of float did not agree well with double computation.

Feedback from SCF-Bio team  They are interested primarily in “relaxing” the molecule.  Actual energy is not of any consequence.  To check float-code, metric should be error between the molecular structures (float vs double).

Start Structure Double Relaxed Structure Float Relaxed Structure RMS Distance New Checking Methodology Acceptance: < 0.5

RMS Distance vs CG Steps We are here

Comparison with new metric

Tasks completed this semester  Software Profiling  No. of calls  Cache misses  Effect of parameters  Control Flow Analysis  Flow Diagram  Data parallelism  Floating point precision requirement  Exploring H/W Options  Platform Selection  S/W H/W Partitioning

Ongoing work + next semester  Setting up building blocks  ZBT RAM access  PCI Interface  Floating Point Unit  Combining blocks for a simple implementation  Refining the implementation  Multiple compute engines  Multiple PCI cards