PERFORMANCE ANALYSIS cont. End-to-End Speedup  Execution time includes communication costs between FPGA and host machine  FPGA consistently outperforms.

Slides:

Advertisements

Similar presentations

Reconstructing Phylogenies from Gene-Order Data Overview.

Advertisements

Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.

An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.

Multi-dimensional Packet Classification on FPGA: 100Gbps and Beyond

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

A HIGH-PERFORMANCE IPV6 LOOKUP ENGINE ON FPGA Author : Thilan Ganegedara, Viktor Prasanna Publisher : FPL 2013.

Branch and Bound Optimization In an exhaustive search, all possible trees in a search space are generated for comparison At each node, if the tree is optimal.

Branch & Bound Algorithms

Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.

Phylogenetic reconstruction

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

Chapter 8 Hardware Conventional Computer Hardware Architecture.

Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing Dr. Jason D. Bakos.

Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.

Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.

Phylogenetic Trees Presenter: Michael Tung

V The DARPA Dynamic Programming Benchmark on a Reconfigurable Computer Justification High performance computing benchmarking Compare and improve the performance.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

FPGA Acceleration of Phylogeny Reconstruction for Whole Genome Data Jason D. Bakos Panormitis E. Elenis Jijun Tang Dept. of Computer Science and Engineering.

FPGA Acceleration of Gene Rearrangement Analysis Jason D. Bakos Dept. of Computer Science and Engineering University of South Carolina Columbia, SC USA.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Building Phylogenies Parsimony 2.

High-Performance Reconfigurable Computing for Genome Analysis Jason D. Bakos Dept. of Computer Science and Engineering University of South Carolina Columbia,

System Architecture A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Hyong-Youb Kim, Paul Willmann, Dr. Scott Rixner Rice.

Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina

The sequence of graph transformation (P1)-(P2)-(P4) generating an initial mesh with two finite elements GENERATION OF THE TOPOLOGY OF INITIAL MESH Graph.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

Busby, Dodge, Fleming, and Negrusa. Backtracking Algorithm Is used to solve problems for which a sequence of objects is to be selected from a set such.

1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

GPU Acceleration of Pyrosequencing Noise Removal Dept. of Computer Science and Engineering University of South Carolina Yang Gao, Jason D. Bakos Heterogeneous.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

FPGA FPGA2  A heterogeneous network of workstations (NOW)  FPGAs are expensive, available on some hosts but not others  NOW provide coarse- grained.

LayeredTrees: Most Specific Prefix based Pipelined Design for On-Chip IP Address Lookups Author: Yeim-Kuau Chang, Fang-Chen Kuo, Han-Jhen Guo and Cheng-Chien.

Multi-dimensional Packet Classification on FPGA 100 Gbps and Beyond Author: Yaxuan Qi, Jeffrey Fong, Weirong Jiang, Bo Xu, Jun Li, Viktor Prasanna Publisher:

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.

Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,

Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.

Reading Phylogenetic Trees

Distributed computing using Projective Geometry: Decoding of Error correcting codes Nachiket Gajare, Hrishikesh Sharma and Prof. Sachin Patkar IIT Bombay.

The Value of Parallelism 16 th Meeting Course Name: Business Intelligence Year: 2009.

An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:

Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Hybrid MPI/Pthreads Parallelization of the RAxML Phylogenetics Code Wayne Pfeiffer.

High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 3: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department.

A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.

CSCE350 Algorithms and Data Structure Lecture 21 Jianjun Hu Department of Computer Science and Engineering University of South Carolina

Custom Computing Machines for the Set Covering Problem Paper Written By: Christian Plessl and Marco Platzner Swiss Federal Institute of Technology, 2002.

Packet Classification Using Dynamically Generated Decision Trees

Parallel Programming in Chess Simulations Part 2 Tyler Patton.

Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.

Application of Phylogenetic Networks in Evolutionary Studies Daniel H. Huson and David Bryant Presented by Peggy Wang.

PRAM and Parallel Computing

WABI: Workshop on Algorithms in Bioinformatics

Ioannis E. Venetis Department of Computer Engineering and Informatics

Póth Miklós Polytechnical Engineering College, Subotica

FPGAs in AWS and First Use Cases, Kees Vissers

Linchuan Chen, Peng Jiang and Gagan Agrawal

Peng Jiang, Linchuan Chen, and Gagan Agrawal

1CECA, Peking University, China

Large-scale Packet Classification on FPGA

A SRAM-based Architecture for Trie-based IP Lookup Using FPGA

Accelerating Regular Path Queries using FPGA

Presentation transcript:

PERFORMANCE ANALYSIS cont. End-to-End Speedup  Execution time includes communication costs between FPGA and host machine  FPGA consistently outperforms GRAPPA  Speedups range from 2.14x to 13.71x  Experiments with higher utilization have faster execution times  Deviations in speedup are mostly due to differences in pruning rate and utilization  Higher pruning rates contribute to faster execution times PERFORMANCE ANALYSIS Throughput & Scalability  Software Platform: 3.4 GHz Intel Xeon processor  Hardware Platform: Xilinx Virtex-2 Pro 100 FPGA  Exclusive performance of entire tree space generation and bounding  Input data sets with search space sizes of 10,395 trees (8 leaf trees) to 316,234,143,225 trees (14 leaf trees)  Speedup increases as input size increases  16 core accelerator able to process larger data sets as much as 40x faster End-to-End Utilization  Experiments performed on 13 leaf input data sets (13,749,310,575 possible trees)  Operating on 16 core parallel architecture  Maximum utilization of approximately 77%  FPGA prunes more trees in the branch and bound search for almost every data set OBJECTIVE We describe an FPGA-based co-processor architecture that performs a high-speed branch-and-bound search of the space of phylogenetic trees corresponding to the number of input taxa. This co-processor architecture is designed to accelerate maximum-parsimony phylogeny reconstruction for gene-order and sequence data and is amenable to exhaustive and heuristic tree searches. Our architecture exposes coarse-grain parallelism by dividing the search space among parallel processing elements and each processing element exposes fine-grain parallelism by exploiting memory parallelism within the lower-bound computation. BACKGROUND Phylogeny Reconstruction Phylogenetic analysis is the study of evolutionary lineage amongst a set of species. A phylogeny (or phylogenetic tree) is an unrooted binary tree where each vertex represents information associated with a species and each edge represents a series of evolutionary events that effectively transformed one species into another. In general, the problem of phylogenetic reconstruction can be summarized as the acquisition of a phylogeny that most closely resembles the true evolutionary history of the input species. GRAPPA is an exhaustive search method, moving systematically through the space of all possible phylogenetic trees to find the tree with the lowest sum of edge lengths. A Cluster-On-A-Chip Architecture For High Throughput Phylogeny Search Tiffany Mintz and Dr. Jason Bakos Department of Computer Science & Engineering, University of South Carolina, Columbia, SC CONCLUSIONS  Successfully demonstrated the use of heterogeneous computing with an FPGA accelerator to enhance performance of a branch- and-bound computation  Branch and bound approach for optimizing tree search is further accelerated on the FPGA  Processing extremely large data sets is made feasible through a dual focused parallelized FPGA architecture that encompasses both fine grained and course grained parallelism Execution Time (sec)FPGA Speedup # of Leaves # of TreesSoftwareFPGA 1 PEFPGA 16 PEs1 PE16 PEs 1 to 16 PE Speedup E E E E E+11> > 4.54> Execution Time (sec) Input #GRAPPAFPGAFPGA Speedup # of Trees Scored% of Trees Scored Input #GRAPPAFPGAGRAPPAFPGA %0.0212% %0.0076% %0.0103% %0.0045% %0.0068% %0.0016% %0.0096% %0.0085% %0.0293% FPGA ACCELERATOR Tree Generation  Tree represented by list of integer edge orderings  Tree begins as an initial three edge structure  Two new vertices are added until a complete structure is generated  At each stage of construction, the tree is separated into 3 parts  Prefix – segment before new edges  Insertion – new edges  Suffix – segment after new edges  Parallelism within the design allows 4 edges to be processed simultaneously Core Architecture  The controller is a finite state machine that implements most of the tree generation and branch-and-bound functionality  Three sets of block RAM (BRAM)  Distance Matrix Storage  Stack  Result  Lower bounds are computed parallel to tree generation  Trees are scored by the host machine if their lower bound exceeds the global upper bound  The tree space is equally divided among 16 cores