Group May 09-06 Bryan McCoy Kinit Patel Tyson Williams Advisor/Client: Zhao Zhang.

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Seeking prime numbers quickly through parallel-computing Daniel J. Wright.

Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems.

Types of Parallel Computers

Adding scalability to legacy PHP web applications Overview Mario A. Valdez-Ramirez.

History of Distributed Systems Joseph Cordina

16/13/2015 3:30 AM6/13/2015 3:30 AM6/13/2015 3:30 AMIntroduction to Software Development What is a computer? A computer system contains: Central Processing.

Development of a Ray Casting Application for the Cell Broadband Engine Architecture Shuo Wang University of Minnesota Twin Cities Matthew Broten Institute.

Beowulf Cluster Computing Each Computer in the cluster is equipped with: – Intel Core 2 Duo 6400 Processor(Master: Core 2 Duo 6700) – 2 Gigabytes of DDR.

EET 4250: Chapter 1 Performance Measurement, Instruction Count & CPI Acknowledgements: Some slides and lecture notes for this course adapted from Prof.

Chapter 4 Assessing and Understanding Performance

1 CSE SUNY New Paltz Chapter Nine Multiprocessors.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Cambodia-India Entrepreneurship Development Centre - : :.... :-:-

07/14/08. 2 Points Introduction. Cluster and Supercomputers. Cluster Types and Advantages. Our Cluster. Cluster Performance. Cluster Computer for Basic.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Introduction to Systems Analysis and Design Trisha Cummings.

Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.

OpenTS for Windows Compute Cluster Server. Overview  Introduction  OpenTS (academic) for Windows CCS  T-converter  T-microkernel  OpenTS installer.

© Janice Regan, CMPT 128, Jan CMPT 128 Introduction to Computing Science for Engineering Students Creating a program.

“SEMI-AUTOMATED PARALLELISM USING STAR-P " “SEMI-AUTOMATED PARALLELISM USING STAR-P " Dana Schaa 1, David Kaeli 1 and Alan Edelman 2 2 Interactive Supercomputing.

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

Chao “Bill” Xie, Victor Bolet, Art Vandenberg Georgia State University, Atlanta, GA 30303, USA February 22/23, 2006 SURA, Washington DC Memory Efficient.

CS 360 Lecture 3.  The software process is a structured set of activities required to develop a software system.  Fundamental Assumption:  Good software.

Cluster Computing Applications for Bioinformatics Thurs., Aug. 9, 2007 Introduction to cluster computing Working with Linux operating systems Overview.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

EET 4250: Chapter 1 Computer Abstractions and Technology Acknowledgements: Some slides and lecture notes for this course adapted from Prof. Mary Jane Irwin.

The Cluster Computing Project Robert L. Tureman Paul D. Camp Community College.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

Neuroblastoma Stroma Classification on the Sony Playstation 3 Tim Hartley, Olcay Sertel, Mansoor Khan, Umit Catalyurek, Joel Saltz, Metin Gurcan Department.

Department of Computer Science University of the West Indies.

PERFORMANCE ANALYSIS cont. End-to-End Speedup  Execution time includes communication costs between FPGA and host machine  FPGA consistently outperforms.

Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.

CMAQ Runtime Performance as Affected by Number of Processors and NFS Writes Patricia A. Bresnahan, a * Ahmed Ibrahim b, Jesse Bash a and David Miller a.

Software Development Cycle What is Software? Instructions (computer programs) that when executed provide desired function and performance Data structures.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

CE Operating Systems Lecture 3 Overview of OS functions and structure.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Systems Life Cycle A2 Module Heathcote Ch.38.

1 Large-Scale Profile-HMM on the Grid Laurent Falquet Swiss Institute of Bioinformatics CH-1015 Lausanne, Switzerland Borrowed from Heinz Stockinger June.

9 Systems Analysis and Design in a Changing World, Fourth Edition.

Group May Bryan McCoy Kinit Patel Tyson Williams.

Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.

Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:

High Performance Computing on an IBM Cell Processor Team May08-24: Kyle Byerly Matt Rohlf Bryan Venteicher Shannon McCormick Faculty Adviser: Team Website:

U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.

High Performance Computing on an IBM Cell Processor Bioinformatics Team Members Kyle Byerly Shannon McCormick Matt Rohlf Bryan Venteicher Advisor Dr. Zhao.

LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.

The Octoplier: A New Software Device Affecting Hardware Group 4 Austin Beam Brittany Dearien Brittany Dearien Warren Irwin Amanda Medlin Amanda Medlin.

© Paradigm Publishing, Inc. 4-1 Chapter 4 System Software Chapter 4 System Software.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.

4. Performance 4.1 Introduction 4.2 CPU Performance and Its Factors

Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.

Software Development Process CS 360 Lecture 3. Software Process The software process is a structured set of activities required to develop a software.

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

CIP HPC CIP - HPC HPC = High Performance Computer It’s not a regular computer, it’s bigger, faster, more powerful, and more.

Computer System Evolution. Yesterday’s Computers filled Rooms IBM Selective Sequence Electroinic Calculator, 1948.

IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.

1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.

High performance bioinformatics

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

Parallel Programming By J. H. Wang May 2, 2017.

High Performance Computing on an IBM Cell Processor --- Bioinformatics

Cell Architecture.

Genomic Data Clustering on FPGAs for Compression

Chapter 16: Distributed System Structures

CSE8380 Parallel and Distributed Processing Presentation

Presentation transcript:

Group May Bryan McCoy Kinit Patel Tyson Williams Advisor/Client: Zhao Zhang

What is Bioinformatics?  Genetic sequencing  Massive amounts of data  Many simple operations  Perfect for distributed computing

Problem  Current solutions are not realistically feasible Too expensive Too expensive ○ Super computers ○ High powered servers Too slow Too slow ○ Some inputs can takes several days  Need for high speed, low cost solutions

Our Solution  Cell Processor Based on Phase 1 Based on Phase 1  Cluster of PlayStation 3s  MPI Message Passing Interface Message Passing Interface

IBM Cell Broadband Engine  1 Power Processing Element (PPE)  8 Synergistic Processing Elements (SPEs) Only 6 SPEs are accessible on a PlayStation 3 Only 6 SPEs are accessible on a PlayStation 3  4 high speed rings for processor communication

DNAPenny  Compares DNA strands from different species  Score indicates evolution similarities between two species  Branch and bound search algorithm

Functional requirements  FR1. Ported applications shall run on the Cell B.E.  FR2. The results returned shall be the same as the original program.  FR3. The applications shall return their runtime.  FR4. The applications shall execute in parallel on multiple Cell B.E.s.

Non-Functional Requirements  NF1. The Cells shall all run on the Linux OS.  NF2. The resulting runtimes of the ported applications shall be faster than on the original applications.  NF3. The ported application shall be coded in the C language.

Market Survey  Results of the survey point to a huge speed up of computationally intensive programs.  Dr. Gaurav Khanna at the University of Massachusetts Dartmouth used cluster of 8 PS3s to replace a supercomputer.  Universitat Pompeu Fabra, in Barcelona, deployed in 2007 a BOINC system called PS3GRID for collaborative biological computing.

Risk Assessment  Slow network speed  Software support  Limited RAM  Hardware Failure Lower quality entertainment hardware Lower quality entertainment hardware  Limited prior experience  Software development schedule

Resource Requirements  3 PlayStation 3s  High performance network switch  Cell programming books  Front node (desktop computer)  Time

Software Environment  Use Fedora 9 OS as it is currently supported by the Cell SDK 3.1.  Uses the command line for user interface.  Use the IBM XLC compiler and/or the current GCC compiler.

Hardware Environment  3 PlayStation 3s  High speed Crossbar switch  Private network  Front Node (desktop computer) Proxy server Proxy server Network File Store (NFS) Network File Store (NFS)

I/O  Input Inputs are DNA sequences stored in a text file. Inputs are DNA sequences stored in a text file. Text is a CustalW alignment organized in Phylip format, a standard format for biological applications. Text is a CustalW alignment organized in Phylip format, a standard format for biological applications.  Output Outputs are Outputs are ○ The parsimony score ○ The best trees ○ The execution time The score and best trees are output to the screen and to text files. The score and best trees are output to the screen and to text files. The execution time is output to a CSV (Comma Separated Value) file. The execution time is output to a CSV (Comma Separated Value) file.

Work Breakdown Structure Port Apps to Cluster PS3s Problem DefinitionResearch Cell/B.E Research Bioperf Suite Research Distributed Parallel Algorithms Research Previously Done Work End Product Design Design Requirements Design ProcessDesign Documents Considerations and Selections Decide Which Linux to Install Decide which applications to port End Product Implementation Hardware Implementation Prototyping Implementation Software Implementation End Product Testing Ensure Correctness of Output Results Benchmarking Final Documentation and Demonstration Create Final Report Create Project Poster Prepare for Presentation

Work Schedule  Gant chart

Deliverables  Source Code  Compiled Executable  Runtime Comparisons  Final Report  Poster  Final Presentation

Costs  Time Approximately 555 man hours total. Approximately 555 man hours total. Freely donated. Freely donated. Total cost $0.  Equipment 3 PlayStation 3s 3 PlayStation 3s ○ Provided by client Crossbar router Crossbar router ○ Provided by client Standard desktop computer Standard desktop computer ○ Provided by department Total cost $0.

Development: Initial Overview  Use MPI to distribute the program to the multiple PlayStations.  Each PlayStation would search one branch of the tree.  1 function (supplement) took 90% of the runtime Phase 1 ported this function to the SPEs Phase 1 ported this function to the SPEs

Development: Difficulties  Found a bug in supplement.  The bug did not affect results but did affect runtime.  We contacted the original developer, Dr. Felsenstein at the University of Washington, who fixed the bug.  The fix significantly improved runtime.  However, the fix negated all work done by Phase 1 as supplement no longer took a significant amount of runtime.

Development: Reworking  After the bug fix, no single function took a significant amount of runtime.  We decided to distribute branches of the tree search to different processors.

Development: Results  Completed our goals Divided work among 3 PlayStation 3s. Divided work among 3 PlayStation 3s. Produced faster code that comparable sequential environment. Produced faster code that comparable sequential environment.  Due to time constraints, we were not able to port the code to the SPEs.

Testing  Used script to test multiple inputs.  Averaged the runtimes.  Used several different code revisions and machines to provide comparisons.  Projected the speedup that could be attained if code was ported to SPEs.

Results: Actual  Our current code is times faster than the it was at the beginning of the semester.  Surpassed our original projections, which assumed the use of the SPEs. Code revision Runtime (sec) X Speedup (compared to desktop) # of available cores Original (Core2) With Bug Fixes (Core 2) Original (1 PPE) With Bug Fixes (1 PPE) MPI with Bug Fixes (3 PPEs) MPI with Bug Fixes (3 PPEs, 18 SPEs) (Projected) Original Projections

Results: MPI  The speedup for MPI was  Excellent speedup for 3 nodes. Code revision Runtime (sec) X Speedup (compared to desktop) # of available cores Original (Core2) With Bug Fixes (Core 2) Original (1 PPE) With Bug Fixes (1 PPE) MPI with Bug Fixes (3 PPEs) MPI with Bug Fixes (3 PPEs, 18 SPEs) (Projected) Original Projections

Results: Comparison  Our final code came close to a high powered desktop. Core 2 Quad at 2.66 GHz Core 2 Quad at 2.66 GHz  Our projected results indicate a speedup of 6.4. Code revision Runtime (sec) X Speedup (compared to desktop) # of available cores Original (Core2) With Bug Fixes (Core 2) Original (1 PPE) With Bug Fixes (1 PPE) MPI with Bug Fixes (3 PPEs) MPI with Bug Fixes (3 PPEs, 18 SPEs) (Projected) Original Projections

Results: Projected  Using all SPEs, the speedup should be Assuming SPEs run as fast as the PPEs Assuming SPEs run as fast as the PPEs ○ Before SPE vectorization Code revision Runtime (sec) X Speedup (compared to desktop) # of available cores Original (Core2) With Bug Fixes (Core 2) Original (1 PPE) With Bug Fixes (1 PPE) MPI with Bug Fixes (3 PPEs) MPI with Bug Fixes (3 PPEs, 18 SPEs) (Projected) Original Projections

Conclusions  Achieved our goal of using MPI to get runtime improvement.  Contributed a major fix to a widely used application.  Surpassed our initial runtime goal.  Projected results show an even larger runtime improvement still possible.

Acknowledgements  May08-24 group (phase I) Kyle Byerly Kyle Byerly Shannon McCormick Shannon McCormick Matt Rohlf Matt Rohlf Bryan Venteicher Bryan Venteicher  DNAPenny Author Dr. Felsenstein Dr. Felsenstein  Advisor Zhao Zhang Zhao Zhang  Environment Help Steve Nystrom Steve Nystrom

Questions?