IBP-BLAST: Using Logistical Networking to distribute BLAST databases over a wide area network Ravi Kosuri Advisor: Dr. Erich J. Baker.

Slides:



Advertisements
Similar presentations
Network Resource Broker for IPTV in Cloud Computing Lei Liang, Dan He University of Surrey, UK OGF 27, G2C Workshop 15 Oct 2009 Banff,
Advertisements

BARNALI CHAKRABARTY. What is an Operating System ?
Distributed Processing, Client/Server and Clusters
Database Architectures and the Web
Clayton Sullivan PEER-TO-PEER NETWORKS. INTRODUCTION What is a Peer-To-Peer Network A Peer Application Overlay Network Network Architecture and System.
Dinker Batra CLUSTERING Categories of Clusters. Dinker Batra Introduction A computer cluster is a group of linked computers, working together closely.
8.
Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases Vinay D. Shet CMSC 838 Presentation Authors: Allison Waugh, Glenn.
TurboBLAST: A Parallel Implementation of BLAST Built on the TurboHub Bin Gan CMSC 838 Presentation.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 17 Client-Server Processing, Parallel Database Processing,
Chapter 9: Moving to Design
Business Intelligence Dr. Mahdi Esmaeili 1. Technical Infrastructure Evaluation Hardware Network Middleware Database Management Systems Tools and Standards.
Nikolay Tomitov Technical Trainer SoftAcad.bg.  What are Amazon Web services (AWS) ?  What’s cool when developing with AWS ?  Architecture of AWS 
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.
Cross Cluster Migration Remote access support Adianto Wibisono supervised by : Dr. Dick van Albada Kamil Iskra, M. Sc.
.NET Mobile Application Development Introduction to Mobile and Distributed Applications.
High Performance Computing (HPC) at Center for Information Communication and Technology in UTM.
Tiered architectures 1 to N tiers. 2 An architectural history of computing 1 tier architecture – monolithic Information Systems – Presentation / frontend,
Cloud based linked data platform for Structural Engineering Experiment Xiaohui Zhang
Passage Three Introduction to Microsoft SQL Server 2000.
07/14/08. 2 Points Introduction. Cluster and Supercomputers. Cluster Types and Advantages. Our Cluster. Cluster Performance. Cluster Computer for Basic.
Distributed Computing COEN 317 DC2: Naming, part 1.
Ajou University, South Korea ICSOC 2003 “Disconnected Operation Service in Mobile Grid Computing” Disconnected Operation Service in Mobile Grid Computing.
Ch 4. The Evolution of Analytic Scalability
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Chapter 9 Elements of Systems Design
1 Lecture 20: Parallel and Distributed Systems n Classification of parallel/distributed architectures n SMPs n Distributed systems n Clusters.
Version 4.0. Objectives Describe how networks impact our daily lives. Describe the role of data networking in the human network. Identify the key components.
DynamicBLAST on SURAgrid: Overview, Update, and Demo John-Paul Robinson Enis Afgan and Purushotham Bangalore University of Alabama at Birmingham SURAgrid.
Master Thesis Defense Jan Fiedler 04/17/98
Distributed Computing COEN 317 DC2: Naming, part 1.
IBP-BLAST: Using Logistical Networking to Distribute BLAST Databases Over a Wide Area Network Ravi Kosuri 1 Jay Snoddy 2, 3 Stefan Kirov2 Erich Baker 1*
Engr. M. Fahad Khan Lecturer Software Engineering Department University Of Engineering & Technology Taxila.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
Example: Sorting on Distributed Computing Environment Apr 20,
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
NOVA Networked Object-based EnVironment for Analysis P. Nevski, A. Vaniachine, T. Wenaus NOVA is a project to develop distributed object oriented physics.
Service - Oriented Middleware for Distributed Data Mining on the Grid ,劉妘鑏 Antonio C., Domenico T., and Paolo T. Journal of Parallel and Distributed.
Serverless Network File Systems Overview by Joseph Thompson.
Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1.
9 Systems Analysis and Design in a Changing World, Fourth Edition.
1 Distributed Databases BUAD/American University Distributed Databases.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
9 Systems Analysis and Design in a Changing World, Fourth Edition.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Topic Distributed DBMS Database Management Systems Fall 2012 Presented by: Osama Ben Omran.
Features Of SQL Server 2000: 1. Internet Integration: SQL Server 2000 works with other products to form a stable and secure data store for internet and.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Tackling I/O Issues 1 David Race 16 March 2010.
Seminar On Rain Technology
Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server
System Models Advanced Operating Systems Nael Abu-halaweh.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
9 Systems Analysis and Design in a Changing World, Fifth Edition.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Clouds , Grids and Clusters
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Self Healing and Dynamic Construction Framework:
The Improvement of PaaS Platform ZENG Shu-Qing, Xu Jie-Bin 2010 First International Conference on Networking and Distributed Computing SQUARE.
Storage Virtualization
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
EECS 498 Introduction to Distributed Systems Fall 2017
Ch 4. The Evolution of Analytic Scalability
Lesson 3 Bioinformatics Laboratory
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

IBP-BLAST: Using Logistical Networking to distribute BLAST databases over a wide area network Ravi Kosuri Advisor: Dr. Erich J. Baker

Outline of presentation Introduction Related Works Solution Experimental Results Conclusions and Future work

Introduction Bioinformatics "The mathematical, statistical and computing methods that aim to solve biological problems using DNA and amino acid sequences and related information." - Fredj Tekaia at the Institut Pasteur

Sequence similarity  Biological sequences show complex patterns of similarity.  Sequences are similar because of evolution.  Understanding sequence evolution may be tantamount to understanding evolution itself. Comparative sequence analysis  Powerful tool for discovering biological function.  Useful tool for discovery of Gene Regulatory Networks (GRNs)

BLAST is the most widely used sequence analysis tool. Why is BLAST the standard?  Fast  Reliable from a rigorous statistical and software development point of view.  Flexible. Can be adapted to a variety of sequence analysis scenarios.

Usage of BLAST  NCBI WWW Server Website of the National Center of Biotechnology Information at the National Library of Medicine, Washington, DC. Most widely used public sequence analysis facility in the world. Provides similarity searching to all currently available sequences (GenBank/DDBJ/EMBL).

Advantages of the NCBI server  Web interface. Can be accessed from any machine connected to the Internet through a browser.  Server maintains all databases and assumes responsibility for using the most current versions of databases.  Server is a centralized authority and has access to resources from various public institutions (like DDBJ and EMBL)  Access to vast storage and supercomputing resources available at the server.

Disadvantages  Centralized. Does not scale in terms of response time as number of users/queries increases.  Not set up to do batch queries on sets of genes. The Time of Execution (TOE) rule: 1 st request – current time 2 nd request – current time + 60 s 3 rd request – current time s 4 th request – current time s

Growth of users using the NCBI web resource

Growth of GenBank and BLAST searches

Local BLAST BLAST tools are available for free download from NCBI, and can be installed locally. Advantages of Local BLAST Processing is done at the local machine. Slowdown due to global traffic is not a problem. As a result, it is possible to perform batch queries. Local software using optimizations like parallel processing can be used to improve response times. Private databases can be used.

Disadvantages All the databases have to be maintained locally (close to 0.5 TB and growing rapidly). Puts a strain on the local storage and computation resources. The databases are constantly changing as new sequences are added everyday from projects across the world. The client is responsible for maintaining the concurrency of the local databases. Stress on the public ftp servers due to growing number of private users.

Is it possible to get the best of both approaches, perhaps with some tradeoffs? Problem Is it possible to perform batch queries by users with limited storage and computational resources, without relying on external data warehouses and without the burden of maintaining the entire complement of databases locally? Associated issues. Performance and scalability of such a solution.

Proposal A globally scalable distributed solution is promising in terms of addressing the problems mentioned.

Related Works Focus on distributing the computation over a cluster or tightly coupled processors. Solutions scale with the number of processors, but not with the number of users. Address restricted domains/scales. Not scalable to WAN scales.

TurboBLAST  Parallel implementation of BLAST suitable for execution on networked clusters of heterogeneous machines. BeoBLAST  Distributes BLAST and PSI- BLAST over nodes of a BeoWulf cluster.

HT-BLAST  Diverges from other parallel solutions that use some form of standard BLAST by modifying the BLAST code itself. mpiBLAST  Parallelization of NCBI-BLAST using mpi. dBLAST  Wrapper for running BLAST in parallel/distributed.

Solution A globally scalable distributed solution addresses these problems:  Centralized server issue is resolved  Storage and computation are distributed.  Scalable with number of users. Globally scalable solution  Globally scalable storage model  Globally scalable computation model Focus on the distributed storage model.

Methodology to distribute BLAST databases over a wide area network Leverage distributed storage technologies developed for the field of Logistical Networking

Digression- Introduction to Logistical Networking “ Coordinated scheduling of data transmission and storage within a unified communications fabric” Globally scalable storage structure. Makes storage available as a shared resource of the network by following the same paradigm as the Internet. Based on a highly generic, best-effort service which provides the foundation on which all higher layer services are built.

The Network storage stack “Unbrokered” access to >35TB of publicly available storage End of digression

Methodology to distribute BLAST databases Use of a centralized directory structure. A central server maintains references to BLAST databases, which are distributed over the IBP infrastructure. Clients use the references obtained from the central server to access the databases

High level schema for IBP-BLAST

High-level schema for UploadingServer

Chunk upload Server DBList

XNDClient

Merge algorithm Merging the intermediate output files produced by runs on the chunks. Most of the merge is a simple text merge. Reconstruction of scores and e-values. Scores do not change. E-values Formula for scaled e-values is obtained by approximating the sum of the effective lengths of each chunk of a database to be equal to the effective length of the complete database, in the Karlin-Altschul equation. Requires all chunks of a database.

Formula for calculating the scaled e-values

Salient features of this approach Server ensures the concurrency of the databases. Client is freed from maintaining the databases locally. Selected distribution of private or restricted databases possible. Integration with client-side software. Large amounts of redundant copies of databases maintained locally at each client is avoided.

Superior to ftp servers or mirroring. Reliability and fault tolerance provided through replication. Striping of databases would enable databases to fit into local memory of clients. Uses the unmodified NCBI BLAST application. Framework for a globally scalable distributed solution.

Optimizations/ Variations  Caching model Clients can cache frequently used databases locally. Version numbers on the databases are used to determine if the cached copies are concurrent with the global copies.  Overlay Network model A parallel scalable computation model Servents (or nodes) distribute computation at the application layer using the underlying distributed database structure. Jobs are distributed as queries + chunk reference. All the chunks can be downloaded in parallel and BLAST queries run parallely.

Overlay Network model

Experimental Results Machines used  Linux machines - and, fire, wind, earth Databases used  e. coli – 4.5 MB, 400 seq., 4.6 million bps.  d. melanogaster – 119 MB, 1170 seq., million bps.  est mouse – 2.3 GB, 4 million seq., 1.8 billion bps

Systems System 1 – Local BLAST setup. System 2 – Basic IBP-BLAST. System 3 – Caching enabled IBP-BLAST. System 4 – Overlay Network model.

Test 1: Effect of batch queries on performance of the systems.

Test 2: Effect of choice of the number of chunks of a database on the performance of the system.

Test 3: Effect of increasing the number of nodes on response time of the overlay network system.

Error in writeup: No of chunks is 8 (and not 4)

Test 4: Accuracy of the approximated e-values calculated by the merge algorithm.

Mean absolute difference = Standard deviation of the absolute differences =

Mean absolute difference = Standard deviation of the absolute differences =

Mean absolute difference = Standard deviation of the absolute differences =

Issues and Future Work 1. Centralized Server scheme Some form of central authority is required as the databases contain scientific data (unlike music files) and are updated frequently. Since the central server is only an index for the references of the databases and not a repository, the load on the server is considerably lower. Time to Live (TTL) on XND files can overcome temporary server failures. Central server is expected to be an authority and can be expected to possess considerable resources to prevent failures.

2. Degree of replication  Amount of replication determines availability of databases.  Could be adjusted according to the popularity of the database.  Future work could address discovery of the location of specific requests and stage the databases closer to locations with high density of requests.

3. Overlay model issues  Arbitrary unavailability of nodes.  Deployment of reliable high-performance servers linked in an overlay network.  Use of the developing technologies for scalable sharable computing, based on the IBP paradigm.

4. Choice of the number of chunks  Depends on the computation model.  Server will require an estimate of network conditions and the sizes of the databases.

Summary Presented a case for a scalable distributed solution at WAN scales for BLAST. Proposed the use of the existing IBP infrastructure as a distributed storage model. Proposed a centralized directory schema utilizing the storage model. Showed that such a system would permit batch queries and a distributed computation model would also allow scalability with respect to the number of queries/users. To be published in 12 th International Conference on Intelligent Systems for Molecular Biology/ 3 rd European Conference on Computational Biology ISMB/ECCB 2004, Glasgow, Scotland.

Questions/Comments/ Suggestions?