IBP-BLAST: Using Logistical Networking to Distribute BLAST Databases Over a Wide Area Network Ravi Kosuri 1 Jay Snoddy 2, 3 Stefan Kirov2 Erich Baker 1*

Slides:



Advertisements
Similar presentations
Recent Developments in Logistical Networking Micah Beck, Assoc. Prof. & Director Logistical Computing & Internetworking (LoCI) Lab Computer Science Department.
Advertisements

Distributed Data Processing
Distributed Processing, Client/Server and Clusters
Computer networks Fundamentals of Information Technology Session 6.
Suphakit Awiphan, Takeshi Muto, Yu Wang, Zhou Su, Jiro Katto
8.
Distributed Processing, Client/Server, and Clusters
Chapter 16 Client/Server Computing Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
Peer-to-Peer Based Multimedia Distribution Service Zhe Xiang, Qian Zhang, Wenwu Zhu, Zhensheng Zhang IEEE Transactions on Multimedia, Vol. 6, No. 2, April.
IBP-BLAST: Using Logistical Networking to distribute BLAST databases over a wide area network Ravi Kosuri Advisor: Dr. Erich J. Baker.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 17 Client-Server Processing, Parallel Database Processing,
Systems Architecture, Fourth Edition1 Internet and Distributed Application Services Chapter 13.
Chapter 9: Moving to Design
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.
.NET Mobile Application Development Introduction to Mobile and Distributed Applications.
On-Demand Media Streaming Over the Internet Mohamed M. Hefeeda, Bharat K. Bhargava Presented by Sam Distributed Computing Systems, FTDCS Proceedings.
Distributed Systems: Client/Server Computing
Middleware for P2P architecture Jikai Yin, Shuai Zhang, Ziwen Zhang.
Distributed Databases
Client-Server Processing and Distributed Databases
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
FTP. SMS based FTP Introduction Existing System Proposed Solution Block Diagram Hardware and Software Features Benefits Future Scope Conclusion.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
LAN / WAN Business Proposal. What is a LAN or WAN? A LAN is a Local Area Network it usually connects all computers in one building or several building.
Chapter 9 Elements of Systems Design
Database Design – Lecture 16
Active Monitoring in GRID environments using Mobile Agent technology Orazio Tomarchio Andrea Calvagna Dipartimento di Ingegneria Informatica e delle Telecomunicazioni.
1 Configurable Security for Scavenged Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany, Matei Ripeanu.
The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.
Master Thesis Defense Jan Fiedler 04/17/98
Session-8 Data Management for Decision Support
Database Systems: Design, Implementation, and Management Ninth Edition Chapter 12 Distributed Database Management Systems.
Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
1 4/23/2007 Introduction to Grid computing Sunil Avutu Graduate Student Dept.of Computer Science.
Logistical Networking Micah Beck, Research Assoc. Professor Director, Logistical Computing & Internetworking (LoCI) Lab Computer.
Enabling Peer-to-Peer SDP in an Agent Environment University of Maryland Baltimore County USA.
DISTRIBUTED COMPUTING Introduction Dr. Yingwu Zhu.
1 Mobile Management of Network Files Alex BassiMicah Beck Terry Moore Computer Science Department University of Tennessee.
Serverless Network File Systems Overview by Joseph Thompson.
Oracle's Distributed Database Bora Yasa. Definition A Distributed Database is a set of databases stored on multiple computers at different locations and.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
9 Systems Analysis and Design in a Changing World, Fourth Edition.
Plethora: A Wide-Area Read-Write Storage Repository Design Goals, Objectives, and Applications Suresh Jagannathan, Christoph Hoffmann, Ananth Grama Computer.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Distributed DBMSs- Concept and Design Jing Luo CS 157B Dr. Lee Fall, 2003.
9 Systems Analysis and Design in a Changing World, Fourth Edition.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]
1 Xin Wang Internet Real -Time Laboratory Internet Real -Time Laboratory Columbia University ( Joint work with Henning Schulzrinne, Dilip Kandlur, and.
Copyright 2007, Information Builders. Slide 1 Machine Sizing and Scalability Mark Nesson, Vashti Ragoonath June 2008.
Object storage and object interoperability
 Distributed Database Concepts  Parallel Vs Distributed Technology  Advantages  Additional Functions  Distribution Database Design  Data Fragmentation.
An Architectural Approach to Managing Data in Transit Micah Beck Director & Associate Professor Logistical Computing and Internetworking Lab Computer Science.
09/13/04 CDA 6506 Network Architecture and Client/Server Computing Peer-to-Peer Computing and Content Distribution Networks by Zornitza Genova Prodanoff.
Load Rebalancing for Distributed File Systems in Clouds.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Active Directory Domain Services (AD DS). Identity and Access (IDA) – An IDA infrastructure should: Store information about users, groups, computers and.
E-commerce Architecture Ayşe Başar Bener. Client Server Architecture E-commerce is based on client/ server architecture –Client processes requesting service.
9. 9 Systems Analysis and Design in a Changing World, Fourth Edition.
9 Systems Analysis and Design in a Changing World, Fifth Edition.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Geethanjali College Of Engineering and Technology Cheeryal( V), Keesara ( M), Ranga Reddy District. I I Internal Guide Mrs.CH.V.Anupama Assistant Professor.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Grid Computing.
Future Data Architecture Cloud Hosting at USGS
Multiple Processor Systems
Scheduled Accomplishments
Presentation transcript:

IBP-BLAST: Using Logistical Networking to Distribute BLAST Databases Over a Wide Area Network Ravi Kosuri 1 Jay Snoddy 2, 3 Stefan Kirov2 Erich Baker 1* 1 Department of Computer Science, Baylor University, Waco, TX, Graduate School in Genome Science and Technology, University of Tennessee-Oakridge National Laboratories, Oak Ridge, TN Oakridge National Laboratories, Oak Ridge, TN * Corresponding Author, ABSTRACT BLAST, the Basic Local Alignment Search Tool, is one of the most important software packages used in biological sequence analysis. To improve upon the normal client- server architecture, recent work has focused on creating distributed implementations to facilitate simultaneous searches of multiple databases with multiple queries. However, most of these approaches address restricted domains and scales. Much remains to be done on the front-end of distributing BLAST on an Internet scale, making the full power of BLAST tools available to the average users without the burden of maintaining local versions of the databases or relying on single data warehouses. A methodology for distributing BLAST databases by leveraging technologies developed for the field of logistical networking is presented to address these issues. Logistical Networking Logistical Networking is the coordinated scheduling of data transmission and storage within a unified communications resource fabric. The integration of networking and storage achieved by providing storage as a shared resource of the network, analogous to the current Internet model of providing bandwidth as a shared resource. We use IBP (Internet Backplane Protocol) to implement the logistical networking component. It has the following characteristics: The network storage stack is analogous to the TCP/IP stack. IBP provides a uniform application-independent interface to storage. Designed analogous to the IP design paradigm. 35+ terabytes of storage space on over 250 locally maintained depots spread across the US and 20 other countries. The Network Storage Stack The Methodology Generalized Approach Centralized directory schema. Resembles the Napster model of file sharing. Server stripes, partitions, replicates, uploads and maintains the BLAST databases on the storage depots in the IBP infrastructure. Maintains a directory of references to them. On a user request, Client obtains the xnd files (i.e., references to the database ‘chunks’) associated with a database from the server through a predetermined protocol. Client proceeds to download the BLAST database chunks from the network and run local BLAST queries on them. A merge on the intermediate results produces the final results which mirror the results obtained through individual BLAST runs on complete databases. Downloads the sequence database complement from a centralized data warehouse (e.g., the FTP site at NCBI). Stripes, formats and uploads them into the IBP network. Done periodically (i.e., once every 24 hrs) to keep the databases concurrent with the daily updates released by NCBI. Maintains a local mirror of the databases to overcome any temporary unavailability of central repositories. Uses the LoRS upload tool to upload each chunk of a database. The chunk is replicated and fragmented according to the parameters specified. The chunks are stored as IBP allocations (managed as byte arrays which are independent of the attributes of the access layer underneath) in the IBP infrastructure. Upload Server XNDClient. Schema represents the complete process at the client after the xnd files are obtained from the server. Merge Algorithm Intermediate output files produced by BLAST runs on individual chunks need to be merged to produce a single output file for each query. The merge algorithm produces results mirroring those produced when a complete database is used. Most parameters require only primary transformations for the merge (e.g., length of database is the sum of the lengths of the individual chunks). No change in parameters such as lambda, K and H. Most important part of merge is reconstruction of the e-values of each alignment. e-value transformed = [ Σm i ’/m i ’] * e-value i Where i = Reference to a chunk numbered i Σm i ’ = Sum of the effective lengths of all the chunks of the database m i ’ – effective length of chunk i e-value transformed = scaled e-value e-value i = e-values associated with chunk I Formula obtained by approximating the effective length of the complete database to the sum of the effective lengths of the chunks in the Karlin-Altschul equation. Optimizations 1.Caching model Caching databases improves response time of the system. The cached databases are updated only when they become stale. Use of version numbers for databases at the server and client help to maintain the cached copies concurrent with the databases in the IBP infrastructure. 2.Overlay Network Model Model was designed to show that deployment of a reliable distributed computation model over the proposed storage model improves response time and scales with the number of queries and users. An application-level overlay network model is used as a distributed computation structure over the underlying IBP model. The overlay model exploits the independence of the xnd files to download the database chunks in parallel and run BLAST queries on a set of independent cooperating machines located anywhere on the Internet. An idealized model with an assumption of no node failures is used. The query-initiating machine obtains the xnd files from the IBP-BLAST server and distributes them among participating peers. The peers process the request (i.e., perform a batch query locally on the available chunk) and send back the intermediate output files to the initiating machine. Merge on the intermediate files is done at the initiating machine using the merge algorithm. Experimental Analysis Dell PowerEdge 1550 systems with dual Pentium 4 processors and 1 GB RAM, with a 10/100 Mbps connection to the ECS backbone at Baylor University. Databases used Escherichia coli genome database (4.5 MB, 400 sequences, approx. 4.6 million nucleotides) Drosophila melanogaster genome database (119 MB, 1170 sequences, approx million nucleotides) est database for the mouse genome (2.3 GB, approx. 4 million sequences, approx. 1.8 billion nucleotides) System 1 (Local BLAST setup) – Locally installed BLAST tools used on local databases. System 2 (Basic IBP-BLAST system) – The basic IBP-BLAST system with the server and client setup on different machines. System 3 (Caching enabled IBP-BLAST system) – Caching optimization enabled for the basic IBP-BLAST system. System 4 (Overlay Network Model) – Servents running on 4 machines. Results The caching model and the overlay model perform better with increasing size of the databases. The systems scale with the number of queries. Discussion and future work Server ensures the concurrency of the databases. Client is freed from the burden of maintaining the databases locally. Selected distribution of private or restricted databases is possible. The server can restrict access to the xnd files in its centralized directory to provide this functionality. Proposed approach is superior to FTP servers or mirroring of databases in terms of flexibility of usage and scalability. Reliability and availability of the databases is ensured through replication. Proposed approach uses the unmodified serial NCBI BLAST application. Some existing implementations use modified versions for distributing BLAST leading to validation problems in the results obtained. Framework for a future globally scalable distributed solution. Future work involves identifying and developing a methodology for deploying a globally scalable computation structure. The emergence of the concept of sharing computation in the network analogous to the IBP paradigm of sharing storage appears promising in this direction. References [1] Altschul et al. (1990). Basic Local Alignment Search Tool. Journal of Molecular Biology. 215: [2] Atchley et al. (2002). Fault-tolerance in the Network Storage Stack. IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, Ft. Lauderdale, FL, USA, April [3] Beck et al. (2002). An End-to-end Approach to Globally Scalable Network Storage. Proceedings of 2002 Conference on Applications, Technologies, Architectures and Protocols for Computer Communication, Pittsburgh, Pennsylvania. Specific Aims The explosive growth of community biological databases has outpaced our ability to adequately maintain and access Biological data. In order to reduce the load on centralized servers such as NCBI, we will attempt to: Develop a globally scalable distributed solution for the storage of BLAST databases over a wide area network (WAN). Implement a supporting computational structure for the local analysis of distributed databases.