Presentation is loading. Please wait.

Presentation is loading. Please wait.

IBP-BLAST: Using Logistical Networking to distribute BLAST databases over a wide area network Ravi Kosuri Advisor: Dr. Erich J. Baker.

Similar presentations


Presentation on theme: "IBP-BLAST: Using Logistical Networking to distribute BLAST databases over a wide area network Ravi Kosuri Advisor: Dr. Erich J. Baker."— Presentation transcript:

1 IBP-BLAST: Using Logistical Networking to distribute BLAST databases over a wide area network Ravi Kosuri Advisor: Dr. Erich J. Baker

2 Outline of presentation Introduction Related Works Solution Experimental Results Conclusions and Future work

3 Introduction Bioinformatics "The mathematical, statistical and computing methods that aim to solve biological problems using DNA and amino acid sequences and related information." - Fredj Tekaia at the Institut Pasteur

4 Sequence similarity  Biological sequences show complex patterns of similarity.  Sequences are similar because of evolution.  Understanding sequence evolution may be tantamount to understanding evolution itself. Comparative sequence analysis  Powerful tool for discovering biological function.  Useful tool for discovery of Gene Regulatory Networks (GRNs)

5 BLAST is the most widely used sequence analysis tool. Why is BLAST the standard?  Fast  Reliable from a rigorous statistical and software development point of view.  Flexible. Can be adapted to a variety of sequence analysis scenarios.

6 Usage of BLAST  NCBI WWW Server Website of the National Center of Biotechnology Information at the National Library of Medicine, Washington, DC. Most widely used public sequence analysis facility in the world. Provides similarity searching to all currently available sequences (GenBank/DDBJ/EMBL).

7 Advantages of the NCBI server  Web interface. Can be accessed from any machine connected to the Internet through a browser.  Server maintains all databases and assumes responsibility for using the most current versions of databases.  Server is a centralized authority and has access to resources from various public institutions (like DDBJ and EMBL)  Access to vast storage and supercomputing resources available at the server.

8 Disadvantages  Centralized. Does not scale in terms of response time as number of users/queries increases.  Not set up to do batch queries on sets of genes. The Time of Execution (TOE) rule: 1 st request – current time 2 nd request – current time + 60 s 3 rd request – current time + 120 s 4 th request – current time + 180 s

9 Growth of users using the NCBI web resource

10 Growth of GenBank and BLAST searches

11 Local BLAST BLAST tools are available for free download from NCBI, and can be installed locally. Advantages of Local BLAST Processing is done at the local machine. Slowdown due to global traffic is not a problem. As a result, it is possible to perform batch queries. Local software using optimizations like parallel processing can be used to improve response times. Private databases can be used.

12 Disadvantages All the databases have to be maintained locally (close to 0.5 TB and growing rapidly). Puts a strain on the local storage and computation resources. The databases are constantly changing as new sequences are added everyday from projects across the world. The client is responsible for maintaining the concurrency of the local databases. Stress on the public ftp servers due to growing number of private users.

13 Is it possible to get the best of both approaches, perhaps with some tradeoffs? Problem Is it possible to perform batch queries by users with limited storage and computational resources, without relying on external data warehouses and without the burden of maintaining the entire complement of databases locally? Associated issues. Performance and scalability of such a solution.

14 Proposal A globally scalable distributed solution is promising in terms of addressing the problems mentioned.

15 Related Works Focus on distributing the computation over a cluster or tightly coupled processors. Solutions scale with the number of processors, but not with the number of users. Address restricted domains/scales. Not scalable to WAN scales.

16 TurboBLAST  Parallel implementation of BLAST suitable for execution on networked clusters of heterogeneous machines. BeoBLAST  Distributes BLAST and PSI- BLAST over nodes of a BeoWulf cluster.

17 HT-BLAST  Diverges from other parallel solutions that use some form of standard BLAST by modifying the BLAST code itself. mpiBLAST  Parallelization of NCBI-BLAST using mpi. dBLAST  Wrapper for running BLAST in parallel/distributed.

18 Solution A globally scalable distributed solution addresses these problems:  Centralized server issue is resolved  Storage and computation are distributed.  Scalable with number of users. Globally scalable solution  Globally scalable storage model  Globally scalable computation model Focus on the distributed storage model.

19 Methodology to distribute BLAST databases over a wide area network Leverage distributed storage technologies developed for the field of Logistical Networking

20 Digression- Introduction to Logistical Networking “ Coordinated scheduling of data transmission and storage within a unified communications fabric” Globally scalable storage structure. Makes storage available as a shared resource of the network by following the same paradigm as the Internet. Based on a highly generic, best-effort service which provides the foundation on which all higher layer services are built.

21 The Network storage stack “Unbrokered” access to >35TB of publicly available storage End of digression

22 Methodology to distribute BLAST databases Use of a centralized directory structure. A central server maintains references to BLAST databases, which are distributed over the IBP infrastructure. Clients use the references obtained from the central server to access the databases

23 High level schema for IBP-BLAST

24 High-level schema for UploadingServer

25 Chunk upload Server DBList

26 XNDClient

27 Merge algorithm Merging the intermediate output files produced by runs on the chunks. Most of the merge is a simple text merge. Reconstruction of scores and e-values. Scores do not change. E-values Formula for scaled e-values is obtained by approximating the sum of the effective lengths of each chunk of a database to be equal to the effective length of the complete database, in the Karlin-Altschul equation. Requires all chunks of a database.

28 Formula for calculating the scaled e-values

29 Salient features of this approach Server ensures the concurrency of the databases. Client is freed from maintaining the databases locally. Selected distribution of private or restricted databases possible. Integration with client-side software. Large amounts of redundant copies of databases maintained locally at each client is avoided.

30 Superior to ftp servers or mirroring. Reliability and fault tolerance provided through replication. Striping of databases would enable databases to fit into local memory of clients. Uses the unmodified NCBI BLAST application. Framework for a globally scalable distributed solution.

31 Optimizations/ Variations  Caching model Clients can cache frequently used databases locally. Version numbers on the databases are used to determine if the cached copies are concurrent with the global copies.  Overlay Network model A parallel scalable computation model Servents (or nodes) distribute computation at the application layer using the underlying distributed database structure. Jobs are distributed as queries + chunk reference. All the chunks can be downloaded in parallel and BLAST queries run parallely.

32 Overlay Network model

33 Experimental Results Machines used  Linux machines - and, fire, wind, earth Databases used  e. coli – 4.5 MB, 400 seq., 4.6 million bps.  d. melanogaster – 119 MB, 1170 seq., 122.6 million bps.  est mouse – 2.3 GB, 4 million seq., 1.8 billion bps

34 Systems System 1 – Local BLAST setup. System 2 – Basic IBP-BLAST. System 3 – Caching enabled IBP-BLAST. System 4 – Overlay Network model.

35 Test 1: Effect of batch queries on performance of the systems.

36

37

38

39 Test 2: Effect of choice of the number of chunks of a database on the performance of the system.

40

41

42

43 Test 3: Effect of increasing the number of nodes on response time of the overlay network system.

44 Error in writeup: No of chunks is 8 (and not 4)

45 Test 4: Accuracy of the approximated e-values calculated by the merge algorithm.

46 Mean absolute difference = 0.057304 Standard deviation of the absolute differences = 0.051847

47 Mean absolute difference = 0.030473 Standard deviation of the absolute differences = 0.031604

48 Mean absolute difference = 0.014312 Standard deviation of the absolute differences = 0.02011

49 Issues and Future Work 1. Centralized Server scheme Some form of central authority is required as the databases contain scientific data (unlike music files) and are updated frequently. Since the central server is only an index for the references of the databases and not a repository, the load on the server is considerably lower. Time to Live (TTL) on XND files can overcome temporary server failures. Central server is expected to be an authority and can be expected to possess considerable resources to prevent failures.

50 2. Degree of replication  Amount of replication determines availability of databases.  Could be adjusted according to the popularity of the database.  Future work could address discovery of the location of specific requests and stage the databases closer to locations with high density of requests.

51 3. Overlay model issues  Arbitrary unavailability of nodes.  Deployment of reliable high-performance servers linked in an overlay network.  Use of the developing technologies for scalable sharable computing, based on the IBP paradigm.

52 4. Choice of the number of chunks  Depends on the computation model.  Server will require an estimate of network conditions and the sizes of the databases.

53 Summary Presented a case for a scalable distributed solution at WAN scales for BLAST. Proposed the use of the existing IBP infrastructure as a distributed storage model. Proposed a centralized directory schema utilizing the storage model. Showed that such a system would permit batch queries and a distributed computation model would also allow scalability with respect to the number of queries/users. To be published in 12 th International Conference on Intelligent Systems for Molecular Biology/ 3 rd European Conference on Computational Biology ISMB/ECCB 2004, Glasgow, Scotland.

54 Questions/Comments/ Suggestions?


Download ppt "IBP-BLAST: Using Logistical Networking to distribute BLAST databases over a wide area network Ravi Kosuri Advisor: Dr. Erich J. Baker."

Similar presentations


Ads by Google