Download presentation
Presentation is loading. Please wait.
1
Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases Vinay D. Shet CMSC 838 Presentation Authors: Allison Waugh, Glenn A. Williams, Liping Wei, and Russ B. Altman
2
CMSC 838T – Presentation Motivation u Biological databases are growing at a very high rate Protein Data Bank (PDB) increased from 5811 entries to 12110 in three years u Computational tools required to efficiently access and analyze this data Typical data analyses l Linear scans across database looking for something l “all-versus-all” comparisons within database u High performance distributed computing resources can play important role in these analyses Authors use a distributed computing environment, L EGION, to enable large scale analysis on PDB
3
CMSC 838T – Presentation Motivation u Similar to evaluation of threaded-blast project We run threaded blast over Sun SMP with 24 processors u Authors run program called F EATURE over L EGION framework Can access hundreds of CPUs worldwide Can spawn sequential versions of F EATURE on all of them
4
CMSC 838T – Presentation Talk Overview u Overview of talk Motivation Background l L EGION l F EATURE Methods l Experiments Results Discussions Related work Observations
5
CMSC 838T – Presentation Background u L EGION (Worldwide Virtual Computer) Metacomputing environment comprised of geographically distributed, heterogeneous collections of workstations and supercomputers Connects resources to make up a single, worldwide, virtual computer Coordinates large number of parallel jobs on a mixture of processors SMPs, MPPs, PCs on any network Legion provides the software infrastructure so that a system of heterogeneous, geographically distributed, high performance machines can interact seamlessly. No manual installation of binaries over multiple platforms (L EGION does it automatically)
6
CMSC 838T – Presentation Background u L EGION LAM - MPI implementation for workstation clusters Legion supports transparent scheduling, data management, fault tolerance, site autonomy, single file name space, efficient scheduling comprehensive resource management, and a wide range of security options.
7
CMSC 838T – Presentation Background u F EATURE Site characterization and recognition system l Site is a microenvironment distinguished by some structural or functional role Identifies functional or structural sites of interest in query protein
8
CMSC 838T – Presentation Background u F EATURE Measures spatial distributions of chemical and physical properties to create statistical model of microenvironment Compares regions of query protein with known sites and control non- sites and assigns scores indicating likelihood of region being site Produces list of potential sites locations with corresponding scores Has been used to recognize ion, ligand and enzyme binding sites FEATURE is typical data-driven algorithm requiring large data storage and efficient data analysis Requires 12 hours on single processor to evaluate 580 non-redundant PDB entries
9
CMSC 838T – Presentation Methods u F EATURE run on all protein entries in May 2000 PDB u Searched for potential Calcium binding sites F EATURE has 90% sensitivity and 100% specificity to this u Three experiments conducted Sequential scan of PDB subset using single processor Comprehensive scan of PDB using L EGION system using 50 processors Set of runs of L EGION using constant PDB subset but varying processors u Input parameters to F EATURE and statistical model for Ca remained constant
10
CMSC 838T – Presentation Methods u Experiments Sequentially scanned arbitrary 726 proteins from PDB l Runs made on single processor Sun E450 machine with 300 MHz Ultra-Sparc CPU Comprehensive scan of all proteins (10,996 total) in PDB l Maximum # of processors: 50 l F EATURE code compiled for various platforms so binaries can be run on different machines across L EGION Scanned subset of proteins with varying number of processors l Arbitrarily selected 4997 proteins for each run l Varied number of processors using values 20, 40, 60, and 80
11
CMSC 838T – Presentation Results u F EATURE reported six run time failures due to non-standard PDB file formats for sequential run u F EATURE also run time assertion failures, illegal instructions or segmentation faults during second experiment
12
CMSC 838T – Presentation Results
13
CMSC 838T – Presentation Discussion u F EATURE performance deteriorates after # of processors exceeds 60 Optimal max number is constrained by l client’s process table which keeps track of each L EGION process spawned l amount of memory available to support spawned processes Thus even if L EGION contains 100s of nodes, users cannot use them u Also L EGION provides minimal fault-tolerance (if any instance fails user must wait till everything has finished to re-spawn) u Authors maintained local copy of database but concede that this is not realistic situation as updates to PDB occur frequently Consumes lot of disk space
14
CMSC 838T – Presentation Related Work u Threaded BLAST and MPI Blast Authors work is similar to threaded blast MPI Blast is a parallelized version of Blast so single query can be split across multiple processors F EATURE is not truly parallelized
15
CMSC 838T – Presentation Observations u Running CPU intensive tasks over many processors is definitely useful However, L EGION does not scale well as there is performance degradation after 60 processors u They have not utilized true parallelism in F EATURE It seems to me that there is lot of potential to parallelize F EATURE given that many potential sites can be examined simultaneously What is performance enhancement in parallelized version?
16
CMSC 838T – Presentation Questions
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.