Download presentation
Presentation is loading. Please wait.
Published byRoberta Townsend Modified over 9 years ago
1
Distributed Protein Structure Analysis By Jeremy S. Brown Travis E. Brown
2
The Problem An exhaustive search of proteins against a known database Each string is between 400 and 600 characters long Comparing a 10,000 strings against 1,000,000 random strings would take 44 days with a 1Ghz processor
3
Why an exhaustive search? Initial intent was to analyze proteins to determine 3-Dimensional structure Exhaustive search is required to ensure that the match that found is the best match
4
Solution Distribute the search among many PCs to obtain an answer faster. Solution raises more problems, however
5
How to Distribute Distributing search strings is not enough Also distribute the search space Must find efficient way to distribute 1GB of data without duplication
6
Program Details Client/Server architecture Uses proprietary protocol over TCP/IP to distribute data Server uses SQL database to store a list of ‘jobs’ and ‘known’ sequences
7
Program Details Server issues a single ‘job’ to each client upon request Client may also request a batch of data for comparison. Server marks which data has been sent to clients and avoids resending that data to new clients
8
The Server
9
The Client
10
Problems Server is slow at updating its database. This is only seen once for each client however.
11
Performance Analysis 1 client = 44 days (best algorithm) 2 clients = 26 days 3 clients = 19 days Adding more than 1 client increases time almost linearly, though distribution is expensive
12
Graphs
13
Graphs
14
Graphs
15
Graphs
16
Notes about the graphs Graphs do not include initial distribution, since this is only done once per client If search data distribution were to be included, efficiency would start at about 70% and increase to ~90% over time
17
Verifying Data Accuracy Add an entry into the search table with a known score Ensure that the result returned by the client is the known entry in the database
18
Lessons Learned Reading and writing single elements to an SQL database can be very expensive. Even the best designs aren’t perfect, especially when the problem is not fully understood.
19
Notes Biggest problem was distribution of data Distribution was very costly, so we tried to reuse data that we already distributed Program is pluggable, so any comparison algorithm can be used
20
Question/Comments
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.