Frank Dehnewww.dehne.net Parallel Computational Biochemistry
Frank Dehnewww.dehne.net Proteins, DNA, etc. DNA encodes the information necessary to produce proteins Proteins are the main molecular building blocks of life (for example, structural proteins, enzymes)
Frank Dehnewww.dehne.net Proteins are formed from a chain of molecules called amino acids Proteins, DNA, etc.
Frank Dehnewww.dehne.net The DNA sequence encodes the amino acid sequence that constitutes the protein Proteins, DNA, etc.
Frank Dehnewww.dehne.net There are twenty amino acids found in proteins, denoted by A, C, D, E, F, G, H, I,... Proteins, DNA, etc.
Frank Dehnewww.dehne.net Multiple Sequence Alignment
Frank Dehnewww.dehne.net Databases of Biological Sequences >BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus. MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH NCBI: 14,976,310 sequences 15,849,921,438 nucleotides Swiss-Prot: 104,559 sequences 38,460,707 residues PDB: 17,175 structures
Frank Dehnewww.dehne.net Sequence comparison Compare one sequence (target) to many sequences (database search) Compare more than two sequences simultaneously
Frank Dehnewww.dehne.net Applications Phylogenetic analysis Identification of conserved motifs and domains Structure prediction
Frank Dehnewww.dehne.net
Frank Dehnewww.dehne.net Phylogenetic Analysis
Frank Dehnewww.dehne.net Structure Prediction Genomic sequences > RICIN GLYCOSIDASE MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH Protein sequences Protein structures
Frank Dehnewww.dehne.net Our Contributions Parallel min vertex cover for improved sequence alignments (to appear in Journal of Computer and System Sciences) Parallel Clustal W (ICCSA 2003) In progress: “Clustal XP” portal at
Frank Dehnewww.dehne.net Clustal W
Frank Dehnewww.dehne.net Progressive Alignment Scerevisiae [1] Celegans [2] Drosophia [3] Human [4] Mouse [5] S.cerevisiae C.elegans Drosophila Mouse Human 1. Do pairwise alignment of all sequences and calculate distance matrix 2. Create a guide tree based on this pairwise distance matrix 3. Align progressively following guide tree. start by aligning most closely related pairs of sequences at each step align two sequences or one to an existing subalignment
Frank Dehnewww.dehne.net Parallel Clustal Parallel pairwise (PW) alignment matrix Parallel guide tree calculation Parallel progressive alignment Scerevisiae [1] Celegans [2] Drosophia [3] Human [4] Mouse [5] S.cerevisiae C.elegans Drosophila Mouse Human
Frank Dehnewww.dehne.net Relative Speedup
Frank Dehnewww.dehne.net Clustal XP vs. SGI SGI data taken from Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal, and MULTICLUSTAL By: Dmitri Mikhailov, Haruna Cofer, and Roberto Gomperts
Frank Dehnewww.dehne.net Parallel Clustal - Improvements Optimization of input parameters –scoring matrices, gap penalties - requires many repetitive Clustal W calculations with various input parameters. Minimum Vertex Cover –use minimum vertex cover to remove erroneous sequences, and identify clusters of highly similar sequences.
Frank Dehnewww.dehne.net Minimum Vertex Cover Conflict Graph –vertex: sequence –edge: conflict (e.g. alignment with very poor score) TASK: remove smallest number of gene sequences that eliminates all conflicts NP-complete
Frank Dehnewww.dehne.net FPT Algorithms Phase 1: Kernelization Reduce problem to size f(k) Phase 2: Bounded Tree Search Exhausive tree search; exponential in f(k)
Frank Dehnewww.dehne.net Kernelization Buss's Algorithm for k-vertex cover Let G=(V,E) and let S be the subset of vertices with degree k or more. Remove S and all incident edges G->G’ k -> k'=k-|S|. IF G' has more than k x k' edges THEN no k-vertex cover exists ELSE start bounded tree search on G'
Frank Dehnewww.dehne.net Bounded Tree Search
Frank Dehnewww.dehne.net Case 1: simple path of length 3 remove selected vertices from G' k' - = 2
Frank Dehnewww.dehne.net Case 2: 3-cycle remove selected vertices from G' k' - = 2
Frank Dehnewww.dehne.net Case 3: simple path of length 2 remove v1, v2 from G' k' - = 1
Frank Dehnewww.dehne.net Case 4: simple path of length 1 remove v, v1 from G' k' - = 1
Frank Dehnewww.dehne.net Sequential Tree Search Depth first search –backtrack when k'=0 and G'<>0 ("dead end" )) –stop when solution found (G'={}, k'>=0 )
Frank Dehnewww.dehne.net Parallel Tree Search Basic Idea: –Build top log p levels of the search tree (T ') –every proc. starts depth- first search at one leaf of T ' –randomize depth-first search by selecting random child
Frank Dehnewww.dehne.net Analysis: Balls-in-bins sequential depth-first search path total length:L, #solutions: m expected sequential time (rand. distr.): L/(m+1) parallel search path expected parallel time (rand. distr.): p + L/(p(m+1)) expected speedup: p / (1 + (m+1)/L) if m << L then expected speedup = p
Frank Dehnewww.dehne.net Simulation Experiment L = 1,000,000
Frank Dehnewww.dehne.net Implementation test platform: –32 node HPCVL Beowulf cluster –each node: dual 1.4 GHz Intel Xeon, 512 MB RAM, 60 GB disk –gcc and LAM/MPI on LINUX Redhat 7.2 code-s: Sequential k-vertex cover code-p: Parallel k-vertex cover
Frank Dehnewww.dehne.net Test Data Protein sequences Same protein from several hundred species Each protein sequence a few hundred amino acid residues in length Obtained from the National Center for Biotechnology Information (
Frank Dehnewww.dehne.net Test Data Somatostatin –neuropeptide involved in the regulation of many functions in different organ systems –Clustal Threshold = 10, |V| = 559, |E| = 33652, k = 273, k' = 255
Frank Dehnewww.dehne.net Test Data WW –small protein domain that binds proline rich sequences in other proteins and is involved in cellular signaling –Clustal Threshold = 10, |V| = 425, |E| = 40182, k = 322, k' = 318
Frank Dehnewww.dehne.net Test Data Kinase –large family of enzymes involved in cellular regulation –Clustal Threshold = 16, |V| = 647, |E| = , k = 497, k' = 397
Frank Dehnewww.dehne.net Test Data SH2 (src-homology domain 2) –involved in targeting proteins to specific sites in cells by binding to phosphor-tyrosine –Clustal Threshold = 10, |V| = 730, |E| = 95463, k = 461, k' = 397
Frank Dehnewww.dehne.net Test Data Thrombin –protease involved in the blood coagulation cascade and promotes blood clotting by converting fibrinogen to fibrin –Clustal Threshold = 15, |V| = 646, |E| = 62731, k = 413, k' = 413
Frank Dehnewww.dehne.net Test Data PHD (pleckstrin homology domain) –involved in cellular signaling –Clustal Threshold = 10, |V| = 670, |E| = , k = 603, k' = 603
Frank Dehnewww.dehne.net Test Data Random Graph |V| = 220, |E| = 2155, k = 122, k' = 122 Grid Graph |V| = 289, |E| = 544, k = 145, k' = 145
Frank Dehnewww.dehne.net Test Data |VC| ~ |V| / 2 k' = k
Frank Dehnewww.dehne.net Sequential Times Kinase, SH2, Thombin: n/a
Frank Dehnewww.dehne.net Code-p on Virtual Proc.
Frank Dehnewww.dehne.net Parallel Times
Frank Dehnewww.dehne.net Speedup: Somatostatin
Frank Dehnewww.dehne.net Speedup: WW
Frank Dehnewww.dehne.net Speedup: Rand. Graph
Frank Dehnewww.dehne.net Speedup: Grid Graph
Frank Dehnewww.dehne.net Clustal W + Parallel Clustal … Parallel FPT MVC Clustal XP Web Portal Clustal XP in progress X : Extended P : Parallel
Frank Dehnewww.dehne.net Clustal XP
Frank Dehnewww.dehne.net