March 26, 2007 Phyloinformatics of Neuraminidase at Micro and Macro Levels using Grid-enabled HPC Technologies B. Schmidt (UNSW) D.T. Singh (Genvea Biosciences) R. Trehan, T. Bretschneider (NTU, Singapore)
March 26, 2007 Contents H5N1 Genetics H5N1 Phyloinformatics Design Principles of Quascade H5N1 Phyloinformatics with Quascade Results Conclusion and Future work
March 26, 2007 H5N1 Genetics Belongs to the Influenza A virus type Segmented RNA genome 8 genes, 11 proteins Classification based on: –Hemagglutinin (HA): 15 subtypes –Neuraminidase (NA): 9 subtypes Genetic variations in HA/NA Genetic drift –Point mutations –1918 Spanish flu Genetic shift –Reassortment of the segmented genome –1957, 1968, 1997 pandemics –2003 Z strain of H5N1
March 26, 2007 H5N1 Phyloinformatics Essential to monitor new emerging strains –Molecular evolution at gene and genome level –Phylogenetic analysis for determining the origin of new strains Phylogenetics –How fast do proteins evolve? –What is the best method to measure the evolution? –How to obtain the best phylogenetic tree? Phylogenetic algorithms –Character based Maximum Parsimony, Maximum Likelihood (ML) –Distance based UPGMA, Neighborhood Join (NJ) –Bayesian MCMC based Mr. Bayes, BEAST
March 26, 2007 Quascade – User Interface Example Communication A data-flow tool in which each black-box represents Java objects running on different computers! Assignment of objects to available computers done automatically (manually if required) Communication between objects done transparently Configuration of objects done before run-time Processing pipeline
March 26, 2007 Java Object Java Object Java Object Coding in regular Java/ C/ C++ Persistent – activated whenever all data-inputs present No explicit messaging protocol required No distributed computing concepts need to be understood Objects automatically or manually assigned to computers / CPU-cores Object Features
March 26, 2007 Phyloinformatics Workflow with Quascade
March 26, 2007 Parallelized Phyloinformatics Workflow
March 26, 2007 Data and Algorithms Core Group –22 H5N1 NA sequences from SwissProt and TREMBL Medium Set –581 NA H5N1 sequences from Uniprot Large Set –909 NA Influenza A sequences from Uniprot ProtDist –NJ –UPGMA ProtPars ProtML Mr. Bayes
March 26, 2007 Runtime and Scalability (NA Bird Flu Protein) 25 processors sequences 581 sequences Processing time [h] Distance-based workflow sequences 581 sequences Processing time [h] MP workflow 1 processor
March 26, 2007 Mr Bayes – Tree Core Set
March 26, 2007 Analysis and Observations Clustering possibilities –Temporal, host-based, geographical Algorithms –Mr. Bayes and ProtML are most consistent in their performance –Too compute-intensive for the larger “macro” sets Observed pattern –All phylograms yielded geographic-based clustering rather than time- based clustering –Host ranges along clustered clades vary –Same strain with identical NA sequences can infect different hosts –NA may not be the sole factor responsible for determining the diverse host range –Glycan site acquisition or loss seems to play a critical role in the molecular evolution of H5N1 NA –Identification of “bridging isolates” may help in rapid monitoring and development of global scale warning system for H5N1
March 26, 2007 Conclusion and Future Work Quascade –New graphical data-flow tool to design automatically grid-enabled pipelines / workflows –Supports implicit high-performance parallelization –Supports persistent components –Can be used with Java / C/ C++ code or application-binaries H5N1 Phyloinformatics –Can take advantage of workflow system and HPC –Can be easily used and modified by biologists –Use H5N1 NA sequences to better understand evolution of H5N1 –Analysis of H5N1 NA data with different algorithms indicates spatial clustering based on geographical distribution rather than temporal or host. Future work –Studies in conjunction with other proteins such as HA, Polymerase etc., and also at gene and genome level