A Grid implementation of the sliding window algorithm for protein similarity searches facilitates whole proteome analysis on continuously updated databases.

Slides:



Advertisements
Similar presentations
Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Measuring the degree of similarity: PAM and blosum Matrix
Lecture 8 Alignment of pairs of sequence Local and global alignment
Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Sequence Similarity Searching Class 4 March 2010.
Sequence alignment SEQ1: VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKK VADALTNAVAHVDDPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHA SLDKFLASVSTVLTSKYR.
Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases Vinay D. Shet CMSC 838 Presentation Authors: Allison Waugh, Glenn.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Sequences are related Darwin: all organisms are related through descent with modification Related molecules have similar functions in different organisms.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
© Wiley Publishing All Rights Reserved. Searching Sequence Databases.
Sequence Alignment.
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU.
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
DynamicBLAST on SURAgrid: Overview, Update, and Demo John-Paul Robinson Enis Afgan and Purushotham Bangalore University of Alabama at Birmingham SURAgrid.
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
Sequence alignment SEQ1: VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKK VADALTNAVAHVDDPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHA SLDKFLASVSTVLTSKYR.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
REMINDERS 2 nd Exam on Nov.17 Coverage: Central Dogma of DNA Replication Transcription Translation Cell structure and function Recombinant DNA technology.
1 Large-Scale Profile-HMM on the Grid Laurent Falquet Swiss Institute of Bioinformatics CH-1015 Lausanne, Switzerland Borrowed from Heinz Stockinger June.
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
11/15/04PittGrid1 PittGrid: Campus-Wide Computing Environment Hassan Karimi School of Information Sciences Ralph Roskies Pittsburgh Supercomputing Center.
Sequence Alignment.
Construction of Substitution matrices
Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is.
Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.
Copyright OpenHelix. No use or reproduction without express written consent1.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Bioinformatics Overview
Sequence Based Analysis Tutorial
Pairwise Sequence Alignment
30% grade = class presentations
Lecture #7: FASTA & LFASTA
Basic Local Alignment Search Tool (BLAST)
Pairwise Alignment Global & local alignment
Sequence Alignment Algorithms Morten Nielsen BioSys, DTU
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Sequence Analysis Alan Christoffels
It is the presentation about the overview of DOT MATRIX and GAP PENALITY..
Presentation transcript:

A Grid implementation of the sliding window algorithm for protein similarity searches facilitates whole proteome analysis on continuously updated databases Jorge Andrade Department of Biotechnology, Royal Institute of Technology (KTH), Stockholm, Sweden.

Bionformatics Bioinformatics involves the integration of computers, software tools, and databases in an effort to address biological questions

The Blast algorithm The BLAST programs (Basic Local Alignment Search Tools) are a set of sequence comparison algorithms introduced in 1990 that are used to search sequence databases for optimal local alignments to a query. GATGCCATAGAGCTGTAGTCGTACCCT < — — > CTAGAGAGC-GTAGTCAGAGTGTCTTTGAGTTCC Seq. A Seq. B Simple Dot Plot Manual alignment

Alignment scores: match vs. mismatch Simple scoring scheme (too simple in fact…): Matching amino acids:5 Mismatch:0 Scoring example: K A W S A D V : : : : : K D W S A E V = 25

Protein substitution matrices A 5 R -2 7 N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V BLOSUM50 matrix: Positive scores on diagonal (identities) Similar residues get higher scores Dissimilar residues get smaller (negative) scores

Pairwise alignments 43.2% identity; Global alignment score: alpha V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSA : :.:.:. : : ::::.. : :.::: :....: :..: : ::: :. beta VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP alpha QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL.::.::::: :.....::.: ::.:: ::.::: ::.::.. :..:: :. beta KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF alpha PAEFTPAVHASLDKFLASVSTVLTSKYR :::: :.:..:.:.:...:. ::. beta GKEFTPPVQAAYQKVVAGVANALAHKYH

Why Compare Sequences? What biologists do with blastp? -Predicting a protein function -Predicting a protein 3-D structure -Finding protein family members -Antibody recognition site Select a unique fragmet of a protein Express that protein fragment in laboratory Immunice protein fragment to rabbit Rabbit create the anti bodies PrEST Validation of antibodies ( no crossbinding) Color label atibodies Antibody on differen tissues, binding to protein.

Sliding window protein similarity search The protein fragments, denoted Protein Epitope Signature Tags (PrESTs), comprise 100 to 150 amino acids (2). PrEST design is based on the selection of a protein region with as low as possible similarity to protein regions from other genes. This is important to avoid cross-reactivity of the resulting antibody.

Graphical representation Graphical representation where the identity of a 51 amino acid fragment of the target protein to all other human proteins from other genes is displayed as a color coded line at the middle position of the fragment on the protein. Green color code implies 60-80% and red >80% identity

The problem When using the complete Ensembl human protein data set (version 31.35) with sequences as input, the runtime on a single up-to-date workstation is 1300 hours. This task comprises a total of 15,193,041 blastp searches 15,193,041 blastp searches8 weeks Ensembl is a continuously updated database, generally once a moth.

The solution

Grid – Blast Architecture To develop and implement this in a Grid environment, we joined the Swegrid / NorduGrid virtual organization. We were granted by Swedish National Infrastructure for Computing (SNIC) to have access to ~600 nodes, 1000 h/month through the different Swedish clusters.

Grid broker Local Grid Proxy-server swegrid cluster / nodes grid_blast.pl pw_blast foreach query{ pw_blast} pw_blast pw-blast pw_blast

Results Runtime comparative analysis: *single CPU 1Ghz speed/512Mb RAM, ** local cluster with 5 processor units each 1Ghz speed/512Mb RAM, *** Swegrid environment with access to ~600 remote CPUs with similar or better hardware. The Grid analysis was performed by submitting the sequence in file split into 300 atomistic jobs. The runtime for the analysis of the complete Ensembl human protein data (33869 protein sequences) was reduced from 1304 hours on a single CPU to 22,3 hours on the Grid. The analysis has been repeated several times. The exact Grid runtime can vary depending to different Grid conditions but the overall performance relative to a single CPU is marginally affected.

Proper number of Grid Jobs Proper number of Grid jobs. The chart shows the runtime needed for three different size input data sizes, 500, and sequences long input files. The time needed to submit the complete set of jobs to the Grid nodes has to be approximate the same as the time needed for a single node to run a single atomistic part of the complete set of jobs

CONCLUSION If the time for submitting the complete set of jobs to the Grid exceeds the time to execute a single atomistic job, the data input has been sub-optimally split into. Grid implementations for computer intensive Bioinformatics tasks represents an economical and time efficient alternative. A local TEMPORARY installation of the executable and database upon submission, makes it very suitable for dynamic environments, avoids the need for a predefined environment, and does not leave/take up space on the computer between runs.