Improved Alignment of Protein Sequences Based on Common Parts David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Improving the Performance of M-tree Family by Nearest-Neighbor Graphs Tomáš Skopal, David Hoksza Charles University in Prague Department of Software Engineering.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
1 Bio-Sequence Analysis with Cradle’s 3SoC™ Software Scalable System on Chip Xiandong Meng, Vipin Chaudhary Parallel and Distributed Computing Lab Wayne.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Midterm Review. Review of previous weeks Pairwise sequence alignment Scoring matrices PAM, BLOSUM, Dynamic programming Needleman-Wunsch (Global) Semi-global.
Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk.
Fa05CSE 182 CSE182-L4: Scoring matrices, Dictionary Matching.
BNFO 240 Usman Roshan. Last time Traceback for alignment How to select the gap penalties? Benchmark alignments –Structural superimposition –BAliBASE.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Heuristic Approaches for Sequence Alignments
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Protein Sequence Comparison Patrice Koehl
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
15-853:Algorithms in the Real World
Sequence alignment, E-value & Extreme value distribution
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Chapter 5 Multiple Sequence Alignment.
Developing Pairwise Sequence Alignment Algorithms
Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
An Introduction to Bioinformatics
David Hoksza, Supervisor: Tomáš Skopal, KSI MFF UK Similarity Search in Protein Databases.
SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.
DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Alignment.
Construction of Substitution matrices
Bioinformatics Computing 1 CMP 807 – Day 2 Kevin Galens.
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
Genome Revolution: COMPSCI 004G 8.1 BLAST l What is BLAST? What is it good for?  Basic.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Bioinformatics: The pair-wise alignment problem
Fast Sequence Alignments
Pairwise sequence Alignment.
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Improved Alignment of Protein Sequences Based on Common Parts David Hoksza Charles University in Prague Department of Software Engineering Czech Republic

ISBRA Presentation Outline Similarity search in protein sequence databases Smith-Waterman algorithm Common parts  basic algorithm  inversed sequences  inexact search Experiments Conclusion

ISBRA Similarity Measures two strings of amino-acids hamming distance  sequences of equal length  number of non-identical positions edit distance  minimal number of operations insert/update/delete to convert one sequence to the other weighted edit distance  takes into account probability of updating one letter to the other  scoring (substitution) matrices PAM, BLOSUM, …  different costs for opening/extending a gap global/local alignment

ISBRA Global Alignment Global alignment  aligning whole sequences  weighted edit distance Needleman-Wunsch  optimal alignment between 2 sequences a and b  distance matrix δ  gap cost σ  s i,j – optimal alignment of prefixes a and b of length i and j  s 0,j = j*σ, s i,0 = i*σ   s |a|, |b| … value of the optimal alignment NPHGIIMGLAE --HG--LGL BLOSUM 62 gap cost … -1 O(|a||b|) adding gap to a adding gap to b align a i and b j

ISBRA Local Alignment Local alignment  best global alignment of all pairs of subsequences of a and b Smith-Waterman  modification of Needleman-Wunsch allowing “free ride” from the start by incorporating zero value  s 0,j = 0, s i,0 = 0   max(s i,j ) … value of optimal alignment NPHGIIMGLAE HGL gap extending - σ gap opening - ρ BLOSUM 62 gap cost … -11

ISBRA Speeding-up Database Search non-rigorous search  heuristic approaches trading off accuracy for speed BLAST, FASTA rigorous search  indexing weighted edit distance is not metric in general → MAMs not applicable turning distance to metric – limited to q-grams  parallelism run more alignments concurrently  MPSrch distance computation itself  FPGA (field-programmable gate arrays)  instructions for parallelism

ISBRA Common Alignment Matrices Parts 1. align s i with the query sequence 2. replace s i with s i+1 3. start alignment form (n+1) st row do the same with h and v matrices algorithm stays intact pre-step – sorting prefix ratio (PR) – speed-up

ISBRA Reversed Sequences score of the alignment is independent on the direction of the alignment  possibility of aligning according to suffixes (prefixes of reversed sequences)  division of the database to 2 groups (prefixes, suffixes) – greedy algorithm: 1. building stage divide a given percent of the database randomly and the rest so that PR increases in every step 2. shifting stage move random sequence to oposite group if it would increase the overall PR repeat step 2 n times

ISBRA Inexact Search bigger database (#sequences) → higher PR split sequences  → increase of database size proportional to number of splits  → inaccuracy sequences with alignment spreading over the split might not be in the result any more

ISBRA Experimental Results UniProt DB  max. sequence length 3000 (99,9% of UniProt)  random subset 1.000, 5.000, , , , , , , , ,  semantically motivated subsets archaea, bacteria, fungi, human, invertebrates, mammals, plants, rodents, vertebrates, viruses Testing of  prefix ratio of basic solution reversed sequences chopped sequences

ISBRA Experiments - Prefix Ratio of Random Subsets and Taxonomic Divisions

ISBRA Experiments – Reversed Sequences after the building stage after the shifting stage without reversed sequences

ISBRA Experiments – Chopped Sequences

ISBRA Conclusion We have proposed  simple method for speeding up the database search of protein sequences by using common prefixes and suffixes easy implementation with current methods  rigorous and non-rigorous version of the algorithm We implemented  modification of Smith-Waterman algorithm Experimental results  we have shown up to 20% speed-up with the rigorous version of the algorithm