Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.

Slides:



Advertisements
Similar presentations
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Improved Alignment of Protein Sequences Based on Common Parts David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Heuristic alignment algorithms and cost matrices
Sequence similarity (II). Schedule Mar 23midterm assignedalignment Mar 30midterm dueprot struct/drugs April 6teams assignedprot struct/drugs April 13RNA.
. Class 4: Sequence Alignment II Gaps, Heuristic Search.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
Fa05CSE 182 L3: Blast: Local Alignment and other flavors.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
From Pairwise Alignment to Database Similarity Search.
Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Heuristic Approaches for Sequence Alignments
FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
Introduction to Bioinformatics Algorithms Sequence Alignment.
From Pairwise Alignment to Database Similarity Search.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Protein Sequence Comparison Patrice Koehl
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
1 Lesson 3 Aligning sequences and searching databases.
15-853:Algorithms in the Real World
Sequence alignment, E-value & Extreme value distribution
From Pairwise Alignment to Database Similarity Search.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence Alignment Topics: Introduction Exact Algorithm Alignment Models BioPerl functions.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Sequence Alignment.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
BLAST Workshop Maya Schushan June 2009.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
. Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Lecture 7 CS5661 Heuristic PSA “Words” to describe dot-matrix analysis Approaches –FASTA –BLAST Searching databases for sequence similarities –PSA –Alternative.
Sequence Alignment.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
Heuristic Alignment Algorithms Hongchao Li Jan
Local Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Fast Sequence Alignments
From Pairwise Alignment to Database Similarity Search Part II
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases for related sequences and subsequences. Exploring frequently occurring patterns of nucleotides. Finding informative elements in protein and DNA sequences. Various experimental applications (reconstruction of DNA, etc.) Motivation:

Seq.Align. Protein Function Similar function More than 25% sequence identity ? Similar 3D structure ? Similar sequences produce similar proteins Protein1Protein2 ?

Exact Pattern Matching Given a pattern P of length m and a (longer) string T of length n, find all the occurrences of P in T. Naïve algorithm: O(m*n) Boyer-Moore, Knuth-Pratt-Morris: O(n+m)

Alignment - inexact matching Substitution - replacing a sequence base by another. Insertion - an insertion of a base (letter) or several bases to the sequence. Deletion - deleting a base (or more) from the sequence. ( Insertion and deletion are the reverse of one another )

Seq. Align. Score Commonly used matrices: PAM250, BLOSUM64

Global Alignment INPUT: Two sequences S and T of roughly the same length. QUESTION: What is the maximum similarity between them? Find one of the best alignments.

The IDEA s[1…n] t[1…m] To align s[1...i] with t[1…j] we have three choices: * align s[1…i-1] with t[1…j-1] and match s[i] with t[j] * align s[1…i] with t[1…j-1] and match a space with t[j] * align s[1…i-1] with t[1…j] and match s[i] with a space s[1…i-1] i t[1…j-1] j s[1… i ] - t[1…j -1 ] j s[1…i -1 ] i t[1… j ] -

Recursive Relation Define: scoring matrix m(a,b) a,b ∈ ∑ U {-} Define: H ij the best score of alignment between s[1…i] and t[1…j] for 1 <= i <= n, 1 <= j <= m H i-1j-1 + m(s i,t j ) H ij = max H ij-1 + m(-,t j ) H i-1j + m(s i,-) H i0 = ∑ 0,k m(s k,-) H 0j = ∑ 0,k m(-,t k ) Needleman-Wunsch 1970 Optimal alignment score = H nm

t s ITIT s3t4s3t4 I-I- s3-s3- - T-t4-t4

Local Alignment INPUT: Two sequences S and T. QUESTION: What is the maximum similarity between a subsequence of S and a subsequence of T ? Find most similar subsequences.

Recursive Relation for 1 <= i <= n, 1 <= j <= m H i-1j-1 + m(s i,t j ) H ij-1 + m(-,t j ) H i-1j + m(s i,-) 0 H i0 = 0 H 0j = 0 Optimal alignment score = max ij H ij Smith-Waterman 1981 H ij = max *Penalties should be negative*

Sequence Alignment Complexity: Time O(n*m) Space O(n*m) (exist algorithm with O(min(n,m)) )

Ends free alignment Ends free alignment INPUT: Two equences S and T (possibly of different length). QUESTION: Find one of the best alignments between subsequences of S and T when at least one of these subsequences is a prefix of the original sequence and one (not necessarily the other) is a suffix. or

Gap Alignment Definition: A gap is the maximal contiguous run of spaces in a single sequence within a given alignment.The length of a gap is the number of indel operations on it. A gap penalty function is a function that measure the cost of a gap as a (nonlinear) function of its length. Gap penalty INPUT: Two sequences S and T (possibly of different length). QUESTION: Find one of the best alignments between the two sequences using the gap penalty function. Affine Gap: W total = W g + qW s W g – weight to open the gap W s – weight to extend the gap

What Kind of Alignment to Use? The same protein from the different organisms. Two different proteins sharing the same function. Protein domain against a database of complete proteins. Protein against a database of small patterns (functional units)...

Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record ACTTTTGGTGACTGTAC

Sequence Alignment vs. Database Tool: Given two sequences, there exists an algorithm to find the best alignment. Naïve Solution: Apply algorithm to each of the records, one by one

Sequence Alignment vs. Database Problem: An exact algorithm is too slow to run millions of times (even linear time algorithm will run slowly on a huge DB) Solution: –Run in parallel (expensive). – Use a fast (heuristic) method to discard irrelevant records. Then apply the exact algorithm to the remaining few.

Sequence Alignment vs. Database General Strategy of Heuristic Algorithms: -Homologous sequences are expected to contain un-gapped (at least) short segments (probably with substitutions, but without ins/dels) -Preprocess DB into some fast access data structure of short segments.

FASTA Idea Idea: a good alignment probably matches some identical ‘words’ (ktups) Example: Database record: ACTTGTAGATACAAAATGTG Aligned query sequence: A-TTGTCG-TACAA-ATCTGT Matching words of size 4

Dictionaries of Words ACTTGTAGATAC Is translated to the dictionary: ACTT, CTTG, TTGT, TGTA … Dictionaries of well aligned sequences share words.

FASTA Stage I Prepare dictionary for db sequence (in advance) Upon query: –Prepare dictionary for query sequence –For each DB record: Find matching words Search for long diagonal runs of matching words Init-1 score: longest run Discard record if low score * = matching word Position in query Position in DB record * * * * * * * * * * * *

FASTA stage II Good alignment – path through many runs, with short connections Assign weights to runs(+) and connections(-) Find a path of max weight Init-n score – total path weight Discard record if low score

FASTA Stage III Improve Init-1. Apply an exact algorithm around Init-1 diagonal within a given width band. Init-1 Opt-score – new weight Discard record if low score

FASTA final stage Apply an exact algorithm to surviving records, computing the final alignment score.

BLAST (Basic Local Alignment Search Tool) Approximate Matches BLAST: Words are allowed to contain inexact matching. Example: In the polypeptide sequence IHAVEADREAM The 4-long word HAVE starting at position 2 may match HAVE,RAVE,HIVE,HALE,…

Approximate Matches For each word of length w from a Data Base generate all similar words. ‘Similar’ means: score( word, word’ ) > T Store all similar words in a look-up table.

DB search 1) For each word of length w from a query sequence generate all similar words. 2) Access DB. 3) Each hit extend as much as possible -> High-scoring Segment Pair (HSP) score(HSP) > V THEFIRSTLINIHAVEADREAMESIRPATRICKREAD INVIEIAMDEADMEATTNAMHEWASNINETEEN

DB search (2) s-query s-db 4) Around HSP perform DP. At each step alignment score should be > T starting point (seed pair)

Homework Write a cgi script (using Perl) that performs pairwise Local/Global Alignment for DNA sequences. All I/O is via HTML only. Input: 1.Choice for Local/Global alignment. 2.Two sequences – text boxes. 3.Values for match, mismatch, ins/dels. 4.Number of iterations for computing random scores. Output: 1.Alignment score. 2.z-score value (z= (score-average)/standard deviation.) Remarks: 1) you are allowed to use only linear space. 2)To compute z-score perform random shuffling: srand(time| $$); #init, $$-proc.id int(rand($i)); #returns rand. number between [0,$i]. 3)Shuffling is done in windows (non-overlapping) of 10 bases length. Number of shuffling for each window is random [0,10].