BNFO 602 Lecture 2 Usman Roshan.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Sequence Alignment.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
BNFO 602 Multiple sequence alignment Usman Roshan.
CIS786, Lecture 7 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Lecture 1 BNFO 601 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Dynamic programming.
Heuristic alignment algorithms and cost matrices
1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 3: Pair-wise alignment Centre for Integrative Bioinformatics VU.
Expected accuracy sequence alignment
Lecture 1 BNFO 240 Usman Roshan. Course overview Perl progamming language (and some Unix basics) Sequence alignment problem –Algorithm for exact pairwise.
BNFO 602, Lecture 2 Usman Roshan Some of the slides are based upon material by David Wishart of University.
BNFO 602 Lecture 2 Usman Roshan. Sequence Alignment Widely used in bioinformatics Proteins and genes are of different lengths due to error in sequencing.
BNFO 240 Usman Roshan. Last time Traceback for alignment How to select the gap penalties? Benchmark alignments –Structural superimposition –BAliBASE.
BNFO 602, Lecture 3 Usman Roshan Some of the slides are based upon material by David Wishart of University.
Introduction to bioinformatics
Lecture 1 BNFO 601 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Sequence alignment.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Lecture 4 BNFO 235 Usman Roshan. IUPAC Nucleic Acid symbols.
Sequence Alignment III CIS 667 February 10, 2004.
BNFO 602 Lecture 2 Usman Roshan. Bioinformatics problems Sequence alignment: oldest and still actively studied Genome-wide association studies: new problem,
BNFO 602 Multiple sequence alignment Usman Roshan.
BNFO 235 Lecture 5 Usman Roshan. What we have done to date Basic Perl –Data types: numbers, strings, arrays, and hashes –Control structures: If-else,
BNFO 136 Sequence alignment Usman Roshan. Pairwise alignment X: ACA, Y: GACAT Match=8, mismatch=2, gap-5 ACA---ACA---ACAACA---- GACATGACATGACATG--ACAT.
Finding the optimal pairwise alignment We are interested in finding the alignment of two sequences that maximizes the similarity score given an arbitrary.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Pairwise alignment Computational Genomics and Proteomics.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Developing Pairwise Sequence Alignment Algorithms
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Expected accuracy sequence alignment Usman Roshan.
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Construction of Substitution matrices
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
Expected accuracy sequence alignment Usman Roshan.
Local alignment and BLAST Usman Roshan BNFO 601. Local alignment Global alignment recursions: Local alignment recursions.
Evaluation of protein alignments Usman Roshan BNFO 236.
Lecture 1 BNFO 601 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Dynamic programming.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
BLAST BNFO 236 Usman Roshan. BLAST Local pairwise alignment heuristic Faster than standard pairwise alignment programs such as SSEARCH, but less sensitive.
Pairwise Sequence Alignment and Database Searching
Lecture 1 BNFO 601 Usman Roshan.
The ideal approach is simultaneous alignment and tree estimation.
Sequence comparison: Local alignment
Biology 162 Computational Genetics Todd Vision Fall Aug 2004
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Local alignment and BLAST
BNFO 136 Sequence alignment
BNFO 236 Smith Waterman alignment
BNFO 602 Lecture 2 Usman Roshan.
Intro to Alignment Algorithms: Global and Local
Affine gaps for sequence alignment
Sequence Alignment Algorithms Morten Nielsen BioSys, DTU
Sequence alignment with Needleman-Wunsch
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

BNFO 602 Lecture 2 Usman Roshan

DNA Sequence Evolution -3 mil yrs -2 mil yrs -1 mil yrs today TAGGCCTT (Human) TAGCCCTTA (Monkey) A_C_CTT (Cat) A_CACTTC (Lion) _G_GCTT (Mouse) _GGGCTT TAGACCTT A_CACTT AAGGCTT T_GACTT AAGACTT AAGACTT T_GACTT AAGGCTT _GGGCTT TAGACCTT A_CACTT ACCTT (Cat) ACACTTC (Lion) TAGCCCTTA (Monkey) TAGGCCTT (Human) GGCTT (Mouse) _GGGCTT TAGACCTT A_CACTT AAGGCTT T_GACTT AAGACTT T_GACTT AAGGCTT AAGACTT AAGACTT

Sequence alignments They tell us about Function or activity of a new gene/protein Structure or shape of a new protein Location or preferred location of a protein Stability of a gene or protein Origin of a gene or protein Origin or phylogeny of an organelle Origin or phylogeny of an organism And more…

Pairwise alignment X: ACA, Y: GACAT Match=8, mismatch=2, gap-5 ACA-- -ACA- --ACA ACA---- GACAT GACAT GACAT G--ACAT 8+2+2-5-5 -5+8+8+8-5 -5-5+2+2+2 2-5-5-5-5-5-5 Score = 2 14 -4 -28

Optimal alignment An alignment can be specified by the traceback matrix. How do we determine the traceback for the highest scoring alignment? Needleman-Wunsch algorithm for global alignment First proposed in 1970 Widely used in genomics/bioinformatics Dynamic programming algorithm

Needleman-Wunsch Input: X = x1x2…xn, Y=y1y2…ym (X is seq2 and Y is seq1) Define V to be a two dimensional matrix with len(X)+1 rows and len(Y)+1 columns Let V[i][j] be the score of the optimal alignment of X1…i and Y1…j. Let m be the match cost, mm be mismatch, and g be the gap cost.

Dynamic programming Initialization: for i = 1 to len(seq2) { V[i][0] = i*g; } For i = 1 to len(seq1) { V[0][i] = i*g; } Recurrence: for i = 1 to len(seq2){ for j = 1 to len(seq1){ V[i-1][j-1] + m(or mm) V[i][j] = max { V[i-1][j] + g V[i][j-1] + g if(maximum is V[i-1][j-1] + m(or mm)) then T[i][j] = ‘D’ else if (maximum is V[i-1][j] + g) then T[i][j] = ‘U’ else then T[i][j] = ‘L’ }

Example V G A C A T A C T Input: seq2: ACA seq1: GACAT m = 5 mm = -4 gap = -20 seq2 is lined along the rows and seq2 is along the columns G A C A T -20 -40 -60 -80 -100 -4 -15 -35 -55 -75 -24 -8 -10 -30 -50 -44 -19 -12 -5 -25 A C T L U D

Affine gap penalties Affine gap model allows for long insertions in distant proteins by charging a lower penalty for extension gaps. We define g as the gap open penalty (first gap) and e as the gap extension penalty (additional gaps) Alignment: ACACCCT ACACCCC ACCT T AC CTT Score = 0 Score = 0.9 Trivial cost matrix: match=+1, mismatch=0, gapopen=-2, gapextension=-0.1

Affine penalty recurrence M(i,j) denotes alignments of x1..i and y1..j ending with a match/mismatch. E(i,j) denotes alignments of x1..i and y1..j such that yj is paired with a gap. F(i,j) defined similarly. Recursion takes O(mn) time where m and n are lengths of x and y respectively.

How do we pick gap parameters?

Structural alignments Recall that proteins have 3-D structure.

Structural alignment - example 1 Alignment of thioredoxins from human and fly taken from the Wikipedia website. This protein is found in nearly all organisms and is essential for mammals. PDB ids are 3TRX and 1XWC.

Structural alignment - example 2 Taken from http://bioinfo3d.cs.tau.ac.il/Align/FlexProt/flexprot.html Unaligned proteins. 2bbm and 1top are proteins from fly and chicken respectively. Computer generated aligned proteins

Structural alignments We can produce high quality manual alignments by hand if the structure is available. These alignments can then serve as a benchmark to train gap parameters so that the alignment program produces correct alignments.

Benchmark alignments Protein alignment benchmarks BAliBASE, SABMARK, PREFAB, HOMSTRAD are frequently used in studies for protein alignment. Proteins benchmarks are generally large and have been in the research community for sometime now. BAliBASE 3.0

Biologically realistic scoring matrices PAM and BLOSUM are most popular PAM was developed by Margaret Dayhoff and co-workers in 1978 by examining 1572 mutations between 71 families of closely related proteins BLOSUM is more recent and computed from blocks of sequences with sufficient similarity

PAM We need to compute the probability transition matrix M which defines the probability of amino acid i converting to j Examine a set of closely related sequences which are easy to align---for PAM 1572 mutations between 71 families Compute probabilities of change and background probabilities by simple counting

Expected accuracy alignment The dynamic programming formulation allows us to find the optimal alignment defined by a scoring matrix and gap penalties. This may not necessarily be the most “accurate” or biologically informative. We now look at a different formulation of alignment that allows us to compute the most accurate one instead of the optimal one.

Posterior probability of xi aligned to yj Let A be the set of all alignments of sequences x and y, and define P(a|x,y) to be the probability that alignment a (of x and y) is the true alignment a*. We define the posterior probability of the ith residue of x (xi) aligning to the jth residue of y (yj) in the true alignment (a*) of x and y as Do et. al., Genome Research, 2005

Expected accuracy of alignment We can define the expected accuracy of an alignment a as The maximum expected accuracy alignment can be obtained by the same dynamic programming algorithm Do et. al., Genome Research, 2005

Example for expected accuracy True alignment AC_CG ACCCA Expected accuracy=(1+1+0+1+1)/4=1 Estimated alignment ACC_G Expected accuracy=(1+1+0.1+0+1)/4 ~ 0.75

Estimating posterior probabilities If correct posterior probabilities can be computed then we can compute the correct alignment. Now it remains to estimate these probabilities from the data PROBCONS (Do et. al., Genome Research 2006): estimate probabilities from pairwise HMMs using forward and backward recursions (as defined in Durbin et. al. 1998) Probalign (Roshan and Livesay, Bioinformatics 2006): estimate probabilities using partition function dynamic programming matrices

Example

Local alignment Global alignment recurrence: Local alignment recurrence

Local alignment traceback Let T(i,j) be the traceback matrices and m and n be length of input sequences. Global alignment traceback: Begin from T(m,n) and stop at T(0,0). Local alignment traceback: Find i*,j* such that T(i*,j*) is the maximum over all T(i,j). Begin traceback from T(i*,j*) and stop when T(i,j) <= 0.

BLAST Local pairwise alignment heuristic Faster than standard pairwise alignment programs such as SSEARCH, but less sensitive. Online server: http://www.ncbi.nlm.nih.gov/blast

BLAST Given a query q and a target sequence, find substrings of length k (k-mers) of score at least t --- also called hits. k is normally 3 to 5 for amino acids and 12 for nucleotides. Extend each hit to a locally maximal segment. Terminate the extension when the reduction in score exceeds a pre-defined threshold Report maximal segments above score S.

Finding k-mers quickly Preprocess the database of sequences: For each sequence in the database store all k-mers in hash-table. This takes linear time Query sequence: For each k-mer in the query sequence look up the hash table of the target to see if it exists Also takes linear time