Pairwise alignment Computational Genomics and Proteomics.

Slides:



Advertisements
Similar presentations
Global Sequence Alignment by Dynamic Programming.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Sequence allignement 1 Chitta Baral. Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)*
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Measuring the degree of similarity: PAM and blosum Matrix
1 ALIGNMENT OF NUCLEOTIDE & AMINO-ACID SEQUENCES.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Sequence Alignments and Database Searches Introduction to Bioinformatics.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.
Heuristic alignment algorithms and cost matrices
1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 3: Pair-wise alignment Centre for Integrative Bioinformatics VU.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Introduction to Bioinformatics
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] Sequence Analysis Sequence Analysis Lecture 3 C E N T R F O R I N T E G R A T I V E B I O I N F O.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Sequence Alignment III CIS 667 February 10, 2004.
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] Sequence Analysis Alignments 2: Local alignment Sequence Analysis
Sequence Alignments Introduction to Bioinformatics.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment II Dynamic Programming
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Developing Pairwise Sequence Alignment Algorithms
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise & Multiple sequence alignments
Pair-wise Sequence Alignment (II) Introduction to bioinformatics 2008 Lecture 6 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Arun Goja MITCON BIOPHARMA
Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Alignment methods April 21, 2009 Quiz 1-April 23 (JAM lectures through today) Writing assignment topic due Tues, April 23 Hand in homework #3 Why has HbS.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Last lecture summary. Sequence alignment What is sequence alignment Three flavors of sequence alignment Point mutations, indels.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
The ideal approach is simultaneous alignment and tree estimation.
Sequence comparison: Local alignment
Gil McVean Department of Statistics, Oxford
Biology 162 Computational Genetics Todd Vision Fall Aug 2004
Introduction to bioinformatics 2007
Introduction to bioinformatics 2007
Pairwise Sequence Alignment
Pairwise Alignment Global & local alignment
Introduction to bioinformatics Lecture 5 Pair-wise sequence alignment
Presentation transcript:

Pairwise alignment Computational Genomics and Proteomics

Genetic code

Searching for similarities  What is the function of the new gene?  The “lazy” investigation: –Find the set of similar proteins –Identify similarities and differences –For long proteins: identify domains

Is similarity really interesting?

 Common ancestry is more interesting Makes it more likely that genes share the same function  Homology: sharing a common ancestor –a binary property (yes/no)‏ –It’s a nice tool: When (a known gene) G is homologous to (an unknown) X it means that we gain a lot of information on X Z XY

How to determine similarity Frequent evolutionary events: 1. Substitution 2. Insertion, deletion 3. Duplication 4. Inversion Evolution at work Z XY Common ancestor, usually extinct available We’ll use only these

Alignment - the problem Operations: –substitution, –Insertion –deletion GTACT--C GGA-TGAC algorithmsforgenomes |||||||||| algorithms4genomes Algorithmsforgenomes ||||||| algorithms4genomes algorithmsforgenomes ||||||||||? ||||||| algorithms4--genomes

Many ways to align  Which alignment is better? –Reflecting evolution ~= –Most probable ~= –Simplest GTA GCA GT-A G-CA GTA GCA 1)‏ 2)‏ 3)‏

Scoring  Should give reasonable alignments  And have to assign scores to: –Substitution (or match/mismatch)‏ –Gap penalty Linear: g(k)=  k Affine: g(k)=  +  k Concave, e.g.: g(k)=log(k)  The score for an alignment is the sum of scores of all columns GTA-- --ATG G-T-A -ATG- GTACT--C GGA-TGAC

Substitution matrices Define a score for match/mismatch of letters DNA –- Simple: –- Used in genome alignments

Substitution matrices for aa Amino acids are not equal: 1. Some are easily substituted, similar: biochemical properties structure 2. Some mutations occur more often due to similar codons The two above give us substitution matrices orange: nonpolar and hydrophobic. green: polar and hydrophilic magenta box are acidic light blue box are basic

BLOSUM 62 substitution matrix

Constant vs. affine gap scoring Seq1G T A - -G - T - A Seq2- - A T G- A T G - Const-2 –2 1 –2 –2 (SUM = -7)-2 – –2 (SUM = -7) Affine – (SUM = -7)-3 – –3 (SUM = -11)‏ -2 extensionintro Gap Scoring -3Affine -2Constant …and +1 for match

Global alignment. The Needleman-Wunsch algorithm  Goal: find the maximal scoring alignment  Scores: m match, -s mismatch, -g for insertion/deletion  Dynamic programming –Solve smaller subproblem(s)‏ –Iteratively extend the solution  The best alignment for X[1…i] and Y[1…j] is called M[i, j] X 1 … X i X i Y 1 … - Y j-1 Y j -

The algorithm Y[1…j - 1] Y[j] X[1…i - 1] X[i] Y[1…j-1]Y[j] X[1…i] - Y[1…j] - X[1…i-1] X[i] M[i, j-1] - gM[i-1, j] - g M[i-1, j-1] + m - s M[i,j] = Goal: find the maximal scoring alignment Scores: m match, -s mismatch, -g for insertion/deletion The best alignment for X[1…i] and Y[1…j] is called M[i, j] 3 ways to extend the alignment

The algorithm – final equation M[i, j] = M[i, j- 1 ] – g M[i- 1, j] – g M[i- 1, j- 1 ] + score(X[i],Y[j])‏ jj-1 i-1 i max * * -g X 1 …X i-1 X i Y 1 …Y j-1 Y j X 1 …X i - Y 1 …Y j-1 Y j X 1 …X i-1 X i Y 1 …Y j - Corresponds to:

Example: global alignment of two sequences  Align two DNA sequences: –GAGTGA –GAGGCGA (note the length difference)‏  Parameters of the algorithm: –Match: score(A,A) = 1 –Mismatch: score(A,T) = -1 –Gap: g = -2 M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max

The algorithm. Step 1: init  Create the matrix  Initiation –0 at [0,0] –Apply the equation… M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max

The algorithm. Step 1: init  Initiation of the matrix: –0 at pos [0,0] –Fill in the first row using the “  ” rule –Fill in the first column using “  ” M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max j i

The algorithm. Step 2: fill in  Continue filling in of the matrix, remembering from which cell the result comes (arrows)‏ M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max j i

The algorithm. Step 2: fill in  We are done…  Where’s the result? M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max j i

The algorithm. Step 3: traceback  Start at the last cell of the matrix  Go against the direction of arrows  Sometimes the value may be obtained from more than one cell (which one?)‏ j i M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max

The algorithm. Step 3: traceback  Extract the alignments j i GAGT-GA GAGGCGA GA-GTGA GAGGCGA a b a)‏ b)‏

Affine gaps M[i, j] = I x [i-1, j- 1 ] I y [i- 1, j-1] M[i- 1, j- 1 ] max score(X[i], Y[j]) + I x [i, j] = M[i, j-1] + g o I x [i, j- 1 ] + g e max I y [i, j] = M[i-1, j] + g o I y [i-1, j] + g x max Penalties: g o - gap opening (e.g. -8)‏ g e - gap extension (e.g. -2)‏ X 1 … - Y 1 … Y j X 1 …X i - Y 1 …Y j-1 Y j X 1 … - - Y 1 …Y j-1 Y j X1… XiY1… YjX1… XiY1… home: think of boundary values M[*,0], I[*,0] etc.

Variation on global alignment  Global alignment: previous algorithms is called global alignment, because it uses all letters from both sequences. CAGCACTTGGATTCTCGG CAGC-----G-T----GG  Semi-global alignment: don’t penalize for start/end gaps (omit the start/end gaps of sequences). CAGCA-CTTGGATTCTCGG ---CAGCGTGG –Applications of semi-global: –Finding a gene in genome –Placing marker onto a chromosome –One sequence much longer than the other –Danger! – really bad alignments for divergent seqs seq X: seq Y:

Semi-global alignment  Ignore start gaps –First row/column set to 0  Ignore end gaps –Read the result from last row/column T G A - GTGAG

 a heuristics –Heuristics: A rule of thumb that often helps in solving a certain class of problems, but makes no guarantees. Perkins, DN (1981) The Mind's Best Work