Sequence Alignments.

Slides:



Advertisements
Similar presentations
CS4030: Bio-Computing Revision Lecture. DNA Replication Prior to cell division, all the genetic instructions must be copied so that each new cell will.
Advertisements

Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Global Sequence Alignment by Dynamic Programming.
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Previous Lecture: Probability
1 ALIGNMENT OF NUCLEOTIDE & AMINO-ACID SEQUENCES.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
Sequence Alignments and Database Searches Introduction to Bioinformatics.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Sequence Similarity Searching Class 4 March 2010.
Sequencing and Sequence Alignment
Summer Bioinformatics Workshop 2008 Sequence Alignments Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University.
C T C G T A GTCTGTCT Find the Best Alignment For These Two Sequences Score: Match = 1 Mismatch = 0 Gap = -1.
Introduction to Bioinformatics
Sequence Alignments Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Sequence Alignments Introduction to Bioinformatics.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Pairwise alignment Computational Genomics and Proteomics.
Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Local alignment
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignments and Dynamic Programming BIO/CS 471 – Algorithms for Bioinformatics.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise & Multiple sequence alignments
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Click to edit Master subtitle style 12/22/10 Analyzing Sequences.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Arun Goja MITCON BIOPHARMA
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Last lecture summary. Sequence alignment What is sequence alignment Three flavors of sequence alignment Point mutations, indels.
Welcome to Introduction to Bioinformatics
Sequence comparison: Local alignment
Biology 162 Computational Genetics Todd Vision Fall Aug 2004
Pairwise sequence Alignment.
Pairwise Sequence Alignment
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Find the Best Alignment For These Two Sequences
Pairwise Alignment Global & local alignment
Dynamic Programming Finds the Best Score and the Corresponding Alignment O Alignment: Start in lower right corner and work backwards:
BIOINFORMATICS Sequence Comparison
Sequence alignment BI420 – Introduction to Bioinformatics
Basic Local Alignment Search Tool
Sequence Analysis Alan Christoffels
Presentation transcript:

Sequence Alignments

DNA Replication Prior to cell division, all the genetic instructions must be “copied” so that each new cell will have a complete set DNA polymerase is the enzyme that copies DNA Reads the old strand in the 3´ to 5´ direction

Over time, genes accumulate mutations Environmental factors Radiation Oxidation Mistakes in replication or repair Deletions, Duplications Insertions Inversions Point mutations

Deletions Codon deletion: ACG ATA GCG TAT GTA TAG CCG… Effect depends on the protein, position, etc. Almost always deleterious Sometimes lethal Frame shift mutation: ACG ATA GCG TAT GTA TAG CCG… ACG ATA GCG ATG TAT AGC CG?… Almost always lethal

Why align sequences? The draft human genome is available Automated gene finding is possible Gene: AGTACGTATCGTATAGCGTAA What does it do? One approach: Is there a similar gene in another species? Align sequences with known genes Find the gene with the “best” match

Are there other sequences like this one? 1) Huge public databases - GenBank, Swissprot, etc. 2) Sequence comparison is the most powerful and reliable method to determine evolutionary relationships between genes 3) Similarity searching is based on alignment 4) BLAST and FASTA provide rapid similarity searching a. rapid = approximate (heuristic) b. false + and - scores

Similarity ≠ Homology 1) 25% similarity ≥ 100 AAs is strong evidence for homology 2) Homology is an evolutionary statement which means “descent from a common ancestor” common 3D structure usually common function homology is all or nothing, you cannot say "50% homologous"

Global vs Local similarity 1) Global similarity uses complete aligned sequences - total % matches GCG GAP program, Needleman & Wunch algorithm 2) Local similarity looks for best internal matching region between 2 sequences GCG BESTFIT program, Smith-Waterman algorithm, BLAST and FASTA 3) dynamic programming optimal computer solution, not approximate

Search with Protein, not DNA Sequences 1) 4 DNA bases vs. 20 amino acids - less chance similarity 2) can have varying degrees of similarity between different AAs - # of mutations, chemical similarity, PAM matrix 3) protein databanks are much smaller than DNA databanks

Similarity is Based on Dot Plots 1) two sequences on vertical and horizontal axes of graph 2) put dots wherever there is a match 3) diagonal line is region of identity (local alignment) 4) apply a window filter - look at a group of bases, must meet % identity to get a dot

Simple Dot Plot

Dot plot filtered with 4 base window and 75% identity

Dot plot of real data

Indels Comparing two genes it is generally impossible to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known: ACGTCTGATACGCCGTATCGTCTATCT ACGTCTGAT---CCGTATCGTCTATCT

The Genetic Code Substitutions are mutations accepted by natural selection. Synonymous: CGC  CGA Non-synonymous: GAU  GAA

Comparing two sequences Point mutations, easy: ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGATTCGCCCTATCGTCTATCT Indels are difficult, must align sequences: ACGTCTGATACGCCGTATAGTCTATCT CTGATTCGCATCGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT ----CTGATTCGC---ATCGTCTATCT

Scoring a sequence alignment Match score: +1 Mismatch score: +0 Gap penalty: –1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 × (+1) Mismatches: 2 × 0 Gaps: 7 × (– 1) Score = +11

Origination and length penalties We want to find alignments that are evolutionarily likely. Which of the following alignments seems more likely to you? ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGAT-------ATAGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT AC-T-TGA--CG-CGT-TA-TCTATCT We can achieve this by penalizing more for a new gap, than for extending an existing gap  

Scoring a sequence alignment (2) Match/mismatch score: +1/+0 Origination/length penalty: –2/–1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 × (+1) Mismatches: 2 × 0 Origination: 2 × (–2) Length: 7 × (–1) Score = +7

Scoring Similarity 1) Can only score aligned sequences 2) DNA is usually scored as identical or not 3) modified scoring for gaps - single vs. multiple base gaps (gap extension) 4) AAs have varying degrees of similarity a. # of mutations to convert one to another b. chemical similarity c. observed mutation frequencies 5) PAM matrix calculated from observed mutations in protein families

DNA Scoring Matrix A T C G 1 A T C G 5 -4 A T C G 1 -5 -1 Identity A T C G 5 -4 A T C G 1 -5 -1 Identity BLAST Transition/Transversion

The PAM 250 scoring matrix

How can we find an optimal alignment? Finding the alignment is computationally hard: ACGTCTGATACGCCGTATAGTCTATCT CTGAT---TCG—CATCGTC--T-ATCT C(27,7) gap positions = ~888,000 possibilities It’s possible, as long as we don’t repeat our work! Dynamic programming: The Needleman & Wunsch algorithm

What is the optimal alignment? ACTCG ACAGTAG Match: +1 Mismatch: 0 Gap: –1

The dynamic programming concept Suppose we are aligning: ACTCG ACAGTAG Last position choices: G +1 ACTC G ACAGTA G -1 ACTC - ACAGTAG - -1 ACTCG G ACAGTA

We can use a table Suppose we are aligning: A with A… A -1 A -1

Needleman-Wunsch: Step 1 Each sequence along one axis Mismatch penalty multiples in first row/column 0 in [1,1]

Needleman-Wunsch: Step 2 Vertical/Horiz. move: Score + (simple) gap penalty Diagonal move: Score + match/mismatch score Take the MAX of the three possibilities

Needleman-Wunsch: Step 2 (cont’d) Fill out the rest of the table likewise…

Needleman-Wunsch: Step 2 (cont’d) Fill out the rest of the table likewise… The optimal alignment score is calculated in the lower-right corner

But what is the optimal alignment To reconstruct the optimal alignment, we must determine of where the MAX at each step came from…

A path corresponds to an alignment = GAP in top sequence = GAP in left sequence = ALIGN both positions One path from the previous table: Corresponding alignment (start at the end): AC--TCG ACAGTAG Score = +2

Semi-global alignment Suppose we are aligning: GCG GGCG Which do you prefer? G-CG -GCG GGCG GGCG Semi-global alignment allows gaps at the ends for free.

Semi-global alignment Semi-global alignment allows gaps at the ends for free. Initialize first row and column to all 0’s Allow free horizontal/vertical moves in last row and column

Local alignment Global alignments – score the entire alignment Semi-global alignments – allow unscored gaps at the beginning or end of either sequence Local alignment – find the best matching subsequence CGATG AAATGGA This is achieved by allowing a 4th alternative at each position in the table: zero, if alternative neg. Smith-Waterman Algorithm (1981).

Local alignment Mismatch = –1 this time CGATG AAATGGA

Practice Problem Find an optimal alignment for these two sequences: GCGGTT GCGT Match: +1 Mismatch: 0 Gap: –1