Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence Alignments Introduction to Bioinformatics.

Similar presentations


Presentation on theme: "Sequence Alignments Introduction to Bioinformatics."— Presentation transcript:

1 Sequence Alignments Introduction to Bioinformatics

2 Intro to Bioinformatics – Sequence Alignment2 Sequence Alignments  Cornerstone of bioinformatics  What is a sequence? Nucleotide sequence Amino acid sequence  Pairwise and multiple sequence alignments We will focus on pairwise alignments  What alignments can help Determine function of a newly discovered gene sequence Determine evolutionary relationships among genes, proteins, and species Predicting structure and function of protein Acknowledgement: This notes is adapted from lecture notes of both Wright State University’s Bioinformatics Program and Professor Laurie Heyer of Davidson College with permission.

3 Intro to Bioinformatics – Sequence Alignment3 DNA Replication  Prior to cell division, all the genetic instructions must be “copied” so that each new cell will have a complete set  DNA polymerase is the enzyme that copies DNA Reads the old strand in the 3´ to 5´ direction

4 Intro to Bioinformatics – Sequence Alignment4 Over time, genes accumulate mutations  Environmental factors Radiation Oxidation  Mistakes in replication or repair  Deletions, Duplications  Insertions, Inversions  Translocations  Point mutations

5 Intro to Bioinformatics – Sequence Alignment5  Codon deletion: ACG ATA GCG TAT GTA TAG CCG… Effect depends on the protein, position, etc. Almost always deleterious Sometimes lethal  Frame shift mutation: ACG ATA GCG TAT GTA TAG CCG… ACG ATA GCG ATG TAT AGC CG?… Almost always lethal Deletions

6 Intro to Bioinformatics – Sequence Alignment6 Indels  Comparing two genes it is generally impossible to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known: ACGTCTGATACGCCGTATCGTCTATCT ACGTCTGAT---CCGTATCGTCTATCT

7 Intro to Bioinformatics – Sequence Alignment7 The Genetic Code Substitutions Substitutions are mutations accepted by natural selection. Synonymous: CGC  CGA Non-synonymous: GAU  GAA

8 Intro to Bioinformatics – Sequence Alignment8 Comparing Two Sequences  Point mutations, easy: ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGATTCGCCCTATCGTCTATCT  Indels are difficult, must align sequences: ACGTCTGATACGCCGTATAGTCTATCT CTGATTCGCATCGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT ----CTGATTCGC---ATCGTCTATCT

9 Intro to Bioinformatics – Sequence Alignment9 Why Align Sequences?  The draft human genome is available  Automated gene finding is possible  Gene: AGTACGTATCGTATAGCGTAA What does it do?What does it do?  One approach: Is there a similar gene in another species? Align sequences with known genes Find the gene with the “best” match

10 Intro to Bioinformatics – Sequence Alignment10 Gaps or No Gaps  Examples

11 Intro to Bioinformatics – Sequence Alignment11 Scoring a Sequence Alignment  Match score:+1  Mismatch score:+0  Gap penalty:–1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT  Matches: 18 × (+1)  Mismatches: 2 × 0  Gaps: 7 × (– 1) Score = +11

12 Intro to Bioinformatics – Sequence Alignment12 Origination and Length Penalties  We want to find alignments that are evolutionarily likely.  Which of the following alignments seems more likely to you? ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGAT-------ATAGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT AC-T-TGA--CG-CGT-TA-TCTATCT  We can achieve this by penalizing more for a new gap, than for extending an existing gap  

13 Intro to Bioinformatics – Sequence Alignment13 Scoring a Sequence Alignment (2)  Match/mismatch score:+1/+0  Origination/length penalty:–2/–1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT  Matches: 18 × (+1)  Mismatches: 2 × 0  Origination: 2 × (–2)  Length: 7 × (–1) Score = +7

14 Intro to Bioinformatics – Sequence Alignment14 How can we find an optimal alignment?  Finding the alignment is computationally hard: ACGTCTGATACGCCGTATAGTCTATCT CTGAT---TCG-CATCGTC--T-ATCT  C(27,7) gap positions = ~888,000 possibilities  It’s possible, as long as we don’t repeat our work!  Dynamic programming: The Needleman & Wunsch algorithm

15 Intro to Bioinformatics – Sequence Alignment15 Dynamic Programming  Technique of solving optimization problems Find and memorize solutions for subproblems Use those solutions to build solutions for larger subproblems Continue until the final solution is found  Recursive computation of cost function in a non-recursive fashion

16 Intro to Bioinformatics – Sequence Alignment16 Global Sequence Alignment  Needleman-Wunsch algorithm  Suppose we are aligning: A with A … D i,j = max{ D i-1,j + d(A i, –), D i,j-1 + d(–, B j ), D i-1,j-1 + d(A i, B j ) } i-1 j-1 j i

17 Intro to Bioinformatics – Sequence Alignment17 Dynamic Programming (DP) Concept  Suppose we are aligning: CACGA CCGA

18 Intro to Bioinformatics – Sequence Alignment18 DP – Recursion Perspective  Suppose we are aligning: ACTCG ACAGTAG  Last position choices: G+1ACTC GACAGTA G-1ACTC -ACAGTAG --1ACTCG GACAGTA

19 Intro to Bioinformatics – Sequence Alignment19 What is the optimal alignment?  ACTCG ACAGTAG  Match: +1  Mismatch: 0  Gap: –1

20 Intro to Bioinformatics – Sequence Alignment20 Needleman-Wunsch: Step 1  Each sequence along one axis  Mismatch penalty multiples in first row/column  0 in [1,1] (or [0,0] for the CS-minded)

21 Intro to Bioinformatics – Sequence Alignment21 Needleman-Wunsch: Step 2  Vertical/Horiz. move: Score + (simple) gap penalty  Diagonal move: Score + match/mismatch score  Take the MAX of the three possibilities

22 Intro to Bioinformatics – Sequence Alignment22 Needleman-Wunsch: Step 2 (cont’d)  Fill out the rest of the table likewise…

23 Intro to Bioinformatics – Sequence Alignment23 Needleman-Wunsch: Step 2 (cont’d)  Fill out the rest of the table likewise…  The optimal alignment score is calculated in the lower-right corner

24 Intro to Bioinformatics – Sequence Alignment24 But what is the optimal alignment  To reconstruct the optimal alignment, we must determine of where the MAX at each step came from…

25 Intro to Bioinformatics – Sequence Alignment25 A path corresponds to an alignment  = GAP in top sequence  = GAP in left sequence  = ALIGN both positions  One path from the previous table:  Corresponding alignment (start at the end): AC--TCG ACAGTAG Score = +2

26 Intro to Bioinformatics – Sequence Alignment26 Algorithm Analysis  Brute force approach If the length of both sequences is n, number of possibility = C(2n, n) = (2n)!/(n!) 2  2 2n / (  n) 1/2, using Sterling’s approximation of n! = (2  n) 1/2 e -n n n. O(4 n )  Dynamic programming O(mn), where the two sequence sizes are m and n, respectively O(n 2 ), if m is in the order of n

27 Intro to Bioinformatics – Sequence Alignment27 Practice Problem  Find an optimal alignment for these two sequences: GCGGTT GCGT  Match: +1  Mismatch: 0  Gap: –1

28 Intro to Bioinformatics – Sequence Alignment28 Practice Problem  Find an optimal alignment for these two sequences: GCGGTT GCGT GCGGTT GCG-T- Score = +2

29 Intro to Bioinformatics – Sequence Alignment29 Semi-global alignment  Suppose we are aligning: GCG GGCG  Which do you prefer? G-CG-GCG GGCGGGCG  Semi-global alignment allows gaps at the ends for free.

30 Intro to Bioinformatics – Sequence Alignment30 Semi-global alignment  Semi-global alignment allows gaps at the ends for free.  Initialize first row and column to all 0’s  Allow free horizontal/vertical moves in last row and column

31 Intro to Bioinformatics – Sequence Alignment31 Local alignment  Global alignments – score the entire alignment  Semi-global alignments – allow unscored gaps at the beginning or end of either sequence  Local alignment – find the best matching subsequence  CGATG AAATGGA  This is achieved by allowing a 4 th alternative at each position in the table: zero.

32 Intro to Bioinformatics – Sequence Alignment32 Local Sequence Alignment  Why local sequence alignment? Subsequence comparison between a DNA sequence and a genome Protein function domains Exons matching  Smith-Waterman algorithm D i,j = max{ D i-1,j + d(A i, –), D i,j-1 + d(–,B j ), D i-1,j-1 + d(A i,B j ), 0 } Initialization: D 1,j = , D i,1 = 

33 Intro to Bioinformatics – Sequence Alignment33 Local alignment  Score: Match = 1, Mismatch = -1, Gap = -1 CGATG AAATGGA

34 Intro to Bioinformatics – Sequence Alignment34 Local alignment  Another example

35 Intro to Bioinformatics – Sequence Alignment35 More Example  Align ATGGCCTC ACGGCTC Mismatch  = -3 Gap  = -4 --ACGGCTC 0-4-8-12-16-20-24-32 1-3 -- A T G G C C T C -4 -8 -12 -16 -20 -24 -28 -32 -7-11-15-19-23 -2-6-10-14 -18 -7-6-5-9-13-17 -11-10-50-4-8-12 -15-10-9-41-3-7 -19-14-13-8-3-2 -23-18-17-12-7-2-5 -27-22-21-16-11-6 Global Alignment: ATGGCCTC ACGGC-TC

36 Intro to Bioinformatics – Sequence Alignment36 More Example Local Alignment: ATGGCCTC ACGG CTC or ATGGCCTC ACGGC TC --ACGGCTC 00000000 10 0 A T G G C C T C 0 0 0 0 0 0 0 0 00000 000010 0011000 0012000 0100301 0100101 0000020 0100103

37 Intro to Bioinformatics – Sequence Alignment37 Scoring Matrices for DNA Sequences  Transition: A  G C  T  Transversion: a purine (A or G) is replaced by a pyrimadine (C or T) or vice versa

38 Intro to Bioinformatics – Sequence Alignment38 Scoring Matrices for Protein Sequence  PAM (Percent Accepted Mutations) 250

39 Intro to Bioinformatics – Sequence Alignment39 Scoring Matrices for Protein Sequence  BLOSUM (BLOcks SUbstitution Matrix) 62

40 Intro to Bioinformatics – Sequence Alignment40 Using Protein Scoring Matrices  Divergence BLOSUM 80 BLOSUM 62 BLOSUM 45 PAM 1 PAM 120 PAM 250 Closely related Distantly related Less divergent More divergent Less sensitive More sensitive  Looking for Short similar sequences → use less sensitive matrix Long dissimilar sequences → use more sensitive matrix Unknown → use range of matrices  Comparison PAM – designed to track evolutionary origin of proteins BLOSUM – designed to find conserved regions of proteins


Download ppt "Sequence Alignments Introduction to Bioinformatics."

Similar presentations


Ads by Google