Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.

Similar presentations


Presentation on theme: "Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation."— Presentation transcript:

1 Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation mature mRNA protein a gene

2 Sequence Alignment  We assume a link between the linear information stored in DNA, RNA or amino-acid sequence and the protein function determined by its three dimensional structure.  We want to compare the linear sequence between various genes, in order to deduce function, phylogeny, structure,origin…  The level of similarity is the homology

3 The Problem Biological problem Finding a way to compare and represent similarity or dissimilarity between biomolecular sequences (DNA, RNA or amino acid) Computational problem Finding a way to perform inexact or approximate matching of subsequences within strings of characters Statistical problem How to estimate the validity of our results

4 Course plan (for the next three weeks)  Details of biology  Estimate of computation time  Dynamic programming algorithm for full an local alignment  Statistical analysis of results  Dot matrices and heuristics for alignment  Distance matrices and information theory  (MSA)

5 Homology  Similarity due to descent from a common ancestor  Homologous sequences can be identified through sequence alignment  Thus, possible to predict/infer structure or function from primary sequence analysis

6 Gaps  Sequences may have diverged from common ancestor through mutations: Substitution (AAGC AAGT) Insertion (AAG AAGT) Deletion (AAGC AAG)  Latter two operations result in gaps ( _ ) K contiguous spaces = gap of length K ( > 0 )

7 Similarity and Alignment  Similarity has two aspects: Quantitative aspect: Similarity measure A number that represents degree of similarity Example: a score indicating 10% match between 2 DNA sequences. Qualitative aspect: An alignment a mutual arrangement of two sequences that shows where the two sequences are similar, and where they differ. An optimal alignment is one that exhibits the most correspondences, and the least differences. Example: a b c d e – h z a b w d e f h _

8 The Edit Distance between two strings  Definition: The edit distance between two strings is defined as the minimum number of edit operations – insertions, deletions, & substitutions – needed to transform the first string into the second. For emphasis, note that matches are not counted.  Example: AATT and AATG Distance = 1 (edit operation of substitution)

9 String alignment  An edit transcript is a way to represent a particular transformation of one string into another Emphasizes point mutations in the model  An alignment displays a relationship between two strings Global alignment means for each string, entire string is involved in the alignment Examples: (1) A A G C A(2) GSAQVKGHGKKVADAL …. A A _ C _ ++ ++++H+ KV + …. NNPELQAHAGKVFKLV ….

10 Alignment vs. Edit Transcript  Essentially equivalent: Two opposing characters in an alignment a substitution in edit transcript A gap or space in an alignment in first string an insertion of opposing character A gap or space in second string a deletion of opposing character  product vs. process

11 Gap cost or penalty functions  Observation: Gap of length k more probable than k gaps of length 1 Cause might be single mutational event Separated gaps probably arose due to different events  Gap penalty functions: Linear cost: Treats both cases uniformly Common to use a higher cost for h for opening a gap and a lower cost g for extending a gap

12 Pairwise Sequence Alignment  Example  Which one is better? HEAGAWGHEE PAWHEAE HEAGAWGHE-E P-A--W-HEAE HEAGAWGHE-E --P-AW-HEAE

13 Example AEGHW A50-2-3 E6-30 H-20 10-3 P -2 -4 W-3 15 Gap penalty: -8 Gap extension: -3 HEAGAWGHE-E P-A--W-HEAE HEAGAWGHE-E --P-AW-HEAE (-8) + (-8) + (-1) + 5 + 15 + (-8) + 10 + 6 + (-8) + 6 = 9 Exercise: Calculate for

14 Formal Description  Problem: PairSeqAlign  Input: Two sequences x,y Scoring matrix s Gap penalty d Gap extension penalty e  Output: The optimal sequence alignment

15 How Difficult Is This?  Given two sequences of length m and n.  How many alignments are there? f(m,n)  How many non-equivalent alignments are there ? g(m,n)

16 F(n,m)  F(n,m)=f(n-1,m)+f(n,m-1)+f(n-1,m-1)

17 F(n,m) F(n,m-1)F(n-1,m-1) F(n,m)F(n-1,m)

18 G(n,m)

19 g(n,m-1)g(n-1,m-1) g(n,m)g(n-1,m)

20 So what?  So at n = 20, we have over 120 billion possible alignments  We want to be able to align much, much longer sequences Some proteins have 1000 amino acids Genes can have several thousand base pairs

21 Dynamic Programming  General algorithmic development technique  Reuses the results of previous computations Store intermediate results in a table for reuse  Look up in table for earlier result to build from

22 Global Alignment  Needleman-Wunsch 1970  Idea: Build up optimal alignment from optimal alignments of subsequences HEAG --P- -25 HEAGA --P-A -20 HEAGA --P— -33 HEAG- --P-A -33 Add score from table Gap with bottom Gap with top Top and bottom

23 Global Alignment  Notation x i – ith letter of string x y j – jth letter of string y x 1..i – Prefix of x from letters 1 through I F – matrix of optimal scores F(i,j) represents optimal score lining up x 1..i with y 1..j d – gap penalty s – scoring matrix

24 Global Alignment  The work is to build up F  Initialize: F(0,0) = 0, F(i,0) = id, F(0,j)=jd  Fill from top left to bottom right using the recursive relation

25 Global Alignment F(i-1,j-1)F(i,j-1) F(i-1,j)F(i,j) s(x i,y j ) d d Move ahead in both x i aligned to gap y j aligned to gap While building the table, keep track of where optimal score came from, reverse arrows

26 Example HEAGAWGHEE 0-8-16-24-32-40-48-56-64-72-80 P-8-2-9-17-25-33-42-49-57-65-73 A-16 W-24 H-32 E-40 A-48 E-56

27 Completed Table HEAGAWGHEE 0-8-16-24-32-40-48-56-64-72-80 P-8-2-9-17-25-33-42-49-57-65-73 A-16-10-3-4-12-20-28-36-44-52-60 W-24-18-11-6-7-15-5-13-21-29-37 H-32-14-18-13-8-9-13-7-3-11-19 E-40-22-8-16 -9-12-15-73-5 A-48-30-16-3-11 -12 -15-52 E-56-38-24-11-6-12-14-15-12-91 Score Gap –8 Error –2 Fit +6

28 Traceback HEAGAWGHEE 0-8-16-24-32-40-48-56-64-72-80 P-8-2-9-17-25-33-42-49-57-65-73 A-16-10-3-4-12-20-28-36-44-52-60 W-24-18-11-6-7-15-5-13-21-29-37 H-32-14-18-13-8-9-13-7-3-11-19 E-40-22-8-16 -9-12-15-73-5 A-48-30-16-3-11 -12 -15-52 E-56-38-24-11-6-12-14-15-12-91 HEAGAWGHE-E --P-AW-HEAE Trace arrows back from the lower right to top left Diagonal – both Up – upper gap Left – lower gap

29 Summary  Uses recursion to fill in intermediate results table  Uses O(nm) space and time O(n 2 ) algorithm Feasible for moderate sized sequences, but not for aligning whole genomes.


Download ppt "Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation."

Similar presentations


Ads by Google