Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pairwise sequence comparison

Similar presentations


Presentation on theme: "Pairwise sequence comparison"— Presentation transcript:

1 Pairwise sequence comparison
Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington

2 One-minute responses Need more clarification on the alignment of sequences. More precise definitions, included on slides. Python programming very useful. A lot to learn at once. I did not get the Python part, but it’s easy when I try it myself. Please write on the board sometimes. I liked the part about evolutionary theory. I understood about 50% of the lecture. You speak too fast. I did not understand Moore’s law. I liked the biology video and the analogy with Moore’s law. The Python code is different from what we are used to. Give a practical example of BLAST usage. More explanation on sys.argv. I like the summary of the previous lecture. Sometimes you talk a long time before asking for questions. Biology of cells and genomes wasn’t clear enough. Take time explaining the biology concepts. The last part of the lecture was not clear. Give more examples. The class could go a bit slower on difficult concepts. It is a good habit to accept comments. We may forget what you are saying because we are not taking notes. I didn’t understand the mechanism by which we can physically read the DNA sequence. This topic is outside the scope of this class to cover. If you’d like to read about this, check out “Overview of DNA sequencing strategies.”

3

4

5 Other questions How many assignments will we do? Will there be a test?
Is there a lot of mathematics here, or just Python and biology? Are theoretical understanding of genetics and biochemistry vital for studying bioinformatics? Are we going to write this feedback every day? How can I make a dictionary from a .txt file? When you compare two protein sequences, how do you guess where the first codon begins? What is evolution? Is it dogma?

6 Outline Responses from last class Sequence alignment Python Motivation
Scoring alignments Python

7 Revision What is the difference between the DNA and RNA alphabets?
In RNA, “T” is changed to “U.” What is a codon, and why is it significant? A set of three adjacent RNA nucleotides. One codon codes for a single amino acid. What is the universal genetic code? A set of rules for translating codons into amino acids. What is the purpose of aligning two DNA or protein sequences? To infer common ancestry, function or structure.

8 Sequence comparison overview
Problem: Find the “best” alignment between a query sequence and a target sequence. To solve this problem, we need a method for scoring alignments, and an algorithm for finding the alignment with the best score. The alignment score is calculated using a substitution matrix, and gap penalties. The algorithm for finding the best alignment is dynamic programming.

9 A simple alignment problem.
Problem: find the best pairwise alignment of GAATC and CATAC.

10 Scoring alignments GAATC CATAC GAAT-C C-ATAC -GAAT-C C-A-TAC GAATC-
We need a way to measure the quality of a candidate alignment. Alignment scores consist of two parts: a substitution matrix, and a gap penalty.

11 Scoring aligned bases GAATC | | CATAC A C G T 10 -5
A hypothetical substitution matrix: A C G T 10 -5 GAATC | | CATAC = 5

12 BLOSUM62 A R N D C Q E G H I L K M F P S T W Y V B Z X

13 Scoring gaps Linear gap penalty: every gap character receives a score of d. Affine gap penalty: opening a gap receives a score of d; extending a gap receives a score of e. GAAT-C d=-4 CA-TAC = 17 G--AATC d=-4 CATA--C e=-1 = 5

14 A simple alignment problem.
Problem: find the best pairwise alignment of GAATC and CATAC. Use a linear gap penalty of -4. Use the following substitution matrix: A C G T 10 -5

15 How many possibilities?
GAATC CATAC GAAT-C C-ATAC -GAAT-C C-A-TAC GAATC- CA-TAC GAAT-C CA-TAC GA-ATC CATA-C How many different alignments of two sequences of length n exist? Too many to enumerate!

16 -G- CAT DP matrix G A T C The value in position (i,j) is the score of the best alignment of the first i positions of the first sequence versus the first j positions of the second sequence. -8

17 DP matrix -8 G A T C -12 -G-A CAT-
Moving horizontally in the matrix introduces a gap in the sequence along the left edge.

18 DP matrix -8 G A T C -12 -G-- CATA
Moving vertically in the matrix introduces a gap in the sequence along the top edge.

19 Initialization G A T C

20 G - Introducing a gap G A T C -4

21 - C DP matrix G A T C -4

22 DP matrix G A T C -4 -8

23 G C DP matrix G A T C -4 -5

24 Three legal moves A diagonal move aligns a character from the left sequence with a character from the top sequence. A vertical move introduces a gap in the sequence along the top edge. A horizontal move introduces a gap in the sequence along the left edge.

25 ----- CATAC DP matrix G A T C -4 -8 -12 -16 -20 -5

26 DP matrix G A T C -4 -8 -12 -16 -20 -5 ?

27 DP matrix G A T C -4 -8 -12 -16 -20 -5 -G CA G- CA --G CA- -4 -9 -12
-4 -8 -12 -16 -20 -5 -4 -4

28 DP matrix G A T C -4 -8 -12 -16 -20 -5 ?

29 DP matrix G A T C -4 -8 -12 -16 -20 -5

30 DP matrix G A T C -4 -8 -12 -16 -20 -5 ?

31 DP matrix G A T C -4 -8 -12 -16 -20 -5 -9 5 1 2 -2

32 Find the optimal alignment, and its score.
DP matrix G A T C -4 -8 -12 -16 -20 -5 -9 5 1 2 -2 ? Find the optimal alignment, and its score.

33 DP matrix G A T C -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2
-4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 17

34 DP in equation form Align sequence x and y.
F is the DP matrix; s is the substitution matrix; d is the linear gap penalty.

35 DP in equation form

36 Summary Scoring a pairwise alignment requires a substition matrix and gap penalties. Dynamic programming is an efficient algorithm for finding the optimal alignment. Entry (i,j) in the DP matrix stores the score of the best-scoring alignment up to those positions. DP iteratively fills in the matrix using a simple mathematical rule.

37 A simple example A G C Find the optimal alignment of AAG and AGC.
Use a gap penalty of d=-5. A C G T 2 -7 -5 A G C

38 Some useful Python tidbits

39 sys.argv sys.argv is a list containing the strings given on the command line Write a program that adds up all of the numbers on the command line. > add-numbers.py 1 2 3 6

40 sys.stdout versus sys.stderr
sys.stdout and sys.stderr are two different streams to print to Use sys.stdout for the primary output of your program. Use sys.stderr to report errors or give status updates.

41 write() versus print()
The write() function differs from print() write() does not automatically add an end-of-line. write() requires that you specify the name of the file or stream to be written to. write() only accepts a single string.

42 % The “%” operator substitutes values into a string based on the presence of format strings. %s = string %d = integer %g = float Place “%” between a string and a tuple of values. >>> "%s %s %s" % ("larry", "curly", "moe") 'larry curly moe’ >>> "%d + %d" % (21, 15) ' '

43 Using sys.stderr Write a program to divide one number by a second number. > divide.py 8 3 Print the usage message if exactly two numbers are not given. > ./divide.py USAGE: divide.py <value1> <value2> Divide <value1> by <value2> and report the result. Stop and print an error if the second number is zero. > divide.py 8 0 Divide by zero error.

44 Using write() and % Write a program that prints the command line arguments with “+” signs between, and then reports their sum. > add-numbers2.py 1 2 3 = 6

45 One-minute response At the end of each class
Write for about one minute. Provide feedback about the class. Was part of the lecture unclear? What did you like about the class? Do you have unanswered questions? Sign your name I will begin the next class by responding to the one-minute responses


Download ppt "Pairwise sequence comparison"

Similar presentations


Ads by Google