Pairwise sequence comparison

Slides:



Advertisements
Similar presentations
Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Bioinformatics Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
Sequence comparison: Introduction and motivation Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Pairwise Sequence Alignment
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Space Efficient Alignment Algorithms and Affine Gap Penalties
Introduction to Bioinformatics Spring 2008 Yana Kortsarts, Computer Science Department Bob Morris, Biology Department.
Summer Bioinformatics Workshop 2008 Sequence Alignments Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University.
Sequence Alignments Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Local alignment
Bioinformatics Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
Genome Sciences 373 Genome Informatics Quiz Section 4 April 21, 2015.
Motif search and discovery Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Sequence comparison: Local alignment Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Sequence Alignment.
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Traceback and local alignment Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Multiple testing correction
An Introduction to Bioinformatics
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
Motif search Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Bacterial Genetics - Assignment and Genomics Exercise: Aims –To provide an overview of the development and.
Inferring phylogenetic trees: Maximum likelihood methods Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Chapter 3 Computational Molecular Biology Michael Smith
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Statistical significance of alignment scores Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Space Efficient Alignment Algorithms and Affine Gap Penalties Dr. Nancy Warter-Perez.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
INTRODUCTION TO BIOINFORMATICS
Inferring phylogenetic trees: Distance methods
Sequence comparison: Dynamic programming
Sequence comparison: Local alignment
Sequence comparison: Significance of similarity scores
Sequence comparison: Traceback and local alignment
Motif p-values GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Lesson 2 Programming constructs – Algorithms – Scratch – Variables Intro.
Sequence comparison: Multiple testing correction
GENOME 559: Introduction to statistical and computational genomics
Sequence Alignment 11/24/2018.
Sequence comparison: Dynamic programming
Using Dynamic Programming To Align Sequences
Pairwise sequence Alignment.
Pairwise Sequence Alignment
Sequence comparison: Local alignment
Sequence comparison: Traceback
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Sequence Alignment Algorithms Morten Nielsen BioSys, DTU
Sequence comparison: Significance of similarity scores
Sequence comparison: Introduction and motivation
Applying principles of computer science in a biological context
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

Pairwise sequence comparison Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington thabangh@gmail.com

One-minute responses Need more clarification on the alignment of sequences. More precise definitions, included on slides. Python programming very useful. A lot to learn at once. I did not get the Python part, but it’s easy when I try it myself. Please write on the board sometimes. I liked the part about evolutionary theory. I understood about 50% of the lecture. You speak too fast. I did not understand Moore’s law. I liked the biology video and the analogy with Moore’s law. The Python code is different from what we are used to. Give a practical example of BLAST usage. More explanation on sys.argv. I like the summary of the previous lecture. Sometimes you talk a long time before asking for questions. Biology of cells and genomes wasn’t clear enough. Take time explaining the biology concepts. The last part of the lecture was not clear. Give more examples. The class could go a bit slower on difficult concepts. It is a good habit to accept comments. We may forget what you are saying because we are not taking notes. I didn’t understand the mechanism by which we can physically read the DNA sequence. This topic is outside the scope of this class to cover. If you’d like to read about this, check out “Overview of DNA sequencing strategies.”

Other questions How many assignments will we do? Will there be a test? Is there a lot of mathematics here, or just Python and biology? Are theoretical understanding of genetics and biochemistry vital for studying bioinformatics? Are we going to write this feedback every day? How can I make a dictionary from a .txt file? When you compare two protein sequences, how do you guess where the first codon begins? What is evolution? Is it dogma?

Outline Responses from last class Sequence alignment Python Motivation Scoring alignments Python

Revision What is the difference between the DNA and RNA alphabets? In RNA, “T” is changed to “U.” What is a codon, and why is it significant? A set of three adjacent RNA nucleotides. One codon codes for a single amino acid. What is the universal genetic code? A set of rules for translating codons into amino acids. What is the purpose of aligning two DNA or protein sequences? To infer common ancestry, function or structure.

Sequence comparison overview Problem: Find the “best” alignment between a query sequence and a target sequence. To solve this problem, we need a method for scoring alignments, and an algorithm for finding the alignment with the best score. The alignment score is calculated using a substitution matrix, and gap penalties. The algorithm for finding the best alignment is dynamic programming.

A simple alignment problem. Problem: find the best pairwise alignment of GAATC and CATAC.

Scoring alignments GAATC CATAC GAAT-C C-ATAC -GAAT-C C-A-TAC GAATC- We need a way to measure the quality of a candidate alignment. Alignment scores consist of two parts: a substitution matrix, and a gap penalty.

Scoring aligned bases GAATC | | CATAC A C G T 10 -5 A hypothetical substitution matrix: A C G T 10 -5 GAATC | | CATAC -5 + 10 + -5 + -5 + 10 = 5

BLOSUM62 A R N D C Q E G H I L K M F P S T W Y V B Z X

Scoring gaps Linear gap penalty: every gap character receives a score of d. Affine gap penalty: opening a gap receives a score of d; extending a gap receives a score of e. GAAT-C d=-4 CA-TAC -5 + 10 + -4 + 10 + -4 + 10 = 17 G--AATC d=-4 CATA--C e=-1 -5 + -4 + -1 + 10 + -4 + -1 + 10 = 5

A simple alignment problem. Problem: find the best pairwise alignment of GAATC and CATAC. Use a linear gap penalty of -4. Use the following substitution matrix: A C G T 10 -5

How many possibilities? GAATC CATAC GAAT-C C-ATAC -GAAT-C C-A-TAC GAATC- CA-TAC GAAT-C CA-TAC GA-ATC CATA-C How many different alignments of two sequences of length n exist? Too many to enumerate!

-G- CAT DP matrix G A T C The value in position (i,j) is the score of the best alignment of the first i positions of the first sequence versus the first j positions of the second sequence. -8

DP matrix -8 G A T C -12 -G-A CAT- Moving horizontally in the matrix introduces a gap in the sequence along the left edge.

DP matrix -8 G A T C -12 -G-- CATA Moving vertically in the matrix introduces a gap in the sequence along the top edge.

Initialization G A T C

G - Introducing a gap G A T C -4

- C DP matrix G A T C -4

DP matrix G A T C -4 -8

G C DP matrix G A T C -4 -5

Three legal moves A diagonal move aligns a character from the left sequence with a character from the top sequence. A vertical move introduces a gap in the sequence along the top edge. A horizontal move introduces a gap in the sequence along the left edge.

----- CATAC DP matrix G A T C -4 -8 -12 -16 -20 -5

DP matrix G A T C -4 -8 -12 -16 -20 -5 ?

DP matrix G A T C -4 -8 -12 -16 -20 -5 -G CA G- CA --G CA- -4 -9 -12 -4 -8 -12 -16 -20 -5 -4 -4

DP matrix G A T C -4 -8 -12 -16 -20 -5 ?

DP matrix G A T C -4 -8 -12 -16 -20 -5

DP matrix G A T C -4 -8 -12 -16 -20 -5 ?

DP matrix G A T C -4 -8 -12 -16 -20 -5 -9 5 1 2 -2

Find the optimal alignment, and its score. DP matrix G A T C -4 -8 -12 -16 -20 -5 -9 5 1 2 -2 ? Find the optimal alignment, and its score.

DP matrix G A T C -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 17

DP in equation form Align sequence x and y. F is the DP matrix; s is the substitution matrix; d is the linear gap penalty.

DP in equation form

Summary Scoring a pairwise alignment requires a substition matrix and gap penalties. Dynamic programming is an efficient algorithm for finding the optimal alignment. Entry (i,j) in the DP matrix stores the score of the best-scoring alignment up to those positions. DP iteratively fills in the matrix using a simple mathematical rule.

A simple example A G C Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5. A C G T 2 -7 -5 A G C

Some useful Python tidbits

sys.argv sys.argv is a list containing the strings given on the command line Write a program that adds up all of the numbers on the command line. > add-numbers.py 1 2 3 6

sys.stdout versus sys.stderr sys.stdout and sys.stderr are two different streams to print to Use sys.stdout for the primary output of your program. Use sys.stderr to report errors or give status updates.

write() versus print() The write() function differs from print() write() does not automatically add an end-of-line. write() requires that you specify the name of the file or stream to be written to. write() only accepts a single string.

% The “%” operator substitutes values into a string based on the presence of format strings. %s = string %d = integer %g = float Place “%” between a string and a tuple of values. >>> "%s %s %s" % ("larry", "curly", "moe") 'larry curly moe’ >>> "%d + %d" % (21, 15) '21 + 15'

Using sys.stderr Write a program to divide one number by a second number. > divide.py 8 3 2.66667 Print the usage message if exactly two numbers are not given. > ./divide.py USAGE: divide.py <value1> <value2> Divide <value1> by <value2> and report the result. Stop and print an error if the second number is zero. > divide.py 8 0 Divide by zero error.

Using write() and % Write a program that prints the command line arguments with “+” signs between, and then reports their sum. > add-numbers2.py 1 2 3 1 + 2 + 3 = 6

One-minute response At the end of each class Write for about one minute. Provide feedback about the class. Was part of the lecture unclear? What did you like about the class? Do you have unanswered questions? Sign your name I will begin the next class by responding to the one-minute responses