Lecture 2: Sequence Alignment BMI/IBGP 705 Kun Huang Department of Biomedical Informatics Ohio State University.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
1 ALIGNMENT OF NUCLEOTIDE & AMINO-ACID SEQUENCES.
Types of homology BLAST
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences.
Sequence Similarity Searching Class 4 March 2010.
Lecture 2: Sequence Alignment BMI/IBGP 730 Kun Huang Department of Biomedical Informatics Ohio State University.
Heuristic alignment algorithms and cost matrices
Kun Huang Department of Biomedical Informatics Ohio State University
Introduction to bioinformatics
Sequence Analysis Tools
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence alignment, E-value & Extreme value distribution
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise & Multiple sequence alignments
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
BLAST Workshop Maya Schushan June 2009.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Bioinformatics for Research
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Lecture 2: Sequence Alignment BMI/IBGP 705 Kun Huang Department of Biomedical Informatics Ohio State University

Major issues in genomics Homology Alignment as an optimization problem Dynamical programming BLAST Tools and examples (in the lab session)

“I think …” Charles Darwin ( )

Homology A Working Definition: Sequences or structures which share a common ancestor

"The same organ in different animals under a variety of form and function." Sir Richard Owen, Lectures on the Comparative Anatomy and Physiology of the Invertebrate Animals, "The mechanism of homology is heredity." Allan Boyden, Homology and Analogy: A century after the definitions of "homologue" and "analogue" of Richard Owen, "Homology is a relation bearing on recency of common ancestry.“ Olivier Rieppel, Homology and logical fallacy, Homology Defined

Sequence Homology Genes in separate species derived from the same ancestral genes in the last common ancestor of those two species are orthologs. Related genes resulted from a gene duplication event within a single genome--and are likely to have diverged in their function--are paralogs. Both orthologs and paralogs are homologs, a general term to cover both types of relationships

Recognizing Sequence Homology Relies primarily on understanding random sequence similarity Only by knowing what random similarity looks like can we tell when two sequences are significantly similar Understanding mutational regularity and sequence evolution increases the significance 1.Closely-related: Transitions/transversions 2.Distantly-related: PAM mutation probabilities Even distantly-related sequences can be recognized "Significant Similarity" is not a definition of homology.

Databases GenBank EMBL DDBJ SWISSPROT …

Major issues in genomics Homology Format Search Alignment as an optimization problem Dynamical programming BLAST Tools and examples

Aligning Text Strings T C A T G C A T T G 2 matches, 0 gaps T C A T G | | C A T T G 3 matches, 2 end gaps T C A T G | | | C A T T G 4 matches, 1 inserted gaps T C A - T G | | | | C A T T G 4 matches, 1 inserted gaps T C A T - G | | | | C A T T G Optimal solution: with respect to what criteria / cost function?

Alignment as An Optimization Problem Optimization criteria / cost function Parameters to be adjusted Search algorithm / process Exhaustive testing Suboptimal solutions Computational cost / complexity Statistical significance

Alignment as An Optimization Problem Optimization criteria / cost function What sort of alignment should be considered Scoring system (maximize the score) Additive model Based on probability compared with random sequence (PAM, BLOSUM) Assumption of independence More complicated cases Gap penalty – linear (s = -gd) affine (s = -d – (g-1)e)

Alignment as An Optimization Problem Optimization criteria / cost function What sort of alignment should be considered Scoring system Additive model Based on probability compared with random sequence (PAM, BLOSUM) Assumption of independence More complicated cases Gap penalty – linear (s = -gd) affine (s = -d – (g-1)e)

Alignment as An Optimization Problem Parameters to be adjusted Shift Number of gaps Position of gaps 3 matches, 2 end gaps T C A T G | | | C A T T G 4 matches, 1 inserted gaps T C A - T G | | | | C A T T G

Alignment as An Optimization Problem Search algorithm / process Exhaustive testing Try all possible configuration of parameters. E.g., sequence a with length m, sequence b with length n. Try all m+n shifts (if we use the O(.) annotation, then the running time is O(m+n)). 2 matches, 0 gaps T C A T G | | C A T T G 3 matches, 2 end gaps T C A T G | | | C A T T G 0 matches, 0 gaps T C A T G C A T T G

Alignment as An Optimization Problem Search algorithm / process Computational cost / complexity What if we allow gaps? 2 matches, 0 gaps T C A T G | | C A T T G 3 matches, 2 end gaps T C A T G | | | C A T T G 0 matches, 0 gaps T C A T G C A T T G

Many possible alignments to consider Without gaps, there are are n+m possible alignments between sequences of length n and m Once we start allowing gaps, there are many possible arrangements to consider: abcbcd abcbcd abcbcd | | | | | | | | || || abc--d a--bcd ab--cd This becomes a very large number when we allow mismatches, since we then need to look at every possible pairing between elements: there are roughly n m possible alignments.

Exponential computations get big fast If n=m=100, there are = = 100,000,000,000,000,000,000,000,000,000,000,000,0 00,000,000,000,000,000,000,000,000,000,000,000,00 0,000,000,000,000,000,000,000,000,000,000,000,000, 000,000,000,000,000,000,000,000,000,000,000,000,0 00,000,000,000,000,000,000,000,000,000,000,000,00 0,000,000,000,000,000 different alignments. And 100 amino acids is a small protein!

Alignment as An Optimization Problem Statistical significance Not only are there many possible gapped alignments, but introducing too many gaps makes nonsense alignments possible: s--e-----qu---en--ce sometimesquipsentice Need to distinguish between alignments that occur due to homology, and those that could be expected to be seen just by chance. Define a score function that accounts for statistical significance (logarithmic scale – multiplication of odds becomes addition of scores).

Major issues in genomics Homology Alignment as an optimization problem Dynamical programming BLAST Tools and examples

Dot matrix sequence comparison Write one sequence across top of matrix, the other across left side, then put a dot where character on line i equals one in column j Examples below: DNA and amino acid sequences of the phage cI (vertical axis) and phage P22 c2 (horizontal axis) repressors

Dynamic programming The name comes from an operations research task, and has nothing to do with writing programs. Programming – use tabular structure for computing. The key idea is to start aligning the sequences left to right; once a prefix is optimally aligned, nothing about the remainder of the alignment changes the alignment of the prefix. We construct a matrix of possible alignment scores (nxm 2 calculations worst case) and then "traceback" to find the optimal alignment. Called Needleman-Wunch (for global matching) or Smith- Waterman (for local matching)

Dynamic programming The name comes from an operations research task, and has nothing to do with writing programs. Programming – use tabular structure for computing. A B

Dynamic programming matrix Each cell has the score for the best aligned sequence prefix up to that position. Example: ATGCT vs. ACCGCT Match: +2, mismatch: 0, gap: -1 GapATGCT 0 A C C G C T Matching matrix, NOT the dynamical programming matrix!

Dynamic programming matrix GapATGCT A2 (2) 1 (0) 000 C-21 (0) 2 (0) 121 C-301 (0) 232 G-4003 (2) 23 C (2) 4 T (2) A T A C A T A C A T _ A _ C A _ T A C _

Optimal alignment by traceback We “traceback” a path that gets us the highest score. If we don't have “end gap” penalties, then take any path from the last row or column to the first. Otherwise we need to include the top and bottom corners GapATGCT A2 (2) 1 (0) 0-2 C 1 (0) 2 (0) 121 C-301 (0) 232 G-403 (2) 23 C (2) 4 T (2) AT - GCT ACCGCT A - TGCT ACCGCT

Dynamic programming Global alignment – an alignment of two or more sequences that matches as many characters as possible in all of the sequences. Needleman-Wunch algorithm Local alignment – an alignment that includes only the best matching, highest-scoring regions in two or more sequences. Smith-Waterman algorithm Difference – all the scores are kept in the dynamical programming matrix for global alignment; only the positive scores are kept in the dynamical programming matrix for local alignment, the negative ones are converted to zero.

Major issues in genomics Homology Alignment as an optimization problem Dynamical programming BLAST Tools and examples

Sequence alignment (BLAST) The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

BLAST – Algorithm Intuition The BLAST algorithm.The BLAST algorithm is a heuristic search method that seeks words of length W (default = 3 in blastp) that score at least T when aligned with the query and scored with a substitution matrix. Words in the database that score T or greater are extended in both directions in an attempt to find a locally optimal ungapped alignment or HSP (high scoring pair) with a score of at least S or an E value lower than the specified threshold. HSPs that meet these criteria will be reported by BLAST, provided they do not exceed the cutoff value specified for number of descriptions and/or alignments to report.

BLAST – Algorithm Intuition Databases are pre-indexed by the words. Without gaps: Altschul, S. F., Gish, W., Miller, W., Myers, E. W., Lipman, D. J., J. Mol. Biol. (1990) 215: With gaps: Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., Lipman, D. J., Nucleic Acids Research (1997) 25(17):

BLAST – Scoring Matrices DNA scoring matrix (substitution matrix) ATGC A T G C ATTTAGCCG ACTTGGCCT Score = 5X6+(-4)X3=18

BLAST – Scoring Matrices DNA is relatively easy to choose and protein is harder. PAM (Percent Accepted Mutation) matrices: predicted matrices, most sensitive for alignments of sequences with evolutionary related homologs. The greater the number in the matrix name, the greater the expected evolutionary (mutational) distance, i.e. PAM30 would be used for alignments expected to be more closely related in evolution than an alignment performed using the PAM250 matrix. BLOSUM (Blocks Substitution Matrix): calculated matrices, most sensitive for local alignment of related sequences, ideal when trying to identify an unknown nucleotide sequence. BLOSUM62 is the default matrix set be the BLAST search tool.

BLAST – Parameters Word size – for MegaBlast, can work between w=16 and 64. Expected – statistical based notion, compare the matched sequence with random sequence (the likelihood). The larger the score, the smaller the expected value, the more significant the result. Percent Identity, match/mismatch scores.

BLAST – Program Selection Nucleotide Quickly search for highly similar sequences (megablast)Quickly search for highly similar sequences (megablast) Quickly search for divergent sequences (discontiguous megablast)Quickly search for divergent sequences (discontiguous megablast) Nucleotide-nucleotide BLAST (blastn) Search for short, nearly exact matches Search trace archives with megablast or discontiguous megablastmegablast discontiguous megablast Protein Protein-protein BLAST (blastp) Position-specific iterated and pattern-hit initiated BLAST (PSI- and PHI-BLAST)Position-specific iterated and pattern-hit initiated BLAST (PSI- and PHI-BLAST) Search for short, nearly exact matches Search the conserved domain database (rpsblast)Search the conserved domain database (rpsblast) Protein homology by domain architecture (cdart)Protein homology by domain architecture (cdart)