Welcome to Introduction to Bioinformatics

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
Sequence Similarity Searching Class 4 March 2010.
Sequence similarity (II). Schedule Mar 23midterm assignedalignment Mar 30midterm dueprot struct/drugs April 6teams assignedprot struct/drugs April 13RNA.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
DNA Alignment. Dynamic Programming R. Bellman ~ 1950.
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Sequence alignment, E-value & Extreme value distribution
Sequence comparison: Local alignment
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Sequence comparison: Local alignment Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Developing Pairwise Sequence Alignment Algorithms
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Protein Sequence Alignment and Database Searching.
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
A Tutorial of Sequence Matching in Oracle Haifeng Ji* and Gang Qian** * Oklahoma City Community College ** University of Central Oklahoma.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Doug Raiford Phage class: introduction to sequence databases.
Heuristic Alignment Algorithms Hongchao Li Jan
Genome Revolution: COMPSCI 004G 8.1 BLAST l What is BLAST? What is it good for?  Basic.
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
Introduction to Sequence Alignment. Why Align Sequences? Find homology within the same species Find clues to gene function Practical issues in experiments.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
What is BLAST? Basic BLAST search What is BLAST?
Scoring Sequence Alignments Calculating E
INTRODUCTION TO BIOINFORMATICS
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Sequence comparison: Local alignment
Biology 162 Computational Genetics Todd Vision Fall Aug 2004
Identifying templates for protein modeling:
Local alignment and BLAST
Bioinformatics and BLAST
Pairwise sequence Alignment.
BLAST.
Sequence alignment, Part 2
Pairwise Sequence Alignment
Basic Local Alignment Search Tool
30% grade = class presentations
Find the Best Alignment For These Two Sequences
Basic Local Alignment Search Tool (BLAST)
Sequence Alignment Algorithms Morten Nielsen BioSys, DTU
Sequence comparison: Significance of similarity scores
Dynamic Programming Finds the Best Score and the Corresponding Alignment O Alignment: Start in lower right corner and work backwards:
Global vs Local Alignment
Sequence alignment BI420 – Introduction to Bioinformatics
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Sequence alignment, E-value & Extreme value distribution
It is the presentation about the overview of DOT MATRIX and GAP PENALITY..
Presentation transcript:

Welcome to Introduction to Bioinformatics I. Scenario 4: Sequence alignment Bring up course web site Go to Scenario 4 Open the first sequence alignment notes

Scenario 3: Our Story You: Our first defense at CDC Outbreak: . . . Anthrax? Samples: Confirm agent Identify strain

Toxin gene-specific primers Scenario 3: Our Story Toxin gene-specific primers

Scenario 3: Our Story If DNA from bacterium with toxin gene If DNA NOT from bacterium with toxin gene? PCR

Scenario 3: Our Story If DNA from bacterium with toxin gene If DNA NOT from bacterium with toxin gene? PCR (no product)

AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG Scenario 3: Our Story DG47 AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG >gi|16031490|emb|AJ413935.1|BAN413935 Bacillus anthracis partial lef gene, isolate Microsoft-6259 Length = 2417 Score = 155 bits (78), Expect = 2e-35 Identities = 138/158 (87%) Strand = Plus / Plus Query: 1 aatattgacgctttactacatcagtccatcggaagtacgttgtataataaaatatatctg 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct: 1267 aatattgacgctttactacatcagtccatcggaagtacgttgtataataaaatatatctg 1326 Query: 61 tatgaaaacatgaatataaataacttaacagcaacgttaggtgccgatttagtagattcc 120 Sbjct: 1327 tatgaaaacatgaatataaataacctaacagcaacgttaggtgccgatttagtagattcc 1386 Query: 121 acagataatacaaaaattaatcgaggtatattcaatga 158 |||||||||||||||||||||||||||||||||||||| Sbjct: 1387 acagataatacaaaaattaatcgaggtatattcaatga 1424

Scenario 3: Our Story PCR Toxin gene present

AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG Scenario 3: Our Story DG47 AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG Do it!

Maybe it’s not from the toxin gene?? Scenario 3: Our Story Maybe it’s not from the toxin gene??

AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG Scenario 3: Our Story DG47 AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG NIDALLHQSIGSTLYNKIYLYENMNINNLTATLGADLVDSTDNTKINRGIFNEFKKNFKYSIS Translate Do it!

DG47 nucleotide sequence: Matches nothing in GenBank DG47 amino acid sequence: 100% match to toxin gene

Compare nucleotide sequences by hand Scenario 3: Our Story Compare nucleotide sequences by hand DG47 vs lef Do it!

Compare nucleotide sequences by hand Scenario 3: Our Story Compare nucleotide sequences by hand DG47      1  AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTG              |||||||| |||||| ||||||| ||||| |||||||| ||||| |||||||| ||| || lef gene 1831  AATATTGATGCTTTATTACATCAATCCATTGGAAGTACCTTGTACAATAAAATTTATTTG   DG47       61  TATGAAAACATGAATATAAATAACTTAACAGCAACGTTAGGTGCCGATTTAGTAGATTCC                |||||||| |||||||| |||||| | ||||||||  ||||||| |||||||| |||||| lef gene 1891  TATGAAAATATGAATATCAATAACCTTACAGCAACCCTAGGTGCGGATTTAGTTGATTCC   DG47      121  ACAGATAATACAAAAATTAATCGAGGTATATTCAATGAGTTCAAAAAAAATTTCAAATAC                || |||||||| ||||||||| ||||||| |||||||| ||||||||||||||||||||| lef gene 1951  ACTGATAATACTAAAATTAATAGAGGTATTTTCAATGAATTCAAAAAAAATTTCAAATAT   DG47      181  AGTATTTCTA                |||||||||| lef gene 2011  AGTATTTCTA 89% identical!

Compare nucleotide sequences by hand Scenario 3: Our Story Compare nucleotide sequences by hand DG47 AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTGTATG + lef gene Sequence 1lcl|PCR Product DG47 Length190 Sequence 2lcl|M29081: Bacillus anthracis lethal factor (lef) gene, 1831-2020. Length190 No significant similarity was found

Why can’t Blast figure out what you can plainly see? Scenario 3: Our Story DG47      1  AATATTGACGCTTTACTACATCAGTCCATCGGAAGTACGTTGTATAATAAAATATATCTG              |||||||| |||||| ||||||| ||||| |||||||| ||||| |||||||| ||| || lef gene 1831  AATATTGATGCTTTATTACATCAATCCATTGGAAGTACCTTGTACAATAAAATTTATTTG   89% identical! Why can’t Blast figure out what you can plainly see? Sequence 1lcl|PCR Product DG47 Length190 Sequence 2lcl|M29081: Bacillus anthracis lethal factor (lef) gene, 1831-2020. Length190 No significant similarity was found

Scenario 3: How does Blast work? Clearly we need to understand more about how sequence alignment really works! Theory behind nucleotide vs nucleotide Blast Working BlastN program Theory behind protein-protein Blast How to get Blast to do what you want

“Flavours” of sequence alignment Global Alignment - Needleman-Wunsch algorithm - Compares two sequences across their whole length - Mostly only useful when you already know sequences might be similar - Not useful for comparing a short query to an entire genome. - Not discussed further in this class. Local Alignment - Allows alignment of subsequences of the target and the query Usually what we want ; the query can be searched against entire genomes or large databases.

Crude Local Alignment Methods The “Dot Matrix” method (Gibbs and McIntyre, 1970) Represents the query and target sequences as a matrix ( a two-dimensional array) using a sliding window of similarity The human eye can powerfully distinguish the identity line from the noise

The “Dot Matrix” method (Gibbs and McIntyre, 1970) Normally a “window size” and “stringency” are specified i.e. if the window size is 8 and stringency is 6, a dot is only placed if at least 6 of the current 8 positions in the query match the target

The “Dot Matrix” method (Gibbs and McIntyre, 1970) window = 2 stringency = 2 G T A A T A

Problems with the Dot Matrix method Requires human supervision! A memory and processor time pig (a complete m*n matrix is calculated each time) No explicit handling of gaps No good quantitative score of alignment quality

The Smith-Waterman Algorithm (no gaps version) 1 1 Match Extension = +1 NoMatch Penalty = -2 G 1 2 T 3 1 A 4 1 2 Negative values are reset to zero!! C 2 T 1 3 Download SmithWaterman1.py A 2 1 4

Smith Waterman – Dynamic Programming An optimal alignment can be found starting from the highest scoring box and working backwards. Dynamic Programming is a method for recording the solutions to subproblems, then working backwards to find an overall solution. If we incorporate gaps, we must start keeping track of this “traceback” pathway.

Download SmithWaterman2.py The Smith-Waterman Algorithm (with gaps) G G T A A T A Match Extension = +1 NoMatch Penalty = -2 Gap Penalty = -3 G 1 1 G 1 2 T 3 Take the Max of: 0; adding Query Gap; adding Target Gap; Match/No match; A 4 1 -2 2 C 1 -2 T Download SmithWaterman2.py A

(a complete m*n matrix is still calculated each time!!) Problems with Smith-Waterman Still a pig! Memory and processor time requirements are huge when the query and/or the database gets large….. (a complete m*n matrix is still calculated each time!!) Do we really need to calculate the whole matrix?

BlastN – “word” based heuristics Notice that in a typical S-W matrix, most of the boxes are empty!!! What if we find exact matches of some seed words, then just work in the area surrounding these seeds trying to extend the alignment? This is exactly the heuristic that blast employs to avoid calculating the whole matrix! (see figure on page 6 of Alignment notes)

BlastN Procedure Identify the subsequences of size word in the query Filter the query sequence for repetitive “low complexity” sequences Identify the subsequences of size word in the query Find the exact matches in the target of the all the words Use a modified S-W to extend the hits around the seed words Score and report on the best matches More on scoring on next class!!!