Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch 4.1-4.7, Ch 5.1, get what you can.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
BLAST Sequence alignment, E-value & Extreme value distribution.
Last lecture summary.
Introduction to Bioinformatics
Lecture 3.11 BLAST. Lecture 3.12 BLAST B asic L ocal A lignment S earch T ool Developed in 1990 and 1997 (S. Altschul) A heuristic method for performing.
Sequence Similarity Searching Class 4 March 2010.
BLAST Tutorial 3 What is BLAST? Basic Local Alignment Search Tool Is a set of similarity search programs designed to explore sequence databases. What are.
We continue where we stopped last week: FASTA – BLAST
Database searching. Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation)
Overview of sequence database searching techniques and multiple alignment May 1, 2001 Quiz on May 3-Dynamic programming- Needleman-Wunsch method Learning.
Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.
Introduction to bioinformatics
Similar Sequence Similar Function Charles Yan Spring 2006.
Heuristic Approaches for Sequence Alignments
BLAST.
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Sequence alignment, E-value & Extreme value distribution
Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.
An Introduction to Bioinformatics
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
BLAST : Basic local alignment search tool B L A S T !
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Blast 1. Blast 2 Low Complexity masking >GDB1_WHEAT MKTFLVFALIAVVATSAIAQMETSCISGLERPWQQQPLPPQQSFSQQPPFSQQQQQPLPQ QPSFSQQQPPFSQQQPILSQQPPFSQQQQPVLPQQSPFSQQQQLVLPPQQQQQQLVQQQI.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
School B&I TCD Bioinformatics Database homology searching May 2010.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
What is BLAST? BLAST® (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Part 2- OUTLINE Introduction and motivation How does BLAST work?
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Alignment.
Construction of Substitution matrices
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Bioinformatics Computing 1 CMP 807 – Day 2 Kevin Galens.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
Practice -- BLAST search in your own computer 1.Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2.Start a CMD window,
Sequence Alignment. Assignment Read Lesk, Problem: Given two sequences R and S of length n, how many alignments of R and S are possible? If you.
Sequence Similarity The bioinformatics for molecular biologists lecture series.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Lab 3.2: Database Similarity Searching “The BLAST Buffet” Stephanie Minnema University of Calgary.
Fasta and Blast Heuristic algorithm for database search.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Bioinformatics for Research
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
BLAST.
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can out of 5.2, 5.4

Pairwise alignment DNA:DNA polypeptide:polypeptide The BASIC Sequence Analysis Operation

Alignments Pairwise sequence alignments –One-to-One –One-to-Database Multiple sequence alignments –Many-to-Many

Origins of Sequence Similarity Homology –common evolutionary descent Chance –Short similar segments are very common. Similarity in function –Convergence (very rare)

Visual sequence comparison: Dotplot

Visual sequence comparison: Filtered dotplot 4 bp window, 75% identity cutoff

Visual sequence comparison: Dotplot 4 bp windw, 75% identity cutoff

Dotplots of sequence rearrangements

Assessing similarity GAACAAT |||||||7/7 OR 100% GAACAAT GAACAAT | 1/7 or 14% GAACAAT Which is BETTER? How do we SCORE?

Similarity GAACAAT |||||||7/7 OR 100% GAACAAT ||| |||6/7 OR 84% GAATAAT MISMATCH

Mismatches GAACAAT ||| |||6/7 OR 84% GAATAAT GAACAAT ||| |||6/7 OR 84% GAAGAAT

Terminal Mismatch GAACAATttttt ||| ||| aaaccGAATAAT 6/7 OR 84%

INDELS GAAgCAAT ||| ||||7/7 OR 100% GAA*CAAT

Indels, cont’d GAAgCAAT ||| |||| GAA*CAAT GAAggggCAAT ||| |||| GAA****CAAT

Similarity Scoring Common Method: Terminal mismatches (0) Match score (1) Mismatch penalty (-3) Gap penalty (-1) Gap extension penalty (-1) DNA Defaults

DNA Scoring GGGGGGAGAA 2 |||||*|*|| 8(1)+2(-3)= 2 GGGGGAAAAAGGGGG GGGGGGAGAA--GGG 3 |||||*|*|| ||| 11(1)+2(-3)+1(-1)+1(-1)= 3 GGGGGAAAAAGGGGG

Absurdity of Low Gap Penalty GATCGCTACGCTCAGC A.C.C..C..T Perfect similarity, Every time!

Sequence alignment algorithms Local alignment –Smith-Waterman Global alignment –Needleman-Wunsch

Alignment Programs Local alignment (Smith-Waterman) –BLAST (simplified Smith-Waterman ) –FASTA (simplified Smith-Waterman ) –BESTFIT (GCG program) Global alignment (Needleman-Wunsch) –GAP

Local vs. global alignment 10 gaggc 15 ||||| 3 gaggc 7 1 gggggaaaaagtggccccc 19 || |||| || 1 gggggttttttttgtggtttcc 22 Global alignment: alignment of the full length of the sequences Local alignment: alignment of regions of substantial similarity

Local vs. global alignment

BLAST Algorithm Look for local alignment, a High Scoring Pair (HSP) Finding word (W) in query and subject. Score > T. Extend local alignment until score reaches maximum-X. Keep High Scoring Segment Pairs (HSPs) with scores > S. Find multiple HSPs per query if present Expectation value (E value) using Karlin-Altschul stats

BLAST statistical significance: assessing the likelihood a match occurs by chance Karlin-Altschul statistic: E = k m N exp(-Lambda S) m = Size of query seqeunce N = Size of database k = Search space scaling parameter Lambda = scoring scaling parameter S = BLAST HSP score Low E -> good match

BLAST statistical significance: Rule of thumb for a good match: Nucleotide match E < 1e-6 Identity > 70% Protein match E < 1e-3 Identity > 25%

Protein Similarity Scoring Identity - Easy WEAK Alignments Chemical Similarity –L vs I, K vs R… Evolutionary Similarity –How do proteins evolve? –How do we infer similarities?

BLOSUM62

Single-base evolution changes the encoded AA CAU=H CAC=HCGU=RUAU=Y CAA=QCCU=PGAU=D CAG=QCUU=LAAU=N

Substitution Matrices Two main classes: PAM-Dayhoff BLOSUM-Henikoff

PAM-Dayhoff Built from closed related proteins, substitutions constrained by evolution and function “accepted” by evolution (Point Accepted Mutation=PAM) 1 PAM::1% divergence PAM120=closely related proteins PAM250=divergent proteins

BLOSUM- Henikoff&Henikoff Built from ungapped alignments in proteins: “BLOCKS” Merge blocks at given % similar to one sequence Calculate “target” frequencies BLOSUM62=62% similar blocks –good general purpose BLOSUM30 –Detects weak similarities, used for distantly related proteins

BLOSUM62

Gapped alignments No general theory for significance of matches!! G+L(n) –indel mutations rare –variation in gap length “easy”, G > L

Real Alignments

Phylogeny

Cow-to-Pig Protein

Cow-to-Pig cDNA 80% Identity (88% at aa!)

DNA similarity reflects polypeptide similarity

Coding vs Non-coding Regions 90% in coding (70% in non-coding)

Third Base of Codon is Hypervariable

Cow-to-Fish Protein 42% identity, 51% similarity

Cow-to-Fish DNA 48% similarity

Protein vs. DNA Alignments Polypeptide similarity > DNA Coding DNA > Non-coding 3rd base of codon hypervariable Moderate Distance  poor DNA similarity

Rules of Thumb DNA-DNA similarities –50% significant if “long” –E < 1e-6, 70% identity Protein-protein similarities –80% end-end: same structure, same function –30% over domain, similar function, structure overall similar –15-30% “twilight zone” –Short, strong match…could be a “motif”

Basic BLAST Family BLASTN –DNA to DNA database BLASTP –protein to protein database TBLASTN –DNA (translated) to protein database BLASTX –protein to DNA database (translated) TBLASTX –DNA (translated) to DNA database (translated)

DNA Databases nr (non-redundantish merge of Genbank, EMBL, etc…) –EXCLUDES HTGS0,1,2, EST, GSS, STS, PAT, WGS est (expressed sequence tags) htgs (high throughput genome seq.) gss (genome survey sequence) vector, yeast, ecoli, mito chromosome (complete genomes) And more

Protein Databases nr (non-redundant Swiss-prot, PIR, PDF, PDB, Genbank CDS) swissprot ecoli, yeast, fly month And more

BLAST Input Program Database Options - see more Sequence –FASTA –gi or accession#

BLAST Options Algorithm and output options –# descriptions, # alignments returned –Probability cutoff –Strand Alignment parameters –Scoring Matrix BLOSUM62, BLOSUM80PAM30, PAM70, BLOSUM45, BLOSUM62, BLOSUM80 –Filter (low complexity) PPPPP->XXXXX

Extended BLAST Family Gapped Blast (default)Gapped Blast (default) PSI-Blast (Position-specific iterated blast) –“self” generated scoring matrix PHI BLAST (motif plus BLAST) BLAST2 client (align two seqs) megablast (genomic sequence) rpsblast (search for domains)