Sequence Similarity The bioinformatics for molecular biologists lecture series.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Bioinformatics Tutorial I BLAST and Sequence Alignment.
Last lecture summary.
Types of homology BLAST
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Introduction to Bioinformatics
Sequence analysis course
Bioinformatics and Phylogenetic Analysis
Overview of sequence database searching techniques and multiple alignment May 1, 2001 Quiz on May 3-Dynamic programming- Needleman-Wunsch method Learning.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Introduction to bioinformatics
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Pairwise & Multiple sequence alignments
An Introduction to Bioinformatics
. Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology  Two genes are homologous if they share a common evolutionary.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
COT 6930 HPC and Bioinformatics Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Last lecture summary. Sequence alignment What is sequence alignment Three flavors of sequence alignment Point mutations, indels.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
What is BLAST? Basic BLAST search What is BLAST?
Pairwise Sequence Alignment and Database Searching
Basics of BLAST Basic BLAST Search - What is BLAST?
Identifying templates for protein modeling:
Bioinformatics and BLAST
Sequence Based Analysis Tutorial
Sequence alignment, Part 2
Bioinformatics Lecture 2 By: Dr. Mehdi Mansouri
Basic Local Alignment Search Tool
Presentation transcript:

Sequence Similarity The bioinformatics for molecular biologists lecture series

Sequence similarity One of the two major database searching strategies Types of sequence similarity comparison Accurate alignment of two sequences Heuristic comparison between one query sequence and a database of target sequences Alignment of multiple sequences of the same biological function / biological origin

Why similarity Similarity and Homology Similarity refers to the likeness or % identity between two sequences Similarity means sharing a statistically significant number of bases or amino acids Similarity does not imply homology Homology refers to shared ancestry Two sequences are homologous is they are derived from a common ancestral sequence Homology implies similarity

Similarity v.s. Homology Similarity can be quantified It is correct to say that two sequences are 30% identical It is generally incorrect to say that two sequences are 30% similar It is correct to say that two sequences have a similarity score of 200 The definition of similarity score determines the “best” or the “correct” alignment

Similarity v.s. Homology Homology cannot be quantified If two sequences have a high percentage identity, it is OK to say they are homologous It is incorrect to say two sequences are 40% homologous It is incorrect to say two sequences have a homology score of 150 Two types of homology Orthologous Paralogous Homology is usually inferred rather than observed

Orthologs Ortho = exact Orthologs is the result of speciation. For example, “Hemaglobin A” in human and in mouse. When the genes are orthologous, the history of the gene reflects the history of the species. Orthologs implies conserved function

Paralogs Para = in parallel Paralogs are homologous sequences that arose by a mechanism such as gene duplication. For example, when both copies have descended side by side during the history of an organism e.g. Hemoglobin A and B, the genes are paralogous. They have distinct but related functions

Analogous

Homology v.s. Analogy Homology: Similarity in characteristics resulting from shared ancestry Analogy: Similarity of structure between two species that are not closely related; attributable to convergent evolution

Summary of concepts Similarity – an observation Homology – a biological relationship Similarity is a result of homology or analogy Molecular similarity is likely a result of homology Molecular similarity is frequently used to infer biological relationship – homology.

How to measure similarity Score an alignment Scoring matching position Scoring gaps Total score = score of matching positions + score of gaps

Scoring matrix For DNA

Scoring protein alignment E: Glutamic Acid Q: Glutamine T: Threonine

14 Point Accepted Mutation Margaret Dayhoff et al (1970s) First to assemble sequences into protein seq atlas – families and superfamilies based on seq similarity Tables of frequency of changes/mutations observed in the sequences of a family derived. Percent amino acid mutations accepted by evolutionary selection or PAM Tables derived. Shows probability that one amino acid change into any other in these families A score above zero assigned to two amino acids indicates that these two replace each other more often than expected by chance alone. ie., they are functionally exchangeable. A negative score (below zero) indicates that the two amino acids are rarely interchangeable. eg., a basic amino acid for an acidic one or one with an aromatic side chain for one with aliphatic side chain. 1 PAM – average change in 1% of all amino acid possibilities 100 PAMs (1 PAM to the power of 100) does not mean every residue is changed

PAM250

BLOSUM BLOSUM (Blocks Substitution Matrix) matrix These are substitution matrices derived from the observed frequencies of amino acid replacements in highly conserved regions of ungapped local alignments. Henikoff and Henikoff PNAS 1992 Number indicate percent identity within set eg. BLOSUM62 means 62% identity The data for the substitution scores in these matrices come from about 2000 blocks of aligned sequence segments characterizing more than 500 groups of related proteins [ Ref: Henikoff, S., and Henikoff, J. G. (1992) Proc. Natl. Acad. Sci. USA 89: ] The BLAST server from NCBI and the search servers from EBI use different versions of the BLOSUM matrix for protein similarity searches and alignments.

Choice of scoring matrices For DNA Identity matrix For protein

Scoring gaps Introducing gaps may improve the alignment APPLESANDORANGES APPLESANDORANGES |||||| |||||| ||||||| APPLESORANGES APPLES---ORANGES Introducing too many gaps is not meaningful ATCCTACTCATCAT ATCCTACTCA-T-CAT- ||| | | ||| |||| | | | | ATCTACTACTACTG ATC-TACT-ACTAC-TG Affine gap penalty: Penalty =a + bx a, b are constants; x is gap length, a is usually big Typical a and b: Protein (11,1) DNA(5,2)

Sequence alignment Find the best alignment in terms of score Types of sequence similarity comparison Accurate alignment of two sequences Heuristic comparison between one query sequence and a database of target sequences Alignment of multiple sequences of the same biological function / biological origin

Accurate alignment Global alignment Needleman-Wunsch Alignment of two complete sequences Local alignment Smith-waterman Alignment of the most similar fragments in two sequences All based on the dynamic programming algorithm with O(MN) complexity and gives the optimal solution

Global Alignment Earliest pairwise alignment method Easily detectable similarity along entire length of sequence E.g. trypsin and quinone oxidoreductase/ zeta crystallin alignments. Needleman Wunsch Algorithm 1970 Optimised over entire length of query sequence |||||||||||||||||||||||||| |||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||

Local Alignment Many proteins composed of mosaics of domains The gene sequences match at the domain level rather than the entire length of the protein. They share sequence similarity at localised regions in the gene, primarily at the domains. Local alignment algorithm used: Smith-Waterman 1981 Optimised for local optimal alignments Most useful for database searching ||||||||||||||||||||||||||

Heuristic algorithm BLAST – Basic local alignment search tool

Types of blast blastn: Search a nucleotide database using a nucleotide query blastp: Search protein database using a protein query blastx: Search protein database using a translated nucleotide query tblastn: Search translated nucleotide database using a protein query tblastx: Search translated nucleotide database using a translated nucleotide query

Blastn Blastn: The general algorithm Megablast: comparing a query to closely related sequences and works best if the target percent identity is 95% or more but is very fast. Discontiguous megablast: intended for cross-species comparisons of the same gene.

Blastp Blastp: the general algorithm PSI-BLAST: allows the user to build a PSSM (position- specific scoring matrix) using the results of the first BlastP run. PHI-BLAST: blast + motif scan DELTA-BLAST: search first against CCD to get a PSSM, then use it to search the entire database.

Typical blast result Domain hit

Typical blast result Sequence hit

Typical blast result Sequence hit

Blast Search Parameters A BLAST search can be limited to the result of an Entrez query against the database chosen.

Search scope limit by query protease NOT hiv1[organism] 1000:2000[slen] – sequence length Mus musculus[organism] AND biomol_mrna[properties] 10000:100000[mlwt] – molecular weight all[filter] NOT enviromnental sample[filter] NOT metagenomes[orgn]

BLAST heads up For short amino acid sequences with size 20-40, 50% identity happens by chance Similarity can be present even if there is absence of homology low complexity transmembrane and coiled coil regions

More details Choice of programs The blast document The statistics behind blast scores

Next generation sequencing Millions of short reads The entire human genome for $5000 Wide applications 454 – longer reads Illumina – shorter but higher throughput Single end/pair end

Genome re-sequensing Genetic variation detection Single nucleotide polymorphism Copy number variation 1000 human genome and 1001 Arabidopsis genome Focused re-sequencing Exon capture chips Chip-seq Methylomes

Transcriptome sequencing Expression level ncRNA transcripts Novel transcripts Alternative splicing ENCODE project - The Encyclopedia of DNA Elements

Next generation sequencing analysis Short but similar sequence reads with a reference genome BWA and BOWTIE Fastaq (fq) and SAM tools Tablet

Fastq First line – seqID Second line – sequence Third line – “+” anything Fourth line – quality For illumina – Q30:99.9%; Q20:99%; Q10:90%

SAM tools SAM: sequence alignment / map BAM: compressed SAM file Read name; Flag for matching status; Target name; position; mapping confidence; 49bp matched; RnExt; Paired matched; matched length; Seq; Quality; Flags

Tablet

Summary Similarity search