Types of homology BLAST

Slides:

Advertisements

Similar presentations

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.

MCB 5472 Blast, Psi BLAST, Perl: Arrays, Loops J. Peter Gogarten Office: BPB 404 phone: ,

1 Orthologs: Two genes, each from a different species, that descended from a single common ancestral gene Paralogs: Two or more genes, often thought of.

Sequence Analysis Millions of entries of protein and nucleotide data now in databases - How to convert this to useful information? Sequence analysis -

Bioinformatics Tutorial I BLAST and Sequence Alignment.

BLAST Sequence alignment, E-value & Extreme value distribution.

Last lecture summary.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.

Introduction to Bioinformatics

Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.

Heuristic alignment algorithms and cost matrices

Sequence analysis course

BLAST Basic Local Alignment Search Tool. BLAST החכה BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence [[רצף.

Bioinformatics and Phylogenetic Analysis

Introduction to bioinformatics

Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences Database searching for sequences Multiple sequence alignment Protein classification.

Similar Sequence Similar Function Charles Yan Spring 2006.

1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.

Sequence alignment, E-value & Extreme value distribution

BLAST Basic Local Alignment Search Tool. BLAST החכה BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence [[רצף.

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.

An Introduction to Bioinformatics

Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.

Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.

Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely) This lecture is designed to introduce you to.

Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)

NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)

Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.

Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.

Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?

BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.

1 P6a Extra Discussion Slides Part 1. 2 Section A.

Comp. Genomics Recitation 3 The statistics of database searching.

Construction of Substitution Matrices

Sequence Alignment Csc 487/687 Computing for bioinformatics.

Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.

Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG.

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Sequence Alignment.

Construction of Substitution matrices

Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University

Finding Sequence Similarities >query AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG.

Step 3: Tools Database Searching

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

What is BLAST? Basic BLAST search What is BLAST?

Sequence Similarity The bioinformatics for molecular biologists lecture series.

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,

Last lecture summary. Sequence alignment What is sequence alignment Three flavors of sequence alignment Point mutations, indels.

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

What is BLAST? Basic BLAST search What is BLAST?

Pairwise Sequence Alignment and Database Searching

Sequence similarity, BLAST alignments & multiple sequence alignments

Lecture 3.1 BLAST.

Blast Basic Local Alignment Search Tool

Basics of BLAST Basic BLAST Search - What is BLAST?

BLAST program selection guide

BLAST Anders Gorm Pedersen & Rasmus Wernersson.

What do you with a whole genome sequence?

Basic Local Alignment Search Tool

BLAST Slides adapted & edited from a set by

Sequence alignment, E-value & Extreme value distribution

BLAST Slides adapted & edited from a set by

Presentation transcript:

Types of homology BLAST MCB 5472 Lecture 3 Feb 10/14 Types of homology BLAST

Homology references “Homology a personal view on some of the problems” Fitch WM (2000) Trends Genet. 16: 227-231 “Orthologs, paralogs, and evolutionary genomics” Koonin EV (2005) Annu. Rev. Genet. 39: 309-338

What is homology? Owen 1843: “the same organ in different animals under every variety of form and function” Huxley (post Darwin): homology evidence of evolution Similarity is due to descent from a common ancestor

What is homology? Homology is a statement about shared ancestry Two things either share a common ancestor (are homologous) or do not

These are all homologs (common ancestor) Common ancestor of species 1, 2, 3 Common ancestor of species 2, 3 a b c Species 1 Species 2 Species 3 These are all homologs (common ancestor)

Ohno 1970: “Evolution by Gene Duplication” New genes arise by gene duplication One copy retains ancestral function Other copy diverges functionally “Homolog” as a single term therefore is a sloppy fit What kind of ancestor to homologs share?

Fitch 1970: “Orthologs” and “Paralogs” “Orthologs”: genes related by vertical descent

These are all homologs (common ancestor) Common ancestor of species 1, 2, 3 Common ancestor of species 2, 3 a b c Species 1 Species 2 Species 3 These are all homologs (common ancestor) These are all orthologs (vertical descent)

Homology and Function Homology and function are two different concepts Strict orthology and functional conservation often correlate but this is not absolute Basis for annotating genomes based on similarity to previous work

Fitch 1970: “Orthologs” and “Paralogs” “Orthologs”: genes related by vertical descent “Paralogs”: gene related by gene duplication

Genes a, b, c & d are homologs (common ancestor) Common ancestor of species 1, 2, 3 Common ancestor of species 2, 3 Duplication in species 1 a b c d Species 1 Species 2 Species 3 Genes a, b, c & d are homologs (common ancestor) Genes a & b are paralogs (related by duplication)

Orthology/paralogy is somewhat relative Depends on the depth of duplication relative to common ancestry “Co-orthologs”: paralogs formed in a lineage after speciation, relative to other lineages (Koonin 2005)

Genes a & b are paralogs (related by duplication) Common ancestor of species 1, 2, 3 Common ancestor of species 2, 3 Duplication in species 1 a b c d Species 1 Species 2 Species 3 Genes a & b are paralogs (related by duplication) Genes a & b are co-orthologs of genes c & d (duplication followed speciation)

Genes a & b are paralogs (duplication) Common ancestor of species 1, 2, 3 Duplication in ancestor of species 2 & 3 Duplication in species 1 Common ancestor of species 2, 3 a a b c d e f Genes a & b are paralogs (duplication) Genes a & b are co-orthologs of c, d, e & f (duplication followed speciation)

Genes c & d are orthologs (common ancestor) Common ancestor of species 1, 2, 3 Duplication in ancestor of species 2 & 3 Duplication in species 1 Common ancestor of species 2, 3 a a b c d e f Genes c & d are orthologs (common ancestor) Genes e & f are orthologs (common ancestor) Genes c & d are paralogs of genes e & f (duplication preceded speciation)

Genes c is a paralog of gene f even though it doesn’t seem so Common ancestor of species 1, 2, 3 Duplication in ancestor of species 2 & 3 Duplication in species 1 Common ancestor of species 2, 3 (Loss of d) (Loss of e) a a b c f Genes c is a paralog of gene f even though it doesn’t seem so (duplication still preceded speciation followed by extinction)

Xenologs Bacteria exchange DNA between distant relatives by horizontal gene transfer (HGT) Increasingly recognized in eukaroytes too Gene tree does not match species tree

Gene c is a xenolog relative to the others Common ancestor of species 1, 2, 3 Common ancestor of species 2, 3 HGT from species 2 ancestor to species 1 a b c d Species 1 Species 2 Species 1 Species 3 Gene c is a xenolog relative to the others

Other “-logs” Inparalogs: duplication follows speciation Outparalogs: duplication precedes speciation Synlogs: arising from organism fusion

Orthology & paralogy can get quite complicated when multiple duplications happened at different moments in time Gene loss & HGT can always confound – one often has to rely on external evidence to recreate speciation E.g., other genes not thought to be horizontally transferred, average signal of multiple genes

Discuss: how are these genes related to each other? Three possibilities Species 1 Species 2 Species 3

How to determine orthologs Most detailed: phylogenetic trees Can be computationally expensive Reciprocal BLAST hit (RBH/BBH) Simplest, computationally cheap, less accurate & more complicated with many genomes More complicated RBH clustering OrthoMCL, Inparanoid

RBH orthologs Genome A Genome B Best matches in both directions - ortholog Best matches in both directions - ortholog Best matches in only 1 direction - not ortholog Genome A Genome B Different matches in each direction - not ortholog Different matches in each direction - not ortholog

BLAST Standard method to identify homologous sequences Not for comparing two sequences directly; use NEEDLE instead for this (global vs. local alignment methods) Requires database to query sequence against Probably the most common scientific experiment

Different BLAST types BLASTn: nucleotide vs nucleotide BLASTp: protein vs protein BLASTx: protein vs translated nucleotide tBLASTn: translated nucleotide vs protein tBLASTx: translated nucleotide vs translated nucleotide Nucleotides translated in all six open reading frames

Implimentations blastall: older command line version Atschul et al. 1990 J. Mol. Biol. 215:403-410 BLAST+: newer command line version Camacho et al. 2008 BMC Bioinformatics 10:421 Faster than blastall Web BLAST: www.blast.ncbi.nlm.nih.gov/Blast.cgi Web version of BLAST+

Databases All BLAST queries are done vs. a database Examples: NCBI’s “nr” queries against all of GenBank WebBLAST has preformatted databases for different taxonomic groups, other NCBI divisions (e.g., Refseq, Genomes) Command line allows custom databases e.g., lab genomes

WebBLAST Common genome databases Different BLAST flavors

WebBLAST (BLASTn) Input sequence Database BLAST type Megablast optimized for short sequences vs. BLASTn

WebBLAST (BLASTn) parameters

BLAST: Step 1 Break sequence into words Goal: exact word matches Protein: 2-3 amino acids Nucleotide: 16-256 nucleotides Goal: exact word matches Computational speedup http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001014

Substitution matrices Evolutionarily, some substitutions are more common than others Some amino acids are common (e.g., Leu) and some are rare (e.g., Trp) Some substitutions are more feasible than others (e.g., Leu -> Ile vs. Leu -> Arg) Substitution matrices therefore weight alignments by these probabilities

BLOSUM matrices Alignments of a set of divergent reference sequences BLOSUM62: sequences 62% identical BLOSUM80: sequences 80% identical Substitution frequency calculated for each reference set and used to derive substitution matrix Henikoff & Henikoff (1992) PNAS 89:10915-10919 Also: M. Dayhoff’s PAM matrices from 1978

BLOSUM62 matrix http://upload.wikimedia.org/wikipedia/commons/5/52/BLOSUM62.gif

BLAST: Step 2 Use substitution matrix to find synonymous words about some scoring threshold http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001014

BLAST: Step 3 Find matching words in the database Extend word matches between query and matching sequence in both directions until extension score drops below threshold First without gaps http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001014

BLAST: Step 4 If initial alignment good enough, redo with gaps and calculate statistics http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001014

BLAST score 𝑆= 𝑀 𝑖𝑗 −𝑐𝑂−𝑑𝐺 Penalty for opening gap Per-residue for gap extension 𝑆= 𝑀 𝑖𝑗 −𝑐𝑂−𝑑𝐺 Score Sum of scores from distance matrix # of gaps Total length of gaps Gap opening penalty typically significantly larger than gap extension penalty Why?

Questions: Why do gap opening and extension penalties differ? Why is BLAST a local aligner vs. global

Local alignment Sequence extensions do not necessarily extend to sequence ends Domains vs entire proteins Can be multiple query->reference matches i.e., alignment can be broken, each with own statistics Can be multiple reference matches to the same query

Sequence masking Low-complexity regions can arise convergently Small hydrophobic amino acids common in transmembrane helices Violates homology assumption, therefore often excluded from BLAST search

Comparing BLAST scores Different BLASTs can use different parameters, e.g., matrices & gap penalties “Bit scores” normalize for this 𝑆 ′ = λ𝑆− ln 𝐾 / ln 2 Bit score Matrix penalty Gap penalty Score

E-values What is the likelihood that the sequence similarity is due to chance vs. actual homology? Larger databases are more likely to include chance matches

E-values 𝐸= 𝑛×𝑚 / 2 𝑆 ′ Length of the query sequence Bit score E-value 𝐸= 𝑛×𝑚 / 2 𝑆 ′ E-value Total # of residues in the database

E-values The E-value represents the likelihood of a random match >= the calculated score Smaller E-values therefore reflect greater probability of true homology Typically 1e-5 operationally used as a threshold for considering sequences as homologous

Summary Wednesday: applying BLAST Next week: expanding from one->many sequence comparisons to many->many