Independent scientist

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
1 Orthologs: Two genes, each from a different species, that descended from a single common ancestral gene Paralogs: Two or more genes, often thought of.
1 ALIGNMENT OF NUCLEOTIDE & AMINO-ACID SEQUENCES.
Clustal Ω for Protein Multiple Sequence Alignment Des Higgins (Conway Institute, University College Dublin, Ireland), “Clustal Omega for Protein Multiple.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
© Wiley Publishing All Rights Reserved. Phylogeny.
Structural bioinformatics
Heuristic alignment algorithms and cost matrices
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Expected accuracy sequence alignment
BNFO 602 Lecture 2 Usman Roshan. Sequence Alignment Widely used in bioinformatics Proteins and genes are of different lengths due to error in sequencing.
Sequence Analysis Tools
Genomic Rearrangements CS 374 – Algorithms in Biology Fall 2006 Nandhini N S.
Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Sequence comparison: Local alignment
Sequencing a genome and Basic Sequence Alignment
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple sequence alignment
Biology 4900 Biocomputing.
Multiple Sequence Alignment
Sequence Analysis Alignments dot-plots scoring scheme Substitution matrices Search algorithms (BLAST)
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
© Wiley Publishing All Rights Reserved. Building Multiple- Sequence Alignments.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Using the T-Coffee Multiple Sequence Alignment Package I - Overview Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.
Sequence-based Similarity Module (BLAST & CDD only ) & Horizontal Gene Transfer Module (Ortholog Neighborhood & GC content only)
Construction of Substitution Matrices
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. [many slides borrowed from various sources]
COT 6930 HPC and Bioinformatics Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. WTCCB Bioinformatics Core [many slides borrowed from various sources]
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Multiple alignments, PATTERNS, PSI-BLAST.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
Finding, Aligning and Analyzing Non Coding RNAs Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.
Expected accuracy sequence alignment Usman Roshan.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Robert Edgar Independent scientist
Multiple Sequence Alignment
Sequence similarity, BLAST alignments & multiple sequence alignments
Introduction to Bioinformatics Resources for DNA Barcoding
INTRODUCTION TO BIOINFORMATICS
ncRNA Multiple Alignments with R-Coffee
Basics of Comparative Genomics
The ideal approach is simultaneous alignment and tree estimation.
Sequence based searches:
Sequence comparison: Local alignment
Dr Tan Tin Wee Director Bioinformatics Centre
BLAST.
Protein structure prediction.
Pairwise Sequence Alignment
Bioinformatics Lecture 2 By: Dr. Mehdi Mansouri
Basics of Comparative Genomics
Basic Local Alignment Search Tool
Sequence Analysis Alan Christoffels
Presentation transcript:

Independent scientist Robert Edgar Independent scientist robert@drive5.com www.drive5.com Multiple alingnment and database search

Multiple alignment in 16S world Curated 16S multiple alignments: RDP SILVA Greengenes MSA methods for 16s: ARB Infernal NAST

Why multiple alignment for 16S? Computational efficiency Alignment is often most expensive step Pre-computed alignment faster Some claim more accurate than pair-wise I disagree

Global alignment Most MSAs are global Curated 16S alignments are global CLUSTALW, MUSCLE, MAFFT, PROBCONS etc. Approximate homology over full length No rearrangements Duplications, inversions, translocations Models only these mutations: Substitutions Few short, non-overlapping indels

Issues with global MSA Rearrangements Tandem duplications Churn (hyper-variable regions) Scaling to large datasets

Domain rearrangements Multiple alignment of different nicotin­amide nucleotide transhydrogenase sequences related to the bovine protein (NNTM). ENTHI, Entamoeba histolitica; EIMTE, Eimeria tenella; CAEEL, Caenorhabditis elegans; ACEAT, Acetabularia acetabulum; NEUCR, Neurospora crassa. The ancestor of the orthologs in the protozoan branch has undergone a circular permutation: note that the motif H-I-J appears on the left in the protozoans Entamoeba and Eimeria but on the right in the other species. Letters are used for brevity, the corresponding ProDom IDs are shown under the alignment. Weiner J 3rd, Thomas G, Bornberg-Bauer E. (2005), Rapid motif-based prediction of circular permutations in multi-domain proteins. Bioinformatics. 2005 Apr 1;21(7):932-7.

Tandem duplications Short tandem duplications most common source of short insertions in human genome No evidence for tandems in 16S Common in proteins

Tandem duplications & arrays Gamma crystallin, tandem duplication of greek key domain. Ribonuclease inhibitor, tandem array of leucine-rich repeats.

Alignments with tandems F T R E P R E P T E R N L E F T R E P R E P T E R N L E F T R E P T E R N L E F T R E P R E P T E R N Tandems cause 1:n homology Cannot be represented in conventional alignment format

What is a correct alignment? Per-residue homology Two residues aligned iff homologous Homology can be ambiguous (churn) Cannot be determined experimentally Structural similarity Two residues aligned iff same position in structure Structural correspondence is fuzzy Not well-defined... ...but can be useful if limitations are understood

Churn in hyper-variable regions HISTORY ALIGNMENT Are B and E homologous? A B C Deletion A B C A - C A E D Substitution A - D A B - C Insertion A - E D A E D

Churn in hyper-variable regions

Alignment by structure Alignable Ambiguous Not alignable Gradual transition from alignable to ambiguous. Distantly related (low %id) structures are ambiguous Methods disagree on alignments Structure benchmarks cannot measure specificity SABmark, FSA nonsense

Structure methods disagree A Godzik (1996), The structural alignment between two proteins: is there a unique answer? Protein Sci. 1996 Jul;5(7):1325-38

Homology but diverged structure From SABmark benchmark [van Walle et al, 2004] MUSCLE aligns conserved 10mer (red), allegedly incorrect

Protein MSA benchmarks BALIBASE published in 1999 Started "benchmark war" CLUSTALW has ~40k citations PREFAB Pair-wise structure alignments SABmark OXBENCH Multiple structure alignments

Nucleotide MSA benchmark BRALIBASE, based on solved RNA structures Only credible nucleotide benchmark Too "easy", hard to discriminate methods

MSA methods CLUSTALW (1994) T-COFFEE (1999) PROBCONS Still most widely used Newer methods definitively better MUSCLE, (2004) MAFFT (2004) faster and more accurate T-COFFEE (1999) Pioneered consistency Now PROBCONS (2004) is faster and more accurate PROBCONS Most accurate, but MUSCLE & MAFFT better scaling

Diminishing returns Last few years Many new methods Claims: ~2% better on benchmarks Validation problems, especially BALIBASE Method A better than B only on average CLUSTALW ≥ PROBCONS on 1/3 of sets Is 2% real or significant in practice? IMO not proven, dubious

PRANK Published in Science

PRANK PRANK less accurate than CLUSTALW

Significant advances in MSA since 2004

BALIBASE v3 90% aligned by sequence, not structure! Some sets have zero or one structure Aligned by sequence only Not independent of sequence methods Comparing my sequence method against theirs Many structures have unclear homology So gold standard sequence alignment not possible Many regions are definitively not homologous

BALIBASE v3 Structures unknown except for SH2 Same domain in many sets, not independent (p-values) Grossly violates global alignment assumptions Some published validations use full-length sequences Complete nonsense! Edgar RC (2010)., Quality measures for protein alignment benchmarks, NAR 2010 Apr;38(7):2145-53.

BALIBASE v3 Edgar RC (2010)., Quality measures for protein alignment benchmarks, NAR 2010 Apr;38(7):2145-53.

Pair-wise or multiple? Multiple pros Multiple cons Can be more accurate Pre-computed alignment saves expensive step Multiple cons Can be less accurate Accuracy degrades with number of sequences Each new sequence adds ~Nε errors for some ε, total N2ε Exponentially more difficult to reconcile diverged regions Does not scale to very large datasets

When / why is multiple better? Accuracy decreases with distance A C Pair-wise A B C Multiple Transitive alignment Intermediate sequences Only if highly variable rates B A C

USEARCH Pair-wise alignments on the fly Scales to very large databases New paradigm in database search Edgar (2010), Search and clustering orders of magnitude faster than BLAST, Bioinformatics. Fast heuristics identify top candidate hits Finds top hit, or top few hits Often 10s - 1000s x faster than BLAST Heuristic local or global alignments BLAST-like algorithm for alignment

drive5.com/usearch

Pair-wise vs. multiple Conserved Hyper-variable Conserved NAST MUSCLE 7000004128189679 ACATGCAAGTCGAACGCTGAAGC-CCAGCTTGCTGGGTGG-AT-----------GAGTGGCGAACGGGTGAGTAA |||||||||||||||| || |||||||||||||||||||| 7000004128189554 ACATGCAAGTCGAACG-------AAG--------------CATCTTCGGATGCTTAGTGGCGAACGGGTGAGTAA 4 gaps, 2/6 identities MUSCLE 7000004128189679 ACATGCAAGTCGAACGCTG-AAGCCCAGCTTGCTGGGTGGATGAGTGGCGAACGGGTGAGTAA |||||||||||||||| | | | | | |||||||||||||||||| 7000004128189554 ACATGCAAGTCGAACGAAGCAT---CTTCGGAT----GCTTAG--TGGCGAACGGGTGAGTAA 4 gaps, 4/17 identities USEARCH 7000004128189679 ACATGCAAGTCGAACG--------AAGCATCTTCGGATGCTTAGTGGCGAACGGGTGAGTAA |||||||||||||||| ||| | | || | | |||||||||||||||||||| 7000004128189554 ACATGCAAGTCGAACGCTGAAGCCCAGCTTGCTGGGTGGATGAGTGGCGAACGGGTGAGTAA 1 gap, 9/18 identities

USEARCH speed and accuracy RFAM test RFAM database has ~200,000 RNAs Classified into ~1,400 families Extract 1,000 to use as query Remainder is search database True positive if hit in same family False positive if hit in different family Families may in fact be distantly related

Benchmarks at drive5.com

RFAM results

RFAM results

RFAM results