Independent scientist Robert Edgar Independent scientist robert@drive5.com www.drive5.com Multiple alingnment and database search
Multiple alignment in 16S world Curated 16S multiple alignments: RDP SILVA Greengenes MSA methods for 16s: ARB Infernal NAST
Why multiple alignment for 16S? Computational efficiency Alignment is often most expensive step Pre-computed alignment faster Some claim more accurate than pair-wise I disagree
Global alignment Most MSAs are global Curated 16S alignments are global CLUSTALW, MUSCLE, MAFFT, PROBCONS etc. Approximate homology over full length No rearrangements Duplications, inversions, translocations Models only these mutations: Substitutions Few short, non-overlapping indels
Issues with global MSA Rearrangements Tandem duplications Churn (hyper-variable regions) Scaling to large datasets
Domain rearrangements Multiple alignment of different nicotinamide nucleotide transhydrogenase sequences related to the bovine protein (NNTM). ENTHI, Entamoeba histolitica; EIMTE, Eimeria tenella; CAEEL, Caenorhabditis elegans; ACEAT, Acetabularia acetabulum; NEUCR, Neurospora crassa. The ancestor of the orthologs in the protozoan branch has undergone a circular permutation: note that the motif H-I-J appears on the left in the protozoans Entamoeba and Eimeria but on the right in the other species. Letters are used for brevity, the corresponding ProDom IDs are shown under the alignment. Weiner J 3rd, Thomas G, Bornberg-Bauer E. (2005), Rapid motif-based prediction of circular permutations in multi-domain proteins. Bioinformatics. 2005 Apr 1;21(7):932-7.
Tandem duplications Short tandem duplications most common source of short insertions in human genome No evidence for tandems in 16S Common in proteins
Tandem duplications & arrays Gamma crystallin, tandem duplication of greek key domain. Ribonuclease inhibitor, tandem array of leucine-rich repeats.
Alignments with tandems F T R E P R E P T E R N L E F T R E P R E P T E R N L E F T R E P T E R N L E F T R E P R E P T E R N Tandems cause 1:n homology Cannot be represented in conventional alignment format
What is a correct alignment? Per-residue homology Two residues aligned iff homologous Homology can be ambiguous (churn) Cannot be determined experimentally Structural similarity Two residues aligned iff same position in structure Structural correspondence is fuzzy Not well-defined... ...but can be useful if limitations are understood
Churn in hyper-variable regions HISTORY ALIGNMENT Are B and E homologous? A B C Deletion A B C A - C A E D Substitution A - D A B - C Insertion A - E D A E D
Churn in hyper-variable regions
Alignment by structure Alignable Ambiguous Not alignable Gradual transition from alignable to ambiguous. Distantly related (low %id) structures are ambiguous Methods disagree on alignments Structure benchmarks cannot measure specificity SABmark, FSA nonsense
Structure methods disagree A Godzik (1996), The structural alignment between two proteins: is there a unique answer? Protein Sci. 1996 Jul;5(7):1325-38
Homology but diverged structure From SABmark benchmark [van Walle et al, 2004] MUSCLE aligns conserved 10mer (red), allegedly incorrect
Protein MSA benchmarks BALIBASE published in 1999 Started "benchmark war" CLUSTALW has ~40k citations PREFAB Pair-wise structure alignments SABmark OXBENCH Multiple structure alignments
Nucleotide MSA benchmark BRALIBASE, based on solved RNA structures Only credible nucleotide benchmark Too "easy", hard to discriminate methods
MSA methods CLUSTALW (1994) T-COFFEE (1999) PROBCONS Still most widely used Newer methods definitively better MUSCLE, (2004) MAFFT (2004) faster and more accurate T-COFFEE (1999) Pioneered consistency Now PROBCONS (2004) is faster and more accurate PROBCONS Most accurate, but MUSCLE & MAFFT better scaling
Diminishing returns Last few years Many new methods Claims: ~2% better on benchmarks Validation problems, especially BALIBASE Method A better than B only on average CLUSTALW ≥ PROBCONS on 1/3 of sets Is 2% real or significant in practice? IMO not proven, dubious
PRANK Published in Science
PRANK PRANK less accurate than CLUSTALW
Significant advances in MSA since 2004
BALIBASE v3 90% aligned by sequence, not structure! Some sets have zero or one structure Aligned by sequence only Not independent of sequence methods Comparing my sequence method against theirs Many structures have unclear homology So gold standard sequence alignment not possible Many regions are definitively not homologous
BALIBASE v3 Structures unknown except for SH2 Same domain in many sets, not independent (p-values) Grossly violates global alignment assumptions Some published validations use full-length sequences Complete nonsense! Edgar RC (2010)., Quality measures for protein alignment benchmarks, NAR 2010 Apr;38(7):2145-53.
BALIBASE v3 Edgar RC (2010)., Quality measures for protein alignment benchmarks, NAR 2010 Apr;38(7):2145-53.
Pair-wise or multiple? Multiple pros Multiple cons Can be more accurate Pre-computed alignment saves expensive step Multiple cons Can be less accurate Accuracy degrades with number of sequences Each new sequence adds ~Nε errors for some ε, total N2ε Exponentially more difficult to reconcile diverged regions Does not scale to very large datasets
When / why is multiple better? Accuracy decreases with distance A C Pair-wise A B C Multiple Transitive alignment Intermediate sequences Only if highly variable rates B A C
USEARCH Pair-wise alignments on the fly Scales to very large databases New paradigm in database search Edgar (2010), Search and clustering orders of magnitude faster than BLAST, Bioinformatics. Fast heuristics identify top candidate hits Finds top hit, or top few hits Often 10s - 1000s x faster than BLAST Heuristic local or global alignments BLAST-like algorithm for alignment
drive5.com/usearch
Pair-wise vs. multiple Conserved Hyper-variable Conserved NAST MUSCLE 7000004128189679 ACATGCAAGTCGAACGCTGAAGC-CCAGCTTGCTGGGTGG-AT-----------GAGTGGCGAACGGGTGAGTAA |||||||||||||||| || |||||||||||||||||||| 7000004128189554 ACATGCAAGTCGAACG-------AAG--------------CATCTTCGGATGCTTAGTGGCGAACGGGTGAGTAA 4 gaps, 2/6 identities MUSCLE 7000004128189679 ACATGCAAGTCGAACGCTG-AAGCCCAGCTTGCTGGGTGGATGAGTGGCGAACGGGTGAGTAA |||||||||||||||| | | | | | |||||||||||||||||| 7000004128189554 ACATGCAAGTCGAACGAAGCAT---CTTCGGAT----GCTTAG--TGGCGAACGGGTGAGTAA 4 gaps, 4/17 identities USEARCH 7000004128189679 ACATGCAAGTCGAACG--------AAGCATCTTCGGATGCTTAGTGGCGAACGGGTGAGTAA |||||||||||||||| ||| | | || | | |||||||||||||||||||| 7000004128189554 ACATGCAAGTCGAACGCTGAAGCCCAGCTTGCTGGGTGGATGAGTGGCGAACGGGTGAGTAA 1 gap, 9/18 identities
USEARCH speed and accuracy RFAM test RFAM database has ~200,000 RNAs Classified into ~1,400 families Extract 1,000 to use as query Remainder is search database True positive if hit in same family False positive if hit in different family Families may in fact be distantly related
Benchmarks at drive5.com
RFAM results
RFAM results
RFAM results