Download presentation
Presentation is loading. Please wait.
1
Independent scientist
Robert Edgar Independent scientist Multiple alingnment and database search
2
Multiple alignment in 16S world
Curated 16S multiple alignments: RDP SILVA Greengenes MSA methods for 16s: ARB Infernal NAST
3
Why multiple alignment for 16S?
Computational efficiency Alignment is often most expensive step Pre-computed alignment faster Some claim more accurate than pair-wise I disagree
4
Global alignment Most MSAs are global
Curated 16S alignments are global CLUSTALW, MUSCLE, MAFFT, PROBCONS etc. Approximate homology over full length No rearrangements Duplications, inversions, translocations Models only these mutations: Substitutions Few short, non-overlapping indels
5
Issues with global MSA Rearrangements Tandem duplications
Churn (hyper-variable regions) Scaling to large datasets
6
Domain rearrangements
Multiple alignment of different nicotinamide nucleotide transhydrogenase sequences related to the bovine protein (NNTM). ENTHI, Entamoeba histolitica; EIMTE, Eimeria tenella; CAEEL, Caenorhabditis elegans; ACEAT, Acetabularia acetabulum; NEUCR, Neurospora crassa. The ancestor of the orthologs in the protozoan branch has undergone a circular permutation: note that the motif H-I-J appears on the left in the protozoans Entamoeba and Eimeria but on the right in the other species. Letters are used for brevity, the corresponding ProDom IDs are shown under the alignment. Weiner J 3rd, Thomas G, Bornberg-Bauer E. (2005), Rapid motif-based prediction of circular permutations in multi-domain proteins. Bioinformatics Apr 1;21(7):932-7.
7
Tandem duplications Short tandem duplications most common source of short insertions in human genome No evidence for tandems in 16S Common in proteins
8
Tandem duplications & arrays
Gamma crystallin, tandem duplication of greek key domain. Ribonuclease inhibitor, tandem array of leucine-rich repeats.
9
Alignments with tandems
F T R E P R E P T E R N L E F T R E P R E P T E R N L E F T R E P T E R N L E F T R E P R E P T E R N Tandems cause 1:n homology Cannot be represented in conventional alignment format
10
What is a correct alignment?
Per-residue homology Two residues aligned iff homologous Homology can be ambiguous (churn) Cannot be determined experimentally Structural similarity Two residues aligned iff same position in structure Structural correspondence is fuzzy Not well-defined... ...but can be useful if limitations are understood
11
Churn in hyper-variable regions
HISTORY ALIGNMENT Are B and E homologous? A B C Deletion A B C A - C A E D Substitution A - D A B - C Insertion A - E D A E D
12
Churn in hyper-variable regions
13
Alignment by structure
Alignable Ambiguous Not alignable Gradual transition from alignable to ambiguous. Distantly related (low %id) structures are ambiguous Methods disagree on alignments Structure benchmarks cannot measure specificity SABmark, FSA nonsense
14
Structure methods disagree
A Godzik (1996), The structural alignment between two proteins: is there a unique answer? Protein Sci Jul;5(7):
15
Homology but diverged structure
From SABmark benchmark [van Walle et al, 2004] MUSCLE aligns conserved 10mer (red), allegedly incorrect
16
Protein MSA benchmarks
BALIBASE published in 1999 Started "benchmark war" CLUSTALW has ~40k citations PREFAB Pair-wise structure alignments SABmark OXBENCH Multiple structure alignments
17
Nucleotide MSA benchmark
BRALIBASE, based on solved RNA structures Only credible nucleotide benchmark Too "easy", hard to discriminate methods
18
MSA methods CLUSTALW (1994) T-COFFEE (1999) PROBCONS
Still most widely used Newer methods definitively better MUSCLE, (2004) MAFFT (2004) faster and more accurate T-COFFEE (1999) Pioneered consistency Now PROBCONS (2004) is faster and more accurate PROBCONS Most accurate, but MUSCLE & MAFFT better scaling
19
Diminishing returns Last few years Many new methods
Claims: ~2% better on benchmarks Validation problems, especially BALIBASE Method A better than B only on average CLUSTALW ≥ PROBCONS on 1/3 of sets Is 2% real or significant in practice? IMO not proven, dubious
20
PRANK Published in Science
21
PRANK PRANK less accurate than CLUSTALW
22
Significant advances in MSA since 2004
23
BALIBASE v3 90% aligned by sequence, not structure!
Some sets have zero or one structure Aligned by sequence only Not independent of sequence methods Comparing my sequence method against theirs Many structures have unclear homology So gold standard sequence alignment not possible Many regions are definitively not homologous
24
BALIBASE v3 Structures unknown except for SH2
Same domain in many sets, not independent (p-values) Grossly violates global alignment assumptions Some published validations use full-length sequences Complete nonsense! Edgar RC (2010)., Quality measures for protein alignment benchmarks, NAR 2010 Apr;38(7):
25
BALIBASE v3 Edgar RC (2010)., Quality measures for protein alignment benchmarks, NAR 2010 Apr;38(7):
26
Pair-wise or multiple? Multiple pros Multiple cons
Can be more accurate Pre-computed alignment saves expensive step Multiple cons Can be less accurate Accuracy degrades with number of sequences Each new sequence adds ~Nε errors for some ε, total N2ε Exponentially more difficult to reconcile diverged regions Does not scale to very large datasets
27
When / why is multiple better?
Accuracy decreases with distance A C Pair-wise A B C Multiple Transitive alignment Intermediate sequences Only if highly variable rates B A C
28
USEARCH Pair-wise alignments on the fly Scales to very large databases
New paradigm in database search Edgar (2010), Search and clustering orders of magnitude faster than BLAST, Bioinformatics. Fast heuristics identify top candidate hits Finds top hit, or top few hits Often 10s s x faster than BLAST Heuristic local or global alignments BLAST-like algorithm for alignment
29
drive5.com/usearch
30
Pair-wise vs. multiple Conserved Hyper-variable Conserved NAST MUSCLE
ACATGCAAGTCGAACGCTGAAGC-CCAGCTTGCTGGGTGG-AT GAGTGGCGAACGGGTGAGTAA |||||||||||||||| || |||||||||||||||||||| ACATGCAAGTCGAACG AAG CATCTTCGGATGCTTAGTGGCGAACGGGTGAGTAA 4 gaps, 2/6 identities MUSCLE ACATGCAAGTCGAACGCTG-AAGCCCAGCTTGCTGGGTGGATGAGTGGCGAACGGGTGAGTAA |||||||||||||||| | | | | | |||||||||||||||||| ACATGCAAGTCGAACGAAGCAT---CTTCGGAT----GCTTAG--TGGCGAACGGGTGAGTAA 4 gaps, 4/17 identities USEARCH ACATGCAAGTCGAACG AAGCATCTTCGGATGCTTAGTGGCGAACGGGTGAGTAA |||||||||||||||| ||| | | || | | |||||||||||||||||||| ACATGCAAGTCGAACGCTGAAGCCCAGCTTGCTGGGTGGATGAGTGGCGAACGGGTGAGTAA 1 gap, 9/18 identities
31
USEARCH speed and accuracy
RFAM test RFAM database has ~200,000 RNAs Classified into ~1,400 families Extract 1,000 to use as query Remainder is search database True positive if hit in same family False positive if hit in different family Families may in fact be distantly related
32
Benchmarks at drive5.com
33
RFAM results
34
RFAM results
35
RFAM results
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.