Independent scientist

Independent scientist
Robert Edgar Independent scientist Multiple alingnment and database search

Multiple alignment in 16S world
Curated 16S multiple alignments: RDP SILVA Greengenes MSA methods for 16s: ARB Infernal NAST

Why multiple alignment for 16S?
Computational efficiency Alignment is often most expensive step Pre-computed alignment faster Some claim more accurate than pair-wise I disagree

Global alignment Most MSAs are global
Curated 16S alignments are global CLUSTALW, MUSCLE, MAFFT, PROBCONS etc. Approximate homology over full length No rearrangements Duplications, inversions, translocations Models only these mutations: Substitutions Few short, non-overlapping indels

Issues with global MSA Rearrangements Tandem duplications
Churn (hyper-variable regions) Scaling to large datasets

Domain rearrangements
Multiple alignment of different nicotinamide nucleotide transhydrogenase sequences related to the bovine protein (NNTM). ENTHI, Entamoeba histolitica; EIMTE, Eimeria tenella; CAEEL, Caenorhabditis elegans; ACEAT, Acetabularia acetabulum; NEUCR, Neurospora crassa. The ancestor of the orthologs in the protozoan branch has undergone a circular permutation: note that the motif H-I-J appears on the left in the protozoans Entamoeba and Eimeria but on the right in the other species. Letters are used for brevity, the corresponding ProDom IDs are shown under the alignment. Weiner J 3rd, Thomas G, Bornberg-Bauer E. (2005), Rapid motif-based prediction of circular permutations in multi-domain proteins. Bioinformatics Apr 1;21(7):932-7.

Tandem duplications Short tandem duplications most common source of short insertions in human genome No evidence for tandems in 16S Common in proteins

Tandem duplications & arrays
Gamma crystallin, tandem duplication of greek key domain. Ribonuclease inhibitor, tandem array of leucine-rich repeats.

Alignments with tandems
F T R E P R E P T E R N L E F T R E P R E P T E R N L E F T R E P T E R N L E F T R E P R E P T E R N Tandems cause 1:n homology Cannot be represented in conventional alignment format

What is a correct alignment?
Per-residue homology Two residues aligned iff homologous Homology can be ambiguous (churn) Cannot be determined experimentally Structural similarity Two residues aligned iff same position in structure Structural correspondence is fuzzy Not well-defined... ...but can be useful if limitations are understood

Churn in hyper-variable regions
HISTORY ALIGNMENT Are B and E homologous? A B C Deletion A B C A - C A E D Substitution A - D A B - C Insertion A - E D A E D

Churn in hyper-variable regions

Alignment by structure
Alignable Ambiguous Not alignable Gradual transition from alignable to ambiguous. Distantly related (low %id) structures are ambiguous Methods disagree on alignments Structure benchmarks cannot measure specificity SABmark, FSA nonsense

Structure methods disagree
A Godzik (1996), The structural alignment between two proteins: is there a unique answer? Protein Sci Jul;5(7):

Homology but diverged structure
From SABmark benchmark [van Walle et al, 2004] MUSCLE aligns conserved 10mer (red), allegedly incorrect

Protein MSA benchmarks
BALIBASE published in 1999 Started "benchmark war" CLUSTALW has ~40k citations PREFAB Pair-wise structure alignments SABmark OXBENCH Multiple structure alignments

Nucleotide MSA benchmark
BRALIBASE, based on solved RNA structures Only credible nucleotide benchmark Too "easy", hard to discriminate methods

MSA methods CLUSTALW (1994) T-COFFEE (1999) PROBCONS
Still most widely used Newer methods definitively better MUSCLE, (2004) MAFFT (2004) faster and more accurate T-COFFEE (1999) Pioneered consistency Now PROBCONS (2004) is faster and more accurate PROBCONS Most accurate, but MUSCLE & MAFFT better scaling

Diminishing returns Last few years Many new methods
Claims: ~2% better on benchmarks Validation problems, especially BALIBASE Method A better than B only on average CLUSTALW ≥ PROBCONS on 1/3 of sets Is 2% real or significant in practice? IMO not proven, dubious

PRANK Published in Science

PRANK PRANK less accurate than CLUSTALW

Significant advances in MSA since 2004

BALIBASE v3 90% aligned by sequence, not structure!
Some sets have zero or one structure Aligned by sequence only Not independent of sequence methods Comparing my sequence method against theirs Many structures have unclear homology So gold standard sequence alignment not possible Many regions are definitively not homologous

BALIBASE v3 Structures unknown except for SH2
Same domain in many sets, not independent (p-values) Grossly violates global alignment assumptions Some published validations use full-length sequences Complete nonsense! Edgar RC (2010)., Quality measures for protein alignment benchmarks, NAR 2010 Apr;38(7):

BALIBASE v3 Edgar RC (2010)., Quality measures for protein alignment benchmarks, NAR 2010 Apr;38(7):

Pair-wise or multiple? Multiple pros Multiple cons
Can be more accurate Pre-computed alignment saves expensive step Multiple cons Can be less accurate Accuracy degrades with number of sequences Each new sequence adds ~Nε errors for some ε, total N2ε Exponentially more difficult to reconcile diverged regions Does not scale to very large datasets

When / why is multiple better?
Accuracy decreases with distance A C Pair-wise A B C Multiple Transitive alignment Intermediate sequences Only if highly variable rates B A C

USEARCH Pair-wise alignments on the fly Scales to very large databases
New paradigm in database search Edgar (2010), Search and clustering orders of magnitude faster than BLAST, Bioinformatics. Fast heuristics identify top candidate hits Finds top hit, or top few hits Often 10s s x faster than BLAST Heuristic local or global alignments BLAST-like algorithm for alignment

drive5.com/usearch

Pair-wise vs. multiple Conserved Hyper-variable Conserved NAST MUSCLE
ACATGCAAGTCGAACGCTGAAGC-CCAGCTTGCTGGGTGG-AT GAGTGGCGAACGGGTGAGTAA |||||||||||||||| || |||||||||||||||||||| ACATGCAAGTCGAACG AAG CATCTTCGGATGCTTAGTGGCGAACGGGTGAGTAA 4 gaps, 2/6 identities MUSCLE ACATGCAAGTCGAACGCTG-AAGCCCAGCTTGCTGGGTGGATGAGTGGCGAACGGGTGAGTAA |||||||||||||||| | | | | | |||||||||||||||||| ACATGCAAGTCGAACGAAGCAT---CTTCGGAT----GCTTAG--TGGCGAACGGGTGAGTAA 4 gaps, 4/17 identities USEARCH ACATGCAAGTCGAACG AAGCATCTTCGGATGCTTAGTGGCGAACGGGTGAGTAA |||||||||||||||| ||| | | || | | |||||||||||||||||||| ACATGCAAGTCGAACGCTGAAGCCCAGCTTGCTGGGTGGATGAGTGGCGAACGGGTGAGTAA 1 gap, 9/18 identities

USEARCH speed and accuracy
RFAM test RFAM database has ~200,000 RNAs Classified into ~1,400 families Extract 1,000 to use as query Remainder is search database True positive if hit in same family False positive if hit in different family Families may in fact be distantly related

Benchmarks at drive5.com

RFAM results

Independent scientist

Similar presentations

Presentation on theme: "Independent scientist"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Independent scientist

Similar presentations

Presentation on theme: "Independent scientist"— Presentation transcript:

Similar presentations

About project

Feedback