Presentation is loading. Please wait.

Presentation is loading. Please wait.

Independent scientist

Similar presentations


Presentation on theme: "Independent scientist"— Presentation transcript:

1 Independent scientist
Robert Edgar Independent scientist Multiple alingnment and database search

2 Multiple alignment in 16S world
Curated 16S multiple alignments: RDP SILVA Greengenes MSA methods for 16s: ARB Infernal NAST

3 Why multiple alignment for 16S?
Computational efficiency Alignment is often most expensive step Pre-computed alignment faster Some claim more accurate than pair-wise I disagree

4 Global alignment Most MSAs are global
Curated 16S alignments are global CLUSTALW, MUSCLE, MAFFT, PROBCONS etc. Approximate homology over full length No rearrangements Duplications, inversions, translocations Models only these mutations: Substitutions Few short, non-overlapping indels

5 Issues with global MSA Rearrangements Tandem duplications
Churn (hyper-variable regions) Scaling to large datasets

6 Domain rearrangements
Multiple alignment of different nicotin­amide nucleotide transhydrogenase sequences related to the bovine protein (NNTM). ENTHI, Entamoeba histolitica; EIMTE, Eimeria tenella; CAEEL, Caenorhabditis elegans; ACEAT, Acetabularia acetabulum; NEUCR, Neurospora crassa. The ancestor of the orthologs in the protozoan branch has undergone a circular permutation: note that the motif H-I-J appears on the left in the protozoans Entamoeba and Eimeria but on the right in the other species. Letters are used for brevity, the corresponding ProDom IDs are shown under the alignment. Weiner J 3rd, Thomas G, Bornberg-Bauer E. (2005), Rapid motif-based prediction of circular permutations in multi-domain proteins. Bioinformatics Apr 1;21(7):932-7.

7 Tandem duplications Short tandem duplications most common source of short insertions in human genome No evidence for tandems in 16S Common in proteins

8 Tandem duplications & arrays
Gamma crystallin, tandem duplication of greek key domain. Ribonuclease inhibitor, tandem array of leucine-rich repeats.

9 Alignments with tandems
F T R E P R E P T E R N L E F T R E P R E P T E R N L E F T R E P T E R N L E F T R E P R E P T E R N Tandems cause 1:n homology Cannot be represented in conventional alignment format

10 What is a correct alignment?
Per-residue homology Two residues aligned iff homologous Homology can be ambiguous (churn) Cannot be determined experimentally Structural similarity Two residues aligned iff same position in structure Structural correspondence is fuzzy Not well-defined... ...but can be useful if limitations are understood

11 Churn in hyper-variable regions
HISTORY ALIGNMENT Are B and E homologous? A B C Deletion A B C A - C A E D Substitution A - D A B - C Insertion A - E D A E D

12 Churn in hyper-variable regions

13 Alignment by structure
Alignable Ambiguous Not alignable Gradual transition from alignable to ambiguous. Distantly related (low %id) structures are ambiguous Methods disagree on alignments Structure benchmarks cannot measure specificity SABmark, FSA nonsense

14 Structure methods disagree
A Godzik (1996), The structural alignment between two proteins: is there a unique answer? Protein Sci Jul;5(7):

15 Homology but diverged structure
From SABmark benchmark [van Walle et al, 2004] MUSCLE aligns conserved 10mer (red), allegedly incorrect

16 Protein MSA benchmarks
BALIBASE published in 1999 Started "benchmark war" CLUSTALW has ~40k citations PREFAB Pair-wise structure alignments SABmark OXBENCH Multiple structure alignments

17 Nucleotide MSA benchmark
BRALIBASE, based on solved RNA structures Only credible nucleotide benchmark Too "easy", hard to discriminate methods

18 MSA methods CLUSTALW (1994) T-COFFEE (1999) PROBCONS
Still most widely used Newer methods definitively better MUSCLE, (2004) MAFFT (2004) faster and more accurate T-COFFEE (1999) Pioneered consistency Now PROBCONS (2004) is faster and more accurate PROBCONS Most accurate, but MUSCLE & MAFFT better scaling

19 Diminishing returns Last few years Many new methods
Claims: ~2% better on benchmarks Validation problems, especially BALIBASE Method A better than B only on average CLUSTALW ≥ PROBCONS on 1/3 of sets Is 2% real or significant in practice? IMO not proven, dubious

20 PRANK Published in Science

21 PRANK PRANK less accurate than CLUSTALW

22 Significant advances in MSA since 2004

23 BALIBASE v3 90% aligned by sequence, not structure!
Some sets have zero or one structure Aligned by sequence only Not independent of sequence methods Comparing my sequence method against theirs Many structures have unclear homology So gold standard sequence alignment not possible Many regions are definitively not homologous

24 BALIBASE v3 Structures unknown except for SH2
Same domain in many sets, not independent (p-values) Grossly violates global alignment assumptions Some published validations use full-length sequences Complete nonsense! Edgar RC (2010)., Quality measures for protein alignment benchmarks, NAR 2010 Apr;38(7):

25 BALIBASE v3 Edgar RC (2010)., Quality measures for protein alignment benchmarks, NAR 2010 Apr;38(7):

26 Pair-wise or multiple? Multiple pros Multiple cons
Can be more accurate Pre-computed alignment saves expensive step Multiple cons Can be less accurate Accuracy degrades with number of sequences Each new sequence adds ~Nε errors for some ε, total N2ε Exponentially more difficult to reconcile diverged regions Does not scale to very large datasets

27 When / why is multiple better?
Accuracy decreases with distance A C Pair-wise A B C Multiple Transitive alignment Intermediate sequences Only if highly variable rates B A C

28 USEARCH Pair-wise alignments on the fly Scales to very large databases
New paradigm in database search Edgar (2010), Search and clustering orders of magnitude faster than BLAST, Bioinformatics. Fast heuristics identify top candidate hits Finds top hit, or top few hits Often 10s s x faster than BLAST Heuristic local or global alignments BLAST-like algorithm for alignment

29 drive5.com/usearch

30 Pair-wise vs. multiple Conserved Hyper-variable Conserved NAST MUSCLE
ACATGCAAGTCGAACGCTGAAGC-CCAGCTTGCTGGGTGG-AT GAGTGGCGAACGGGTGAGTAA |||||||||||||||| || |||||||||||||||||||| ACATGCAAGTCGAACG AAG CATCTTCGGATGCTTAGTGGCGAACGGGTGAGTAA 4 gaps, 2/6 identities MUSCLE ACATGCAAGTCGAACGCTG-AAGCCCAGCTTGCTGGGTGGATGAGTGGCGAACGGGTGAGTAA |||||||||||||||| | | | | | |||||||||||||||||| ACATGCAAGTCGAACGAAGCAT---CTTCGGAT----GCTTAG--TGGCGAACGGGTGAGTAA 4 gaps, 4/17 identities USEARCH ACATGCAAGTCGAACG AAGCATCTTCGGATGCTTAGTGGCGAACGGGTGAGTAA |||||||||||||||| ||| | | || | | |||||||||||||||||||| ACATGCAAGTCGAACGCTGAAGCCCAGCTTGCTGGGTGGATGAGTGGCGAACGGGTGAGTAA 1 gap, 9/18 identities

31 USEARCH speed and accuracy
RFAM test RFAM database has ~200,000 RNAs Classified into ~1,400 families Extract 1,000 to use as query Remainder is search database True positive if hit in same family False positive if hit in different family Families may in fact be distantly related

32 Benchmarks at drive5.com

33 RFAM results

34 RFAM results

35 RFAM results


Download ppt "Independent scientist"

Similar presentations


Ads by Google