Dr Tan Tin Wee Director Bioinformatics Centre

Dr Tan Tin Wee Director Bioinformatics Centre
Basic Overview of Bioinformatics Tools and Biocomputing Applications III Dr Tan Tin Wee Director Bioinformatics Centre

More BioComputational Tools
Phylogenetics Analysis Multiple Sequence Alignment Profile Searching Sensitivity and Specificity and Probabilities in the Prediction of Functions

Phylogenetic Analysis
Assumption: evolutionary descent Divergence Phylogenetic tree Rooted and unrooted trees Species X Y A B

Rooted and Unrooted Trees
Rooted: ancestral state of the evolved organism or gene is known. Branches at bifurcation points until terminal branches, or tips/ leaves. Unrooted trees represent branching order, but does not indicate the root of the last common ancestor

Phylogenetic inference for genes
Infancy, inexact science computational tools based on general mathematical and statistical principles Phylogenetic reconstructions may conflict with common sense. Incorrect sequence alignments, inadequate models All sites within sequences evolve at different rates unequal rate effects

Some algorithms Maximum parsimony maximum likelihood distance methods
UPGMA paralinear (logdet) distances Software Packages: PAUP phylogenetic analysis using parsimony PHYLIP phylogenetic inference package MacClade, GAMBIT, MEGA/METREE

Limitations Inspection of sequence alignments
Removal of deviant sequences from the phylogenetic inference Different genes analysed produce different trees "Bootstrapping" for estimating statistical significance may still have errors in interpretation

Uses Molecular Taxonomy
B Uses C Molecular Taxonomy 16S and 23S rRNA analysis for bacterial classification 18S rRNA analysis of nematodes, drosophila epidemiological analysis of strain variation eg. In infections pathogens D

Multiple Sequence Analysis
Gather a set of sequences of putative similarity or homology Pairwise comparison for each set of multiple sequences Build a "tree" of similarity realignment of all sequences based on "ancestral" sequence padding with gaps etc Used for generating "profiles"

Use Detection of conserved and variable regions Infer gene functions
Variable segments - infer dispensable to function or antigenic variants Motifs can be used to analyse unknown sequence and infer possible function or relatedness Motifs as basis for annotation of genome project sequences

Software CLUSTALW Profile software based on Hidden Markov Models (HMM) statistical models, eg HMMer, HMMPro, META-MEME, PROBE, BLOCKS

Example C. elegans genome project
several large gene families of sequence homology - function unknown. Now classified as putative G-protein coupled receptors (GPCRs). Have to detect significant similarity between putative Worm GPCRs and experimentally known GPCRs in other species

Process Select a typical unknown sequence BLAST Search against nr database Inspect hits and E-values Top scoring hits - mitochondrial L11 ribosomal protein E=0.002 (not low enough to be trusted for annotation) The rest of top scorers are all nematode-specific unknown sequences Compare with PSI-BLAST iterative searching at NCBI Similarity with mammalian GPCRs or the high scoring mt rL11 protein ?

Further analysis Gather all nematode specific sequences
WormPep database of non-redundant seqs Discard seqs of abnormally long or short Multiple sequence alignment using CLUSTALW General Profile of multiple alignment using HMMer Use profile to search database again

Results Similarity at significance level detected with Mammalian GPCRs
Find that L11 protein has very significant high score E=5x10 Pitfalls of PSI-Blast - significance of match to the training set during iteration. Finally, L11 protein may be wrongly annotated and not based on experimental results -49

A.Sensitivity and Specificity of a Fairly Good Test
Total real +ve = 73 Total real - ve = 27 Specificity = (25)/(2+25)=.93 picked up 25 of the 27 negatives, very specific Low false positives Sensitivity = 70/(70+3)=.96 able to pickup 70 of the total 73 that are known positive- quite sensitive- Low false negatives Gold standards Known gold standard + ve ve + ve - ve 70 2 3 25 Exptal test result N=100

B.Increase Sensitivity but Lower Specificity of a Test
Total real +ve = 73 Total real - ve = 27 Specificity = (14)/(13+14)=.52 picked up 14 of the 27 negatives, not very specific high false positives Sensitivity = 72/(72+1)=.99 able to pickup 72 of the total 73 that are known positive- super sensitive Low false negatives Known gold standard + ve ve + ve - ve 72 13 1 14 Exptal test result N=100

C.Increase Specificity of a Test but Sensitivity may drop
Total real +ve = 73 Total real - ve = 27 Specificity = (27)/(0+27)=1.0 picked up 27 of the 27 negatives,completely specific increase threshold to zero false positives, true positives will drop Sensitivity = 50/(50+23)=.68 able to pickup 50 of the total 73 that are known positive- not quite sensitive- Low false negatives Known gold standard + ve ve + ve - ve 50 23 27 Exptal test result N=100

Trade off involved If threshold of test set high, so that all the noise disappears, you may also miss out on some true positives, get a lot of false negatives and thus not so sensitive - case C If threshold of test set low, so that you get as much of the positives as you can get, ie high sensitivity, your non-specific false positive hits start appearing - Case B

Computational Predictions of Gene Function
Sensitivity and specificity has similar tradeoffs. Cutoff threshold values have to be empirically determined or arbitrarily chosen depending on situation

Dr Tan Tin Wee Director Bioinformatics Centre

Similar presentations

Presentation on theme: "Dr Tan Tin Wee Director Bioinformatics Centre"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dr Tan Tin Wee Director Bioinformatics Centre

Similar presentations

Presentation on theme: "Dr Tan Tin Wee Director Bioinformatics Centre"— Presentation transcript:

Similar presentations

About project

Feedback