Dr Tan Tin Wee Director Bioinformatics Centre Basic Overview of Bioinformatics Tools and Biocomputing Applications III Dr Tan Tin Wee Director Bioinformatics Centre
More BioComputational Tools Phylogenetics Analysis Multiple Sequence Alignment Profile Searching Sensitivity and Specificity and Probabilities in the Prediction of Functions
Phylogenetic Analysis Assumption: evolutionary descent Divergence Phylogenetic tree Rooted and unrooted trees Species X Y A B
Rooted and Unrooted Trees Rooted: ancestral state of the evolved organism or gene is known. Branches at bifurcation points until terminal branches, or tips/ leaves. Unrooted trees represent branching order, but does not indicate the root of the last common ancestor
Phylogenetic inference for genes Infancy, inexact science computational tools based on general mathematical and statistical principles Phylogenetic reconstructions may conflict with common sense. Incorrect sequence alignments, inadequate models All sites within sequences evolve at different rates unequal rate effects
Some algorithms Maximum parsimony maximum likelihood distance methods UPGMA paralinear (logdet) distances Software Packages: PAUP phylogenetic analysis using parsimony PHYLIP phylogenetic inference package MacClade, GAMBIT, MEGA/METREE
Limitations Inspection of sequence alignments Removal of deviant sequences from the phylogenetic inference Different genes analysed produce different trees "Bootstrapping" for estimating statistical significance may still have errors in interpretation
Uses Molecular Taxonomy B Uses C Molecular Taxonomy 16S and 23S rRNA analysis for bacterial classification 18S rRNA analysis of nematodes, drosophila epidemiological analysis of strain variation eg. In infections pathogens D
Multiple Sequence Analysis Gather a set of sequences of putative similarity or homology Pairwise comparison for each set of multiple sequences Build a "tree" of similarity realignment of all sequences based on "ancestral" sequence padding with gaps etc Used for generating "profiles"
Use Detection of conserved and variable regions Infer gene functions Variable segments - infer dispensable to function or antigenic variants Motifs can be used to analyse unknown sequence and infer possible function or relatedness Motifs as basis for annotation of genome project sequences
Software CLUSTALW Profile software based on Hidden Markov Models (HMM) statistical models, eg HMMer, HMMPro, META-MEME, PROBE, BLOCKS
Example C. elegans genome project several large gene families of sequence homology - function unknown. Now classified as putative G-protein coupled receptors (GPCRs). Have to detect significant similarity between putative Worm GPCRs and experimentally known GPCRs in other species
Process Select a typical unknown sequence BLAST Search against nr database Inspect hits and E-values Top scoring hits - mitochondrial L11 ribosomal protein E=0.002 (not low enough to be trusted for annotation) The rest of top scorers are all nematode-specific unknown sequences Compare with PSI-BLAST iterative searching at NCBI Similarity with mammalian GPCRs or the high scoring mt rL11 protein ?
Further analysis Gather all nematode specific sequences WormPep database of non-redundant seqs Discard seqs of abnormally long or short Multiple sequence alignment using CLUSTALW General Profile of multiple alignment using HMMer Use profile to search database again
Results Similarity at significance level detected with Mammalian GPCRs Find that L11 protein has very significant high score E=5x10 Pitfalls of PSI-Blast - significance of match to the training set during iteration. Finally, L11 protein may be wrongly annotated and not based on experimental results -49
A.Sensitivity and Specificity of a Fairly Good Test Total real +ve = 73 Total real - ve = 27 Specificity = (25)/(2+25)=.93 picked up 25 of the 27 negatives, very specific Low false positives Sensitivity = 70/(70+3)=.96 able to pickup 70 of the total 73 that are known positive- quite sensitive- Low false negatives Gold standards Known gold standard + ve - ve + ve - ve 70 2 3 25 Exptal test result N=100
B.Increase Sensitivity but Lower Specificity of a Test Total real +ve = 73 Total real - ve = 27 Specificity = (14)/(13+14)=.52 picked up 14 of the 27 negatives, not very specific high false positives Sensitivity = 72/(72+1)=.99 able to pickup 72 of the total 73 that are known positive- super sensitive Low false negatives Known gold standard + ve - ve + ve - ve 72 13 1 14 Exptal test result N=100
C.Increase Specificity of a Test but Sensitivity may drop Total real +ve = 73 Total real - ve = 27 Specificity = (27)/(0+27)=1.0 picked up 27 of the 27 negatives,completely specific increase threshold to zero false positives, true positives will drop Sensitivity = 50/(50+23)=.68 able to pickup 50 of the total 73 that are known positive- not quite sensitive- Low false negatives Known gold standard + ve - ve + ve - ve 50 23 27 Exptal test result N=100
Trade off involved If threshold of test set high, so that all the noise disappears, you may also miss out on some true positives, get a lot of false negatives and thus not so sensitive - case C If threshold of test set low, so that you get as much of the positives as you can get, ie high sensitivity, your non-specific false positive hits start appearing - Case B
Computational Predictions of Gene Function Sensitivity and specificity has similar tradeoffs. Cutoff threshold values have to be empirically determined or arbitrarily chosen depending on situation