Download presentation
Presentation is loading. Please wait.
Published byHelena Webster Modified over 9 years ago
1
Automatic methods for functional annotation of sequences Petri Törönen
2
What, Why, How??? Functional annotation of sequence (seq.) – Definition of description line – Mapping seq. to functional categories Simple solutions are error-sensitive Review some available tools in the exercises
3
Old, simple way Do a Sequence Search (SS), like BLAST, with your sequence Find the best match Transfer all the info from the best match to your sequence Everything done? Finished?
4
Problems First hit is unknown seq. First hit is misannotated seq. – an increasing problem!! No significant matches found Strong, but only local matches => impurities in search Inpurities in query seq.
5
Why manual analysis is hard? Large size of gene lists (SS result list) False positives among observed results
6
Each gene can have multiple functions -the important common theme among the genes can go easily unnoticed. Requires detailed knowledge of genes varying representations for same function in description lines Objectivity Why manual analysis is hard?
7
Gene Ontology (GO) A controlled vocabulary of gene product roles in cells and the role associations The roles can be applied to all organisms Three main hierarchies: biological process, cellular component and molecular function include currently about 19,000 classes (=roles) -usually only a small portion of these classes is in use with one organism (example: chloroplasts related functions are important only within plants) www.geneontology.org
8
Structure of GO GO graph: Hierarchical structure of linked nodes -each node presents one class that is part of its parental class Direct Acylic Graph (DAG) -a tree-structure where branches can also merge when going from parental nodes to child nodes. Genes can be linked to many classes in the GO structure Starting node root of hierarchical structure More detailed classes Less detailed classes
9
How GO helps GO presents a terminology for presentation of known information of the gene GO classifies genes according to their known/predicted functions Classes represent varying detail Classifications can be used to find over- represented functions in the results
10
How GO helps Look over-represented GO classes from the gene list Sampling w/o replacements answers to: How many ways there are to select 8 balls so that two of them are white and rest are black from the whole data? we would like to ask: what is the probability of observing the number of class members like we have in the cluster by random? Solution from the statistics is the sampling without replacement
11
Methods that predict protein function Methods that summarize the SS result list Methods that use profile searches Methods that use sequence features Methods based on sequence patterns Methods based on sequence phylogeny
12
SS list summarization Consensus analysis of SS list Do the SS Look repetitively occuring descriptions /GO classes Over-representation of GO classes (BLAST2GO) Tools performing this: Our method PANNZER (Koskinen et al. unpubl.) BLAST2GO ( http://www.blast2go.org/start_blast2go ) ConFunc
13
Profile search methods Use profile searches instead of SS Some positions are more conserved in the seq. PFAM http://pfam.sanger.ac.uk/ ConFunc http://www.sbg.bio.ic.ac.uk/~confunc/
14
ConFunc in detail BLAST search with query seq. Obtain a result list Seq:s in result list are clustered to seq:s with similar function (same GO classes) Each cluster is used as a seed for a profile search Test how well the query seq matches to each profile Use link: http://www.sbg.bio.ic.ac.uk/confunc/indextemp.cgi
15
Sequence feature methods Look for sequence features Features: Secondary structure, protein domains Compare sequences by looking which features they have in common Methods that do this: FACT http://www.cibiv.at/FACT/ Limited search possibilities with FACT
16
Sequence pattern methods Pattern => frequently observed short motif from seq. DB InterProScan BioDictionary from IBM Computational Biology (http://cbcsrv.watson.ibm.com/Tpa.html) – Extraction of most of the patterns from swissprot – Linking of each pattern to keywords, seen in the seq:s where pattern was – Query seq. is linked to keywords via patterns it has
17
Phylogeny based methods Shortly: Include the species tree to the annotation of the sequences. Evolutionary distance is taken into account Compara from ENSEMBL http://www.ebi.ac.uk/GOA/compara_go_annotations.html
18
Tip for testing the tools For testing with purely random sequence http://www.bioinformatics.org/sms2/random_protein.html For testing partially random sequence http://www.bioinformatics.org/sms2/mutate_protein.html
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.