Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199.

Similar presentations


Presentation on theme: "Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199."— Presentation transcript:

1 Genome Annotation Md. Imtiyaz Hassan, Ph.D.

2 (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199 (274)(579) 827complete (includes (28) (36) (49) 94 eukaryotes) 508 (728) (1285) 1932 prokaryotic genomes in progress 421 (494) (721) 936 eukaryotic genomes in progress small: archaebacterium Nanoarchaeum equitans 500 kb Bacillus anthracis (anthrax)5228 kb S. cerivisiae (yeast)12,069 kb Arabidopsis thaliana115,428 kb Drosophila melanogaster (fruit fly)137,000 kb Anopheles gambiae (malaria mosquito)278,000 kb Oryza sativa (rice)420,000 kb Mus musculus (mouse)2,493,000 kb Homo sapiens (human)2,900,000 kb http://www.genomesonline.org/ Genome Sequencing

3 Genome Annotation Annotation is the process of interpreting raw sequence data into useful biological information Annotations describe the genome and transform raw genome sequences into biological information by integrating computational analyses, other biological data and biological expertise. Old Days: One Gene done by one Lab = LOTS of INFO Now: Many genes = Superficial and incomplete of many genes. Features could be repeats, genes, promoters, protein domains…….. Features can be linked to other databases eg Pfam/Pubmed

4 Genome sequencing helps in: identifying new genes (“gene discovery”) looking at chromosome organization and structure finding gene regulatory sequences comparative genomics These in turn lead to advances in: medicine agriculture biotechnology understanding evolution and other basic science questions

5 high throughput assays robotics high speed computing statistics bioinformatics Because of the vast amounts of data that are generated, we need new approaches

6 Published by AAAS The ENCODE Project Consortium Science 306, 636 -640 (2004) Functional genomic elements being identified by the ENCODE pilot phase

7 Annotation of eukaryotic genomes transcription RNA processing translation AAAAAAA Genomic DNA Unprocessed RNA Mature mRNA Nascent polypeptide folding Reactant A Product B Function Active enzyme ab initio gene prediction Comparative gene prediction Functional identification Gm 3

8 8 How many genes? Consortium: 35,000 genes? Celera: 30,000 genes? Affymetrix: 60,000 human genes on GeneChips? Incyte and HGS: over 120,000 genes? GenBank: 49,000 unique gene coding sequences? UniGene: > 89,000 clusters of unique ESTs?

9 9 Current consensus (in flux …) 15,000 known genes (similarity to previously isolated genes and expressed sequences from a large variety of different organisms) 15,000 known genes (similarity to previously isolated genes and expressed sequences from a large variety of different organisms) 17,000 predicted (GenScan, GeneFinder, GRAIL) 17,000 predicted (GenScan, GeneFinder, GRAIL) Based on and limited to previous knowledge Based on and limited to previous knowledge

10 The Annotation Process DNA SEQUENCE ANNALYSIS SOFTWARE Useful Information Annotator

11 A Common Mistake! PROTEIN SEQUENCE Annotator BLAST Function

12 Protein Families, Motifs & Domains. BLAST and FASTA Sequence alignment Domains Prosite Pfam/HMMs SignalP/ TMHMM

13 BLAST Local Alignment Suggests the presence of a common domain between two proteins. However common domains can be conserved between proteins with very different functions Eg ATP binding common to many proteins

14 BLAST/FASTA BLAST/FASTA FASTA is a global alignment tool BLAST blast is local BLAST FASTA Reduces sensitivity increases specificity

15 Using FASTA Global Alignment Annotation gained from homology hits is only as good as the annotation you are transferring. Eg there are two different genes called ESAG2 in swall. Small changes in “your gene” might confer functional differences.

16 FASTA 10 -5 Low scoring hits Can give good alignments

17 10 -8 High scoring hits can give poor alignments

18

19

20 The big problem with searching public databases is… There is a need to reduce The amount of sequences We search and to prevent bad Annotation from spreading

21 Protein Families, Motifs & Domains. Proteins with common functions have some common features. Domains and motifs from conserved residues. Families can be grouped, profiles and HMMs derived. There is more to life than Blast

22 Sequence Alignment Sequence alignments allow us to see which residues are important to a family of proteins. This lets us make motifs/profiles/fingerprints/HMMs. To define families

23 Domains A domain is a functional part of a protein It may contain amino acid sequence motifs that can be used to identify it. More than one motif is known as a fingerprint

24 DOMAINS Fingerprints Blocks Domain Alignment Prosite Motifs Pfam (HMMs)

25 Overview Profile DB (1) Identifying functional motifs and structural domains by comparing sequences against PROSITE, BLOCKS, SMART, Pfam, CDD databases, Prodom, Trembl, Interpro Prosite patterns - http://www.expasy.ch/prosite/ Prosite profiles Pfam – database of HMMs for domain and families http://www.sanger.ac.uk/Software/Pfam/index.shtml SMART - http://smart.embl-heidelberg.de/ Prints TIGRFAMs BLOCKS Alignment databases ProDom – Protein Domain Database http://www.toulouse.inra.fr/prodom.html PIR-ALN ProtoMap Domo ProClass

26 Overview Profile DB (2) Integrated Pattern Databases: MetaFam IProClass InterPro CDD – Common Domain Database http://www.ncbi.nih.gov/Structure/ http://www.ncbi.nih.gov/Structure/ CDD Search  DART

27 Prosite http://us.expasy.org/prosite/ Maintained a the swiss institute of Bioinformatics. All Motifs are checked for false positives and fine tuned. Sometimes a family can be defined by more than one expression. Fingerprints and BLOCKs automatically scan proteins for a number of motifs. http://bioinf.man.ac.uk/dbbrowser/PRINTS/ http://blocks.fhcrc.org/help/

28 Pfam Pfam 7.0 contains a total of 3360 families. Pfam is a database of two parts: –Pfam A..curated –Pfam B automatically generated. All HMMs have a seed alignment which is added to using the HMMer package.

29 Pfam http://www.sanger.ac.uk/Software/Pfam// http://pfam.wustl.edu/

30 Interpro curation http://www.ebi.ac.uk/interpro/

31 Gene Ontology http://www.geneontology.org/

32 TMHMM http://www.cbs.dtu.dk/services/TMHMM / Transmembrane Domains: Membrane bound proteins http://www.cbs.dtu.dk/services/TMHMM / http://www.cbs.dtu.dk/services/TMHMM /

33 SIGNALP What Is a signal Peptide? Any protein that has to be targeted to a specific part of the cell requires a signal peptide. The signal peptide ensures that the protein in translated at the ER where it can enter the secretory pathway. Ie, the signal peptide suggests a cellular (or extracellular) location other than the cytoplasm. Signal Peptides: Secreted/targeted proteins

34 using secondary databases for functional Assignments Better, more detailed, professional annotation. More powerful and sensitive search methods, hmms/profiles/weight matrixes. Not as good coverage.

35 Protein Secondary Structure CATH (Class, Architecture,Topology, Homology) http://www.biochem.ucl.ac.uk/dbbrowser/cath/ SCOP (structural classification of proteins) - hierarchical database of protein folds http://scop.mrc-lmb.cam.ac.uk/scop FSSP Fold classification using structure-structure alignment of proteins http://www2.ebi.ac.uk/fssp/fssp.html TOPS Cartoon representation of topology showing helices and strands http://tops.ebi.ac.uk/tops/

36 The Gene Prediction Process DNA SEQUENCE ANNALYSIS SOFTWARE Functional Assignments Annotator Prosite TMHMM Pfam SignalP FASTA BLAST

37 Slide Break – EMBL Features

38 More… More on gene prediction Gene Finding Genome Comparison and Further Genome Analysis

39 Genome Annotation Genome Databases The GenBank/EMBL file format Editing GenBank/EMBL files with Artemis The annotation process Common pitfalls

40 Public Databases Genbank, EMBL and DDBJ. All databases update each other automatically

41 EMBL and TREMBL Patricia Rodriguez-Tomé, Peter J. Stoehr, Graham N. Cameron and Tomas P. Flores, "The European Bioinformatics Institute (EBI) databases", Nucleic Acids Res. 24:(6-13), 1996 EMBL currently contains 14366182 entries

42 EMBL File Contains: A header File containing: –Information about the sequence –Organism –Authors –References –Comments A feature table containing –Sequence features and co-ordinates

43 ID PFMAL1P4 standard; DNA; INV; 66441 BP. XX AC AL031747; XX SV AL031747.8 XX DT 24-SEP-1998 (Rel. 57, Created) DT 27-APR-2000 (Rel. 63, Last updated, Version 13) XX DE Plasmodium falciparum DNA from MAL1P4 XX KW HTG; rifin; telomere; var; var-like hypothetical protein. XX OS Plasmodium falciparum (malaria parasite P. falciparum) OC Eukaryota; Alveolata; Apicomplexa; Haemosporida; Plasmodium. XX RN [1] RA Oliver K., Bowman S., Churcher C., Harris B., Harris D., Lawson D., RA Quail M., Rajandream M., Barrell B.; RT ; RL Submitted (24-SEP-1998) to the EMBL/GenBank/DDBJ databases. RL P.falciparum Genome Sequencing Consortium, The Sanger Centre, Wellcome RL Trust Genome Campus, Hinxton, Cambridge CB10 1S. Header File

44 EMBL File Feature Table misc_difference misc_feature misc_recomb misc_RNA misc_signal misc_structure modified_base mRNA N_region old_sequence polyA_signal polyA_site precursor_RNA prim_transcript primer_bind promoter protein_bind RBS repeat_region repeat_unit rep_origin rRNA S_region satellite scRNA sig_peptide snRNA snoRNA source stem_loop STS TATA_signal terminator transit_peptide tRNA unsure V_region V_segment variation 3'clip 3'UTR 5'clip 5'UTR -10_signal -35_signal attenuator C_region CAAT_signal CDS conflict D-loop D_segment enhancer exon GC_signal gene iDNA intron J_segment LTR mat_peptide misc_binding Anything that can have a coordinate on a DNA sequence.

45 Feature qualifiesr Additional information about a feature /allele="text" /citation=[number] /codon=(seq:"text",aa: ) /codon_start= : " /EC_number="text" /evidence= /exception="text" /function="text" /gene="text" /label=feature_label /map="text"allelecitationcodoncodon_startdb_xrefEC_numberevidenceexceptionfunctiongenelabelmap /note="text" /number=unquoted /product="text" /protein_id=" " /pseudo /standard_name="text" /translation="text" /transl_except=(pos:,aa: ) /transl_table /usedin=accnum:feature_labelnotenumberproductprotein_idpseudostandard_nametranslationtransl_excepttransl_tableusedin

46 Features

47 Annotation in Artemis FT CDS 732..1415 FT /db_xref="IPR002038" FT /gene="PfLtest.01" FT /label=PfLtest.01 FT /note="PfLtest.01. len=227aa. Asp-rich protein.Predicted FT by Genefinder, Phat and GlimmerM. Similar to Plasmodium FT falciparum hypothetical 132.2 kDa protein TR:O97242 FT (EMBL:AL034558) (1114 aa) fasta scores: E(): 7.1e-21, FT 44.388% id in 196 aa." FT /product="Asp-rich hypothetical protein" FT /colour=10 FT /fasta_file="fasta/sanger_100kb.embl.seq.00001.out" FT misc_feature complement(1855..1871) FT /fasta_file="fasta/TEST100.tab.seq.00105.out" FT CDS 3151..4821 FT /gene="PfLtest.02" FT /label=PfLtest.02 FT /note="PfLtest.02. len=556aa. Predicted by Genefinder, FT Phat and GlimmerM. Unknown hypothetical protein" FT /product="unknown hypothetical protein" FT /colour=8 FT /fasta_file="fasta/sanger_100kb.embl.seq.00002.out"

48 CDS features CDS stands for coding sequence and is used to denote genes and pseudogenes. These features are automatically translated on submission and the protein added to the protein databases.

49 /note Note field contains all the evidence for a gene call……..plus anything else. –Similarity (fasta or blast) –Domain/motif information (pfam, tmhmm etc) –Unusual features (repeats, aa richness)

50 /product The name of the gene product eg Alcohol dehydrogenase Unless there is proof we must qualify.. Putative Possible Always be conservative!.. eg. Putative dehydrogenase dehyrogenase like protein Only piece of annotation added to the protein databases.

51 Naming protocols Hypothetical proteinunknown function and no homology Conserved hypothetical proteinunknown function WITH homology alcohol dehydrogenase likelooks a bit like it, but may not be. Putative alcohol dehydrogenaseprobably a alcohol dehydrogenase Alcohol dehydrogenasethis has previously been characterised and shown to be alcohol dehydrogenase in this organism.

52 /gene The gene name Eg ADH1 Only transfer a gene name if it is meaningful Never transfer a gene name like PfB0024. Is it a gene family? make sure two genes have the same name.

53 Transitive Annotation AKA annotation catastrophe Junk in = Junk out Miss-annotations spread through incorrect database submissions.

54

55

56

57

58

59

60

61

62

63

64

65


Download ppt "Genome Annotation Md. Imtiyaz Hassan, Ph.D.. (As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08 (1128)(1496)(2680) (3825) genome projects: 199."

Similar presentations


Ads by Google