Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genome Annotation Rosana O. Babu.

Similar presentations


Presentation on theme: "Genome Annotation Rosana O. Babu."— Presentation transcript:

1 Genome Annotation Rosana O. Babu

2 Sequence to Annotation

3 Input1-Variant Annotation

4 Input2- Structural Annotation
Structural Annotation was conducted using AUGUSTUS (version 2.5.5), Magnaporthe_grisea as genome model However, we have to develop genome model for Oomycete to obtain accurate result

5 Input3-Functional Annotation

6 Genome Annotation The process of identifying the locations of genes and the coding regions in a genome to determe what those genes do Finding and attaching the structural elements and its related function to each genome locations

7 Genome Annotation gene function prediction gene structure prediction
Attaching biological information to these elements- eg: for which protein exon will code for gene structure prediction Identifying elements (Introns/exons,CDS,stop,start) in the genome

8 Structural Annotation
Structural Annotation Homology based methods- Reference (Pairwise alignments) Ab- initio methods- (codon usage table)

9 Eukaryote genome annotation
Find locus Genome Transcription Primary Transcript RNA processing Find exons using transcripts ATG STOP Processed mRNA m7G AAAn Translation Find exons using peptides Polypeptide Protein folding Folded protein Find function Enzyme activity A B Functional activity

10 Prokaryote genome annotation
Find locus Genome Transcription Primary Transcript RNA processing Find CDS START STOP START STOP Processed RNA Translation Polypeptide Protein folding Folded protein Find function Enzyme activity A B Functional activity

11 Genome annotation - workflow
Genome sequence Masked or un-masked genome sequence Repeats Structural annotation-Gene finding nc-RNAs, Introns Protein-coding genes Try to describe Genome annotation as a process Emphasize the ongoing nature of annotation. There is no real end point to the annotation process (only artificially defined ones) Best to think of this as a ‘best guess’ annotation Functional annotation Viewed & Released in Genome viewer

12 Genome Repeats & features
Polymorphic between individuals/populations Percentage of repetitive sequences in different organisms Genome Genome Size (Mb) % Repeat Aedes aegypti 1,300 ~70 Anopheles gambiae 260 ~30 Culex pipiens 540 ~50 Microsatellite Minisatellite Tandem repeat Short tandem repeat SSR

13 Finding repeats as a preliminary to gene prediction
Repeat discovery Literature and public databanks Homology based approaches Automated approaches (e.g. RepeatScout or RECON) Tandem repeats: Tandem, TRF Use RepeatMasker to search the genome and mask the sequence

14 Positions/locations are not affected by masking
Masked sequence Repeatmasked sequence is an artificial construction where those regions which are thought to be repetitive are marked with X’s Widely used to reduce the overhead of subsequent computational analyses and to reduce the impact of TE’s in the final annotation set >my sequence atgagcttcgatagcgatcagctagcgatcaggctactattggcttctctagactcgtctatctctattagctatcatctcgatagcgatcagctagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctactattggctgatcttaggtcttctgatcttct >my sequence (repeatmasked) atgagcttcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxatctcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxtagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctxxxxxxxxxxxxxxxxxxxtcttctgatcttct Softmasking Positions/locations are not affected by masking

15 Types of Masking- Hard or Soft?
Sometimes we want to mark up repetitive sequence but not to exclude it from downstream analyses. This is achieved using a format known as soft-masked >my sequence ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGCTACTATTGGCTTCTCTAGACTCGTCTATCTCTATTAGTATCATCTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGATCTTCT >my sequence (softmasked) ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGCTACTATTggcttctctagactcgtctatctctattagtatcATCTCGATAGCGATCAGCTAGCGATCAGGCTACTATTggcttcgatagcgatcagcTAGCGATCAGGCTACTATTggcttcgatagcgatcagcTAGCGATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGATCTTCT Softmasking >my sequence (hardmasked) atgagcttcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxatctcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxtagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctxxxxxxxxxxxxxxxxxxxtcttctgatcttct

16 Genome annotation - workflow
Genome sequence Masked or un-masked Map repeats Gene finding- structural annotation nc-RNAs, Introns Protein-coding genes Try to describe Genome annotation as a process Emphasize the ongoing nature of annotation. There is no real end point to the annotation process (only artificially defined ones) Best to think of this as a ‘best guess’ annotation Functional annotation Viewed & Released in Genome viewer

17 Structural annotation
Identification of genomic elements Open reading frame and their localization Coding regions Location of regulatory motifs Start/Stop Splice Sites Non coding Regions/RNA’s

18 Structural annotation- Gene-finding
Start/Stop Splice Sites CDS (Intron/Exon) Coding sequence for a protein-coding gene prediction (not necessarily continuous in a genomic context) ORF Open reading frame, sequence devoid of stop codons NC- RNA

19 Methods Similarity Ab- initio prediction
Similarity between sequences which does not necessarily infer any evolutionary linkage Ab- initio prediction Prediction of gene structure from first principles using only the genome sequence

20 Genefinding ab initio similarity

21 Gene_finding resources for Homology based methods
Transcript cDNA sequences EST sequences Peptide Non-redundant (nr) protein database Protein sequence data, Mass spectrometry data Genome Other genomic sequence

22 ab initio prediction Genome Coding potential ATG & Stop codons
Splice sites ATG & Stop codons Coding potential

23 Genefinding - ab initio predictions
Use compositional features of the DNA sequence to define coding segments (essentially exons) ORFs Coding bias Splice site consensus sequences Start and stop codons Methods Training sets are required Each feature is assigned a log likelihood score Use dynamic programming to find the highest scoring path for accuracy Examples: Genefinder, Augustus, Glimmer, SNAP, fgenesh

24 Genefinding - similarity
Use known coding sequence to define coding regions EST sequences Peptide sequences Problem to handle fuzzy alignment regions around splice sites Examples: EST2Genome, exonerate, genewise Gene-finding - comparative Use two or more genomic sequences to predict genes based on conservation of exon sequences Examples: Twinscan and SLAM

25 Genefinding - non-coding RNA genes
Non-coding RNA genes can be predicted using knowledge of their structure or by similarity with known examples tRNAscan - uses an HMM and co-variance model for prediction of tRNA genes Rfam - a suite of HMM’s trained against a large number of different RNA genes

26 Gene-finding omissions
Alternative isoforms Currently there is no good method for predicting alternative isoforms Only created where supporting transcript evidence is present Pseudogenes Each genome project has a fuzzy definition of pseudogenes Badly curated/described across the board Promoters Rarely a priority for a genome project Some algorithms exist but usually not integrated into an annotation set

27 Practical- structural annotation
Eukaryotes- AUGUSTUS (gene model) ~/Programs/augustus.2.5.5/bin/augustus --strand=both --genemodel=partial --singlestrand=true --alternatives-from-evidence=true --alternatives-from-sampling=true --progress=true --gff3=on --uniqueGeneId=true --species=magnaporthe_grisea our_genome.fasta >structural_annotation.gff Prokaryotes – PRODIGAL (Codon Usage table) ~/Programs/prodigal.v2_60.linux -a protein_file.fa -g 11 –d nucleotide_exon_seq.fa -f gff -i contigs.fa -o genes_quality.txt -s genes_score.txt -t genome_training_file.txt

28 Structural Annotation-
Structural Annotation was conducted using AUGUSTUS (version 2.5.5), Magnaporthe_grisea as genome model However, we have to develop genome model for obtaining accurate result

29 Functional annotation

30 Functional annotation
Attaching biological information to genomic elements Biochemical function Biological function Involved regulation and interactions Expression Utilise known structural information to predicted protein sequence

31 Genome annotation - workflow
Genome sequence Masked or un-masked Map repeats Gene finding- structural annotation nc-RNAs, Introns Protein-coding genes Try to describe Genome annotation as a process Emphasize the ongoing nature of annotation. There is no real end point to the annotation process (only artificially defined ones) Best to think of this as a ‘best guess’ annotation Functional annotation Viewed & Released in Genome viewer

32 Genome annotation A B Genome Primary Transcript Processed mRNA
Transcription Primary Transcript RNA processing ATG STOP Processed mRNA m7G AAAn Translation Polypeptide Protein folding Folded protein Find function Enzyme activity A B Functional activity

33 Functional annotation – Homology Based
Predicted Exons/CDS/ORF are searched against the non-redundant protein database (NCBI, SwissProt) to search for similarities Visually assess the top 5-10 hits to identify whether these have been assigned a function Functions are assigned

34 Functional annotation - Other features
Other features which can be determined Signal peptides Transmembrane domains Low complexity regions Various binding sites, glycosylation sites etc. Protein Domain See for a good list of possible prediction algorithms

35 Functional annotation - Other features (Ontologies)
Use of ontologies to annotate gene products Gene Ontology (GO) Cellular component Molecular function Biological process

36 Practical - FUNCTIONAL ANNOTATION
Homology Based Method setup blast database for nucleotide/protein Blasting the genome.fasta for annotations (nucleotide/protein) sorting for blast minimum E-value (>=0.01) for nucleotide/protein Further filtering for best blast hit (5-15) and assigning functions Removing Positive strand blast hits Removing negative strand blast hits

37 Functional annotation- output
August 2008 Bioinformatics tools for Comparative Genomics of Vectors

38 Conclusion Annotation accuracy is only as good as the available supporting data at the time of annotation- update information is necessary Gene predictions will change over time as new data becomes available (ESTs, related genomes) that are much similar than previous ones Functional assignments will change over time as new data becomes available (characterization of hypothetical proteins)

39 Thank You


Download ppt "Genome Annotation Rosana O. Babu."

Similar presentations


Ads by Google