Gene Structure Annotation David Swarbreck ASPB Plant Biology, June 29, 2008, Merida.

Slides:



Advertisements
Similar presentations
Model Organism Databases and Community Annotation
Advertisements

Making best use of TAIR tools and datasets Philippe Lamesch Donghui Li The Arabidopsis Information Resource contact us:
TAIR: Bringing together data for the global plant biology community Philippe Lamesch Kate Dreher The Arabidopsis Information Resource
GBrowse at TAIR Philippe Lamesch TAIR curator. Seqviewer.
The Arabidopsis Information Resource (TAIR)
Gene Structure Annotation Philippe Lamesch International Arabidopsis conference July 23, 2008, Montreal.
Part I: Tips and Techniques from curators GBrowse at TAIR David Swarbreck.
2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
ABSTRACT WormBase is a freely available information resource primarily for the nematode Caenorhabditis elegans but which progressively includes data from.
Genome organization Lesk, Ch 2 (Lesk, 2008). Genomes and proteomes Genome of a typical bacterium comes as a single DNA molecule of about 5 million characters.
Genome analysis and annotation. Genome Annotation Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
The Protein Data Bank (PDB)
Genome Annotation BCB 660 October 20, From Carson Holt.
NGS Analysis Using Galaxy
TAIR resources for plant biology research kate dreher curator TAIR/PMN.
The Ensembl Gene set The “Genebuild” 21 April 2008.
Genomics of Microbial Eukaryotes Igor Grigoriev, Fungal Genomics Program Head US DOE Joint Genome Institute, Walnut Creek, CA.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Genome Annotation and Databases Genomic DNA sequence Genomic annotation BIO520 BioinformaticsJim Lund Reading Ch 9, Ch10.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Rhesus genome annotations Rob Norgren Department of Genetics, Cell Biology and Anatomy University of Nebraska Medical Center.
New data and tools at TAIR (The Arabidopsis Information Resource)
EXPLORING DEAD GENES Adrienne Manuel I400. What are they? Dead Genes are also called Pseudogenes Pseudogenes are non functioning copies of genes in DNA.
NCBI’s Genome Annotation: Overview Incremental processing Re-annotation ( batch ) Post-annotation review Case studies NOTE: limiting discussion to annotation.
ModENCODE August 20-21, 2007 Drosophila Transcriptome: Aim 2.2.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Chapter 21 Eukaryotic Genome Sequences
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Genome Annotation Rosana O. Babu.
Sackler Medical School
Annotating genomes using MAKER-P and iPlant. What Are Annotations? Annotations are descriptions of features of the genome –Structural: exons, introns,
Mark D. Adams Dept. of Genetics 9/10/04
From Genomes to Genes Rui Alves.
Curation Tools Gary Williams Sanger Institute. SAB 2008 Gene curation – prediction software Gene prediction software is good, but not perfect. Out of.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Comparative Genomics Methods for Alternative Splicing of Eukaryotic Genes Liliana Florea Department of Computer Science Department of Biochemistry GWU.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
August 2008Bioinformatics tools for Comparative Genomics of Vectors1 Genome Annotation Daniel Lawson EBI.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Advisory Board Meeting, Caltech 2004 Genome Sequence Updates. Paul Davis The Sanger Institute.
-1- Module 3: RNA-Seq Module 3 BAMView Introduction Recently, the use of new sequencing technologies (pyrosequencing, Illumina-Solexa) have produced large.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Finding genes in the genome
Accessing and visualizing genomics data
Annotation of eukaryotic genomes
Daphnia Genome Annotation & Analysis Notes July 2007 Don Gilbert Genome Informatics Lab, Biology Dept., Indiana University
bacteria and eukaryotes
Annotating The data.
The Transcriptional Landscape of the Mammalian Genome
VectorBase genome annotation
Genomes and Their Evolution
PlantGDB: Annotation Principles & Procedures
Genome Annotation w/ MAKER
Introduction to Bioinformatics II
Part I: Tips and Techniques from curators
Ensembl Genome Repository.
The Release 5.1 Annotation of Drosophila melanogaster Heterochromatin
2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA
Schematic representation of proteogenomic annotation strategy.
.1Sources of DNA and Sequencing Methods 2 Genome Assembly Strategy and Characterization 3 Gene Prediction and Annotation 4 Genome Structure 5 Genome.
Part II SeqViewer AraCyc Help
Volume 11, Issue 7, Pages (May 2015)
Presentation transcript:

Gene Structure Annotation David Swarbreck ASPB Plant Biology, June 29, 2008, Merida

Outline Overview of TAIR8 Data availability Assembly updates Transposable elements Plans for TAIR9 Gene confidence Alternative gene model Utilising Comparative, proteomic and transcriptome data New GBrowse tracks

TAIR8 Release 33,282 total genes (38,963 gene models) 1291 new genes (2009 new gene models) 50 obsolete genes (65 deleted gene models) Merge 41, Split updated structures, 625 CDS updates 23% (7380) TAIR7 genes updated Source of updates Submission from community (reviewed by TAIR) Manual annotation in-house Computational pipeline (PASA)

TAIR8 Release 33,282 total genes (38,963 gene models) 1291 (681) new genes (2009 new gene models) 50 obsolete genes (65 deleted gene models) Merge 41, Split updated structures, 625 CDS updates 23% (7380) (32% 10098) TAIR7 genes updated

Genome Annotation Portal ion_data.jsp

Genome Annotation Portal ion_data.jsp

Sequences and information, TAIR FTP ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR8_genome_release/ Sequences GFF/XML/NCBI.tbl Updates Conversion files Associations

Browse the genome Seqviewer Data types

Browse the genome GBrowse Data types >50 tracks

Changes made for TAIR8 Assembly updates Remove sequence contamination Single base pair errors Addition of Transposable elements

Assembly updates Genome assembly unchanged since TIGR5 (prior to TAIR8) Remove sequence contamination Vector= NCBI VecScreen, Webcutter 2.0 Ecoli = Megablast v Ecoli(nr) Rice = Community Vector/Ecoli = 12 regions Rice = 2 regions Equivalent #Ns substituted 8 genes set to obsolete, 2 modified

Assembly updates Single base pair errors Solexa read data (Columbia) supplied by Joe Eckers Lab (Salk institute) 1425 bases changed called 2 or greater, % of time consensus base is called is >=75%) no minority read support/no ler support Confirmed base changes where overlap current annotation

Assembly updates Single base pair errors 1425 bases changed 157 gene model protein sequences updated 518 had either protein/CDS,mRNA or genomic sequence updated

Assembly updates - GBrowse Gaps

Transposable Elements (TE) & TE-genes 31,060 elements, 339 families, 17 superfamilies Hadi Quesneville Institut Jacques Monod (Buisine et al. Genomics, 2008) Combines evidence from multiple homology-based predictions TE-gene annotation gene encoded within a transposable element e.g. helicase, transposase etc TAIR7, No defined type (ncRNA, protein coding, pseudogene) TAIR7, Not all TE-genes have TE descriptions

HELITRON4 family DNA transposon Unknown pseudogenes Overlapping TEs Protein alignments Transposable Element

Identifying TE-genes Categorization as TE-gene By % Overlap with TE (100, >70, >50, below 50) Similarity to set of Known TE-proteins Manual review Additional checks (description, GO terms, publications, transcript evidence) 3900 AGI genes were reclassified (720 previously classed as protein coding)

Associating TE to TE-genes Overlap single TE >75% 2940 TE-genes associated 960 TE-genes unassociated

Transposons & TAIR TE given ID AT2TE ,189 TEs, 3900 TE-genes

Transposons & TAIR

Plans for TAIR9

Gene confidence score Why assign a confidence score? Differentiates well supported, partially supported and non-supported models Allows TAIR users to target particular categories For further experimentation For use as a reference set For computational analysis Allows TAIR to target partially supported genes Provides a measure with which to monitor improvement

Gene confidence outline Categories of evidence Transcript (cDNA/EST) Protein Conservation Proteomic data Transcriptome data (MPSS etc) Rankings within category Assign confidence score/rank to model + exons

Transcript exon rankings - internal Splice sites confirmed by transcript Transcript only overlaps exon Intermediates

Transcript exon rankings - external

Transcript Model rankings Intermediates

Gene confidence outline Provide evidence ranks on web pages/GFF Transcript (cDNA/EST)7 Protein2 Conservation2 Proteomic data0 Transcriptome data (MPSS etc)0 Include overall rank (incorporating all evidence) Associate general description to each overall rank e.g. Confirmed, partially confirmed or Platinum, Gold, Silver etc Exon ranks included in GFF file Rank

Alternative gene annotations Eugene (transcript, proteins +) Thierry-Mieg (NCBI) Gnomon (transcript, proteins) Souvorov (NCBI) Aceview (transcript) Sebastien Aubourg Hanada et al 2007 (3633 predicted genes) Identify possible corrections

Utilising Comparative, proteomic and transcriptome data Existing annotation ab initio + transcript Advancements in sequencing technology Proteomic data (mass spec) Comparative data Transcriptome data (MPSS, SAGE)

Proteomic Data High-density Arabidopsis proteome map (Baerenfaller. 2008) Verification of gene structure at the level of translation Not all transcripts expressed at protein level Transcribed pseudogenes NMD targets Aid locus classification Help identify missing genes/exons coding exons TSS Incorrect start codon

Comparative data Cross spp transcript/peptide alignments Genomic alignments (LBL) Populus trichocarpa Oryza sativa Medicago truncatula Physcomitrella patens Selaginella moellendorfii

VISTA plot Gbrowse track

Transcriptome data Sequence based signature methods MPSS SAGE etc Identify intergenic expression Alternative exons Anti-sense expression

Transcriptome data

A collective approach Utilise alt. gene predictions, comparative alignments, transcriptome and proteomic data complements individual strategies Gene confidence, identify weakly supported genes Comparing across data types Identifies potential gene updates Allows us to prioritize updates Combined manual and computational approach

Orthologs and Gene Families

Variation

Promoter Elements

Methylation

Decorated Fasta file