The Ensembl Gene set The “Genebuild” 21 April 2008.

Slides:



Advertisements
Similar presentations
Genome Annotation: A Protein-centric Perspective.
Advertisements

Bioinformatics Ayesha M. Khan Spring 2013.
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Alignment of mRNAs to genomic DNA Sequence Martin Berglund Khanh Huy Bui Md. Asaduzzaman Jean-Luc Leblond.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Genome Assembly and Annotation Erik Arner Omics Science Center, RIKEN Yokohama, Japan
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
How to access genomic information using Ensembl August 2005.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Genome Browsers UCSC (Santa Cruz, California) and Ensembl (EBI, UK)
Evaluating genes and transcripts in Ensembl
UniProt - The Universal Protein Resource
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
The Human Genome Project Public: International Human Genome Sequencing Consortium (aka HUGO) Private: Celera Genomics, Inc. (aka TIGR)
RNA.
Doug Brutlag 2011 Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University School of Medicine Genomics, Bioinformatics.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
On line (DNA and amino acid) Sequence Information
1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Genomes School B&I TCD Bioinformatics May Genome sizes Completed eukaryotic nuclear genomes Type of organismSpeciesGenome size (10 6 base pairs)
Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management.
Corrections. - The cacao genome is currently being sequenced - Human Chromosome 1 sequence Search ‘Genome’
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
NCBI’s Genome Annotation: Overview Incremental processing Re-annotation ( batch ) Post-annotation review Case studies NOTE: limiting discussion to annotation.
Part I: Identifying sequences with … Speaker : S. Gaj Date
An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit.
1 of 38 Data Mining in Ensembl with BioMart. 2 of 38 Simple Text-based Search Engine.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Sackler Medical School
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.
Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.
Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Copyright OpenHelix. No use or reproduction without express written consent1.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
A guided tour of Ensembl This quick tour will give you an outline view of what Ensembl is all about. You will learn: –Why we need Ensembl –What is in the.
Evaluating genes and transcripts in Ensembl March 2007.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Sequence Curation Paul Davis Sanger Institute. Overview Sequence curation within WormBase consortium. Import of sequence data. Prediction stats. Work.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
What is BLAST? Basic BLAST search What is BLAST?
Welcome to the combined BLAST and Genome Browser Tutorial.
Sequence-Structure-Function Sequence Structure Function Threading Ab initio BLAST Folding: impossible but for the smallest structures Function prediction.
Web Databases for Drosophila
What is BLAST? Basic BLAST search What is BLAST?
bacteria and eukaryotes
Introduction to Genes and Genomes with Ensembl
Basics of BLAST Basic BLAST Search - What is BLAST?
Ensembl Genome Repository.
Gene Safari (Biological Databases)
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

The Ensembl Gene set The “Genebuild” 21 April 2008

2 of 32  The GeneBuild (determining the Ensembl gene set)  What it means for the scientist?  ‘annotation pipeline’ vs ‘manual curation’  Pseudogenes  ncRNAs  The CCDS project Outline

3 of 32 What is available? I) Sequence Assemblies from genome sequencing efforts Introduction

4 of 32 Gene Sequencing- the Assembly This generates clones, vs new sequencing methods

5 of 32 Clones Available Human: (Tilepath- used in the assembly) Ciona intestinalis Shotgun assembly

6 of 32 ContigView: Clones and Contigs Contigs Clones (Plate/well numbers) Ensembl Transcripts

7 of 32 Task: View the tilepath clone in ContigView for the region containing the human BRCA2 gene. Hint: Start with a search for the BRCA2 gene.

8 of 32 The Ensembl Geneset How does Ensembl use mRNA and protein information along with the sequence assembly to define distinct genes on the genome? ProteinSequence Assembly Ensembl Geneset

9 of 32 Once the Assembly is Imported… Proteins/mRNAs are aligned. These have been submitted to databases such as: UniProt (manually curated) and RefSeq (partially manually curated)

10 of 32 The Biological Evidence UniProt/Swiss-Prot A manually curated database and therefore of highest accuracy NCBI RefSeq A partially manually curated database UniProt/TrEMBL Automatically annotated translations of EMBL coding sequence (CDS) features EMBL / GenBank / DDBJ Primary nucleotide sequence repository All Ensembl gene predictions are based on experimental evidence:

11 of 32 Database Relationship NCBI RefSeq EMBL-Bank DDBJ GenBank UniProt Swiss- Prot TrEMBL Individual Lab’s Submission

12 of 32 Sequence (Assembly) Proteins ( e.g. Swiss-Prot) mRNA EST Manual annotation (HAVANA) EST genes Ensembl Genebuild EMBL-Bank GenBank DDBJ

13 of 32 Ensembl genes may be based on multiple protein/mRNAs What is an Ensembl gene based on? Why do I want to know?…

14 of 32 Task Look at the evidence for the human EPO gene. What was this gene based on? Hint: Go to Exon Information from the GeneView page

15 of 32 EPO gene supporting evidence

16 of 32 Species-Specific GeneBuilds Pan troglodytes genes are built by projection from human genes. Zebrafish has many gene duplications. Homo sapiens genes must have protein evidence, not just mRNA.

17 of 32 Task When was the chimpanzee (Pan troglodytes) Genebuild performed? Can you find information as to how genes were annotated? Hint: Look on the chimpanzee index page

18 of 32 External Gene Set: VEGA/Havana Human, zebrafish, mouse and dog Havana transcripts in blue or gold… What are Havana transcripts?

19 of 32 Automatic vs Manual Annotation Automatic Annotation (Ensembl Genebuild) Quick Use unfinished sequence or shotgun assembly Consistent annotation Manual Annotation (Havana) Flexible, can deal with inconsistencies Most rules have exceptions Consult publications as well as databases ‘Out of the Ordinary’ Biology However… Slow Need finished sequence

20 of 32 Havana and Ensembl match When a Havana (manually curated) and Ensembl (automatic methods) predict the same transcript, basepair for basepair, the transcripts are merged and coloured gold.

21 of 32 Manually-curated gene sets in Ensembl Vega (Havana) Homo sapiens, Danio rerio, Mus musculus and Canis familiaris WormBase Caenorhabditis elegans FlyBase Drosophila melanogaster SGD Saccharomyces cerevisiae

22 of 32 Consensus coding sequences (CCDS) Collaboration between NCBI, UCSC, Ensembl and Havana to agree on a coding sequence for a transcript. The long term aim is to have a single gene set for human The genebuild pipeline has been modified to retain these CDSs

23 of 32 What Can Go Wrong? I)A Gap in the assembly Gene might not be found in Ensembl II) Fused genes BLAST hit (SwissProt entry) Gene might be associated with two names

24 of 32  The genome sequence  The Genebuild  ‘manual curation’ by Havana  Other: EST gene set Pseudogenes ncRNAs Outline

25 of 32 Expressed Sequence Tags vs ‘cDNA’ ESTs are annotated separately. Why?  mRNA and cDNA used in the GeneBuild: Sequenced to high standard, often complete.  EST: Lower quality sequence. ‘One shot’ sequencing of cDNA from the 5’ and 3’ end creates the EST sequence. ESTs are only nucleotides long Low quality fragment- sequence error of ~2%. BUT confers useful expression information  discovery of new genes esp in diseased organisms  Tissue type  Timing/developmental stage  Samples more transcripts, variants

26 of 32 Where Can I See This EST Geneset? ContigView Choose EST genes EST track

27 of 32 Pseudogenes: ‘False’ Genes Unprocessed Produced by gene duplication and rearrangement Reverse transcription and re-integration mRNA pseudogene AAAAAA Processed AAAAAA

28 of 32 ncRNAs (non coding RNAs) What types are in Ensembl? tRNA (transfer RNA) rRNA (ribosomal RNA) scRNA (small cytoplasmic) snRNA (small nuclear) snoRNA (small nucleolar) miRNA (microRNA)

29 of 32 ncRNAs (2 types) I) RNA with low homology can be identified through conserved 2 ary structure (search genome using Rfam pattern) II) High sequence conservation (miRNA) BLAST alignment ‘RNA fold’ applied to make sure sequences can fold (hairpin)

30 of 32 ncRNAs… where can I see them? Find them in ContigView: or use BioMart.

31 of 32 *All Ensembl genes are based on biological evidence (protein and mRNA)  One Ensembl gene may come from proteins and mRNAs in various databases.  Havana (manually curated) genes are incorporated into the Ensembl geneset, merged for human.  The CCDS set strives for consensus coding sequences across databases.  Pseudogenes and RNAs are annotated, along with a separate EST gene set. Summary – Ensembl Genes

32 of 32 For more on GeneBuild: Help and Documentation (About Ensembl)