1 of 42 Browsing Genes and Genomes with Ensembl Bert Overduin Ensembl User Support EMBL Outstation European Bioinformatics Institute Wellcome Trust Genome Campus Hinxton, Cambridge, UK
2 of 42 Course Schedule Introduction Website walk-through Coffee Exercises BioMart Lunch Exercises GeneBuild Tea Variations / Compara Exercises
3 of 42 Ensembl Workshops
4 of 42 EMBL-EBI Hinxton, Cambridge
5 of 42 Wellcome Trust Genome Campus Hinxton, Cambridge © John Freebrey (
6 of 42
7 of 42 © Sean T. McHugh ( Cambridge
8 of 42 A Bit of History 1995Haemophilus influenzae 1.8 Mb 1996Yeast 12 Mb 1998C. elegans100 Mb 1999Fruit fly125 Mb 2000Arabidopsis115 Mb 2001Human (draft) 2002Mouse 2.6 Gb 2004Human (“finished”) 3 Gb Sequenced genomes
9 of 42 A Bit of History
10 of 42 Annotation Wikipedia : Genome annotation is the process of attaching biological information to sequences. It consists of two main steps: 1. identifying elements on the genome, a process called Gene Finding, and 2. attaching biological information to these elements. Automatic annotation tools try to perform all this by computer analysis, as opposed to manual annotation which involves human expertise. Ideally, these approaches co-exist and complement each other in the same annotation pipeline.
11 of 42 Ensembl - Goals Provide automatic annotation of genomic sequence Integrate other biological data Make data available to all via the web
12 of 42 Ensembl - Organisation Joint project between European Bioinformatics Institute (EMBL-EBI) and Wellcome Trust Sanger Institute Started in 1999 for the Human Genome Project Funded primarily by the Wellcome Trust, additional funding by EMBL, EU, NIH-NIAID, BBSRC and MRC Team of ca. 40 people, led by Ewan Birney (EBI) and Tim Hubbard (Sanger) Uses the largest dedicated computer system in biology in Europe
13 of 42 Genome Browsers Ensembl Genome browser NCBI Map Viewer UCSC Genome Browser
14 of 42 NCBI Map Viewer
15 of 42 UCSC Genome Browser
16 of 42 Ensembl Genome Browser
17 of 42 What Distinguishes Ensembl from the UCSC and NCBI Browsers? Automatic annotation for those species for which no manually curated gene set exists Direct database access and programmatic access via the Perl API Not only the data, but also the software source code is open source
18 of 42 Caveats While genome browsers can be very useful tools they do not provide the definitive answer to every question! Data is fluid
19 of 42 Which Species Are Available? 36 chordates, ranging from mammals to ‘primitive’ chordates (Ciona intestinalis and Ciona savignyi) 3 key eukaryote model organisms: fruitfly (Drosophila melanogaster) nematode (Caenorhabditis elegans) yeast (Saccharomyces cerevisiae) 2 insect pathogen vectors: malaria mosquito (Anopheles gambiae) yellow fever / dengue mosquito (Aedes aegypti)
20 of 42 Species in Ensembl CAMBRI ORDO SIL DEV CARBON PER TRIA JURA CRETAC TERTIA MYBP FISHES BIRDS REPTILES MAMMALS PLACENTALS MONOTREMES MARSUPIALS OTHER BIRDS PALEOGNATHS PASSERINES CROCODILES TURTLES LIZARDS AMPHIBIANS TELEOSTS SHARKS RAYS LATIMERIA BICHIR/POLYPTERUS LUNGFISHES AGNATHANS NON-VERTEBRATES
21 of 42 More Species to Come …. Oikopleura Gorilla Zebrafinch Orangutan Marmoset Amphioxus Acorn worm Hyrax Megabat Dolphin Tarsier Kangaroo rat Chinese pangolin Two toed sloth Llama Flying lemur
22 of 42 Which Data Are Available? Genomic sequence Gene/transcript/peptide models External references Mapped cDNAs, peptides, micro array probes, BAC clones etc. Other features of the genome: cytogenetic bands, markers, repeats etc. Comparative data: orthologues and paralogues, protein families, whole genome alignments, syntenic regions Variation data: SNPs Regulatory data: “best guess” set of regulatory elements Data from external sources (DAS)
23 of 42 Gene/Transcript/Peptide Models Manual annotation For parts of genomes: human, dog, mouse, zebrafish (“Vega genes”) For complete genomes: fruitfly (FlyBase), C. elegans (WormBase), yeast (SGD) Automatic predictions (“Ensembl genes”) EST predictions Ab initio predictions (GENSCAN, SNAP)
24 of 42 Biological Evidence UniProt/Swiss-Prot A manually curated database and therefore of highest accuracy NCBI RefSeq A partially manually curated database UniProt/TrEMBL Automatically annotated translations of EMBL coding sequence (CDS) features EMBL / GenBank / DDBJ Primary nucleotide sequence repository All Ensembl gene predictions are based on experimental evidence:
25 of 42 The Ensembl Genebuild Genome assembly Computer programs Experimental evidence Ensembl Genes + +
26 of 42 Ensembl Identifiers ENSG###Ensembl Gene ID ENST###Ensembl Transcript ID ENSP###Ensembl Peptide ID ENSE###Ensembl Exon ID ENSF###Ensembl Family ID ENSR###Ensembl Regulatory Feature ID For other species than human a suffix is added: MUS for mouse (Mus musculus) : ENSMUSG###, DAR for zebrafish (Danio rerio) : ENSDARG### etc.etc. For imported genes Ensembl uses the original identifiers
27 of 42 Access to Genome Annotation Release web site Pre-Release Archive BioMart Downloads ftp://ftp.ensembl.org/ ftp://ftp.ensembl.org/ MySQL interface ensembldb.ensembl.org Perl API
28 of 42 Pre! and Archive! Sites
29 of 42 BioMart Data Mining Tool
30 of 42 Downloads ftp://ftp.ensembl.org/pub FASTA files: plain sequence DNA (assembly masked and unmasked) cDNA (Ensembl and ab initio predictions) Peptides (Ensembl and ab initio predictions) RNA (non-coding RNA predictions) Flatfiles: annotated 1Mb slices EMBL format GenBank format MySQL: database table dumps
31 of 42 MySQL SQL = Structured Query Language Needed: MySQL client program Ability to write MySQL queries Knowledge of database schema
32 of 42 Perl API API = Application Programming Interface Needed: BioPerl modules Ensembl modules Ability to code in Perl For more information (installation instructions, tutorials, documentation etc.):
33 of 42 Ensembl BLAST WU-BLAST 2.0: search against assemblies, Ensembl predictions or ab initio predictions BLAT and SSAHA2: BLAST-like Alignment Tool Sequence Search and Alignment by Hashing Algorithm very fast search against assemblies for (almost) exact DNA-DNA matches Search against one or multiple species Search max. 30 sequences simultaneously
34 of 42 Ensembl Accounts Personalise Ensembl by saving bookmarks, view configurations and homepage preferences in a user account Share bookmarks and configurations by setting up groups Please note that all Ensembl data remains free access. It is not necessary to register in order to gain access to Ensembl data!
35 of 42 Website Statistics On average 1,000,000 page impressions / week Top 3 species: Top 3 countries:
36 of 42 Ensembl – Open Source Data and software freely available More than 50 installs worldwide Academia and industry Local or available via the web Mirrors with Ensembl data, e.g. or user projects with own data
37 of 42 Powered by Ensembl
38 of 42 What If I Need Help? Helpdesk: Workshops on use of the browser or the API Mailing lists: ‘Geek for a week’ program Animated tutorials
39 of 42 Ensembl Team Guy Coates, Tim Cutts, Shelley Goddard Systems & Support Paul Flicek, Yuan Chen, Stefan Gräf, Nathan Johnson, Daniel Rios Functional Genomics Ewan Birney (EBI), Tim Hubbard (Sanger Institute) Leaders Damian Keefe, Guy Slater, Michael Hoffman, Alison Meynert, Dace Ruklisa, Daniel Zerbino Research Martin Hammond, Dan Lawson, Karyn Megy Vectorbase Annotation Kerstin Howe, Tina Eyre, Ian Sealy Zebrafish Annotation Val Curwen, Steve Searle, Bronwen Aken, Julio Banet, Laura Clarke, Sarah Dyer, Felix Kokocinski, Jan-Hinnerck Vogel, Simon White Analysis and Annotation Pipeline Javier Herrero, Benoit Ballester, Kathryn Beal, Stephen Fitzgerald, Albert Vilella Comparative Genomics James Smith, Fiona Cunningham, Anne Parker, Bethan Pritchard, Stephen Rice, Steve Trevanion Web Team Xosé M Fernández, Bert Overduin, Michael Schuster, Giulietta Spudich Outreach & QC Eugene Kulesha, Andy Jenkinson Distributed Annotation System (DAS) Arek Kasprzyk, Syed Haider, Richard Holland, Damian Smedley BioMart Glenn Proctor, Andreas Kähäri, Ian Longden, Patrick Meidl Database Schema and Core API
40 of 42 Ensembl Team on the river Cam, 2006
41 of 42 Ewan Birney
42 of 42 Q & A Q U E S T I O N S A N S W E R S