Lecture 7.11 The Ensembl Database Erin Pleasance Steven Jones Canada’s Michael Smith Genome Sciences Centre, Vancouver.

Slides:

Advertisements

Similar presentations

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.

Advertisements

ABSTRACT WormBase is a freely available information resource primarily for the nematode Caenorhabditis elegans but which progressively includes data from.

Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.

The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.

Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.

1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.

April 2006 March 2007 Xosé Mª Fernández European Bioinformatics Institute Browsing Genomes with Ensembl.

Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.

Genome Browsers Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.

Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.

Alignment of mRNAs to genomic DNA Sequence Martin Berglund Khanh Huy Bui Md. Asaduzzaman Jean-Luc Leblond.

Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.

Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.

Data Mining in Ensembl with EnsMart. 2 of 24 All genes from a candidate region Genes with a particular protein domain Members of a protein family Genes.

Genomic Database - Ensembl Ka-Lok Ng Department of Bioinformatics Asia University.

Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.

How to access genomic information using Ensembl August 2005.

Genome Browsers UCSC (Santa Cruz, California) and Ensembl (EBI, UK)

Bioinformatics Alternative splicing Multiple isoforms Exonic Splicing Enhancers (ESE) and Silencers (ESS) SpliceNest Lecture 13.

Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.

UniProt - The Universal Protein Resource

UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.

Data retrieval BioMart Data sets on ftp site MySQL queries of databases Perl API access to databases Export View.

Genome Annotation BCB 660 October 20, From Carson Holt.

The Ensembl Gene set The “Genebuild” 21 April 2008.

Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.

Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.

Genome Annotation and Databases Genomic DNA sequence Genomic annotation BIO520 BioinformaticsJim Lund Reading Ch 9, Ch10.

How to access genomic information using Ensembl Damian Smedley and Xosé Fernández Ensembl Project European Bioinformatics Institute Cambridge, UK November.

What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.

Genome Annotation BBSI July 14, 2005 Rita Shiang.

EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Daniel Rios Stephen Fitzgerald Edinburgh, 24 & 25 February 2009 Ensembl.

UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.

NCBI’s Genome Annotation: Overview Incremental processing Re-annotation ( batch ) Post-annotation review Case studies NOTE: limiting discussion to annotation.

An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit.

EnsEMBL Opening up the whole Genome Philip Lijnzaad

Browsing the Genome Using Genome Browsers to Visualize and Mine Data.

Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.

Genome Annotation Rosana O. Babu.

Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.

Bioinformatic Tools for Comparative Genomics of Vectors Comparative Genomics.

Protein and RNA Families

EnsMart: A Generic System for Fast and Flexible Access to Biological Data Arek Kasprzyk et al (2004) 14: , Genome research EBI, Wellcome Trust.

Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.

Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. [many slides borrowed from various sources]

Curation Tools Gary Williams Sanger Institute. SAB 2008 Gene curation – prediction software Gene prediction software is good, but not perfect. Out of.

Copyright OpenHelix. No use or reproduction without express written consent1.

Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. WTCCB Bioinformatics Core [many slides borrowed from various sources]

How can we find genes? Search for them Look them up.

EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.

A guided tour of Ensembl This quick tour will give you an outline view of what Ensembl is all about. You will learn: –Why we need Ensembl –What is in the.

Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,

1 of 28 Evaluating Genes and Transcripts (“Genebuild”)

Accessing and visualizing genomics data

Sequence Curation Paul Davis Sanger Institute. Overview Sequence curation within WormBase consortium. Import of sequence data. Prediction stats. Work.

Annotation of eukaryotic genomes

BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.

Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.

Welcome to the combined BLAST and Genome Browser Tutorial.

Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.

The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.

Lecture/Lab 7.31

Basics of Genome Annotation Daniel Standage Biology Department Indiana University.

Ensembl Database and Web Browser

VectorBase genome annotation

The Ensembl Database Steven Jones August 18, 2004

Data Mining with BioMart

Visualization of genomic data

Introduction to Bioinformatics II

Ensembl Genome Repository.

Presentation transcript:

Lecture 7.11 The Ensembl Database Erin Pleasance Steven Jones Canada’s Michael Smith Genome Sciences Centre, Vancouver

Lecture

Lecture 7.13 What is Ensembl? Public annotation of mammalian and other genomes Open source software Relational database system The future of genomic bioinformatics?

Lecture 7.14 The Ensembl Project “Ensembl is a joint project between EMBL European Bioinformatics Institute and the Sanger Institute to develop a software system which produces and maintains automatic annotation on eukaryotic genomes. Ensembl is primarily funded by the Wellcome Trust”

Lecture 7.15 The Ensembl Project “The main aim of this campaign is to encourage scientists across the world - in academia, pharmaceutical companies, and the biotechnology and computer industries - to use this free information.” - Dr. Mike Dexter, Director of the Wellcome Trust

Lecture 7.16 Diagram of contigview as “what we want in the end” Goal: An Accessible, Annotated Genome

Lecture 7.17 Ensembl Software System Uses extensively BioPerl ( The free mySQL database Entire Ensembl code base is freely available under Apache open source license. Mainly written in Perl, extensions in C. Some viewers have been written in Java (e.g. Apollo).

Lecture 7.18 Ensembl Genome Annotation Utilizes raw DNA sequence data from public sources Creates a tracking database (The “Ensembl database”) Joins the sequences - based on a sequence scaffold or “Golden Path” Automatically finds genes and other features of the sequence Associates sequence and features with data from other sources Provides a publicly accessible web based interface to the database

Lecture 7.19 The Genome Problem The problem with the genome (particularly human) is that it is “large, complicated, and opaque to analysis” (Ewan Birney, Ensembl) Genome features to identify include: –Genes: protein coding, RNA, pseudogenes –Regulatory elements –SNPs, repeats, etc….

Lecture DNA sequence in Ensembl Sequences are determined in fragments (contigs) Features cross boundaries between fragments Entire sequence too large and changes too much (constantly updated and reassembled) to be stored as one long database entry

Lecture DNA sequence in Ensembl Core design feature is the “virtual contig” object Allows genome sequence to be accessed as a single large contiguous sequence even though it is stored as a collection of fragments VC object handles reading and writing features to the DNA sequence

Lecture Ensembl Gene Build System Three-part gene build system –“Best in genome” matches for known genes –Alignment of homologous genes –Ab initio gene finding Genes predicted on repeat-masked DNA All genes predicted based on experimental (available sequence) evidence

Lecture “Best in genome” predictions Find known proteins from SPTREMBL on genome using pmatch Incorporate cDNAs using exonerate and EST_genome –Align with gaps placed preferentially at splice consensus sites –Allows prediction of 5’ and 3’ UTRs Refine predictions using genewise

Lecture “Best in genome” predictions ContigView of best in genome gene with associated evidence Known gene (p53) Proteins aligned cDNAs aligned UTRs predicted Unigene clusters aligned Alignments shown in ContigView

Lecture Homology predictions Align homologous proteins using BLAST, genewise –Paralogs (from same organism) –Orthologs (from closely related organisms) Assemble novel genes

Lecture Ab initio gene predictions Use Genscan to identify novel exons Confirm exons by BLAST to known proteins, mRNAs, UniGene clusters Based on ab initio predictions but require homology evidence ContigView of homology gene with associated evidence Novel gene GenScan predictions Proteins aligned Unigene clusters aligned

Lecture Pseudogenes Many pseudogenes also predicted

Lecture Ensembl Gene Build System Resulting “Ensembl genes” are highly accurate with low false positive rates Ensembl human gene identifiers are 95% stable between builds Snapshot or stats on genes

Lecture Ensembl EST genes ESTs not accurate enough to produce Ensembl genes, but important especially for identifying alternative transcripts Create an independent set of “EST genes” Known gene Unigene clusters aligned EST genes

Lecture Ensembl EST genes Map ESTs to genome using Exonerate, BLAST, and EST2Genome Define transcripts by merging redundant ends, setting splice sites to common ends –Finds splice sites and defines UTRs –Alternative transcript predicted if at least one alternatively spliced EST exists Process transcripts with Genomewise to find longest ORF for each

Lecture Ensembl EST genes Evidence for genes shown (ExonView)

Lecture Manual gene annotation: Otter Manual annotation done with applications eg. Apollo Otter database/server allows manual annotations to be integrated with automated annotations

Lecture Manually curated genes: VEGA Chromosomes 6,7,13,14, 20 and 22 contain manually curated genes from VEGA database

Lecture Gene information in Ensembl: GeneView

Lecture Transcript information in Ensembl: TransView

Lecture Protein information in Ensembl: ProteinView

Lecture Comparative genomics in Ensembl Gene orthologue pairs: Human Mouse Rat Fugu Zebrafish C. elegans C. briggsae Fly Mosquito DNA homology: Human Mouse Rat

Lecture Comparative genomics in Ensembl: Gene orthologs Gene ortholog pairs shown in GeneView Calculated by BLAST (reciprocal best BLAST hits, or BLAST + synteny) dN/dS = nonsynonymous/synonymous change (measure of selection)

Lecture Comparative genomics in Ensembl: DNA homology DNA homology shown in ContigView Mouse and rat homology

Lecture Comparative genomics in Ensembl: Synteny Large-scale homology shown in SyntenyView –Synteny = homologous sequence blocks, in same order and orientation

Lecture Other features in Ensembl Menus provide other feature options Features eg. SNPs and markers have special views

Lecture Other data sources in Ensembl Ensembl incorporates gene and feature info from many other datasources OMIM SwissProt

Lecture Other data sources in Ensembl: Link out

Lecture The Distributed Annotation System Allows viewing third-party annotation of the genomic scaffold Users can choose the annotation they are interested in Features are viewed in consistent user interface/display Allows specialized feature annotation and the comparison of different methodologies

Lecture DAS: Selecting data

Lecture GeneDAS GeneDAS allows exchange of annotations on gene level –eg. access to SwissProt annotations from GeneView

Lecture DAS: Add your own annotations Anyone can add data and upload it to DAS server for others to view

Lecture Sequence similarity searching Two search methods –SSAHA: very fast, good for identifying near-exact DNA-DNA matches –BLAST: slower but more accurate, can do DNA or protein searches Can search against any species Can search against genomic sequence, cDNAs (Ensembl or Genscan), or protein sequences

Lecture 7.139

Lecture Show alignment [A], sequence [S], or ContigView [C] Hits relative to genome

Lecture BLAST results

Lecture Data Mining with EnsMart EnsMart - organizes data from Ensembl into a query-optimized database Allows very fast, cross-data source querying Accessible from: –Ensembl website (MartView) –Stand-alone application (MartExplorer) –Command-line interface (MartShell) Extremely powerful for data mining

Lecture Dataming with Ensmart Mouse homologues for human disease genes. Coding SNPs for all novel kinases. Genes on chromosome 1 expressed in liver. Ensembl genes mapped to RefSeq identifiers. Upstream sequence for all Ensembl genes mapped to U95A chip. Disease related genes between markers (eg D10S255 and D10S259). Transmembrane proteins with an Ig-MHC domain (IPR003006) on chromosome 2. Genes with associated coding SNPs on chromosomal band 5q35.3

Lecture Choose focus: gene set or SNPS Choose organism (any species in Ensembl)

Lecture Filter genes based on info about: Region Genes Diseases Expression patterns Multi-species comparisons Protein domains and families SNPs

Lecture Choose output type: –Features (genes with associated info) –SNPs –Structures (of genes – eg. exons) –Sequences Choose what information to output

Lecture Multiple Programming Interfaces now exist for Ensembl

Lecture Another example of how to utilize the Ensembl database – Sockeye

Lecture Apollo – java viewer

Lecture Ensembl updates Monthly Include: –Changes in genome builds (with new annotations) –Changes in code or database schema –Additional views and tools on website

Lecture Pre-Ensembl Full annotation can take weeks Pre-Ensembl site provides in-progress annotation –Placement of known proteins –Ab initio gene predictions –Repeat masking –BLAST and SSAHA searching

Lecture Ensembl Software System Software can be accessed by FTP Can also be accessed through CVS (concurrent versions system) Possible to set up a mirror of the entire Ensembl system.

Lecture Further Information The Ensembl Project: VEGA: vega.sanger.ac.uk EnsMart: Distribributed Annotation System: Human Genome Central Resources: References: –Ensembl: Hubbard et al, NAR 30 (1), Clamp et al, NAR 31 (1), Birney et al, NAR 32, D468-D470. –EnsMart: Birney et al, Genome Res. 14,