An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit.

Slides:



Advertisements
Similar presentations
Introduction 1.Ordering of P. knowlesi contigs v P. falciparum methodology progress/status towards a synteny map – ‘true’ scaffold 2. Gene prediction generating.
Advertisements

EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Homology Based Analysis of the Human/Mouse lncRNome
Genomic Innovations- Orthology Paralogy. Genomic innovation.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.
Lecture 7.11 The Ensembl Database Erin Pleasance Steven Jones Canada’s Michael Smith Genome Sciences Centre, Vancouver.
Data Mining in Ensembl with EnsMart. 2 of 24 All genes from a candidate region Genes with a particular protein domain Members of a protein family Genes.
How to access genomic information using Ensembl August 2005.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Genome Browsers UCSC (Santa Cruz, California) and Ensembl (EBI, UK)
Eukaryotic Gene Finding
Evaluating genes and transcripts in Ensembl
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Data retrieval BioMart Data sets on ftp site MySQL queries of databases Perl API access to databases Export View.
Genome Annotation BCB 660 October 20, From Carson Holt.
Doug Brutlag 2011 Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University School of Medicine Genomics, Bioinformatics.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.
The Ensembl Gene set The “Genebuild” 21 April 2008.
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
Genome Annotation and Databases Genomic DNA sequence Genomic annotation BIO520 BioinformaticsJim Lund Reading Ch 9, Ch10.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Genomes School B&I TCD Bioinformatics May Genome sizes Completed eukaryotic nuclear genomes Type of organismSpeciesGenome size (10 6 base pairs)
Copyright OpenHelix. No use or reproduction without express written consent 2 Overview of Genome Browsers Materials prepared by Warren C. Lathe, Ph.D.
GeneWise and Artemis Exercises Spliced Alignment using GeneWise Click on the GeneWise hyperlink on the course links page,
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
Part I: Identifying sequences with … Speaker : S. Gaj Date
EnsEMBL Opening up the whole Genome Philip Lijnzaad
1 of 38 Data Mining in Ensembl with BioMart. 2 of 38 Simple Text-based Search Engine.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Data Mining in Ensembl with BioMart Nov,
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Genome Annotation Rosana O. Babu.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Bioinformatic Tools for Comparative Genomics of Vectors Comparative Genomics.
SPIDA Substitution Periodicity Index and Domain Analysis Combining comparative sequence analysis with EST alignment to identify coding regions Damian Keefe.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
The Havana-Gencode annotation GENCODE CONSORTIUM.
Mark D. Adams Dept. of Genetics 9/10/04
Curation Tools Gary Williams Sanger Institute. SAB 2008 Gene curation – prediction software Gene prediction software is good, but not perfect. Out of.
1 of 42 Browsing Genes and Genomes with Ensembl Maria Wilbe Department of Animal Breeding and Genetics, SLU, Sweden
Data Mining in Ensembl with BioMart Giulietta Spudich.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
What do we already know ? The rice disease resistance gene Pi-ta Genetically mapped to chromosome 12 Rybka et al. (1997). It has also been sequenced Bryan.
A guided tour of Ensembl This quick tour will give you an outline view of what Ensembl is all about. You will learn: –Why we need Ensembl –What is in the.
Evaluating genes and transcripts in Ensembl March 2007.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
1 Many to 1 Gene Associations The following slides show a few examples of gene predictions by one annotation group that overlap one or more genes from.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Accessing and visualizing genomics data
Genomes at NCBI. Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools lists 57 databases.
Welcome to the combined BLAST and Genome Browser Tutorial.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Lecture/Lab 7.31
Introduction to Genes and Genomes with Ensembl
Ensembl Database and Web Browser
Data Mining with BioMart
Introduction to Bioinformatics II
Ensembl Genome Repository.
Welcome - webinar instructions
Presentation transcript:

An Introduction to ENSEMBL Cédric Notredame

The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit and Button-Fly. 2.Tiny villages of Hobbits actually live in our DNA and produce minute quantities of wool -- which we've been ignorantly referring to as "navel lint" and throwing away for centuries. 3.It's nearly impossible to re-fold it along the original creases. 4.Beer-drinking gene conveniently located next to bathroom-locating gene. and the Number 1 Surprise In The Human Genome Map... 5-Now that there's a map, male scientists will attempt to cure diseases by randomly throwing stuff into beakers, stubbornly refusing to use the map or ask for directions -- all the while insisting the cure is right around the next corner

ENSEMBL: Our Scope -What is ENSEMBL ? -Searching Genes in ENSEMBL -Viewing Genes in ENSEMBL? -Doing Research With ENSEMBL? -Where do ENSEMBL Genes Come From

Genomes sequences are becoming available very rapidly –Large and difficult to handle computationally –Everyone expects to be able to access them immediately Bench Biologists –Has my gene been sequenced? –What are the genes in this region? –Where are all the GPCRs –Connect the genome to other resources Research Bioinformatics –Give me a dataset of human genomic DNA –Give me a protein dataset Accessing Genomes

Set of high quality gene predictions –From known human mRNAs aligned against genome –From similar protein and mRNAs aligned against genome –From Genscan predictions confirmed via BLAST of Protein, cDNA, ESTs databases. Initial functional annotation from Interpro Integration with external resources (SNPs, SAGE, OMIM) Comparative analysis –DNA sequence alignment –Protein orthologs What is It ?

Mr ENSEMBL ? Richard Durbin (ACEDB) Ewan Birney (EBI)

Scale and data flow –mainly engineering problems Presentation, ease of use –mainly engineering problems Algorithmic –Partly engineering –Partly research Challenges ?

ENSEMBL Home

Help! context sensitive help pages - click access other documentation via generic home page the helpdesk HelpDesk / Suggestions

Finding What You Need

Human homepage

Text search

BLAST/SSAHA

BLAST/SSAHA ????

Changing Angle…

Anchor View Map View

Detailed View Genes, ESTs, CpG etc. 100kb Overview Genes and Markers 1Mb Chromosome Configuration Contig View

Contig View close-up Evidence Transcripts red & black (Ensembl predictions) Customising & short cuts Pop-up menu

Cyto View

Marker View

SNP View

Synteny View

Dotter View

Gene View

Gene-View

Trans View

Exon-View

Protein-View

CDK-like Family-View

CDK-like Family-View

The Right View On My Gene -Where Is My Gene ? Map View Cyto View Contig View -How Many Transcript for My Gene Gene View Exon View -What is the Function of my Gene Protein View SNP View Family View -How does My Gene compare with other Species Synteny View Dotter View

Getting The Stuff Back Home

Export-View

The aim of EnsMart is to integrate Ensembl data into a single, multi-species, query-optimised database –Requirement for cross-database joins removed. –Query-optimised schema improves speed of data retrieval. Examples –Coding SNPs for all novel GPCRs –The sequence in the 5kb upstream region of known proteases between D1S2806 and D1S2907 –Mouse homologues of human disease genes containing transmembrane domain located between 1p23 and 1q23 Data Mining with EnsMart

EnsMart I

EnsMart II

Asking Questions With ENSEMBL

Asking Questions 1-Selecting AND Downloading Genes using -Functional -And Evolutive Criteria 2-Comparing Two Pieces of Genome

All The Human Genes -Involved in Cell Death -Associated with a Disease -With a Homologue in Mouse and Chicken Asking A Question with ENSMART What Do You Want ???

Which Specie

Select the region Where? What kind of Gene ?

Select the kind of data Choose An Evolutionnary Trace What Kind of Function ?

Select the kind of data Control of Genetic Variation Control of Regulatory Region Control of Biochemical Function

Human Gene Cell Death Human Gene Cell Death Mouse Human Gene Cell Death Chicken Human Gene Cell Death C. Elegans 1133 genes1106 genes880 genes338 genes

I would like -Chromosome Information -The ID of my sequences -The corresponding OMIM Id -The corresponding Chicken id Asking A Question with ENSMART How Do You Want it Packed ???

Come to think of it… -I’d like to take a look at the 5’ upstream regions Asking A Question with ENSMART How Do You Want it Packed ???

I Want To know if the Mouse and the Human Genome are conserved around the Human Gene SNX5 Asking A Question with ENSMART What Do You Want ???

Where Do ENSEMBL Genes Come From Genebuild

Ensembl gene set Ensembl EST genes Ab initio predictions Manual curation (Vega / Sanger) Gene models from other groups Known v. novel genes Gene names & descriptions Evaluating genes and transcripts

The Aim…

Ensembl transcript predictions evidence other groups’ models manual curation Overview…

Automatic Gene Annotation human proteins Ensembl Genes Other proteinscDNAs Pmatch Exonerate GenewiseEst2Genome ESTs Genscan exons Add UTRs EST genes other evidence Merge

Place all available species-specific proteins to make transcripts Place similar proteins to make transcripts  Use mRNA data to add UTRs Build transcripts using cDNA evidence Build additional transcripts using Genscan + homology evidence Combine annotations to make genes with alternative transcripts ENSEMBL Geneset

blast and Miniseq Human protein sequences SwissProt/TrEMBL/RefSeq pmatch* v. assemblyGenewise *R. Durbin, unpublished Getting Genes from Known Proteins

Translatable gene with UTRs cDNAs - Est2Genome – UTRs, no phases proteins - Genewise – phases, no UTRs Adding the UTRs

DNA-DNA alignments don’t give translatable genes Protein level Alignment give: – frameshifts and splice sites Genewise (Ewan Birney) –Protein – genomic alignment –Has splice site model –Penalises stop codons –Allows for frameshifts Gene Build is Protein-Based

Combine results of all Genewises and Genscans: Group transcripts which share exons Reject non-translating transcripts Remove duplicate exons Attach supporting evidence Write genes to database Making Genes

NCBI 34 assembly, released Dec 2003 Ensembl genes: 21,787 ( in release 35) Ensembl coding transcripts: 31,609 (plus 1,744 pseudogenes) Ensembl exons: 225,897 Input human seqs: 48,176 proteins; 86,918 cDNAs Transcripts made from: –Human proteins with (without) UTRs 68% (19%) –Non-human proteins with (without) UTRs2% (9%) –cDNA alignment only0.8% A Typical Human Release: NCBI 34 (Dec 2003)

GenesSensitivity ~90% of manual genes are in Specificity ~75% of genes are in the manual sets Exon bpsSensitivity ~70% of manual bps are in exons (90% of coding bps) Specificity ~80% of bps are in manual exons Alternative transcripts per gene manual Figures are for the gene build on NCBI 33 (human) and manual annotation for chromosomes 6, 14 & 14 Manual Vs Automatic Annotation

Data availability Hard evidences in mouse, rat, human Similarity build more important For other species; Structural Issues Zebrafish Many similar genes near each other Genome from different haplotypes C. briggsae Very dense genome Short introns Mosquito Many single-exon genes Genes within genes Configuration Files provide flexibility Each Genebuild is a Story…

SpeciesGene numberExons/gene Homo sapiens Mus musculus Rattus norvegicus Danio rerio (zebra fish) Caenorhabditis briggsae (nematode) Anopheles gambiae (mosquito) Life in Release 2003

Ensembl gene set Ensembl EST genes Ab initio predictions Manual curation (Vega / Sanger) Gene models from other groups Known v. novel genes Gene names & descriptions Evaluating genes and transcripts

human proteins Ensembl Genes Other proteinscDNAs Pmatch Exonerate GenewiseEst2Genome ESTs Genscan exons Add UTRs EST genes Other evidence Merge Using ESTs

EST analysis Map to genome using Est2Genome (determine strand, splicing) Map ESTs using Exonerate (determine coverage, % identity and location in genome) Filter on %identity and depth (5.5 million ESTs from dbEST – maping of about 1/3) Using ESTs

Exonerate Golden path contigs cDNA hits Exonerate positions cDNA sequences to assembly contigs Store hits as Ensembl FeaturePairs in database Exonerate

Blast and Est2Genome Virtual contig cDNA hits Filter Blast & Miniseq Est_genome EST2Genome

Merge ESTs according to consecutive exon overlap and set splice ends Genomewise Alternative transcripts with translation and UTRs ESTs Reconstructing Alternative Splicing

Human ESTs EST transcripts Display limited to 7 at any one point – full data accessible in the databases Display of EST Evidences

Ensembl gene set Ensembl EST genes Ab initio predictions Manual curation (Vega / Sanger) Gene models from other groups Known v. novel genes Gene names & descriptions Evaluating genes and transcripts

Ab initio Genscan predictions Genscan prediction Evidence supporting Genscan exons

Ensembl gene set Ensembl EST genes Ab initio predictions Manual curation (Vega / Sanger) Gene models from other groups Known v. novel genes Gene names & descriptions Evaluating genes and transcripts

Manual Curation: VErtebrate Genome Annotation

Sanger / Vega manual curation Manual Curation: VEGA

Ensembl gene set Ensembl EST genes Ab initio predictions Manual curation (Vega / Sanger) Gene models from other groups Known v. novel genes Gene names & descriptions Evaluating Genes and Transcripts

Other models as ‘DAS sources’ Turn on DAS sources FASTAView display Other Gene-Models

Ensembl gene set Ensembl EST genes Ab initio predictions Manual curation (Vega / Sanger) Gene models from other groups Known v. novel genes Gene names & descriptions Evaluating Genes and Transcripts

Naming takes place after the gene build is completed Transcripts/proteins mapped to SwissProt, RefSeq and SPTrEMBL entries If mapped = ‘known’ : if not = ‘novel’ Require high sequence similarity, but allow incomplete coverage Note:  Difficult for families of closely-related genes  Wrongly annotated pseudogenes may also cause problems Known Vs novel transcripts

Ensembl gene set Ensembl EST genes Ab initio predictions Manual curation (Vega / Sanger) Gene models from other groups Known v. novel genes Gene names & descriptions Evaluating Genes and Transcripts

Names and descriptions Names taken from mapped database entries Official HGNC (HUGO) name used if available (or equivalent for other species) Otherwise SwissProt > RefSeq > SPTrEMBL Novel transcripts have only Ensembl stable ids Genes named after ‘best-named’ transcript Gene description taken from mapped database entries (source given) Hints:  Orthology can provide useful confirmation  If no description, check for any Family description Gene Names and Descriptors

Stability…

Evidence used to build the transcript links to ExonView Mapping to external databases Links to putative orthologues Transcript name Gene name & description Alternative transcripts Geneview and Exonview

Compressed tracks Expanded tracks Evidence Tracks in ContigView

Improved pseudogene annotation, for all species Upstream regulatory elements - using CpG islands, Eponine predictions, motifs to aid in prediction of transcription start sites Improve use of cDNAs - can already use to add alternatively spliced transcripts Improve UTR extension Make use of comparative data Non coding RNAs - currently filtered out of build sets Future Directions

ENSEMBL -Finding the right DATA: ENSMART and BLAST -The central View of ENSEMBL: ContigView -Genome Comparison: Synteny View-ENSEMBL incorporate all the evidences into its gene models

Genebuild overview Pmatch Other Proteins Genewise genes with UTRs Human Proteins Genewise genes Genebuilder Supported genscans (optional) Preliminary gene set cDNA genes ClusterMerge Gene Combiner Core Ensembl genes Pseudogenes Final set + pseudogenes Ensembl EST genes Est2Genome Aligned cDNAs Exonerate Human cDNAs Aligned ESTs Human ESTs

Place all known genes Map all AVAILABLE species specific proteins in the genome and find gene structure using Genewise Annotate novel genes Use protein from other species to build new transcripts based on homology Use AVAILABLE mRNAs to add UTRs to the built transcripts Use further homology to proteins, mRNAs and ESTs to build transcripts using Genscan exons Combine annotations Annotation Stages

SnSp chr chr chr Numbers are for NCBI33 genebuild Gene locus level ENSEMBL predictions cover 90% or more of manually annotated gene structures, with around 75% of the predictions covered by a manual annotation Exon level (based on transcript pairs) Coding exons onlyAll exons SnSpSnSp chr chr chr UTR exons predictions are less accurate than coding exons. 92% of coding exons and 80% of all exons are exact matches Manual Vs Automatic Annotation