European Bioinformatics Institute The Ensembl Database Schema.

Slides:



Advertisements
Similar presentations
“BioMart is a query-oriented data management system developed jointly by the Ontario Institute for Cancer Research (OICR) and the.
Advertisements

SRI International Bioinformatics 1 Genome Browser Markus Krummenacker Bioinformatics Research Group SRI, International Q
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
The Ensembl API European Bioinformatics Institute Hinxton, Cambridge, UK.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
Lecture 14 Genome sequencing projects
Structural Biology and Biocomputing Programme 1 Osvaldo Graña, CNIO Distributed Annotation System (DAS) part I Osvaldo Graña VIII.
9 Genomics and Beyond Brief Chapter Outline
Tutorial 7 Genome browser. Free, open source, on-line broswer for genomes Contains ~100 genomes, from nematodes to human. Many tools that can be used.
Genome Browsers Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Assembly.
Data Mining in Ensembl with EnsMart. 2 of 24 All genes from a candidate region Genes with a particular protein domain Members of a protein family Genes.
CSE182-L12 Gene Finding.
How to access genomic information using Ensembl August 2005.
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?
Comprehensive Microbial Resource Bioinformatics Visualization Workshop Owen White May 30, 2002.
Data retrieval BioMart Data sets on ftp site MySQL queries of databases Perl API access to databases Export View.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
Mouse Genome Sequencing
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
European Bioinformatics Institute The Ensembl Database Schema.
EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
The Ensembl API European Bioinformatics Institute Hinxton, Cambridge, UK.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
EnsEMBL Opening up the whole Genome Philip Lijnzaad
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Data Mining in Ensembl with BioMart Nov,
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Linkage and Mapping. Figure 4-8 For linked genes, recombinant frequencies are less than 50 percent.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Curation Tools Gary Williams Sanger Institute. SAB 2008 Gene curation – prediction software Gene prediction software is good, but not perfect. Out of.
NCBI Genome Workbench Chuong Huynh NIH/NLM/NCBI Sao Paulo, Brasil July 15, 2004 Slides from Michael Dicuccio’s Genome Workbench.
Human Genome.
SRI International Bioinformatics 1 Genome Browser Tomer Altman Bioinformatics Research Group SRI, International August 19th, 2009.
VectorBase Vectorbase probe mapping. VectorBase Automatic Annotation browser Array data CHADO Manual Annotation XML vectorbase Automatic Annotation.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Do not reproduce without permission 1 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Gerstein Lab Aims in ModENCODE.
EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
A guided tour of Ensembl This quick tour will give you an outline view of what Ensembl is all about. You will learn: –Why we need Ensembl –What is in the.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
What is BLAST? Basic BLAST search What is BLAST?
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
Welcome to the combined BLAST and Genome Browser Tutorial.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
Genome Analysis. This involves finding out the: order of the bases in the DNA location of genes parts of the DNA that controls the activity of the genes.
Data Loading into Ensembl Database TGAC Browser
What is BLAST? Basic BLAST search What is BLAST?
Introduction to Genes and Genomes with Ensembl
Human Genome Project.
The Ensembl Database Steven Jones August 18, 2004
Data Mining with BioMart
Basics of BLAST Basic BLAST Search - What is BLAST?
Experimental Verification Department of Genetic Medicine
Pre-genomic era: finding your own clones
UniProt: Universal Protein Resource
GEP Annotation Workflow
Access to Sequence Data and Related Information
Ensembl Genome Repository.
Vector NTI Introduction
Presentation transcript:

European Bioinformatics Institute The Ensembl Database Schema

Requirements for the schema Store data for human genome … and all the other genomes we have … and all the genomes we might get Flexible to add more data Easy to adapt to new genome Responds fast enough for web site display and pipelined genebuild

System Context Java API (EnsJ) www Pipeline Apollo Mart DB MartShell MartView Other Scripts & Applications Ensembl DBs Perl API

Sequence regions Everything which represents a length of nucleotide sequence is a sequence region. –chromosome, BAC-clone, supercontig, scaffold, contig … Sequence regions of the same type belong to the same coordinate system. –“1”, “2”, and “3” are sequence regions with coordinate system “chromosome” Sequence regions have names and lengths.

Sequence regions intcoord_system_id intlength varcharname intseq_region_id seq_region mediumtextsequence intseq_region_id dna “default_version”, “sequence_level” attrib varcharversion intrank varcharname intcoord_system_id coord_system …n

Example sequence region Chromosome, 1, 200MB Clone, AL , 132KB NT_contig, NT_ , 17MB Contig, AC ,

Coordinate system The coord_system describes the type of the sequence region –Name (“chromosome”, “contig”,… ) –Version (eg. NCBI35, ZFISH3) –Internal id (coord_system_id) –Attrib – ( default, sequence_level ) –rank (1..n) If you have 2 coordinate systems with the same name, choose a “default” one. They need to have different versions (NCBI34, NCBI35). The lower the rank, the bigger the sequence region. Choose 1 for your biggest regions (chromosomes). Only one coordinate system is allowed to contain sequence regions with actual sequence attached. Flag it with Attrib = sequence_level.

Coordinate system “contig” –Contiguous sequence. –“N”s should be rare and of short length. –Can serve as your basic sequence holder “clone” –Should have a real BAC or PAC or maybe YAC behind it. –Might not be contiguous “supercontig” –Assembled from smaller contiguous sequences. –May have small gaps (eg between read pairs) “chromosome” –Use it only for real chromosomes. –or for alternative sequences of reference chromosomes. “chunk” –Artificial coordinate system to hold sequence regions for technical reasons. –Create, when none of the other coordinate systems can hold your sequence (eg. You only have full length chromosomes as coordinate system but they are to long to store) –or when you have 2 real sequence containing coordinate systems.

Assemblies An assembly defines how sequence regions in one coordinate system are made up of sequence regions from another coordinate system. For example human chromosomes are assembled from a “tiling path” of BAC clones. Assembly information stored in Ensembl makes it possible to obtain features or sequence from arbitrary sequence regions.

Assemblies Chromosome 17 BAC clones Gap A row in the assembly table references an assembled and component sequence region. How a piece of the assembled sequence region is made from a piece of a component region is defined by a pair of coordinates and an orientation. Gaps are represented by the absence of assembly information.

The assembly table intori intcmp_end intcmp_start intasm_end intasm_start intcmp_seq_region_id intasm_seq_region_id assembly intcoord_system_id intlength varcharname intseq_region_id seq_region mediumtextsequence intseq_region_id dna “default_version”, “sequence_level” attrib varcharversion intrank varcharname intcoord_system_id coord_system n n …n

Sequence region attributes Arbitrary attributes may be associated with a sequence region via the seq_region_attrib table. –sanger ids for certain clones. –htg phases for clones.

The seq_region_attrib table intori intcmp_end intcmp_start intasm_end intasm_start intcmp_seq_region_id intasm_seq_region_id assembly intcoord_system_id intlength varcharname intseq_region_id seq_region varcharvalue intattrib_type_id intseq_region_id seq_region_attrib varcharname textdescription varcharcode intattrib_type_id attrib_type mediumtextsequence intseq_region_id dna mediumblobsequence textn_line intseq_region_id dnac “default_version”, “sequence_level” attrib varcharversion intrank varcharname intcoord_system_id coord_system n n …n 1 0..n 1

Features Features are annotation information placed on the genome. A feature is stored as a position on a sequence region.

A standard feature varchargff_source varchargff_feature varcharprogram_file varcharparameters varcharmodule varcharmodule_version varcharlogic_name varchardb varchardb_version varchardb_file varcharprogram varcharprogram_version datetimecreated intanalysis_id analysis tinyintseq_region_strand intsimple_feature_id intseq_region_id intseq_region_start intseq_region_end varchardisplay_label doublescore intanalysis_id simple_feature int usually has a any_feature_id contains a sequence position with or without strand on a sequence region varchar usually contains a string to display contains any number of other attributes.. int usually links to the analysis responsible for calculating it any feature 1 0..n intanalysis_id textdescription varchardisplay_label analysis_description

other Features doublescore doubleevalue floatperc_ident inthit_start inthit_end varcharhit_name intanalysis_id intprotein_align_feature_id textcigar_line Sequence position protein_align_feature tinyinthit_strand doublescore doubleevalue floatperc_ident inthit_start inthit_end varcharhit_name intanalysis_id intdna_align_feature_id textcigar_line Sequence position dna_align_feature doublescore intrepeat_start intrepeat_end intrepeat_consensus_id intanalysis_id intrepeat_feature_id Sequence position repeat_feature intprediction_exon_id intprediction_transcript_id tinyintstart_phase doublescore doublep_value intexon_rank Sequence position prediction_exon intprediction_transcript_id intanalysis_id Sequence position prediction_transcript intrepeat_consensus_id varcharrepeat_name varcharrepeat_class textconsensus repeat_consensus 1..n 1 1

Genes are features sequence == seq_region exons alternative spliced transcripts.. with different translations ACGTTTCA have stable ids ENSE0001ENSE0002 ENSE0003ENSE0004 ENST0001 ENST0002 ENSP0003 ENSP0004 stable ids

tinyintphase tinyintend_phase intseq_region_id intseq_region_start intseq_region_end tinyintseq_region_strand intexon_id exon Textdescription varcharsource varcharbiotype enum( NOVEL, KNOWN, PREDICTED,.. ) status intanalysis_id intdisplay_xref_id intseq_region_id intseq_region_start intseq_region_end tinyintseq_region_strand intgene_id gene enum(PUTATIVE, NOVEL, KNOWN. status varcharbiotype textdescription intgene_id intdisplay_xref_id intseq_region_id intseq_region_start intseq_region_end tinyintseq_region_strand inttranscript_id transcript inttranscript_id intseq_start intstart_exon_id intseq_end intend_exon_id inttranslation_id translation 1 0..n n 1 datemodified_date datecreated_date varcharstable_id intversion intexon_id exon_stable_id datemodified_date datecreated_date varcharstable_id intversion intgene_id gene_stable_id datemodified_date datecreated_date varcharstable_id intversion inttranscript_id transcript_stable_id datemodified_date datecreated_date varcharstable_id intversion inttranslation_id translation_stable_id inttranscript_id intrank intexon_id exon_transcript 1 1..n 1

External references Ensembl objects can reference objects in other databases. –eg a SWISSPROT identifier, GO identifier, Refseq, HUGO, … External references are used for display ids in Genes and Transcripts. These links are provided directly in Gene and Transcript.

External references intensembl_id “Translation”, “Gene”, “Transcript” ensembl_object_type intxref_id intobject_xref_id object_xref doublescore doubleevalue intanalysis_id inttarget_identity inthit_start inthit_end inttranslation_start inttranslation_end textcigar_line intquery_identity intobject_xref_id identity_xref varchardescription varchardbprimary_acc intxref_id varcharversion varchardisplay_label intexternal_db_id xref varcharrelease “KNOWN”, “PRED”, “ORTH”,… status varchardbname intexternal_db_id external_db “IC”,”IDA”,”IEA”,”IEP”,”IGI”,”IMP,”IPI”,… linkage_type intobject_xref_id go_xref varcharsynonym intxref_id external_synonym 1 1..n 1 0..n n

Misc features intmisc_set_id intmisc_feature_id misc_feature_misc_set varcharcode textdescription intmax_length varcharname intmisc_set_id misc_set intmisc_feature_id intseq_region_end tinyintseq_region_strand intseq_region_start intseq_region_id misc_feature varcharvalue intattrib_type_id intmisc_feature_id misc_attrib varcharname textdescription varcharcode intattrib_type_id attrib_type are features with user definable attributes it can belong to a set to provide a trackname for the feature. a misc feature can be in more than one set n 1 1 1

Archive tables Record of old predictions You can get –history of stable IDs –how genes and transcripts were merged and split from older prediction to newer prediction. –old peptide sequences –how an older gene prediction was made up from transcripts/translations.

Archive tables A mapping session describes the event when a set of ids is mapped from an older database to a newer database –Version number are a relatively new addition, so you need the mapping session to uniquely specify a Gene. A stable id event for a gene states that some part of the old gene is to be found in the new gene. –Same for Transcript. The gene archive records the gene structure of the older gene, when the gene has changed during a mapping session. The peptide archive records the peptide sequence of the old version of the peptide, when the peptide changes.

Archive tables varcharmd5_checksum mediumtextpeptide_seq intpeptide_archive_id peptide_archive 1 1..n 1 varcharold_release varcharnew_release varcharold_assembly varcharnew_assembly varcharold_db_name varcharnew_db_name intmapping_session_id mapping_session datetime created smallinttranscript_version varchartranslation_stable_id smallinttranslation_version intpeptide_archive_id smallintgene_version varchartranscript_stable_id varchargene_stable_id gene_archive intmapping_session_id 1..n 1 “Translation”, “Gene”, “Transcript” type varcharnew_stable_id varcharold_stable_id intmapping_session_id smallintnew_version smallintold_version stable_id_event floatscore

Meta information Meta table contains general key-value pairs –eg. species name –taxonomy id –which coordinates can be mapped and how future additions likely Meta_coord says which feature is stored in which coordinate system –more than 1 entry possible –no feature retrieval without it. varcharmeta_value varcharmeta_key intmeta_id meta intmax_length intcoord_system_id varchartable_name meta_coord

Protein features Features can be add to the peptide sequence hit_id is usually Pfam, Prosite, prints identifier. interpro table links these to interpro ids. xrefs have further information for them. intanalysis_id doublescore inthit_end varcharhit_id doubleevalue floatperc_ident intseq_start intprotein_feature_id inthit_start intseq_end inttranslation_id protein_feature varcharid varcharinterpro_ac interpro

Density feature Density features assign numeric values to regions. –GC content –gene count –repeat coverage The blocksize and value_type enable interpolation by API floatdensity_value intseq_region_id intdensity_feature_id intseq_region_end intseq_region_start intdensity_type_id density_feature intblock_size “sum”, “ratio” value_type intanalysis_id intdensity_type_id density_type 1 1..n

Oligo features replace Affy feature tables Oligo arrays contain probes (oligonucleotides) in probesets. The same probe (and probeset) is part of many oligo arrays. –The same probe will have a different row with the same oligo_probe_id and a different name. The same probe will have only one feature per hit, even if it is represented on many arrays. –It can have many hits though. intanalysis_id intoligo_probe_id tinyintmismatches intseq_region_id intoligo_feature_id intseq_region_end intseq_region_start oligo_feature 1 1..n intoligo_array_id varcharname intprobe_setsize intparent_array_id oligo_array 1 0..n enumtype varcharprobeset varcharname intoligo_array_id intoligo_probe_id oligo_probe textdescription smallintlength

Karyotype bands Karyotype table defines the banding pattern of the chromosomes and how to draw the ideogram. A single band is just like any other feature in the database. band naming convention depends on species and resolution. stain could be (“acen”, “gvar”, “gpos25”, “gpos50”, “gpos75”, “gpos100”, “gneg”...) varcharstain intseq_region_start intkaryotype_id varcharband intseq_region_end intseq_region_id karyotype

Supporting Evidence Exons can be linked to features. These are alignment features that were used as evidence when the exon was created. –supporting evidence The same can be done for whole transcripts. intfeature_id “protein_align_feature”, “dna_align_feature” feature_type intexon_id supporting_feature intfeature_id “protein_align_feature”, “dna_align_feature” feature_type inttranscript_id transcript_supporting_feature

New tables intseq_region_id intseq_region_start intseq_region_end “HAP”, “PAR” exc_type intexc_seq_region_id intexc_seq_region_end intexc_seq_region_start intori intassembly_exception_id assembly_exception intgene_id intalt_allele_id alt_allele intattrib_type_id varcharvalue inttranscript_id transcript_attrib intattrib_type_id varcharvalue inttranslation_id translation_attrib

Ensembl Core Software Team: Glenn Proctor Patrick Meidl Craig Melsopp Ian Longden The Rest of the Ensembl Team. Acknowledgements