EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API.

Slides:



Advertisements
Similar presentations
What is RefSeqGene?.
Advertisements

The Ensembl API European Bioinformatics Institute Hinxton, Cambridge, UK.
Genomic Innovations- Orthology Paralogy. Genomic innovation.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
Genome Browsers Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.
Lane Medical Library & Knowledge Management Center Perl Programming for Biologists PART 3: Tue Feb 17 th 2009 Yannick Pouliot,
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Data Mining in Ensembl with EnsMart. 2 of 24 All genes from a candidate region Genes with a particular protein domain Members of a protein family Genes.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
How to access genomic information using Ensembl August 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Data retrieval BioMart Data sets on ftp site MySQL queries of databases Perl API access to databases Export View.
Doug Brutlag 2011 Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University School of Medicine Genomics, Bioinformatics.
BioPerl. cpan Open a terminal and type /bin/su - start "cpan", accept all defaults install Bio::Graphics.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.
Nucleotide sequence alignments in Compara Stephen Fitzgerald
1 Ensembl Modules and MySQL. SQL and Database Tables Quick Examples 2.
Mouse Genome Sequencing
The Ensembl Gene set The “Genebuild” 21 April 2008.
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
Genome Annotation and Databases Genomic DNA sequence Genomic annotation BIO520 BioinformaticsJim Lund Reading Ch 9, Ch10.
European Bioinformatics Institute The Ensembl Database Schema.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Daniel Rios Stephen Fitzgerald Edinburgh, 24 & 25 February 2009 Ensembl.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
is accessible at: The following pages are a schematic representation of how to navigate through ALE-HSA21.
Use cases for Tools at the Bovine Genome Database Apollo and Bovine QTL viewer.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
NCBI’s Genome Annotation: Overview Incremental processing Re-annotation ( batch ) Post-annotation review Case studies NOTE: limiting discussion to annotation.
The Ensembl API European Bioinformatics Institute Hinxton, Cambridge, UK.
COURSE OF BIOINFORMATICS Exam_31/01/2014 A.
Part I: Identifying sequences with … Speaker : S. Gaj Date
EnsEMBL Opening up the whole Genome Philip Lijnzaad
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Data Mining in Ensembl with BioMart Nov,
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Bulk data files // TeraGrid uses for Genome Databases GMOD meet, June 2006 Don Gilbert,
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Sept 2008 Ensembl Funcgen Perl API Nathan Johnson EBI - Wellcome Trust Genome Campus, UK Funcgen.
EMBOSS over a Grid 1. 1st EELA Grid School December 4th of 2006 Eduardo MURRIETA LEON Romualdo ZAYAS-LAGUNAS Pierre-Alain BRANGER Jérôme VERLEYEN Roberto.
1 of 42 Browsing Genes and Genomes with Ensembl Maria Wilbe Department of Animal Breeding and Genetics, SLU, Sweden
VectorBase Vectorbase probe mapping. VectorBase Automatic Annotation browser Array data CHADO Manual Annotation XML vectorbase Automatic Annotation.
EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
A guided tour of Ensembl This quick tour will give you an outline view of what Ensembl is all about. You will learn: –Why we need Ensembl –What is in the.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Accessing and visualizing genomics data
BIOL 433 Plant Genetics Term 2, Instructors: Dr. George Haughn Dr. Ljerka Kunst BioSciences 2239BioSciences Tel
Genomes at NCBI. Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools lists 57 databases.
Welcome to the combined BLAST and Genome Browser Tutorial.
COURSE OF BIOINFORMATICS Exam_30/01/2014 A.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Data Loading into Ensembl Database TGAC Browser
GeneConnect Use Cases and Design August 3, GeneConnect Database IDs are linked by Direct Annotation, Inferred Annotation, or Sequence Alignment.
This publication represents the views of the Authors, not the EC. The EC is not liable for any use that may be made of the information. EADGENE and SABRE.
Designing, Executing and Sharing Workflows with Taverna 2.4 Different Service Types Katy Wolstencroft Helen Hulme myGrid University of Manchester.
Introduction to Genes and Genomes with Ensembl
The Ensembl Database Steven Jones August 18, 2004
Data Mining with BioMart
BIOL 433 Plant Genetics Term 2,
Gene architecture and sequence annotation
Genome Projects Maps Human Genome Mapping Human Genome Sequencing
Ensembl Genome Repository.
BIOL 433 Plant Genetics Term 2,
Presentation transcript:

EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API

Outline The Ensembl Core databases and Perl API Documentation & Help (1)Data Objects, Object Adaptors, Database Adaptors & The Registry (2)Coordinate Systems & Slices (3)Features (4)Genes, Transcripts, Exons & Translations (5)External References (6)Coordinate Mappings

The Ensembl Core databases The Ensembl Core databases store: genomic sequence assembly information gene, transcript and protein models cDNA and protein alignments cytogenetic bands, markers, repeats, CpG islands etc. external references homo_sapiens_core_52_36n species group data version assembly version software version

The Ensembl Core Perl API Used to retrieve data from and store data in the Ensembl Core databases Written in Object-Oriented Perl Partly based on and compatible with BioPerl objects ( Used by the Ensembl analysis and annotation pipeline and the Ensembl web code Robust and well-supported Forms the basis for the other Ensembl APIs

Documentation & Help Installation instructions, web-browsable version of the POD (Perldoc) and tutorial: Inline Perl POD (Plain Old Documentation) ensembl-dev mailing list: Ensembl helpdesk:

Data Objects Data Objects model biological entities, e.g. Genes, Transcripts, Translations, … Each Data Object encapsulates information from one or a few specific MySQL tables Data Objects are retrieved from and stored in the database using Objects Adaptors

Object Adaptors Object Adaptors are Data Object factories Each Object Adaptor is responsible for creating Data Objects of only one particular type

Database Adaptors Database Adaptors are Object Adaptor factories Database Adaptors are used to connect to a single database

The Registry The Registry is a container for all Database Adaptors The Registry handles all database connections The Registry is an Object Adaptor factory The Registry can be initialised via a configuration file or by automatically discovering databases on a RDBMS instance

System Architecture ObjectGene GeneAdaptor MarkerAdaptor GeneMarker GeneObject ObjectAdaptor Core DBAdaptor Human Core DB Gene VariationAdaptor GenotypeAdaptor GeneMarker GenotypeVariation Variation DBAdaptor Human Variation DB Mouse Core DB Mouse Variation DB Ensembl Registry

Code Example # Obtain the Ensembl Gene IDs for all human genes use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' ); my $gene_adaptor = $registry->get_adaptor( ‘Human’, ‘Core’, ‘Gene’ ); my $genes = $gene_adaptor->fetch_all; while ( my $gene = ){ print $gene->stable_id, “\n”; }

Code Example OUTPUT: ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG

Exercise 1 (a)Load all databases and print their names. (b)What is the name of the human core database? There are several solutions possible! Use Perldoc! (

Coordinate Systems Sequences stored in Ensembl are associated with Coordinate Systems Coordinate Systems vary from species to species: human: chromosome, supercontig, clone, contig zebrafish: chromosome, scaffold, contig Sequence information is directly stored in the database for the ‘sequence level’ Coordinate System The Coordinate System of the highest level in a given region is the ‘top level’ Coordinate System Features are stored in a single Coordinate System

Coordinate Systems Chromosome Contigs Clones (Tiling path) Top level Sequence level

Slices A Slice Data Object represents an arbitrary region of a genome Slices are not directly stored in the database Slices are used to obtain sequences or features from a specific region in a specific coordinate system

Code Example # Obtain a slice covering the entire human Y chromosome my $slice_adaptor = $registry->get_adaptor( ‘Human’, ‘Core’, ‘Slice’ ); my $slice = $slice_adaptor->fetch_by_region( ‘chromosome’, ‘Y’ ); printf( “Slice: %s %s %s-%s (%s)\n”, $slice->coord_system_name $slice->seq_region_name $slice->start $slice->end $slice->strand ); OUTPUT: Slice: chromosome Y (1)

Exercise 2 (a)Obtain the names of the coordinate systems for rat. (b)Obtain a slice covering the first 10 MB of chromosome 20 of human and print its sequence. (c)Obtain a slice covering the human gene with Ensembl Gene ID ‘ENSG ’ with 2 kb of flanking sequence and print its sequence. (d)Print the name, start, end and strand of the obtained slices as well as their coordinate system. If you want to output your sequences to a file, have a look at BioSeq:IO at

Features Features are Data Objects with a defined location on the genome All Features have a start, end, strand and slice The start coordinate of a Feature is always less than its end coordinate, irrespective of the strand on which it is located (exception: insertion features)

Features Some examples of Features: Gene, Transcript and Exon ProteinFeature PredictionTranscript and PredictionExon DNAAlignFeature and ProteinAlignFeature RepeatFeature MarkerFeature OligoFeature SimpleFeature MiscFeature

Code Example # Obtain all markers on human chromosome 1 my $slice_adaptor = $registry->get_adaptor( ‘Human’, ‘Core’, ‘Slice’ ); my $slice = $slice_adaptor->fetch_by_region( ‘chromosome’, ‘1’ ); my $markers = $slice->get_all_MarkerFeatures; while ( my $marker = ){ printf( “%s\t%s\n”, $marker->slice->name, $marker->feature_Slice->name ); } OUTPUT: chromosome:NCBI36:1:1: :1 chromosome:NCBI36:1:1237:1488:1 chromosome:NCBI36:1:1: :1 chromosome:NCBI36:1:2585:2812:1 chromosome:NCBI36:1:1: :1 chromosome:NCBI36:1:4284:5085:1

Exercise 3 (a)Obtain all the CpG islands on the first 5 Mb of dog chromosome 20. Print the total number of CpG islands and the position and sequence of each CpG island. (b)Obtain all the protein alignment features on the first 5 Mb of dog chromosome 20. Print for each alignment the name of the aligned protein, the start and end coordinates of the matching region on the protein and on the genome and the name of the analysis resulting in the alignment. Hint: CpG islands are stored as SimpleFeatures with logic_name ‘cpg’.

Genes, Transcripts & Exons Genes, Transcript and Exons are Feature Data Objects A Gene is a grouping of Transcripts which share any (partially) overlapping Exons A Transcript is a set of Exons Introns are not explicitly defined in the database

Translations Translations are not Feature Data Objects Translations define the Untranslated Region (UTR) and Coding Sequence (CDS) composition of Transcripts Protein sequences are not stored in the database, but computed on the fly using Transcript(!) objects

Exercise 4 (a)Obtain the gene with Ensembl Gene ID ‘ENSG ’ and its transcripts. Print the total number of exons in the gene and the number of exons in each individual transcript. Why do the found numbers disagree with each other? (b)Print for each transcript of the above gene the coding sequence and the protein sequence.

External References External References (Xrefs) are cross references of Ensembl Genes, Transcripts or Translations with identifiers from other databases, e.g. HGNC, WikiGenes, UniProtKB/Swiss-Prot, RefSeq, MIM etc. etc.

Code Example # Obtain external references for Ensembl gene ENSG my $gene = $gene_adaptor->fetch_by_stable_id( 'ENSG ' ); my $gene_xrefs = $gene->get_all_DBEntries; print "Xrefs on the gene: \n\n"; while ( my $gene_xref = ){ printf( "%s: %s\n”, $gene_xref->dbname, $gene_xref->display_id ); } my $all_xrefs = $gene->get_all_DBLinks; print "\nXrefs on the gene, transcript and protein: \n\n"; while ( my $all_xref = ){ printf( "%s: %s\n”, $all_xref->dbname, $all_xref->display_id ); }

Code Example Output: Xrefs on the gene: HGNC: BRCA2 DBASS3: BRCA2 UCSC: uc001uub.1 HGNC_curated_gene: BRCA2 Xrefs on the gene, transcript and protein: shares_CDS_with_OTTT: OTTHUMT AFFY_HC_G110: 1503_at AFFY_HG_U95A: 1503_at AFFY_HG_U95Av2: 1503_at AFFY_HC_G110: 1990_g_at AFFY_HG_U95A: 1990_g_at AFFY_HG_U95Av2: 1990_g_at AFFY_HuGeneFL: X95152_rna1_at AFFY_HG_Focus: _at AFFY_HG_U133A: _at AFFY_HG_U133A_2: _at AFFY_HG_U133_Plus_2: _at AFFY_HG_U133A: _s_at AFFY_HG_U133A_2: _s_at AFFY_HG_U133_Plus_2: _s_at AFFY_U133_X3P: g _3p_a_at AFFY_U133_X3P: _3p_s_at AFFY_U133_X3P: Hs S1_3p_at AFFY_HC_G110: 1989_at AFFY_HG_U95A: 1989_at AFFY_HG_U95Av2: 1989_at RefSeq_dna: NM_ HGNC: BRCA2 UniGene: Hs AgilentCGH: A_14_P AgilentCGH: A_14_P AgilentProbe: A_23_P99452 Codelink: GE60169 Illumina_V1: GI_ S Illumina_V2: ILMN_ HGNC_curated_transcript: BRCA2-001 CCDS: CCDS EntrezGene: BRCA2 MIM_MORBID: MIM_MORBID: MIM_MORBID: MIM_GENE: MIM_MORBID: MIM_MORBID: RefSeq_peptide: NP_ Uniprot/SPTREMBL: A1YBP1_HUMAN EMBL: DQ protein_id: ABI Uniprot/SPTREMBL: B2ZAH0_HUMAN EMBL: EU protein_id: ACD EMBL: AL Uniprot/SPTREMBL: Q5TBJ7_HUMAN EMBL: AL protein_id: CAI protein_id: CAI Uniprot/SPTREMBL: Q8IU64_HUMAN EMBL: AY protein_id: AAN EMBL: AF protein_id: AAN EMBL: AF protein_id: AAN EMBL: AF protein_id: AAN EMBL: AF protein_id: AAN EMBL: AF protein_id: AAN EMBL: AF protein_id: AAN EMBL: AF protein_id: AAN EMBL: AF protein_id: AAN EMBL: AF protein_id: AAN EMBL: AF protein_id: AAN EMBL: AF protein_id: AAN EMBL: AF protein_id: AAN

Exercise 5 (a)Obtain the Ensembl gene(s) that correspond(s) to UniProtKB/Swiss-Prot entry BRCA2_HUMAN. Print its Ensembl Gene ID, name and description. (b)Obtain all external references for the above gene. Print their names and databases.

Coordinate Mappings The API provides the means to convert between any related coordinate systems in the database The Feature methods transfer, transform and project and the Slice method project are used to map features between coordinate systems

Transfer Transfer moves a feature on a slice in a given coordinate system to another slice in the same or another coordinate system Transfer needs the feature to be defined in the requested coordinate system, i.e. it cannot overlap an undefined region

Transfer Chr 1 Chr Y

Transform Like transfer, but transform places the feature on a slice that spans the entire sequence that the feature is on in the requested coordinate system

Transform

Project Project doesn’t move a feature, but it provides a definition of where a feature or slice lies in another coordinate system

Project

Code Example # Project gene ENSG to the clone coordinate system my $gene = $gene_adaptor->fetch_by_stable_id( 'ENSG ' ); my $projection = $gene->project( 'clone’ ); foreach my $segment ) { my $to_slice = $segment->to_Slice; printf( "%s %s-%s projects to %s %s:%s-%s(%s)\n", $gene->stable_id, $segment->from_start, $segment->from_end, $to_slice->coord_system_name, $to_slice->seq_region_name, $to_slice->start, $to_slice->end, $to_slice->strand ); }

Code Example Output: ENSG projects to clone AC : (-1) ENSG projects to clone AC : (-1) ENSG projects to clone AC : (- 1)

Exercise 6 (a)Obtain a gene located on clone ‘AL ’ and print out its coordinate system and gene coordinates. Then transform the gene to ‘toplevel’ and again print out the coordinate system and gene coordinates.

Other Ensembl Core APIs Ruby (by Jan Aerts): Python (by Jenny Qing Qian):

Acknowledgements The Ensembl Core Team Glenn Proctor Andreas KahariDaniel Rios Ian Longden