The Ensembl API European Bioinformatics Institute Hinxton, Cambridge, UK.

Slides:



Advertisements
Similar presentations
ACCESSING AND EXTRACTING CHIP-SEQ AND TF-GENE INTERACTIONS FROM PAZAR
Advertisements

Database System Concepts and Architecture
The Ensembl API European Bioinformatics Institute Hinxton, Cambridge, UK.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Automatic Information Retrieval from Bioinformatics Websites Kang Peng.
Data Mining in Ensembl with EnsMart. 2 of 24 All genes from a candidate region Genes with a particular protein domain Members of a protein family Genes.
DT228/3 Web Development Databases. Database Almost all web application on the net access a database e.g. shopping sites, message boards, search engines.
ASP.NET Programming with C# and SQL Server First Edition
Data retrieval BioMart Data sets on ftp site MySQL queries of databases Perl API access to databases Export View.
FALL 2005CSI 4118 – UNIVERSITY OF OTTAWA1 Part 4 Web technologies: HTTP, CGI, PHP,Java applets)
Systems Analysis – Analyzing Requirements.  Analyzing requirement stage identifies user information needs and new systems requirements  IS dev team.
Nucleotide sequence alignments in Compara Stephen Fitzgerald
1 HTML and CGI Scripting CSC8304 – Computing Environments for Bioinformatics - Lecture 10.
Chapter 33 CGI Technology for Dynamic Web Documents There are two alternative forms of retrieving web documents. Instead of retrieving static HTML documents,
1 Ensembl Modules and MySQL. SQL and Database Tables Quick Examples 2.
2013Dr. Ali Rodan 1 Handout 1 Fundamentals of the Internet.
An Improved Approach to Generating Configuration Files from a Database Jon Finke Rensselaer Polytechnic Institute.
Tutorial 1 Getting Started with Adobe Dreamweaver CS3
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
European Bioinformatics Institute The Ensembl Database Schema.
EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API.
Taverna and my Grid Basic overview and Introduction Tom Oinn
EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Daniel Rios Stephen Fitzgerald Edinburgh, 24 & 25 February 2009 Ensembl.
Innovation & Supplementary Material Eleonora Presani – Elsevier
Use cases for Tools at the Bovine Genome Database Apollo and Bovine QTL viewer.
Taverna and my Grid Open Workflow for Life Sciences Tom Oinn
Engr. M. Fahad Khan Lecturer Software Engineering Department University Of Engineering & Technology Taxila.
Ensembl Steve Searle Joint project leader, Ensembl Genebuild team.

Apollo Future Plans Nomi Harris, BDGP/FlyBase GMOD Meeting, Cambridge April 27, 2004.
EnsEMBL Opening up the whole Genome Philip Lijnzaad
1 of 38 Data Mining in Ensembl with BioMart. 2 of 38 Simple Text-based Search Engine.
Analysis of the RNAseq Genome Annotation Assessment Project by Subhajyoti De.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
The Glance Project ATLAS Management January 2012.
Copyright © 2010 Certification Partners, LLC -- All Rights Reserved Perl Specialist.
Lesson Overview 3.1 Components of the DBMS 3.1 Components of the DBMS 3.2 Components of The Database Application 3.2 Components of The Database Application.
1 MSCS 237 Overview of web technologies (A specific type of distributed systems)
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
ITGS Databases.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Sept 2008 Ensembl Funcgen Perl API Nathan Johnson EBI - Wellcome Trust Genome Campus, UK Funcgen.
Saving State on the WWW. The Issue  Connections on the WWW are stateless  Every time a link is followed is like the first time to the server — it has.
SRI International Bioinformatics 1 Genome Browser Tomer Altman Bioinformatics Research Group SRI, International August 19th, 2009.
VectorBase Vectorbase probe mapping. VectorBase Automatic Annotation browser Array data CHADO Manual Annotation XML vectorbase Automatic Annotation.
David Lawrence 7/8/091Intro. to PHP -- David Lawrence.
Copyright © 2003 ProsoftTraining. All rights reserved. Perl Fundamentals.
EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine.
Creation and Maintenance of GeneKeyDB Research being conducted by Kevin Kastner Under the direction of Dr. Erich Baker.
Cross Language Clone Analysis Team 2 February 3, 2011.
A guided tour of Ensembl This quick tour will give you an outline view of what Ensembl is all about. You will learn: –Why we need Ensembl –What is in the.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Personalizing Web Sites Nasrullah. Understanding Profile The ASP.NET application service that enables you to store and retrieve information about users.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
Data Loading into Ensembl Database TGAC Browser
FILES AND EXCEPTIONS Topics Introduction to File Input and Output Using Loops to Process Files Processing Records Exceptions.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
GeneConnect Use Cases and Design August 3, GeneConnect Database IDs are linked by Direct Annotation, Inferred Annotation, or Sequence Alignment.
Designing, Executing and Sharing Workflows with Taverna 2.4 Different Service Types Katy Wolstencroft Helen Hulme myGrid University of Manchester.
Introduction to Genes and Genomes with Ensembl
The Ensembl Database Steven Jones August 18, 2004
Data Mining with BioMart
Display of Near Optimal Sequence Alignments
Web Systems Development (CSC-215)
Topics Introduction to File Input and Output
Ensembl Genome Repository.
Developing a Model-View-Controller Component for Joomla Part 3
Topics Introduction to File Input and Output
Welcome - webinar instructions
Presentation transcript:

The Ensembl API European Bioinformatics Institute Hinxton, Cambridge, UK

What is the API? Wikipedia defines an Application Programming Interface as: “... a set of definitions of the ways in which one piece of computer software communicates with another. It is a method of achieving abstraction, usually (but not necessarily) between lower-level and higher-level software. One of the primary purposes of an API is to provide a set of commonly- used functions.” The Ensembl API is a framework for applications that need to access or store data in Ensembl's databases.

Java API (EnsJ) System Context www Pipeline Apollo Mart DB MartShell MartView Other Scripts & Applications Ensembl DBs Perl API

The Perl API Written in Object-Oriented Perl. Used to retrieve data from and to store data in Ensembl databases. Foundation for the Ensembl Pipeline and Ensembl Web interface.

Why use an API? Uniform method of access to the data. Avoid writing the same thing twice: reusable in different systems. Reliable: lots of hard work, testing and optimisation already done. Insulates developers to underlying changes at a lower level (i.e. the database).

Data Objects Information is obtained from the API in the form of Data Objects. A Data Object represents a piece of data that is (or can be) stored in the database. Exon Transcript ExonMarkerGene

Data Objects – Code Example # print out the start, end and strand of a transcript print $transcript->start(), '-', $transcript->end(), '(',$transcript->strand(), “)\n”; # print out the stable identifier for an exon print $exon->stable_id(), “\n”; # print out the name of a marker and its primer sequences print $marker->display_marker_synonym()->name, “\n”; print “left primer: ”, $marker->left_primer(), “\n”; print “right primer: ”, $marker->right_primer(), “\n”; # set the start and end of a simple feature $simple_feature->start(10); $simple_feature->end(100);

Object Adaptors Object Adaptors are factories for Data Objects. Data Objects are retrieved from and stored in the database using Object Adaptors. Each Object Adaptor is responsible for creating objects of only one particular type. Data Adaptor fetch, store, and remove methods are used to retrieve, save, and delete information in the database.

Object Adaptors – Code Example # fetch a gene by its internal identifier $gene = $gene_adaptor->fetch_by_dbID(1234); # fetch a gene by its stable identifier $gene = $gene_adaptor->fetch_by_stable_id('ENSG '); # store a transcript in the database $transcript_adaptor->store($transcript); # remove an exon from the database $exon_adaptor->remove($exon); # get all transcripts having a specific interpro

The DBAdaptor The Database Adaptor is a factory for Object Adaptors. It is used to connect to the database and to obtain Object Adaptors. DB DBAdaptor GeneAdaptorMarkerAdaptor... Gene Mark er …

The DBAdaptor – Code Example use Bio::EnsEMBL::DBSQL::DBAdaptor; # connect to the database: $dbCore = Bio::EnsEMBL::DBSQL::DBAdaptor->new (-host => ‘ensembldb.ensembl.org’, -dbname => ‘homo_sapiens_core_41_36c’, -species => 'human', -group => 'core', -user => ‘anonymous’); # get a GeneAdaptor $gene_adaptor = $dbCore->get_GeneAdaptor(); # get a TranscriptAdaptor $transcript_adaptor = $dbCore->get_TranscriptAdaptor();

The Registry Is a central store of all the DBAdaptors Automatically adds DBAdaptors on creation Adaptors can be retrieved by using the species, database type and Adaptor type. Enables you to specify configuration scripts Very useful for multi species scripts.

Registry configuration file use Bio::EnsEMBL::DBSQL::DBAdaptor; use Bio::EnsEMBL::Utils::ConfigRegistry; my $reg = "Bio::EnsEMBL::Registry"; = ('H_Sapiens', 'homo sapiens', 'human', 'Homo_Sapient',"Homo sapiens"); Bio::EnsEMBL::Utils::ConfigRegistry-> add_alias( -species => "Homo_sapiens", -alias => new Bio::EnsEMBL::DBSQL::DBAdaptor( -species => "Homo_sapiens", -group => "core", -host => 'ensembldb.ensembl.org', -user => 'anonymous', -dbname => 'homo_sapiens_core_41_36c');

Registry usage use Bio::EnsEMBL::Registry; my $reg = "Bio::EnsEMBL::Registry"; $reg->load_all(); $gene_adaptor = $reg->get_adaptor("human","core","Gene"); $gene = $gene_adaptor-> fetch_by_stable_id('ENSG ');

load_registry_from_db Loads all the databases for the correct version of this API *Ensures you use the correct API and Database *Lazy loads so no database connections are made until requested Use -verbose option to print out the databases loaded

Registry Usage use Bio::EnsEMBL::Registry; my $reg = "Bio::EnsEMBL::Registry"; $reg->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous'); $gene_adaptor = $reg->get_adaptor("human","core","Gene");

Coordinate Systems Ensembl stores features and sequence in a number of coordinate systems. Example coordinate systems: chromosome, clone, scaffold, contig Chromosome CoordSystem 1, 2,..., X, Y Clone CoordSystem AC105024, AC107993, etc.

Coordinate Systems Regions in one coordinate system may be constructed from a tiling path of regions from another coordinate system. Chromosome 17 BAC clones Gap

Coordinate Systems – Code Example # get a coordinate system adaptor $csa = Bio::EnsEMBL::Registry-> get_adaptor(“human”,”core”,”coordsystem”); # print out all of the coord systems in the db $coord_systems = $csa->fetch_all(); foreach $cs { print $cs->name(), ' ',$cs->version, “\n”; } chromosome NCBI36 supercontig clone contig Sample output:

Slices A Slice Data Object represents an arbitrary region of a genome. Slices are not directly stored in the database. A Slice is used to request sequence or features from a specific region in a specific coordinate system. chr20 Clone AC022035

Slices – Code Example # get the slice adaptor $slice_adaptor = $reg->get_adaptor(“human”,”core”,”slice”); # fetch a slice on a region of chromosome 12 $slice = $slice_adaptor->fetch_by_region('chromosome', '12', 1e6, 2e6); # print out the sequence from this region print $slice->seq(); # get all clones in the database and print out their foreach $slice { print $slice->seq_region_name(), “\n”; }

Features Features are Data Objects with associated genomic locations. All Features have start, end, strand and slice attributes. Features are retrieved from Object Adaptors using limiting criteria such as identifiers or regions (slices). chr20 Clone AC022035

Features in the database Gene Transcript Exon PredictionTranscript PredictionExon DnaAlignFeature ProteinAlignFeature SimpleFeature MarkerFeature QtlFeature MiscFeature KaryotypeBand RepeatFeature AssemblyExceptionFeature DensityFeature

A Complete Code Example use Bio::EnsEMBL::Registry; my $reg = "Bio::EnsEMBL::Registry"; $reg->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous'); $slice_ad = $reg->get_adaptor(“human”,”core”,”slice”); $slice = $slice_ad->fetch_by_region('chromosome', 'X', 1e6, 10e6); foreach $sf { $start = $sf->start(); $end = $sf->end(); $strand = $sf->strand(); $score = $sf->score(); print “$start-$end($strand) $score\n”; }

Genes, Transcripts, Exons Genes, Transcripts and Exons are objects that can be used just like any other Feature object. A Gene is a set of alternatively spliced Transcripts. A Transcript is a set of Exons

Genes, Transcripts, Exons – Code Example # fetch a gene by its stable identifier $gene = $gene_adaptor->fetch_by_stable_id('ENSG '); # print out the gene, its transcripts, and its exons print “Gene: “, get_string($gene), “\n”; foreach $transcript { print “ Transcript: “, get_string($transcript), “\n”; foreach $exon { print “ Exon: “, get_string($exon), “\n”; } # helper function: returns location and stable_id string sub get_string { $feature = shift; $stable_id = $feature->stable_id(); $seq_region = $feature->slice()->seq_region_name(); $start = $feature->start(); $end = $feature->end(); $strand = $feature->strand(); return “$stable_id $seq_region:$start-$end($strand)”; }

Genes, Transcripts, Exons – Sample Output Gene: ENSG : (1) Transcript: ENST : (1) Exon: ENSE : (1) Exon: ENSE : (1) Exon: ENSE : (1) Transcript: ENST : (1) Exon: ENSE : (1) Exon: ENSE : (1) Exon: ENSE : (1) Transcript: ENST : (1) Exon: ENSE : (1) Exon: ENSE : (1) Exon: ENSE : (1) Exon: ENSE : (1)

Translations Translations are not Features. A Translation object defines the UTR and CDS of a Transcript. Peptides are not stored in the database, they are computed on the fly using Translation objects. Translation

Translations – Code Example # fetch a transcript from the database $transcript = $transcript_adaptor->fetch_by_stable_id('ENST '); # obtain the translation of the transcript $translation = $transcript->translation(); # print out the translation info print “Translation: “, $translation->stable_id(), “\n”; print “Start Exon: “,$translation->start_Exon()->stable_id(),“\n”; print “End Exon: “, $translation->end_Exon()->stable_id(), “\n”; # start and end are coordinates within the start/end exons print “Start : “, $translation->start(), “\n”; print “End : “, $translation->end(), “\n”; # print the peptide which is the product of the translation print “Peptide : “, $transcript->translate()->seq(), “\n”;

Translations – Sample Output Translation: ENSP Start Exon: ENSE End Exon: ENSE Start : 3 End : 22 Peptide : SERRPRCPGTPLAGRMADPGPDPESESESVFPREVGLFADSYSEKSQFCFCGHVLTITQN FGSRLGVAARVWDAALSLCNYFESQNVDFRGKKVIELGAGTGIVGILAALQGAYGLVRET EDDVIEQELWRGMRGACGHALSMSTMTPWESIKGSSVRGGCYHH

Coordinate Transformations The API provides the means to convert between any related coordinate systems in the database. Feature methods transfer, transform, project can be used to move features between coordinate systems. Slice method project can be used to move features between coordinate systems.

Feature::transfer The Feature method transfer moves a feature from one Slice to another. The Slice may be in the same coordinate system or a different coordinate system. Chr20 Chr17 AC Chr17 ChrX

Feature::transfer – Code Example # fetch an exon from the database $exon = $exon_adaptor->fetch_by_stable_id('ENSE '); print “Exon is on slice: “, $exon->slice()->name(), “\n”; print “Exon coords: “, $exon->start(), '-', $exon->end(), “\n”; # transfer the exon to a small slice just covering it $exon_slice = $slice_adaptor->fetch_by_Feature($exon); $exon = $exon->transfer($exon_slice); print “Exon is on slice: “, $exon->slice()->name(), “\n”; print “Exon coords: “, $exon->start(), '-', $exon->end(), “\n”; Sample output: Exon is on slice: chromosome:NCBI35:12:1: :1 Exon coords: Exon is on slice: chromosome:NCBI35:12: : :1 Exon coords: 1-302

Feature::transform Transform is like the transfer method but it does not require a Slice argument, only the name of a Coordinate System. Chr20 Chr17 AC Chr17 AC099811

Feature::transform – Code Example # fetch a prediction transcript from the database $pta = $db->get_PredictionTranscriptAdaptor(); $pt = $pta->fetch_by_dbID(1834); print “Prediction Transcript: “, $pt->stable_id(), “\n”; print “On slice: “, $pt->slice()->name(), “\n”; print “Coords: “, $pt->start(), '-', $pt->end(), “\n”; # transform the prediction transcript to the supercontig system $pt = $pt->transform('supercontig'); print “On slice: “, $pt->slice()->name(), “\n”; print “Coords: “, $pt->start(), '-', $pt->end(), “\n”; Sample output: Prediction Transcript: GENSCAN On slice: contig::AL :1:185571:1 Coords: On slice: supercontig::NT_008470:1: :1 Coords:

Feature::project Project does not move a feature, but it provides a definition of where a feature lies in another coordinate system. It is useful to determine where a feature that spans multiple sequence regions lies. Chr17 AC Chr17 AC099811

Feature::project – Code Example # fetch a transcript from the database $tr = $transcript_adaptor->fetch_by_stable_id('ENST '); $cs = $tr->slice()->coord_system(); print “Coords: “, $cs->name(), ' ', $tr->slice->seq_region_name(),' ', $tr->start(), '-', $tr->end(), '(', $tr->strand(), “)\n”; # project to the clone coordinate system foreach $segment { ($from_start, $from_end, $pslice) print “$from_start-$from_end projects to clone region: “; print $pslice->seq_region_name(), ' ', $pslice->start(), '-', $pslice->end(), “(“, $pslice->strand(), ”)\n”; }

Feature::project – Sample Output Coords: chromosome (-1) projects to clone region: AL (-1)

Slice::project Slice::project acts exactly like Feature::project, except that it works with a Slice instead of a Feature. It is used to determine how one coordinate system is made of another. Chr17 AC Chr17 AC099811

Slice::project – Code Example # fetch a slice on a region of chromosome 5 $slice = $slice_adaptor->fetch_by_region('chromosome', '5', 10e6, 12e6); # project to the clone coordinate system foreach $segment { ($from_start, $from_end, $pslice) print “$from_start-$from_end projects to clone region: “; print $pslice->seq_region_name(), ' ', $pslice->start(), '-', $pslice->end(), “(“, $pslice->strand(), ”)\n”; }

Slice::project – Sample Output projects to clone region: AC (1) projects to clone region: AC (1) projects to clone region: AC (1) projects to clone region: AC (-1) projects to clone region: AC (-1) projects to clone region: AC (-1) projects to clone region: AC (1) projects to clone region: AC (-1) projects to clone region: AC (1) projects to clone region: AC (1) projects to clone region: AC (-1) projects to clone region: AC (1) projects to clone region: AC (-1) projects to clone region: AC (1) projects to clone region: AC (1) projects to clone region: AC (1) projects to clone region: AC (-1) projects to clone region: AC (- 1)

External References Ensembl cross references its Gene predictions with identifiers from other databases. Q9UBS0 ENSP RPS6KB2 ENSP BRCA2 P51587 Ensembl SwissProt HUGO

External References – Code Example # fetch a transcript using an external identifier ($transc) print “TTN is Ensembl transcript “,$transc->stable_id(),“\n”; # get all external identifiers for a translation $transl = $transc->translation(); print “External identifiers for “, $transl->stable_id(), “:\n”; foreach my $db_entry { print $db_entry->dbname(), “:”, $db_entry->display_id(),”\n”; }

External References – Sample Output TTN is Ensembl transcript ENST External identifiers for ENSP : GO:GO: GO:GO: GO:GO: GO:GO: GO:GO: GO:GO: GO:GO: GO:GO: GO:GO: GO:GO: Uniprot/SPTREMBL:Q10465 EMBL:X90569 protein_id:CAA GO:GO: GO:GO: GO:GO: Uniprot/SPTREMBL:Q8WZ42 EMBL:AJ protein_id:CAD

Getting More Information perldoc – Viewer for inline API documentation. Also online at: Tutorial document: ensembl-dev mailing list:

Acknowledgements Ensembl Core Software Team: Glenn Proctor Patrick Meidl Ian Longden The Rest of the Ensembl Team.