The Ensembl Database Steven Jones August 18, 2004

Slides:



Advertisements
Similar presentations
Bioinformatics for the Canadian Potato Genome Project David De Koeyer, Martin Lagüe and Rebecca Griffiths Wageningen September 18, 2004.
Advertisements

Genome Browsers Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Lane Medical Library & Knowledge Management Center Perl Programming for Biologists PART 3: Tue Feb 17 th 2009 Yannick Pouliot,
Lecture 7.11 The Ensembl Database Erin Pleasance Steven Jones Canada’s Michael Smith Genome Sciences Centre, Vancouver.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
CS34311 CS3431 – Database Systems I Project Overview Murali Mani.
How Clients and Servers Work Together. Objectives Learn about the interaction of clients and servers Explore the features and functions of Web servers.
Data retrieval BioMart Data sets on ftp site MySQL queries of databases Perl API access to databases Export View.
BioPerl. cpan Open a terminal and type /bin/su - start "cpan", accept all defaults install Bio::Graphics.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
1 CS428 Web Engineering Lecture 23 MySQL Basics (PHP - VI)
Lecture 3 – Data Storage with XML+AJAX and MySQL+socket.io
MySQL Dr. Hsiang-Fu Yu National Taipei University of Education
1 SQL, Databases, and Ensembl Modules. 2 Please look for next lecture Ensembl API Tutorial:
1 HTML and CGI Scripting CSC8304 – Computing Environments for Bioinformatics - Lecture 10.
INTERNET APPLICATION DEVELOPMENT For More visit:
1 Ensembl Modules and MySQL. SQL and Database Tables Quick Examples 2.
Copyright © 2003 Pearson Education, Inc. Slide 8-1 The Web Wizard’s Guide to PHP by David Lash.
EXtensible Neuroimaging Archive Toolkit (XNAT) Washington University Neuroinformatics Group.
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Edinburgh, 24 February 2009 Ensembl Developers Workshop Core API.
EBI is an Outstation of the European Molecular Biology Laboratory. Bert Overduin Daniel Rios Stephen Fitzgerald Edinburgh, 24 & 25 February 2009 Ensembl.
Dbwebsites 2.1 Making Database backed Websites Session 2 The SQL… Where do we put the data?
MySQL. MySQL is a Relational Database Management System (RDBMS) that runs as a server providing multiuser access to a number of databases. A third party.
Copyright OpenHelix. No use or reproduction without express written consent 2 Overview of Genome Browsers Materials prepared by Warren C. Lathe, Ph.D.
MET280: Computing for Bioinformatics Introduction to databases What is a database? Not a spreadsheet. Data types and uses DBMS (DataBase Management System)
SQL pepper. Why SQL File I/O is a great deal of code Optimal file organization and indexing is critical and a great deal of code and theory implementation.
Introduction to Internet Databases MySQL Database System Database Systems.

COSMIC GBrowse Visualising cancer mutations in genomic context Dave Beare Cancer Genome Project Wellcome Trust Sanger Institute, Hinxton,
EnsEMBL Opening up the whole Genome Philip Lijnzaad
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Most information comes from Chapter 3, MySQL Tutorial: 1 MySQL: Part.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
CP476 Internet Computing Perl CGI and MySql 1 Relational Databases –A database is a collection of data organized to allow relatively easy access for retrievals,
A guided tour of Ensembl This quick tour will give you an outline view of what Ensembl is all about. You will learn: –Why we need Ensembl –What is in the.
LIS618 last lecture building a search interface Thomas Krichel
Copyright OpenHelix. No use or reproduction without express written consent1 1.
Introduction to MySQL Ullman Chapter 4. Introduction MySQL most popular open-source database application Is commonly used with PHP We will learn basics.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
Welcome to the combined BLAST and Genome Browser Tutorial.
MySQL API( c ) & SQL2 강동훈.
SQL pepper. Why SQL File I/O is a great deal of code Optimal file organization and indexing is critical and a great deal of code and theory implementation.
Lab 3.21 MySQL Database Lab Developing the Tools May 5 th, 2004 Montréal, Québec Dominik Gehl Hôpital Ste-Justine, Montréal.
Lecture 1.21 SQL Introduction Steven Jones, Genome Sciences Centre.
GeneConnect Use Cases and Design August 3, GeneConnect Database IDs are linked by Direct Annotation, Inferred Annotation, or Sequence Alignment.
Chapter 13 Web Application Infrastructure
Modules and BioPerl.
Ensembl Database and Web Browser
Unix System Administration
Data Mining with BioMart
Introduction to MySQL.
Unix System Administration
Database application MySQL Database and PhpMyAdmin
Principles of Software Development
Server-Side Application and Data Management IT IS 3105 (Spring 2010)
ISC440: Web Programming 2 Server-side Scripting PHP 3
Web Systems Development (CSC-215)
Ensembl Genomes: Overview Poznań, 27th-28th June 2013
MySQL Dr. Hsiang-Fu Yu National Taipei University of Education
Intro to Relational Databases
Ensembl Genomes: Overview Versailles, 12th-13th November 2012
CS122 Using Relational Databases and SQL
CS1222 Using Relational Databases and SQL
Data Definition Language
Data.
MySQL Database System Installation Overview SQL summary
MySQL Database System Installation Overview SQL summary
CS122 Using Relational Databases and SQL
SDMX IT Tools SDMX Registry
Presentation transcript:

The Ensembl Database www.ensembl.org Steven Jones August 18, 2004 Lecture 3.1 (c) 2004 CGDN

What is Ensembl? Public annotation of the Mammalian genomes Open source software Relational database system The future of genomic bioinformatics? Lecture 3.1

The Ensembl Project “Ensembl is a joint project between EMBL European Bioinformatics Institute and the Sanger Institute to develop a software system which produces and maintains automatic annotation on eukaryotic genomes. Ensembl is primarily funded by the Wellcome Trust” Lecture 3.1

The Ensembl Genome Annotation • Utilises raw DNA sequence data from public sources • Creates a tracking database (The “Ensembl” database” • Joins the sequences - based on a sequence scaffold or “Golden Path” • Automatically finds genes and other features of the Sequence • Provides a publicly accessible web based interface to the database Lecture 3.1

Ensembl Software System Uses extensively BioPerl (www.bioperl.org) The free mySQL database Entire Ensembl code base us freely available under Apache open source license. Mainly written in Perl, extensions in C. Some viewers have been written in Java (e.g. Appollo). Lecture 3.1

Ensembl Software System Core design feature is the “virtual contig” object. Allows genome sequence to be accessed as a single large contiguous sequence, but is stored in the database as a collection of fragments. VC object handles reading and writing to the DNA sequence data. Lecture 3.1

Ensembl Software System If sequences were stored as single large sequences, this would be impractical e.g. whole database entry would need to be changed if a single base changed. By being able to store constituent DNA as fragments, can move easily between database versions and assemblies. Lecture 3.1

Ensembl Software System Software can be accessed by FTP Can also be accessed through CVS (concurrent versions system) Possible to set up a mirror of the entire Ensembl system. Lecture 3.1

Ensembl Now Supports a Number of Organisms Lecture 3.1

Lecture 3.1

The Chromosome Overview Lecture 3.1

Entering through Disease Genes via the OMIM database Lecture 3.1

The Ensembl Gene Report Lecture 3.1

Lecture 3.1

Lecture 3.1

Lecture 3.1

Expanding Annotation Features, e. g Expanding Annotation Features, e.g. Unigene and Human mRNA similarities Lecture 3.1

Ensembl links out to other databases to access individual entries Lecture 3.1

Lecture 3.1

Lecture 3.1

Lecture 3.1

Lecture 3.1

Lecture 3.1

Lecture 3.1

Direct Access to Ensembl We will use the underlying Ensembl data. For reasons of simplicity and space we will use the lite version of the Ensembl database {xhost01}/home/pubseq/acedb> mysql -u anonymous -h kaka.sanger.ac.uk Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 11928 to server version: 3.23.32 Type 'help' for help. mysql> use homo_sapiens_lite_8_30b ; Database changed mysql> show tables ; +-----------------------------------+ | Tables_in_homo_sapiens_lite_8_30b | | cpg | | eponine | | gene | | gene_xref | | genscan | | repeat | | repeat_types | | snp | | transcript | | trna | 10 rows in set (0.15 sec) Lecture 3.1

Explore the Database Lecture 3.1 mysql> describe gene ; +---------------+------------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | | id | int(10) unsigned | | PRI | NULL | auto_increment | | db | varchar(40) | | MUL | | | | type | varchar(40) | | | | | | gene_id | int(10) unsigned | | MUL | 0 | | | gene_name | varchar(40) | | MUL | unknown | | | chr_name | varchar(20) | | MUL | | | | chr_start | int(10) unsigned | | | 0 | | | chr_end | int(10) unsigned | | | 0 | | | chr_strand | tinyint(4) | | | 0 | | | description | varchar(255) | | | | | | external_db | varchar(40) | | | | | | external_name | varchar(40) | | | | | 12 rows in set (0.14 sec) Lecture 3.1

mysql> select * from gene limit 10 ; +----+------+----------+---------+-----------------+----------+-----------+-----------+------------+-------------+-------------+---------------+ | id | db | type | gene_id | gene_name | chr_name | chr_start | chr_end | chr_strand | description | external_db | external_name | | 1 | embl | standard | 1 | AB001523.1.1.1 | 21 | 41942232 | 42033272 | 1 | | protein_id | BAA21099.1 | | 2 | embl | standard | 2 | AB001523.1.1.2 | 21 | 42037154 | 42038833 | 1 | | protein_id | BAA21100.1 | | 3 | embl | standard | 3 | AB015355.1.1.1 | 12 | 51605465 | 51627823 | -1 | | SPTREMBL | O94801 | | 4 | embl | pseudo | 4 | AB019437.1.1.31 | 14 | 104129078 | 104129346 | -1 | | EMBL | AB019437.1.1 | | 5 | embl | standard | 5 | AB019437.1.1.23 | 14 | 104166399 | 104166849 | -1 | | protein_id | BAA75024.1 | | 6 | embl | standard | 6 | AB019437.1.1.15 | 14 | 104214187 | 104214630 | -1 | | protein_id | BAA75022.1 | | 7 | embl | pseudo | 7 | AB019437.1.1.24 | 14 | 104163156 | 104163428 | -1 | | | | | 8 | embl | standard | 8 | AB019437.1.1.16 | 14 | 104205297 | 104205735 | -1 | | protein_id | BAA75023.1 | | 9 | embl | pseudo | 9 | AB019437.1.1.1 | 14 | 104323128 | 104323419 | -1 | | | | | 10 | embl | pseudo | 10 | AB019437.1.1.25 | 14 | 104157422 | 104157916 | -1 | | | | 10 rows in set (0.16 sec) Lecture 3.1

mysql> select gene_name, chr_start, (chr_end-chr_start) AS Length, description from gene where chr_name = "2" and chr_start > 20000000 and chr_end < 20500000 limit 10 ; +--------------------+-----------+--------+-----------------------------------------------------+ | gene_name | chr_start | Length | description | | ENSG00000118965 | 20001311 | 79861 | | | ENSestG00000028348 | 20004285 | 18142 | | | ENSestG00000028342 | 20023444 | 14129 | | | ENSestG00000028189 | 20024182 | 33984 | | | ENSestG00000028339 | 20044953 | 36215 | | | ENSestG00000028192 | 20081242 | 14009 | | | ENSestG00000028199 | 20083093 | 2338 | | | ENSG00000132031 | 20083093 | 20642 | MATRILIN-3 PRECURSOR. [Source:SWISSPROT;Acc:O15232] | | ENSestG00000028338 | 20085340 | 6263 | | | ENSestG00000028337 | 20093048 | 10687 | | 10 rows in set (0.15 sec) Lecture 3.1

Using the PERL API API means “application programming interface” This gives us a seamless way to access the mysql database from perl code. A JAVA API also exists for Ensembl. Lecture 3.1

A typical API interface would involve:- 1) Connecting to the database use Bio::EnsEMBL::DBSQL::DBAdaptor; This line has to be in all your ensembl scripts; my $host = 'kaka.sanger.ac.uk'; my $user = 'anonymous'; my $dbname = 'current'; The all important variables telling perl where and what your database is. And now we connect my $db = new Bio::EnsEMBL::DBSQL::DBAdaptor(-host => $host, -user => $user, -dbname => $dbname); Lecture 3.1

2) Use Function calls to access the underlying data my $clone = $db->get_Clone('AC005663'); The function would return an associative array print "Clone is " . $clone->id . "\n"; What Functions you can use and what data is returned will be outlined in the documentation Lecture 3.1

3) We can get the API to return a large amount of data in an array my @contigs = $clone->get_all_Contigs; We now have an array of contig objects which are very useful for obtaining information. Say we want to get the sequence for each contig: foreach my $contig (@contigs) { my $seqobj = $contig->primary_seq; my $length = $contig->length; my $id = $contig->id; print $seqobj->seq . "\n"; } Note: These specific examples are taken from the ensembl tutorial http://www.ensembl.org/Docs/ensembl_tutorial.pdf Lecture 3.1

Further Information The Ensembl Project www.ensembl.org Ensembl Trace Server trace.ensembl.org Ensembl Distributed Annotation Server servlet.sanger.ac.uk/das Human Genome Central Resources www.ensembl.org/genome/central Distribributed Annotation System www.biodas.org Lecture 3.1