The Ensembl Database www.ensembl.org Steven Jones August 18, 2004 Lecture 3.1 (c) 2004 CGDN
What is Ensembl? Public annotation of the Mammalian genomes Open source software Relational database system The future of genomic bioinformatics? Lecture 3.1
The Ensembl Project “Ensembl is a joint project between EMBL European Bioinformatics Institute and the Sanger Institute to develop a software system which produces and maintains automatic annotation on eukaryotic genomes. Ensembl is primarily funded by the Wellcome Trust” Lecture 3.1
The Ensembl Genome Annotation • Utilises raw DNA sequence data from public sources • Creates a tracking database (The “Ensembl” database” • Joins the sequences - based on a sequence scaffold or “Golden Path” • Automatically finds genes and other features of the Sequence • Provides a publicly accessible web based interface to the database Lecture 3.1
Ensembl Software System Uses extensively BioPerl (www.bioperl.org) The free mySQL database Entire Ensembl code base us freely available under Apache open source license. Mainly written in Perl, extensions in C. Some viewers have been written in Java (e.g. Appollo). Lecture 3.1
Ensembl Software System Core design feature is the “virtual contig” object. Allows genome sequence to be accessed as a single large contiguous sequence, but is stored in the database as a collection of fragments. VC object handles reading and writing to the DNA sequence data. Lecture 3.1
Ensembl Software System If sequences were stored as single large sequences, this would be impractical e.g. whole database entry would need to be changed if a single base changed. By being able to store constituent DNA as fragments, can move easily between database versions and assemblies. Lecture 3.1
Ensembl Software System Software can be accessed by FTP Can also be accessed through CVS (concurrent versions system) Possible to set up a mirror of the entire Ensembl system. Lecture 3.1
Ensembl Now Supports a Number of Organisms Lecture 3.1
Lecture 3.1
The Chromosome Overview Lecture 3.1
Entering through Disease Genes via the OMIM database Lecture 3.1
The Ensembl Gene Report Lecture 3.1
Lecture 3.1
Lecture 3.1
Lecture 3.1
Expanding Annotation Features, e. g Expanding Annotation Features, e.g. Unigene and Human mRNA similarities Lecture 3.1
Ensembl links out to other databases to access individual entries Lecture 3.1
Lecture 3.1
Lecture 3.1
Lecture 3.1
Lecture 3.1
Lecture 3.1
Lecture 3.1
Direct Access to Ensembl We will use the underlying Ensembl data. For reasons of simplicity and space we will use the lite version of the Ensembl database {xhost01}/home/pubseq/acedb> mysql -u anonymous -h kaka.sanger.ac.uk Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 11928 to server version: 3.23.32 Type 'help' for help. mysql> use homo_sapiens_lite_8_30b ; Database changed mysql> show tables ; +-----------------------------------+ | Tables_in_homo_sapiens_lite_8_30b | | cpg | | eponine | | gene | | gene_xref | | genscan | | repeat | | repeat_types | | snp | | transcript | | trna | 10 rows in set (0.15 sec) Lecture 3.1
Explore the Database Lecture 3.1 mysql> describe gene ; +---------------+------------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | | id | int(10) unsigned | | PRI | NULL | auto_increment | | db | varchar(40) | | MUL | | | | type | varchar(40) | | | | | | gene_id | int(10) unsigned | | MUL | 0 | | | gene_name | varchar(40) | | MUL | unknown | | | chr_name | varchar(20) | | MUL | | | | chr_start | int(10) unsigned | | | 0 | | | chr_end | int(10) unsigned | | | 0 | | | chr_strand | tinyint(4) | | | 0 | | | description | varchar(255) | | | | | | external_db | varchar(40) | | | | | | external_name | varchar(40) | | | | | 12 rows in set (0.14 sec) Lecture 3.1
mysql> select * from gene limit 10 ; +----+------+----------+---------+-----------------+----------+-----------+-----------+------------+-------------+-------------+---------------+ | id | db | type | gene_id | gene_name | chr_name | chr_start | chr_end | chr_strand | description | external_db | external_name | | 1 | embl | standard | 1 | AB001523.1.1.1 | 21 | 41942232 | 42033272 | 1 | | protein_id | BAA21099.1 | | 2 | embl | standard | 2 | AB001523.1.1.2 | 21 | 42037154 | 42038833 | 1 | | protein_id | BAA21100.1 | | 3 | embl | standard | 3 | AB015355.1.1.1 | 12 | 51605465 | 51627823 | -1 | | SPTREMBL | O94801 | | 4 | embl | pseudo | 4 | AB019437.1.1.31 | 14 | 104129078 | 104129346 | -1 | | EMBL | AB019437.1.1 | | 5 | embl | standard | 5 | AB019437.1.1.23 | 14 | 104166399 | 104166849 | -1 | | protein_id | BAA75024.1 | | 6 | embl | standard | 6 | AB019437.1.1.15 | 14 | 104214187 | 104214630 | -1 | | protein_id | BAA75022.1 | | 7 | embl | pseudo | 7 | AB019437.1.1.24 | 14 | 104163156 | 104163428 | -1 | | | | | 8 | embl | standard | 8 | AB019437.1.1.16 | 14 | 104205297 | 104205735 | -1 | | protein_id | BAA75023.1 | | 9 | embl | pseudo | 9 | AB019437.1.1.1 | 14 | 104323128 | 104323419 | -1 | | | | | 10 | embl | pseudo | 10 | AB019437.1.1.25 | 14 | 104157422 | 104157916 | -1 | | | | 10 rows in set (0.16 sec) Lecture 3.1
mysql> select gene_name, chr_start, (chr_end-chr_start) AS Length, description from gene where chr_name = "2" and chr_start > 20000000 and chr_end < 20500000 limit 10 ; +--------------------+-----------+--------+-----------------------------------------------------+ | gene_name | chr_start | Length | description | | ENSG00000118965 | 20001311 | 79861 | | | ENSestG00000028348 | 20004285 | 18142 | | | ENSestG00000028342 | 20023444 | 14129 | | | ENSestG00000028189 | 20024182 | 33984 | | | ENSestG00000028339 | 20044953 | 36215 | | | ENSestG00000028192 | 20081242 | 14009 | | | ENSestG00000028199 | 20083093 | 2338 | | | ENSG00000132031 | 20083093 | 20642 | MATRILIN-3 PRECURSOR. [Source:SWISSPROT;Acc:O15232] | | ENSestG00000028338 | 20085340 | 6263 | | | ENSestG00000028337 | 20093048 | 10687 | | 10 rows in set (0.15 sec) Lecture 3.1
Using the PERL API API means “application programming interface” This gives us a seamless way to access the mysql database from perl code. A JAVA API also exists for Ensembl. Lecture 3.1
A typical API interface would involve:- 1) Connecting to the database use Bio::EnsEMBL::DBSQL::DBAdaptor; This line has to be in all your ensembl scripts; my $host = 'kaka.sanger.ac.uk'; my $user = 'anonymous'; my $dbname = 'current'; The all important variables telling perl where and what your database is. And now we connect my $db = new Bio::EnsEMBL::DBSQL::DBAdaptor(-host => $host, -user => $user, -dbname => $dbname); Lecture 3.1
2) Use Function calls to access the underlying data my $clone = $db->get_Clone('AC005663'); The function would return an associative array print "Clone is " . $clone->id . "\n"; What Functions you can use and what data is returned will be outlined in the documentation Lecture 3.1
3) We can get the API to return a large amount of data in an array my @contigs = $clone->get_all_Contigs; We now have an array of contig objects which are very useful for obtaining information. Say we want to get the sequence for each contig: foreach my $contig (@contigs) { my $seqobj = $contig->primary_seq; my $length = $contig->length; my $id = $contig->id; print $seqobj->seq . "\n"; } Note: These specific examples are taken from the ensembl tutorial http://www.ensembl.org/Docs/ensembl_tutorial.pdf Lecture 3.1
Further Information The Ensembl Project www.ensembl.org Ensembl Trace Server trace.ensembl.org Ensembl Distributed Annotation Server servlet.sanger.ac.uk/das Human Genome Central Resources www.ensembl.org/genome/central Distribributed Annotation System www.biodas.org Lecture 3.1