Download presentation
Presentation is loading. Please wait.
Published byMelanie Cook Modified over 6 years ago
1
The Ensembl Database www.ensembl.org Steven Jones August 18, 2004
Lecture 3.1 (c) 2004 CGDN
2
What is Ensembl? Public annotation of the Mammalian genomes
Open source software Relational database system The future of genomic bioinformatics? Lecture 3.1
3
The Ensembl Project “Ensembl is a joint project between EMBL European Bioinformatics Institute and the Sanger Institute to develop a software system which produces and maintains automatic annotation on eukaryotic genomes. Ensembl is primarily funded by the Wellcome Trust” Lecture 3.1
4
The Ensembl Genome Annotation
• Utilises raw DNA sequence data from public sources • Creates a tracking database (The “Ensembl” database” • Joins the sequences - based on a sequence scaffold or “Golden Path” • Automatically finds genes and other features of the Sequence • Provides a publicly accessible web based interface to the database Lecture 3.1
5
Ensembl Software System
Uses extensively BioPerl ( The free mySQL database Entire Ensembl code base us freely available under Apache open source license. Mainly written in Perl, extensions in C. Some viewers have been written in Java (e.g. Appollo). Lecture 3.1
6
Ensembl Software System
Core design feature is the “virtual contig” object. Allows genome sequence to be accessed as a single large contiguous sequence, but is stored in the database as a collection of fragments. VC object handles reading and writing to the DNA sequence data. Lecture 3.1
7
Ensembl Software System
If sequences were stored as single large sequences, this would be impractical e.g. whole database entry would need to be changed if a single base changed. By being able to store constituent DNA as fragments, can move easily between database versions and assemblies. Lecture 3.1
8
Ensembl Software System
Software can be accessed by FTP Can also be accessed through CVS (concurrent versions system) Possible to set up a mirror of the entire Ensembl system. Lecture 3.1
9
Ensembl Now Supports a Number of Organisms
Lecture 3.1
10
Lecture 3.1
11
The Chromosome Overview
Lecture 3.1
12
Entering through Disease Genes via the OMIM database
Lecture 3.1
13
The Ensembl Gene Report
Lecture 3.1
14
Lecture 3.1
15
Lecture 3.1
16
Lecture 3.1
17
Expanding Annotation Features, e. g
Expanding Annotation Features, e.g. Unigene and Human mRNA similarities Lecture 3.1
18
Ensembl links out to other databases to access individual entries
Lecture 3.1
19
Lecture 3.1
20
Lecture 3.1
21
Lecture 3.1
22
Lecture 3.1
23
Lecture 3.1
24
Lecture 3.1
25
Direct Access to Ensembl
We will use the underlying Ensembl data. For reasons of simplicity and space we will use the lite version of the Ensembl database {xhost01}/home/pubseq/acedb> mysql -u anonymous -h kaka.sanger.ac.uk Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is to server version: Type 'help' for help. mysql> use homo_sapiens_lite_8_30b ; Database changed mysql> show tables ; | Tables_in_homo_sapiens_lite_8_30b | | cpg | | eponine | | gene | | gene_xref | | genscan | | repeat | | repeat_types | | snp | | transcript | | trna | 10 rows in set (0.15 sec) Lecture 3.1
26
Explore the Database Lecture 3.1 mysql> describe gene ;
| Field | Type | Null | Key | Default | Extra | | id | int(10) unsigned | | PRI | NULL | auto_increment | | db | varchar(40) | | MUL | | | | type | varchar(40) | | | | | | gene_id | int(10) unsigned | | MUL | | | | gene_name | varchar(40) | | MUL | unknown | | | chr_name | varchar(20) | | MUL | | | | chr_start | int(10) unsigned | | | | | | chr_end | int(10) unsigned | | | | | | chr_strand | tinyint(4) | | | | | | description | varchar(255) | | | | | | external_db | varchar(40) | | | | | | external_name | varchar(40) | | | | | 12 rows in set (0.14 sec) Lecture 3.1
27
mysql> select * from gene limit 10 ;
| id | db | type | gene_id | gene_name | chr_name | chr_start | chr_end | chr_strand | description | external_db | external_name | | 1 | embl | standard | | AB | | | | | | protein_id | BAA | | 2 | embl | standard | | AB | | | | | | protein_id | BAA | | 3 | embl | standard | | AB | | | | | | SPTREMBL | O | | 4 | embl | pseudo | | AB | | | | | | EMBL | AB | | 5 | embl | standard | | AB | | | | | | protein_id | BAA | | 6 | embl | standard | | AB | | | | | | protein_id | BAA | | 7 | embl | pseudo | | AB | | | | | | | | | 8 | embl | standard | | AB | | | | | | protein_id | BAA | | 9 | embl | pseudo | | AB | | | | | | | | | 10 | embl | pseudo | | AB | | | | | | | | 10 rows in set (0.16 sec) Lecture 3.1
28
mysql> select gene_name, chr_start, (chr_end-chr_start) AS Length, description from gene where chr_name = "2" and chr_start > and chr_end < limit 10 ; | gene_name | chr_start | Length | description | | ENSG | | | | | ENSestG | | | | | ENSestG | | | | | ENSestG | | | | | ENSestG | | | | | ENSestG | | | | | ENSestG | | | | | ENSG | | | MATRILIN-3 PRECURSOR. [Source:SWISSPROT;Acc:O15232] | | ENSestG | | | | | ENSestG | | | | 10 rows in set (0.15 sec) Lecture 3.1
29
Using the PERL API API means “application programming interface”
This gives us a seamless way to access the mysql database from perl code. A JAVA API also exists for Ensembl. Lecture 3.1
30
A typical API interface would involve:- 1) Connecting to the database
use Bio::EnsEMBL::DBSQL::DBAdaptor; This line has to be in all your ensembl scripts; my $host = 'kaka.sanger.ac.uk'; my $user = 'anonymous'; my $dbname = 'current'; The all important variables telling perl where and what your database is. And now we connect my $db = new Bio::EnsEMBL::DBSQL::DBAdaptor(-host => $host, -user => $user, -dbname => $dbname); Lecture 3.1
31
2) Use Function calls to access the underlying data
my $clone = $db->get_Clone('AC005663'); The function would return an associative array print "Clone is " . $clone->id . "\n"; What Functions you can use and what data is returned will be outlined in the documentation Lecture 3.1
32
3) We can get the API to return a large amount of data in an array
= $clone->get_all_Contigs; We now have an array of contig objects which are very useful for obtaining information. Say we want to get the sequence for each contig: foreach my $contig { my $seqobj = $contig->primary_seq; my $length = $contig->length; my $id = $contig->id; print $seqobj->seq . "\n"; } Note: These specific examples are taken from the ensembl tutorial Lecture 3.1
33
Further Information The Ensembl Project www.ensembl.org
Ensembl Trace Server trace.ensembl.org Ensembl Distributed Annotation Server servlet.sanger.ac.uk/das Human Genome Central Resources Distribributed Annotation System Lecture 3.1
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.