Download presentation
Presentation is loading. Please wait.
Published byRandolph Lambert Modified over 9 years ago
1
EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk
2
Overview what how (science, hardware software) results families and descriptions tour people
3
What is EnsEMBL Automatic Annotation of complete Human Genome –genes –other: markers, SNPs, homologies, etc. completely open –data, software, discussions portable, downloadable ‘the Linux of the Human Genome’
4
From... TCTTCTCCTTCAAGGCATCCAGGTTACCCCGGACAATAAGAGGGGAACAAGCTCTTTGTT TTGCCAAGCGGTGGAAGCTTCAGGAAAGGTGCCCGGCCCCTTAGGAGGAAAACCGGGGAA CAAGACCCGCAGTTTTTGCCTTCCCAACTTCCAGTGGGCCCAAAAAAACTTGGGGCGCCC AGGGTCCCCAAAAGAGAGAGCCACGCTGGGGCCGGGTTCCTGCTTTTAATATCCAGGAAA AGGGGGGGAGGGGTATTCCCCCTTCCTCATTAAGATAAAAGACTCCCCCTCGTACTTATG GGTCCTTTACGGTTGGGCATGGGGCGAAAAAAGGGAGCGCCCCGGTGGACTTAATCGTAT TTTAACACACCCCCCGGGATATTTAAAGTCGGGGTAGGGCTGTTTGAAAATATTCAATGT GGGGGGCTTTTTGACACGCCCGTTTATATTGTTCTGGGACGCGCGTGAGGGGGGTAGACA AGAGGTGTGTAAGCCGTGCTTTATTATCCTCGCGTAGACACGCGTTAGCATGTAGTGGTG TTACCTGGTCGCGCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCCTT CTCTACTAAAAACCCAAAAATTTGCCAGACACGTGGAGAGCGAGACTTCATCTCAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCTAAGAGTTGTTATTTCTGAGGTAGAAT AACTAATGATCTTATCTTCTCTTTTTTCTTTTCTTCAAGATGGGGTCTTGCTTTGTCACC CAGGCCAGAGGGCAGTGGCACAATCATAGTTCACTGCAGCTTCAAACACCTGAACTCAAG CAATCTTCCCCGCTCATACTGCTCCCCAGCACCAGGAGCTGGGACTACAGGCACACGTCA CCACATCCGGCTAATTTTTTTTTTCTTTTGGTGGGTAGAGACGGGGGCCTCACTATGTTG
5
… to: MHSSGSSGKGAGPLRGKTSGTEPADFALPSSRGGPGKLRCYQTNLSSFS SPRKGVSQTGTPVCEEDGDAGLGIRQGGKAPVTPRGRGRRGRPPSRTTG TRETAVPGPLGIEDISPNLSPDDKSFSRVVPRVPDSTRRTDVGAGALRR SDSPEIPFQAAAGPSDGLDASSPGNSFVGLRVVAKWSSNGYFYSGKITR DVGAGKYKLLFDDGYECDVLGKDILLCDPIPLDTEVTALSEDEYFSAGV VKGHRKESGELYYSIEKEGQRKWYKRMAVILSLEQGNRLREQYGLGPYE AVTPLTKAADISLDNLVEGKRKRRSNVSSPATPTASSSSSTTPTRKITE SPRASMGVLSGKRKLITSEEERSPAKRGRKSATVKPGAVGAGEFVSPCE SGDNTGEPSALEEQRGPLPLNKTLFLGYAFLLTMATTSDKLASRSKLPD GPTGSSEEEEEFLEIPPFNKQYTESQLRAGAGYILEDFNEAQCNTAYQC
6
Take: Draft human genome –clones and contigs from public databases –not finished errors gaps –Golden Path assembly of all contigs into (nearly) complete chromosomes
7
Then: Get rid of repeats Targetted searches –pmatch to ‘find back’ known proteins from SWISSPROT, SP-TrEMBL and RefSeq –GeneWise and EST2Genome to build the genes fill in coding sequences and UTR’s
8
And then: Similarity searches –GenScan on raw contigs –its peptides are searched against protein, mRNA and EST databases –genes are built using GeneWise on promising regions –additonally, exons can be used All predictions supported by evidence!
9
Add cross references: HUGO (HGNC) SwissProt/Trembl, RefSeq EMBL OMIM LocusLink InterPro
10
Add yet more GeneTribe families Gene descriptions Markers SNPs external annotations (EMBL) mouse traces...
11
Hardware 360 Alpha’s: DS10, dual EV6 processors, 1GByte memory 200 other nodes 10 days to do a complete blast + gene build ~ 30 million jobs ~ 30 GB
12
Software Digital Unix Apache relational database (MySQL) mostly perl, some C and Java –BioPerl, BioJava, BioCORBA LSF AltaVista
13
Software (2) Wiki Web CVS (~100 Mb) Code review, data review Testing conventions Interfaces VirtualContigs CORBA/Java
14
ID’s for genes, transcripts, exons, peptides, families ENSXnnnnnnnnnnn ( eg: ENSG00000067369 ) –X denotes which type: G = gene T = transcript E = exon P = peptide (translation) F = family
15
ID’s (2) ID’s should be stable difficult, because underlying data keeps being refined! ID mapping version numbers
16
Results Latest release:,1.1 (17. July) –Web code version: 1.1.1 (1 Aug.) April 2001 dataset 4,318,661,441 basepairs 143,479 exons 23,931 transcripts 21,921 genes (‘confirmed’)
17
Errors Missing data Misassembly Misidentification (pseudo-gene, paralog) Sequencing errors –in Human Genome Data –in supporting databases Bugs GenScan tuning GeneWise tuning
18
Gene Families Cluster EnsEMBL peptides together with SwissProt and SPTrembl –vertebrate GeneTRIBE - Automatic Protein Family detection using Markov Clustering. Enright, van Dongen & Ouzounis (in preparation)
19
Family descriptions distill consensus descriptions –using SwissProt DE-lines –may not work => unknown Transfer peptide’s family assignment to gene –resolve conflicts: choose family that has best description –unknown < hypothetical < fragment < cDNA
20
Family statistics: 13,811 families –7284 ‘unknown’ description 128,828 members –21,894 ENS genes –23,867 ENS peptides
21
Family statistics (2) 67591 member 34572-10 members 215 10-100 4> 100 max is 483 (zinc finger)
22
Gene descriptions Use SwissProt DE-line if known use Family if not Statistics: –18053 descriptions 13202 from SwissProt 4851 from family description –3868 still UNKNOWNs
23
Entry points http://www.ensembl.org ID search text search OMIM disease Browse chromosomes BLAST
25
TextSearch
26
DiseaseView
27
BLAST/SSAHA
28
MapView
29
MarkerView
30
ContigView
31
ContigView (2)
32
ContigView(3)
33
DAS annotations
34
Apollo
35
ExportView
36
GeneView
37
GeneView (2)
38
ProteinView
39
ProteinView (2)
40
ExpressionView
41
DomainView
42
FamilyView
43
Recent developments HelpDesk DAS –Adding annotations from anywhere Apollo –Genome viewer Expression data –SAGE
44
Future Better genes! Alignments Other genomes –Comparative Genomics CORBA/Java More protein-structural links –Scop profiles IGI/IPI Entity infra-structure
45
Links http://www.ensembl.org –dev.ensembl.org http://www.ensembl.org/genome/central http://genome.ucsc.edu http://compbio.ornl.gov/channel http://ncbi.nlm.gov/genome/guide/human http://www.biodas.org http://www.bio{perl,xml,corba,python,java}.org
46
Acknowledgements Ewan Birney, Michele Clamp,Tim Hubbard,Tony Cox,Elia Stupka,Arek Kasprzyk, Arne Stabenau, James Stalker, James Cuff, James Smith, Simon Potter, Manu Mongin, Val Curwen, Guy Slater, Richard Durbin, Craig Melsopp, Alistair Rust, Chriss Mungall, Jim Kent and many, many more
47
Join! http://www.ensembl.org mailing lists –ensembl-dev@ebi.ac.uk –ensembl-announce@ebi.ac.uk –(see http://www.ensembl.org/Dev/Lists )
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.