EnsEMBL Opening up the whole Genome Philip Lijnzaad
Overview what how (science, hardware software) results families and descriptions tour people
What is EnsEMBL Automatic Annotation of complete Human Genome –genes –other: markers, SNPs, homologies, etc. completely open –data, software, discussions portable, downloadable ‘the Linux of the Human Genome’
From... TCTTCTCCTTCAAGGCATCCAGGTTACCCCGGACAATAAGAGGGGAACAAGCTCTTTGTT TTGCCAAGCGGTGGAAGCTTCAGGAAAGGTGCCCGGCCCCTTAGGAGGAAAACCGGGGAA CAAGACCCGCAGTTTTTGCCTTCCCAACTTCCAGTGGGCCCAAAAAAACTTGGGGCGCCC AGGGTCCCCAAAAGAGAGAGCCACGCTGGGGCCGGGTTCCTGCTTTTAATATCCAGGAAA AGGGGGGGAGGGGTATTCCCCCTTCCTCATTAAGATAAAAGACTCCCCCTCGTACTTATG GGTCCTTTACGGTTGGGCATGGGGCGAAAAAAGGGAGCGCCCCGGTGGACTTAATCGTAT TTTAACACACCCCCCGGGATATTTAAAGTCGGGGTAGGGCTGTTTGAAAATATTCAATGT GGGGGGCTTTTTGACACGCCCGTTTATATTGTTCTGGGACGCGCGTGAGGGGGGTAGACA AGAGGTGTGTAAGCCGTGCTTTATTATCCTCGCGTAGACACGCGTTAGCATGTAGTGGTG TTACCTGGTCGCGCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCCTT CTCTACTAAAAACCCAAAAATTTGCCAGACACGTGGAGAGCGAGACTTCATCTCAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCTAAGAGTTGTTATTTCTGAGGTAGAAT AACTAATGATCTTATCTTCTCTTTTTTCTTTTCTTCAAGATGGGGTCTTGCTTTGTCACC CAGGCCAGAGGGCAGTGGCACAATCATAGTTCACTGCAGCTTCAAACACCTGAACTCAAG CAATCTTCCCCGCTCATACTGCTCCCCAGCACCAGGAGCTGGGACTACAGGCACACGTCA CCACATCCGGCTAATTTTTTTTTTCTTTTGGTGGGTAGAGACGGGGGCCTCACTATGTTG
… to: MHSSGSSGKGAGPLRGKTSGTEPADFALPSSRGGPGKLRCYQTNLSSFS SPRKGVSQTGTPVCEEDGDAGLGIRQGGKAPVTPRGRGRRGRPPSRTTG TRETAVPGPLGIEDISPNLSPDDKSFSRVVPRVPDSTRRTDVGAGALRR SDSPEIPFQAAAGPSDGLDASSPGNSFVGLRVVAKWSSNGYFYSGKITR DVGAGKYKLLFDDGYECDVLGKDILLCDPIPLDTEVTALSEDEYFSAGV VKGHRKESGELYYSIEKEGQRKWYKRMAVILSLEQGNRLREQYGLGPYE AVTPLTKAADISLDNLVEGKRKRRSNVSSPATPTASSSSSTTPTRKITE SPRASMGVLSGKRKLITSEEERSPAKRGRKSATVKPGAVGAGEFVSPCE SGDNTGEPSALEEQRGPLPLNKTLFLGYAFLLTMATTSDKLASRSKLPD GPTGSSEEEEEFLEIPPFNKQYTESQLRAGAGYILEDFNEAQCNTAYQC
Take: Draft human genome –clones and contigs from public databases –not finished errors gaps –Golden Path assembly of all contigs into (nearly) complete chromosomes
Then: Get rid of repeats Targetted searches –pmatch to ‘find back’ known proteins from SWISSPROT, SP-TrEMBL and RefSeq –GeneWise and EST2Genome to build the genes fill in coding sequences and UTR’s
And then: Similarity searches –GenScan on raw contigs –its peptides are searched against protein, mRNA and EST databases –genes are built using GeneWise on promising regions –additonally, exons can be used All predictions supported by evidence!
Add cross references: HUGO (HGNC) SwissProt/Trembl, RefSeq EMBL OMIM LocusLink InterPro
Add yet more GeneTribe families Gene descriptions Markers SNPs external annotations (EMBL) mouse traces...
Hardware 360 Alpha’s: DS10, dual EV6 processors, 1GByte memory 200 other nodes 10 days to do a complete blast + gene build ~ 30 million jobs ~ 30 GB
Software Digital Unix Apache relational database (MySQL) mostly perl, some C and Java –BioPerl, BioJava, BioCORBA LSF AltaVista
Software (2) Wiki Web CVS (~100 Mb) Code review, data review Testing conventions Interfaces VirtualContigs CORBA/Java
ID’s for genes, transcripts, exons, peptides, families ENSXnnnnnnnnnnn ( eg: ENSG ) –X denotes which type: G = gene T = transcript E = exon P = peptide (translation) F = family
ID’s (2) ID’s should be stable difficult, because underlying data keeps being refined! ID mapping version numbers
Results Latest release:,1.1 (17. July) –Web code version: (1 Aug.) April 2001 dataset 4,318,661,441 basepairs 143,479 exons 23,931 transcripts 21,921 genes (‘confirmed’)
Errors Missing data Misassembly Misidentification (pseudo-gene, paralog) Sequencing errors –in Human Genome Data –in supporting databases Bugs GenScan tuning GeneWise tuning
Gene Families Cluster EnsEMBL peptides together with SwissProt and SPTrembl –vertebrate GeneTRIBE - Automatic Protein Family detection using Markov Clustering. Enright, van Dongen & Ouzounis (in preparation)
Family descriptions distill consensus descriptions –using SwissProt DE-lines –may not work => unknown Transfer peptide’s family assignment to gene –resolve conflicts: choose family that has best description –unknown < hypothetical < fragment < cDNA
Family statistics: 13,811 families –7284 ‘unknown’ description 128,828 members –21,894 ENS genes –23,867 ENS peptides
Family statistics (2) member members > 100 max is 483 (zinc finger)
Gene descriptions Use SwissProt DE-line if known use Family if not Statistics: –18053 descriptions from SwissProt 4851 from family description –3868 still UNKNOWNs
Entry points ID search text search OMIM disease Browse chromosomes BLAST
TextSearch
DiseaseView
BLAST/SSAHA
MapView
MarkerView
ContigView
ContigView (2)
ContigView(3)
DAS annotations
Apollo
ExportView
GeneView
GeneView (2)
ProteinView
ProteinView (2)
ExpressionView
DomainView
FamilyView
Recent developments HelpDesk DAS –Adding annotations from anywhere Apollo –Genome viewer Expression data –SAGE
Future Better genes! Alignments Other genomes –Comparative Genomics CORBA/Java More protein-structural links –Scop profiles IGI/IPI Entity infra-structure
Links –dev.ensembl.org
Acknowledgements Ewan Birney, Michele Clamp,Tim Hubbard,Tony Cox,Elia Stupka,Arek Kasprzyk, Arne Stabenau, James Stalker, James Cuff, James Smith, Simon Potter, Manu Mongin, Val Curwen, Guy Slater, Richard Durbin, Craig Melsopp, Alistair Rust, Chriss Mungall, Jim Kent and many, many more
Join! mailing lists –(see )