Presentation is loading. Please wait.

Presentation is loading. Please wait.

EnsEMBL Opening up the whole Genome Philip Lijnzaad

Similar presentations


Presentation on theme: "EnsEMBL Opening up the whole Genome Philip Lijnzaad"— Presentation transcript:

1 EnsEMBL Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk

2 Overview what how (science, hardware software) results families and descriptions tour people

3 What is EnsEMBL Automatic Annotation of complete Human Genome –genes –other: markers, SNPs, homologies, etc. completely open –data, software, discussions portable, downloadable ‘the Linux of the Human Genome’

4 From... TCTTCTCCTTCAAGGCATCCAGGTTACCCCGGACAATAAGAGGGGAACAAGCTCTTTGTT TTGCCAAGCGGTGGAAGCTTCAGGAAAGGTGCCCGGCCCCTTAGGAGGAAAACCGGGGAA CAAGACCCGCAGTTTTTGCCTTCCCAACTTCCAGTGGGCCCAAAAAAACTTGGGGCGCCC AGGGTCCCCAAAAGAGAGAGCCACGCTGGGGCCGGGTTCCTGCTTTTAATATCCAGGAAA AGGGGGGGAGGGGTATTCCCCCTTCCTCATTAAGATAAAAGACTCCCCCTCGTACTTATG GGTCCTTTACGGTTGGGCATGGGGCGAAAAAAGGGAGCGCCCCGGTGGACTTAATCGTAT TTTAACACACCCCCCGGGATATTTAAAGTCGGGGTAGGGCTGTTTGAAAATATTCAATGT GGGGGGCTTTTTGACACGCCCGTTTATATTGTTCTGGGACGCGCGTGAGGGGGGTAGACA AGAGGTGTGTAAGCCGTGCTTTATTATCCTCGCGTAGACACGCGTTAGCATGTAGTGGTG TTACCTGGTCGCGCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCCTT CTCTACTAAAAACCCAAAAATTTGCCAGACACGTGGAGAGCGAGACTTCATCTCAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCTAAGAGTTGTTATTTCTGAGGTAGAAT AACTAATGATCTTATCTTCTCTTTTTTCTTTTCTTCAAGATGGGGTCTTGCTTTGTCACC CAGGCCAGAGGGCAGTGGCACAATCATAGTTCACTGCAGCTTCAAACACCTGAACTCAAG CAATCTTCCCCGCTCATACTGCTCCCCAGCACCAGGAGCTGGGACTACAGGCACACGTCA CCACATCCGGCTAATTTTTTTTTTCTTTTGGTGGGTAGAGACGGGGGCCTCACTATGTTG

5 … to: MHSSGSSGKGAGPLRGKTSGTEPADFALPSSRGGPGKLRCYQTNLSSFS SPRKGVSQTGTPVCEEDGDAGLGIRQGGKAPVTPRGRGRRGRPPSRTTG TRETAVPGPLGIEDISPNLSPDDKSFSRVVPRVPDSTRRTDVGAGALRR SDSPEIPFQAAAGPSDGLDASSPGNSFVGLRVVAKWSSNGYFYSGKITR DVGAGKYKLLFDDGYECDVLGKDILLCDPIPLDTEVTALSEDEYFSAGV VKGHRKESGELYYSIEKEGQRKWYKRMAVILSLEQGNRLREQYGLGPYE AVTPLTKAADISLDNLVEGKRKRRSNVSSPATPTASSSSSTTPTRKITE SPRASMGVLSGKRKLITSEEERSPAKRGRKSATVKPGAVGAGEFVSPCE SGDNTGEPSALEEQRGPLPLNKTLFLGYAFLLTMATTSDKLASRSKLPD GPTGSSEEEEEFLEIPPFNKQYTESQLRAGAGYILEDFNEAQCNTAYQC

6 Take: Draft human genome –clones and contigs from public databases –not finished errors gaps –Golden Path assembly of all contigs into (nearly) complete chromosomes

7 Then: Get rid of repeats Targetted searches –pmatch to ‘find back’ known proteins from SWISSPROT, SP-TrEMBL and RefSeq –GeneWise and EST2Genome to build the genes fill in coding sequences and UTR’s

8 And then: Similarity searches –GenScan on raw contigs –its peptides are searched against protein, mRNA and EST databases –genes are built using GeneWise on promising regions –additonally, exons can be used All predictions supported by evidence!

9 Add cross references: HUGO (HGNC) SwissProt/Trembl, RefSeq EMBL OMIM LocusLink InterPro

10 Add yet more GeneTribe families Gene descriptions Markers SNPs external annotations (EMBL) mouse traces...

11 Hardware 360 Alpha’s: DS10, dual EV6 processors, 1GByte memory 200 other nodes 10 days to do a complete blast + gene build ~ 30 million jobs ~ 30 GB

12 Software Digital Unix Apache relational database (MySQL) mostly perl, some C and Java –BioPerl, BioJava, BioCORBA LSF AltaVista

13 Software (2) Wiki Web CVS (~100 Mb) Code review, data review Testing conventions Interfaces VirtualContigs CORBA/Java

14 ID’s for genes, transcripts, exons, peptides, families ENSXnnnnnnnnnnn ( eg: ENSG00000067369 ) –X denotes which type: G = gene T = transcript E = exon P = peptide (translation) F = family

15 ID’s (2) ID’s should be stable difficult, because underlying data keeps being refined! ID mapping version numbers

16 Results Latest release:,1.1 (17. July) –Web code version: 1.1.1 (1 Aug.) April 2001 dataset 4,318,661,441 basepairs 143,479 exons 23,931 transcripts 21,921 genes (‘confirmed’)

17 Errors Missing data Misassembly Misidentification (pseudo-gene, paralog) Sequencing errors –in Human Genome Data –in supporting databases Bugs GenScan tuning GeneWise tuning

18 Gene Families Cluster EnsEMBL peptides together with SwissProt and SPTrembl –vertebrate GeneTRIBE - Automatic Protein Family detection using Markov Clustering. Enright, van Dongen & Ouzounis (in preparation)

19 Family descriptions distill consensus descriptions –using SwissProt DE-lines –may not work => unknown Transfer peptide’s family assignment to gene –resolve conflicts: choose family that has best description –unknown < hypothetical < fragment < cDNA

20 Family statistics: 13,811 families –7284 ‘unknown’ description 128,828 members –21,894 ENS genes –23,867 ENS peptides

21 Family statistics (2) 67591 member 34572-10 members 215 10-100 4> 100 max is 483 (zinc finger)

22 Gene descriptions Use SwissProt DE-line if known use Family if not Statistics: –18053 descriptions 13202 from SwissProt 4851 from family description –3868 still UNKNOWNs

23 Entry points http://www.ensembl.org ID search text search OMIM disease Browse chromosomes BLAST

24

25 TextSearch

26 DiseaseView

27 BLAST/SSAHA

28 MapView

29 MarkerView

30 ContigView

31 ContigView (2)

32 ContigView(3)

33 DAS annotations

34 Apollo

35 ExportView

36 GeneView

37 GeneView (2)

38 ProteinView

39 ProteinView (2)

40 ExpressionView

41 DomainView

42 FamilyView

43 Recent developments HelpDesk DAS –Adding annotations from anywhere Apollo –Genome viewer Expression data –SAGE

44 Future Better genes! Alignments Other genomes –Comparative Genomics CORBA/Java More protein-structural links –Scop profiles IGI/IPI Entity infra-structure

45 Links http://www.ensembl.org –dev.ensembl.org http://www.ensembl.org/genome/central http://genome.ucsc.edu http://compbio.ornl.gov/channel http://ncbi.nlm.gov/genome/guide/human http://www.biodas.org http://www.bio{perl,xml,corba,python,java}.org

46 Acknowledgements Ewan Birney, Michele Clamp,Tim Hubbard,Tony Cox,Elia Stupka,Arek Kasprzyk, Arne Stabenau, James Stalker, James Cuff, James Smith, Simon Potter, Manu Mongin, Val Curwen, Guy Slater, Richard Durbin, Craig Melsopp, Alistair Rust, Chriss Mungall, Jim Kent and many, many more

47 Join! http://www.ensembl.org mailing lists –ensembl-dev@ebi.ac.uk –ensembl-announce@ebi.ac.uk –(see http://www.ensembl.org/Dev/Lists )


Download ppt "EnsEMBL Opening up the whole Genome Philip Lijnzaad"

Similar presentations


Ads by Google