EnsEMBL Opening up the whole Genome Philip Lijnzaad

Slides:



Advertisements
Similar presentations
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Advertisements

Annotating a Scarlet Runner Bean genome fragment put together by shotgun sequencing Scarlet Runner ean Max Bachour.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
April 2006 March 2007 Xosé Mª Fernández European Bioinformatics Institute Browsing Genomes with Ensembl.
Tutorial 7 Genome browser. Free, open source, on-line broswer for genomes Contains ~100 genomes, from nematodes to human. Many tools that can be used.
Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.
Lecture 7.11 The Ensembl Database Erin Pleasance Steven Jones Canada’s Michael Smith Genome Sciences Centre, Vancouver.
Data Mining in Ensembl with EnsMart. 2 of 24 All genes from a candidate region Genes with a particular protein domain Members of a protein family Genes.
Genomic Database - Ensembl Ka-Lok Ng Department of Bioinformatics Asia University.
How to access genomic information using Ensembl August 2005.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Genome Browsers UCSC (Santa Cruz, California) and Ensembl (EBI, UK)
We are developing a web database for plant comparative genomics, named Phytome, that, when complete, will integrate organismal phylogenies, genetic maps.
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Data retrieval BioMart Data sets on ftp site MySQL queries of databases Perl API access to databases Export View.
Eukaryotic Gene Finding
Doug Brutlag 2011 Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University School of Medicine Genomics, Bioinformatics.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
Doug Brutlag 2011 Next Generation Sequencing and Human Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University.
Mouse Genome Sequencing
The Ensembl Gene set The “Genebuild” 21 April 2008.
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
Genome Annotation and Databases Genomic DNA sequence Genomic annotation BIO520 BioinformaticsJim Lund Reading Ch 9, Ch10.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
How to access genomic information using Ensembl Damian Smedley and Xosé Fernández Ensembl Project European Bioinformatics Institute Cambridge, UK November.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Rhesus genome annotations Rob Norgren Department of Genetics, Cell Biology and Anatomy University of Nebraska Medical Center.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Part I: Identifying sequences with … Speaker : S. Gaj Date
An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit.
1 of 38 Data Mining in Ensembl with BioMart. 2 of 38 Simple Text-based Search Engine.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Bioinformatic Tools for Comparative Genomics of Vectors Comparative Genomics.
Annotator Interface Sharon Diskin GUS 3.0 Workshop June 18-21, 2002.
SPIDA Substitution Periodicity Index and Domain Analysis Combining comparative sequence analysis with EST alignment to identify coding regions Damian Keefe.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Mark D. Adams Dept. of Genetics 9/10/04
Lettuce/Sunflower EST CGPDB project. Data analysis, assembly visualization and validation. Alexander Kozik, Brian Chan, Richard Michelmore. Department.
数据库使用 杨建华 2010/9/28. Outline of the Topics UCSC and Ensembl Genome Browser (Blat vs Blast vs Blastz vs Multiz) 挖掘数据用 Table Browser 或 BioMart 用户友好化你的数据.
Human Genome.
Data Mining in Ensembl with BioMart Giulietta Spudich.
What do we already know ? The rice disease resistance gene Pi-ta Genetically mapped to chromosome 12 Rybka et al. (1997). It has also been sequenced Bryan.
A guided tour of Ensembl This quick tour will give you an outline view of what Ensembl is all about. You will learn: –Why we need Ensembl –What is in the.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Accessing and visualizing genomics data
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
Genomes at NCBI. Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools lists 57 databases.
CIP HPC CIP - HPC HPC = High Performance Computer It’s not a regular computer, it’s bigger, faster, more powerful, and more.
Welcome to the combined BLAST and Genome Browser Tutorial.
Work Presentation Novel RNA genes in A. thaliana Gaurav Moghe Oct, 2008-Nov, 2008.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
Lecture/Lab 7.31
Introduction to Genes and Genomes with Ensembl
Ensembl Database and Web Browser
The Ensembl Database Steven Jones August 18, 2004
Data Mining with BioMart
INFORMATION FLOW AARTHI & NEHA.
Ensembl Genome Repository.
Next Generation Sequencing and Human Genome Databases
with the Ensembl Genome Browser
Part II SeqViewer AraCyc Help
Annotator Interface GUS 3.0 Workshop June 18-21, 2002.
Presentation transcript:

EnsEMBL Opening up the whole Genome Philip Lijnzaad

Overview what how (science, hardware software) results families and descriptions tour people

What is EnsEMBL Automatic Annotation of complete Human Genome –genes –other: markers, SNPs, homologies, etc. completely open –data, software, discussions portable, downloadable ‘the Linux of the Human Genome’

From... TCTTCTCCTTCAAGGCATCCAGGTTACCCCGGACAATAAGAGGGGAACAAGCTCTTTGTT TTGCCAAGCGGTGGAAGCTTCAGGAAAGGTGCCCGGCCCCTTAGGAGGAAAACCGGGGAA CAAGACCCGCAGTTTTTGCCTTCCCAACTTCCAGTGGGCCCAAAAAAACTTGGGGCGCCC AGGGTCCCCAAAAGAGAGAGCCACGCTGGGGCCGGGTTCCTGCTTTTAATATCCAGGAAA AGGGGGGGAGGGGTATTCCCCCTTCCTCATTAAGATAAAAGACTCCCCCTCGTACTTATG GGTCCTTTACGGTTGGGCATGGGGCGAAAAAAGGGAGCGCCCCGGTGGACTTAATCGTAT TTTAACACACCCCCCGGGATATTTAAAGTCGGGGTAGGGCTGTTTGAAAATATTCAATGT GGGGGGCTTTTTGACACGCCCGTTTATATTGTTCTGGGACGCGCGTGAGGGGGGTAGACA AGAGGTGTGTAAGCCGTGCTTTATTATCCTCGCGTAGACACGCGTTAGCATGTAGTGGTG TTACCTGGTCGCGCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCCTT CTCTACTAAAAACCCAAAAATTTGCCAGACACGTGGAGAGCGAGACTTCATCTCAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCTAAGAGTTGTTATTTCTGAGGTAGAAT AACTAATGATCTTATCTTCTCTTTTTTCTTTTCTTCAAGATGGGGTCTTGCTTTGTCACC CAGGCCAGAGGGCAGTGGCACAATCATAGTTCACTGCAGCTTCAAACACCTGAACTCAAG CAATCTTCCCCGCTCATACTGCTCCCCAGCACCAGGAGCTGGGACTACAGGCACACGTCA CCACATCCGGCTAATTTTTTTTTTCTTTTGGTGGGTAGAGACGGGGGCCTCACTATGTTG

… to: MHSSGSSGKGAGPLRGKTSGTEPADFALPSSRGGPGKLRCYQTNLSSFS SPRKGVSQTGTPVCEEDGDAGLGIRQGGKAPVTPRGRGRRGRPPSRTTG TRETAVPGPLGIEDISPNLSPDDKSFSRVVPRVPDSTRRTDVGAGALRR SDSPEIPFQAAAGPSDGLDASSPGNSFVGLRVVAKWSSNGYFYSGKITR DVGAGKYKLLFDDGYECDVLGKDILLCDPIPLDTEVTALSEDEYFSAGV VKGHRKESGELYYSIEKEGQRKWYKRMAVILSLEQGNRLREQYGLGPYE AVTPLTKAADISLDNLVEGKRKRRSNVSSPATPTASSSSSTTPTRKITE SPRASMGVLSGKRKLITSEEERSPAKRGRKSATVKPGAVGAGEFVSPCE SGDNTGEPSALEEQRGPLPLNKTLFLGYAFLLTMATTSDKLASRSKLPD GPTGSSEEEEEFLEIPPFNKQYTESQLRAGAGYILEDFNEAQCNTAYQC

Take: Draft human genome –clones and contigs from public databases –not finished errors gaps –Golden Path assembly of all contigs into (nearly) complete chromosomes

Then: Get rid of repeats Targetted searches –pmatch to ‘find back’ known proteins from SWISSPROT, SP-TrEMBL and RefSeq –GeneWise and EST2Genome to build the genes fill in coding sequences and UTR’s

And then: Similarity searches –GenScan on raw contigs –its peptides are searched against protein, mRNA and EST databases –genes are built using GeneWise on promising regions –additonally, exons can be used All predictions supported by evidence!

Add cross references: HUGO (HGNC) SwissProt/Trembl, RefSeq EMBL OMIM LocusLink InterPro

Add yet more GeneTribe families Gene descriptions Markers SNPs external annotations (EMBL) mouse traces...

Hardware 360 Alpha’s: DS10, dual EV6 processors, 1GByte memory 200 other nodes 10 days to do a complete blast + gene build ~ 30 million jobs ~ 30 GB

Software Digital Unix Apache relational database (MySQL) mostly perl, some C and Java –BioPerl, BioJava, BioCORBA LSF AltaVista

Software (2) Wiki Web CVS (~100 Mb) Code review, data review Testing conventions Interfaces VirtualContigs CORBA/Java

ID’s for genes, transcripts, exons, peptides, families ENSXnnnnnnnnnnn ( eg: ENSG ) –X denotes which type: G = gene T = transcript E = exon P = peptide (translation) F = family

ID’s (2) ID’s should be stable difficult, because underlying data keeps being refined! ID mapping version numbers

Results Latest release:,1.1 (17. July) –Web code version: (1 Aug.) April 2001 dataset 4,318,661,441 basepairs 143,479 exons 23,931 transcripts 21,921 genes (‘confirmed’)

Errors Missing data Misassembly Misidentification (pseudo-gene, paralog) Sequencing errors –in Human Genome Data –in supporting databases Bugs GenScan tuning GeneWise tuning

Gene Families Cluster EnsEMBL peptides together with SwissProt and SPTrembl –vertebrate GeneTRIBE - Automatic Protein Family detection using Markov Clustering. Enright, van Dongen & Ouzounis (in preparation)

Family descriptions distill consensus descriptions –using SwissProt DE-lines –may not work => unknown Transfer peptide’s family assignment to gene –resolve conflicts: choose family that has best description –unknown < hypothetical < fragment < cDNA

Family statistics: 13,811 families –7284 ‘unknown’ description 128,828 members –21,894 ENS genes –23,867 ENS peptides

Family statistics (2) member members > 100 max is 483 (zinc finger)

Gene descriptions Use SwissProt DE-line if known use Family if not Statistics: –18053 descriptions from SwissProt 4851 from family description –3868 still UNKNOWNs

Entry points ID search text search OMIM disease Browse chromosomes BLAST

TextSearch

DiseaseView

BLAST/SSAHA

MapView

MarkerView

ContigView

ContigView (2)

ContigView(3)

DAS annotations

Apollo

ExportView

GeneView

GeneView (2)

ProteinView

ProteinView (2)

ExpressionView

DomainView

FamilyView

Recent developments HelpDesk DAS –Adding annotations from anywhere Apollo –Genome viewer Expression data –SAGE

Future Better genes! Alignments Other genomes –Comparative Genomics CORBA/Java More protein-structural links –Scop profiles IGI/IPI Entity infra-structure

Links –dev.ensembl.org

Acknowledgements Ewan Birney, Michele Clamp,Tim Hubbard,Tony Cox,Elia Stupka,Arek Kasprzyk, Arne Stabenau, James Stalker, James Cuff, James Smith, Simon Potter, Manu Mongin, Val Curwen, Guy Slater, Richard Durbin, Craig Melsopp, Alistair Rust, Chriss Mungall, Jim Kent and many, many more

Join! mailing lists –(see )