Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.

Slides:



Advertisements
Similar presentations
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
WGS Assembly and Reads Clustering Zemin Ning Production Software Group Informatics Division.
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Lecture 14 Genome sequencing projects
April 2006 March 2007 Xosé Mª Fernández European Bioinformatics Institute Browsing Genomes with Ensembl.
Lecture 7.11 The Ensembl Database Erin Pleasance Steven Jones Canada’s Michael Smith Genome Sciences Centre, Vancouver.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Whole Genome Sequencing, Comparative Genomics, & Systems Biology Gene Myers University of California Berkeley.
Sequencing Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
How to access genomic information using Ensembl August 2005.
Sequencing Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Genome Browsers UCSC (Santa Cruz, California) and Ensembl (EBI, UK)
Evaluating genes and transcripts in Ensembl
Human Genome Sequence and Variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary,
Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Genome Annotation BCB 660 October 20, From Carson Holt.
Doug Brutlag 2011 Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University School of Medicine Genomics, Bioinformatics.
Plants.ensembl.org / The transPLANT project is funded by the European Commission within its 7 th Framework Programme under the thematic.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
Mouse Genome Sequencing
Large-scale genome projects
The Ensembl Gene set The “Genebuild” 21 April 2008.
Tomato genome annotation pipeline in Cyrille2
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Genome Annotation and Databases Genomic DNA sequence Genomic annotation BIO520 BioinformaticsJim Lund Reading Ch 9, Ch10.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
How to access genomic information using Ensembl Damian Smedley and Xosé Fernández Ensembl Project European Bioinformatics Institute Cambridge, UK November.
Tomato Chromosome 4: A Mapping & Sequencing Update 28 th September 2005 Christine Nicholson Mapping Core Group Welcome Trust Sanger Institute, UK.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
RNA Structure Prediction Chapter 16. Primary, Secondary and Tertiary Structures.
An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit.
Solanum lycopersicum Chromosome 4 Sequencing Update UK-SOL– Dec 2008 Wellcome Trust Medical Photographic Library.
EnsEMBL Opening up the whole Genome Philip Lijnzaad
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Theobroma cacao Integrated Physical and Genetic Map 2 BAC Libraries 250 Genetic Markers.
Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Annotating genomes using MAKER-P and iPlant. What Are Annotations? Annotations are descriptions of features of the genome –Structural: exons, introns,
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
August 2008Bioinformatics Tools for Comparative Genomics of Vectors1 Genomes Daniel Lawson EBI.
Human Genome.
VectorBase Vectorbase probe mapping. VectorBase Automatic Annotation browser Array data CHADO Manual Annotation XML vectorbase Automatic Annotation.
The Genome Assemblies of Tasmanian Devil Zemin Ning The Wellcome Trust Sanger Institute.
Hashing Algorithm and its Applications in Bioinformatics By Zemin Ning Informatics Division The Wellcome Trust Sanger Institute.
Maize Genome Project Shiran Pasternak January 13, 2006 Gramene SAB Meeting San Diego, CA Shiran Pasternak January 13, 2006 Gramene SAB Meeting San Diego,
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
A guided tour of Ensembl This quick tour will give you an outline view of what Ensembl is all about. You will learn: –Why we need Ensembl –What is in the.
Evaluating genes and transcripts in Ensembl March 2007.
SRB Genome Assembly and Analysis From 454 Sequences HC70AL S Brandon Le & Min Chen.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Accessing and visualizing genomics data
16 th April 2007 Christine Nicholson, Mapping Core Group Wellcome Trust Sanger Institute Tomato Chromosome 4 Mapping & Use of FPC Copyright Wellcome Trust.
Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
Using DNA Subway in the Classroom Genome Annotation: Red Line.
Lecture/Lab 7.31
Virginia Commonwealth University
Ensembl Database and Web Browser
VectorBase genome annotation
The Ensembl Database Steven Jones August 18, 2004
Phusion2 and The Genome Assembly of Tasmanian Devil
Pre-genomic era: finding your own clones
Genome Annotation w/ MAKER
Cuong Nguyen, Deng Xin, Dongmei, Zheng Wang
Ensembl Genome Repository.
Sequence the 3 billion base pairs of human
Part II SeqViewer AraCyc Help
Presentation transcript:

Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

Bioinformatics resources outline clone mapping, sequencing and manual annotation in genome assemblies and automated annotation in integrated ZF-Models data and tools

Clone mapping and sequencing mapping 2 BAC Tuebingen libraries 1 BAC and 1 cosmid library from single Tuebingen double-haploid fish end sequencing, RH mapping, fingerprinting pieced together according to fingerprints, marker mapping, sequence alignment currently ~ 2500 ctgs

Clone mapping and sequencing sequencing pipeline select clones based on position in fpc contig subcloning sequencing automatical assembly/pre-finishing (back to sequencing if necessary) finishing QC automated analysis pipeline manual annotation submission to EMBL + + =

RepeatMasker CpG island prediction Genscan FGenesh halfwise (Pfam) EPCR Blast (ESTs, cDNAs, proteins) gene structures remarks (gene names, function, similarities) other features EMBL mysql database in 'ensembl style' acedb or apollo front end open to users from the 'outside' unfinished sequence finished sequence automated analysis pipeline manual annotation otter Manual annotation

annotation policy follows guidelines for human annotation (havana team, Sanger Institute) no "guesses", annotations solely based on supporting evidence annotation of:CDSs and UTRs / transcripts splice variants pseudogenes poly A features transposons repeats approved nomenclature (SI:clone.number) collaboration with ZFIN existing ZFIN records are reported ZFIN provides new records for newly found genes

DNA repeats CpG island Genscan FGenesH proteins ESTs mRNAs Manual annotation

vega.sanger.ac.uk

Vega contigview

Vega geneview

when to use what go to vega.sanger.ac.uk if you need highly reliable sequence highly reliable annotation (with your input) ‘your gene’ stable over time (TILLING) go to if you need the whole genome comparative data ZF-Models microarray or insertional mutagenesis data complicated searches (BioMart)

Zebrafish Genome Project assembly release (Zv5) clone libraries map (un)finished clones whole genome shotgun sequencingclone mapping and sequencing WGS reads WGS assembly integration markers (T51) supercontig contig tile path BACs fpc ctg sequencing ~ 8,000 finished clones (~1 Gb) clones+ctgs contigs finish clone 1.63 Gb automatic annotation manual annotation

WGS assembly reads group reads supercontig Phusion assembler - High Performance Assembly Group (Zemin Ning et al.) contig supercontig ABC phrap read-pair tracker A CB B A C gap NNNNNNNN

Read grouping continuous base hash - k=12 ATGGCGTGCAGTCCATGTTCGGATCAATGGCGTGCAGT TGGCGTGCAGTC TGGCGTGCAGTC GGCGTGCAGTCC GGCGTGCAGTCC GCGTGCAGTCCA GCGTGCAGTCCA gap hash k=12 (4x3) - dealing with variation ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGTCCATGT TGGCGTGCAGTCCATGTT TGGCGTGCAGTCCATGTT GGCGTGCAGTCCATGTTC GGCGTGCAGTCCATGTTC GCGTGCAGTCCATGTTCG GCGTGCAGTCCATGTTCG k-mer word hashing ~7 repeats seq. errors word distribution k-mer occurrence frequency

Zebrafish Genome Project assembly release (Zv5) clone libraries map (un)finished clones whole genome shotgun sequencingclone mapping and sequencing WGS reads WGS assembly integration markers (T51) sequencing ~ 7,000 finished clones (~1 Gb) automatic annotation manual annotation

Integration Zv5 scaffoldn BX BX BX005153BX BX005153BX BX BX fpc contig WGS supercontig marker cDNA bacends BACs Zv5 scaffoldn.3Zv5 scaffoldn.5Zv5 scaffoldn.7Zv5 scaffoldn.1

Assemblies Zv5Zv4Zv3Zv2 release date assembly total length [bp]1,630,306,8661,592,025,6861,459,115,4861,452,210,772 scaffolds16,21421,33358,33983,470 finished clones4,519 (699 Mb)2.828 (443 Mb)1,502 (263Mb)- scaffolds in chr 1-251,7491,8921,490- scaffolds in fpc contigs265 (chrU)694 (chrU)1,8425,677 NA scaffolds14,67618,74754,79877,793 sum(length) chr 1-25 [bp] 1,200,129,620 (73%)1,097,507,810 (69%)718,270,423 (49%)- sum(length) ctgs183,993,739 (11%)176,222,396 (11%)365,271,659 (25%)1,143,459,008 sum(length) NAs246,183,507 (16%)318,295,480 (20%)335,615,307 (23%)308,751,764

Automatic Annotation Zebrafish Proteins Genewise genes Other Proteins Aligned cDNAs Zebrafish cDNAs Genewise genes with UTRs Genebuilder Supported ab initio (optional) Final set Aligned ESTs Zebrafish ESTs Ensembl EST genes Exonerate ClusterMerge Genewise

Ensembl

Contigview

Geneview

Searching Ensembl

Biomart startfilter output

Do’s and Dont’s go elsewhere (Ensembl) if you want to know about the whole genome need comparative data need ZF-Models microarray or insertional mut data need to do complicated searches go to Vega if you need highly reliable sequence need highly reliable annotation need ‘your gene’ stable over time (TILLING)

DAS reference sequence genome browser local storage remote storage DAS server remote storage DAS server remote storage DAS server XML DAS client

SNPs and Indels

Ensembl releases