Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources
Bioinformatics resources outline clone mapping, sequencing and manual annotation in genome assemblies and automated annotation in integrated ZF-Models data and tools
Clone mapping and sequencing mapping 2 BAC Tuebingen libraries 1 BAC and 1 cosmid library from single Tuebingen double-haploid fish end sequencing, RH mapping, fingerprinting pieced together according to fingerprints, marker mapping, sequence alignment currently ~ 2500 ctgs
Clone mapping and sequencing sequencing pipeline select clones based on position in fpc contig subcloning sequencing automatical assembly/pre-finishing (back to sequencing if necessary) finishing QC automated analysis pipeline manual annotation submission to EMBL + + =
RepeatMasker CpG island prediction Genscan FGenesh halfwise (Pfam) EPCR Blast (ESTs, cDNAs, proteins) gene structures remarks (gene names, function, similarities) other features EMBL mysql database in 'ensembl style' acedb or apollo front end open to users from the 'outside' unfinished sequence finished sequence automated analysis pipeline manual annotation otter Manual annotation
annotation policy follows guidelines for human annotation (havana team, Sanger Institute) no "guesses", annotations solely based on supporting evidence annotation of:CDSs and UTRs / transcripts splice variants pseudogenes poly A features transposons repeats approved nomenclature (SI:clone.number) collaboration with ZFIN existing ZFIN records are reported ZFIN provides new records for newly found genes
DNA repeats CpG island Genscan FGenesH proteins ESTs mRNAs Manual annotation
vega.sanger.ac.uk
Vega contigview
Vega geneview
when to use what go to vega.sanger.ac.uk if you need highly reliable sequence highly reliable annotation (with your input) ‘your gene’ stable over time (TILLING) go to if you need the whole genome comparative data ZF-Models microarray or insertional mutagenesis data complicated searches (BioMart)
Zebrafish Genome Project assembly release (Zv5) clone libraries map (un)finished clones whole genome shotgun sequencingclone mapping and sequencing WGS reads WGS assembly integration markers (T51) supercontig contig tile path BACs fpc ctg sequencing ~ 8,000 finished clones (~1 Gb) clones+ctgs contigs finish clone 1.63 Gb automatic annotation manual annotation
WGS assembly reads group reads supercontig Phusion assembler - High Performance Assembly Group (Zemin Ning et al.) contig supercontig ABC phrap read-pair tracker A CB B A C gap NNNNNNNN
Read grouping continuous base hash - k=12 ATGGCGTGCAGTCCATGTTCGGATCAATGGCGTGCAGT TGGCGTGCAGTC TGGCGTGCAGTC GGCGTGCAGTCC GGCGTGCAGTCC GCGTGCAGTCCA GCGTGCAGTCCA gap hash k=12 (4x3) - dealing with variation ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGTCCATGT TGGCGTGCAGTCCATGTT TGGCGTGCAGTCCATGTT GGCGTGCAGTCCATGTTC GGCGTGCAGTCCATGTTC GCGTGCAGTCCATGTTCG GCGTGCAGTCCATGTTCG k-mer word hashing ~7 repeats seq. errors word distribution k-mer occurrence frequency
Zebrafish Genome Project assembly release (Zv5) clone libraries map (un)finished clones whole genome shotgun sequencingclone mapping and sequencing WGS reads WGS assembly integration markers (T51) sequencing ~ 7,000 finished clones (~1 Gb) automatic annotation manual annotation
Integration Zv5 scaffoldn BX BX BX005153BX BX005153BX BX BX fpc contig WGS supercontig marker cDNA bacends BACs Zv5 scaffoldn.3Zv5 scaffoldn.5Zv5 scaffoldn.7Zv5 scaffoldn.1
Assemblies Zv5Zv4Zv3Zv2 release date assembly total length [bp]1,630,306,8661,592,025,6861,459,115,4861,452,210,772 scaffolds16,21421,33358,33983,470 finished clones4,519 (699 Mb)2.828 (443 Mb)1,502 (263Mb)- scaffolds in chr 1-251,7491,8921,490- scaffolds in fpc contigs265 (chrU)694 (chrU)1,8425,677 NA scaffolds14,67618,74754,79877,793 sum(length) chr 1-25 [bp] 1,200,129,620 (73%)1,097,507,810 (69%)718,270,423 (49%)- sum(length) ctgs183,993,739 (11%)176,222,396 (11%)365,271,659 (25%)1,143,459,008 sum(length) NAs246,183,507 (16%)318,295,480 (20%)335,615,307 (23%)308,751,764
Automatic Annotation Zebrafish Proteins Genewise genes Other Proteins Aligned cDNAs Zebrafish cDNAs Genewise genes with UTRs Genebuilder Supported ab initio (optional) Final set Aligned ESTs Zebrafish ESTs Ensembl EST genes Exonerate ClusterMerge Genewise
Ensembl
Contigview
Geneview
Searching Ensembl
Biomart startfilter output
Do’s and Dont’s go elsewhere (Ensembl) if you want to know about the whole genome need comparative data need ZF-Models microarray or insertional mut data need to do complicated searches go to Vega if you need highly reliable sequence need highly reliable annotation need ‘your gene’ stable over time (TILLING)
DAS reference sequence genome browser local storage remote storage DAS server remote storage DAS server remote storage DAS server XML DAS client
SNPs and Indels
Ensembl releases