Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.

Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

Bioinformatics resources outline clone mapping, sequencing and manual annotation in genome assemblies and automated annotation in integrated ZF-Models data and tools

Clone mapping and sequencing mapping 2 BAC Tuebingen libraries 1 BAC and 1 cosmid library from single Tuebingen double-haploid fish end sequencing, RH mapping, fingerprinting pieced together according to fingerprints, marker mapping, sequence alignment currently ~ 2500 ctgs

Clone mapping and sequencing sequencing pipeline select clones based on position in fpc contig subcloning sequencing automatical assembly/pre-finishing (back to sequencing if necessary) finishing QC automated analysis pipeline manual annotation submission to EMBL + + =

RepeatMasker CpG island prediction Genscan FGenesh halfwise (Pfam) EPCR Blast (ESTs, cDNAs, proteins) gene structures remarks (gene names, function, similarities) other features EMBL mysql database in 'ensembl style' acedb or apollo front end open to users from the 'outside' unfinished sequence finished sequence automated analysis pipeline manual annotation otter Manual annotation

annotation policy follows guidelines for human annotation (havana team, Sanger Institute) no "guesses", annotations solely based on supporting evidence annotation of:CDSs and UTRs / transcripts splice variants pseudogenes poly A features transposons repeats approved nomenclature (SI:clone.number) collaboration with ZFIN existing ZFIN records are reported ZFIN provides new records for newly found genes

DNA repeats CpG island Genscan FGenesH proteins ESTs mRNAs Manual annotation

vega.sanger.ac.uk

Vega contigview

Vega geneview

www.sanger.ac.uk/Projects/D_rerio

when to use what go to vega.sanger.ac.uk if you need highly reliable sequence highly reliable annotation (with your input) ‘your gene’ stable over time (TILLING) go to www.ensembl.org if you need the whole genome comparative data ZF-Models microarray or insertional mutagenesis data complicated searches (BioMart)

Zebrafish Genome Project assembly release (Zv5) clone libraries map (un)finished clones whole genome shotgun sequencingclone mapping and sequencing WGS reads WGS assembly integration markers (T51) supercontig contig tile path BACs fpc ctg sequencing ~ 8,000 finished clones (~1 Gb) clones+ctgs contigs finish clone 1.63 Gb automatic annotation manual annotation

WGS assembly reads group reads supercontig Phusion assembler - High Performance Assembly Group (Zemin Ning et al.) contig supercontig ABC phrap read-pair tracker A CB B A C gap NNNNNNNN

Read grouping continuous base hash - k=12 ATGGCGTGCAGTCCATGTTCGGATCAATGGCGTGCAGT TGGCGTGCAGTC TGGCGTGCAGTC GGCGTGCAGTCC GGCGTGCAGTCC GCGTGCAGTCCA GCGTGCAGTCCA gap hash k=12 (4x3) - dealing with variation ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGTCCATGT TGGCGTGCAGTCCATGTT TGGCGTGCAGTCCATGTT GGCGTGCAGTCCATGTTC GGCGTGCAGTCCATGTTC GCGTGCAGTCCATGTTCG GCGTGCAGTCCATGTTCG k-mer word hashing ~7 repeats seq. errors word distribution k-mer occurrence frequency

Zebrafish Genome Project assembly release (Zv5) clone libraries map (un)finished clones whole genome shotgun sequencingclone mapping and sequencing WGS reads WGS assembly integration markers (T51) sequencing ~ 7,000 finished clones (~1 Gb) automatic annotation manual annotation

Integration Zv5 scaffoldn BX005049.6BX005057.8BX005153BX005123.6 BX005153BX005057.8 BX005049.6BX005123.6 fpc contig WGS supercontig marker cDNA bacends BACs Zv5 scaffoldn.3Zv5 scaffoldn.5Zv5 scaffoldn.7Zv5 scaffoldn.1

Assemblies Zv5Zv4Zv3Zv2 release date assembly27.05.0512.07.0427.11.0303.04.03 total length [bp]1,630,306,8661,592,025,6861,459,115,4861,452,210,772 scaffolds16,21421,33358,33983,470 finished clones4,519 (699 Mb)2.828 (443 Mb)1,502 (263Mb)- scaffolds in chr 1-251,7491,8921,490- scaffolds in fpc contigs265 (chrU)694 (chrU)1,8425,677 NA scaffolds14,67618,74754,79877,793 sum(length) chr 1-25 [bp] 1,200,129,620 (73%)1,097,507,810 (69%)718,270,423 (49%)- sum(length) ctgs183,993,739 (11%)176,222,396 (11%)365,271,659 (25%)1,143,459,008 sum(length) NAs246,183,507 (16%)318,295,480 (20%)335,615,307 (23%)308,751,764

Automatic Annotation Zebrafish Proteins Genewise genes Other Proteins Aligned cDNAs Zebrafish cDNAs Genewise genes with UTRs Genebuilder Supported ab initio (optional) Final set Aligned ESTs Zebrafish ESTs Ensembl EST genes Exonerate ClusterMerge Genewise

Ensembl

Contigview

Geneview

Searching Ensembl

Biomart startfilter output

Do’s and Dont’s go elsewhere (Ensembl) if you want to know about the whole genome need comparative data need ZF-Models microarray or insertional mut data need to do complicated searches go to Vega if you need highly reliable sequence need highly reliable annotation need ‘your gene’ stable over time (TILLING)

DAS reference sequence genome browser local storage remote storage DAS server remote storage DAS server remote storage DAS server XML DAS client

SNPs and Indels

Ensembl releases

Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.

Similar presentations

Presentation on theme: "Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.

Similar presentations

Presentation on theme: "Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources."— Presentation transcript:

Similar presentations

About project

Feedback