Presentation is loading. Please wait.

Presentation is loading. Please wait.

Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.

Similar presentations


Presentation on theme: "Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources."— Presentation transcript:

1 Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources

2 Bioinformatics resources outline clone mapping, sequencing and manual annotation in genome assemblies and automated annotation in integrated ZF-Models data and tools

3 Clone mapping and sequencing mapping 2 BAC Tuebingen libraries 1 BAC and 1 cosmid library from single Tuebingen double-haploid fish end sequencing, RH mapping, fingerprinting pieced together according to fingerprints, marker mapping, sequence alignment currently ~ 2500 ctgs

4 Clone mapping and sequencing sequencing pipeline select clones based on position in fpc contig subcloning sequencing automatical assembly/pre-finishing (back to sequencing if necessary) finishing QC automated analysis pipeline manual annotation submission to EMBL + + =

5 RepeatMasker CpG island prediction Genscan FGenesh halfwise (Pfam) EPCR Blast (ESTs, cDNAs, proteins) gene structures remarks (gene names, function, similarities) other features EMBL mysql database in 'ensembl style' acedb or apollo front end open to users from the 'outside' unfinished sequence finished sequence automated analysis pipeline manual annotation otter Manual annotation

6 annotation policy follows guidelines for human annotation (havana team, Sanger Institute) no "guesses", annotations solely based on supporting evidence annotation of:CDSs and UTRs / transcripts splice variants pseudogenes poly A features transposons repeats approved nomenclature (SI:clone.number) collaboration with ZFIN existing ZFIN records are reported ZFIN provides new records for newly found genes

7 DNA repeats CpG island Genscan FGenesH proteins ESTs mRNAs Manual annotation

8 vega.sanger.ac.uk

9 Vega contigview

10 Vega geneview

11 www.sanger.ac.uk/Projects/D_rerio

12

13 when to use what go to vega.sanger.ac.uk if you need highly reliable sequence highly reliable annotation (with your input) ‘your gene’ stable over time (TILLING) go to www.ensembl.org if you need the whole genome comparative data ZF-Models microarray or insertional mutagenesis data complicated searches (BioMart)

14 Zebrafish Genome Project assembly release (Zv5) clone libraries map (un)finished clones whole genome shotgun sequencingclone mapping and sequencing WGS reads WGS assembly integration markers (T51) supercontig contig tile path BACs fpc ctg sequencing ~ 8,000 finished clones (~1 Gb) clones+ctgs contigs finish clone 1.63 Gb automatic annotation manual annotation

15 WGS assembly reads group reads supercontig Phusion assembler - High Performance Assembly Group (Zemin Ning et al.) contig supercontig ABC phrap read-pair tracker A CB B A C gap NNNNNNNN

16 Read grouping continuous base hash - k=12 ATGGCGTGCAGTCCATGTTCGGATCAATGGCGTGCAGT TGGCGTGCAGTC TGGCGTGCAGTC GGCGTGCAGTCC GGCGTGCAGTCC GCGTGCAGTCCA GCGTGCAGTCCA gap hash k=12 (4x3) - dealing with variation ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGTCCATGT TGGCGTGCAGTCCATGTT TGGCGTGCAGTCCATGTT GGCGTGCAGTCCATGTTC GGCGTGCAGTCCATGTTC GCGTGCAGTCCATGTTCG GCGTGCAGTCCATGTTCG k-mer word hashing ~7 repeats seq. errors word distribution k-mer occurrence frequency

17 Zebrafish Genome Project assembly release (Zv5) clone libraries map (un)finished clones whole genome shotgun sequencingclone mapping and sequencing WGS reads WGS assembly integration markers (T51) sequencing ~ 7,000 finished clones (~1 Gb) automatic annotation manual annotation

18 Integration Zv5 scaffoldn BX005049.6BX005057.8BX005153BX005123.6 BX005153BX005057.8 BX005049.6BX005123.6 fpc contig WGS supercontig marker cDNA bacends BACs Zv5 scaffoldn.3Zv5 scaffoldn.5Zv5 scaffoldn.7Zv5 scaffoldn.1

19 Assemblies Zv5Zv4Zv3Zv2 release date assembly27.05.0512.07.0427.11.0303.04.03 total length [bp]1,630,306,8661,592,025,6861,459,115,4861,452,210,772 scaffolds16,21421,33358,33983,470 finished clones4,519 (699 Mb)2.828 (443 Mb)1,502 (263Mb)- scaffolds in chr 1-251,7491,8921,490- scaffolds in fpc contigs265 (chrU)694 (chrU)1,8425,677 NA scaffolds14,67618,74754,79877,793 sum(length) chr 1-25 [bp] 1,200,129,620 (73%)1,097,507,810 (69%)718,270,423 (49%)- sum(length) ctgs183,993,739 (11%)176,222,396 (11%)365,271,659 (25%)1,143,459,008 sum(length) NAs246,183,507 (16%)318,295,480 (20%)335,615,307 (23%)308,751,764

20 Automatic Annotation Zebrafish Proteins Genewise genes Other Proteins Aligned cDNAs Zebrafish cDNAs Genewise genes with UTRs Genebuilder Supported ab initio (optional) Final set Aligned ESTs Zebrafish ESTs Ensembl EST genes Exonerate ClusterMerge Genewise

21 Ensembl

22 Contigview

23 Geneview

24 Searching Ensembl

25 Biomart startfilter output

26

27 Do’s and Dont’s go elsewhere (Ensembl) if you want to know about the whole genome need comparative data need ZF-Models microarray or insertional mut data need to do complicated searches go to Vega if you need highly reliable sequence need highly reliable annotation need ‘your gene’ stable over time (TILLING)

28 DAS reference sequence genome browser local storage remote storage DAS server remote storage DAS server remote storage DAS server XML DAS client

29 SNPs and Indels

30 Ensembl releases


Download ppt "Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources."

Similar presentations


Ads by Google