Stickleback Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis 4.Files and images are at

Slides:



Advertisements
Similar presentations
Mo17 shotgun project Goal: sequence Mo17 gene space with inexpensive new technologies Datasets in progress: Four-phases of 454-FLX sequencing to max of.
Advertisements

Homology Based Analysis of the Human/Mouse lncRNome
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Finishing Phage Genomes How to identify circularly permuted genomes, physical ends, 3’ overhangs, terminal repeats, and nicks.
Transcriptome Sequencing with Reference
Lecture 14 Genome sequencing projects
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Elephant Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Assembly.
Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.
Expanding the Tool Kit for BAC Extension Summary of completion criteria developed for NSF Tomato Sequencing Workshop January 14, 2007.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Whole Genome Assembly. WGA 1. Screener 2. Overlapper 3. Unitigger, 4. Scaffolder, 5. Repeat Resolver.
Evaluation of PacBio sequencing to improve the sunflower genome assembly Stéphane Muños & Jérôme Gouzy Presented by Nicolas Langlade Sunflower Genome Consortium.
Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
Plants.ensembl.org / The transPLANT project is funded by the European Commission within its 7 th Framework Programme under the thematic.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Hi Kathy, I’ve had a look at the remapped version of chr7 (MAL7.remapped this is the cons file you gave me) and the old version (MAL7.embl) in order to.
Mouse Genome Sequencing
PHYSICAL MAPPING AND POSITIONAL CLONING. Linkage mapping – Flanking markers identified – 1cM, for example Probably ~ 1 MB or more in humans Need very.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Tomato Chromosome 4: A Mapping & Sequencing Update 28 th September 2005 Christine Nicholson Mapping Core Group Welcome Trust Sanger Institute, UK.
Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15.
EXPLORING DEAD GENES Adrienne Manuel I400. What are they? Dead Genes are also called Pseudogenes Pseudogenes are non functioning copies of genes in DNA.
LOC_Os02g08480 Supplementary Figure S1. Exons shorter than a read length have few or no reads aligned. The gene at LOC_Os02g08040 contains exons shorter.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
A Sequenciação em Análises Clínicas Polymerase Chain Reaction.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
FINISHING WORKSHOP APRIL 2008 CHROMOSOME 7 THE FRENCH CONTRIBUTION TG216 TG438 T1112 T1355 T1328 T1428 T1962 T1414 T1497 T0676 TM18 CT54 T0966 T0731 TM15.
RNA Sequencing I: De novo RNAseq
Finishing tomato chromosomes #6 and #12 using a Next Generation whole genome shotgun approach Roeland van Ham, CBSG, NL René Klein Lankhorst, EUSOL Giovanni.
Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher Yonsei Biomedical Science Institute Yonsei University College.
Chromosome 2 Doil Choi, Sunghwan Jo KOREA. Cytological architecture of chromosome kb/µm DAPI (4’-6-diamidino-2-phenylindole) stained pachytene chromosome.
Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1.
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
Wageningen, April 24-25, 2008 II Tomato Finishing Workshop Chromosome 12 Update ENEA, Rome University of Naples ‘Federico II’ CRIBI and Univ. of Padua.
HeterochromatinEuchromatin Relative chromosome length Relative bivalent diameter X 1.23 X 1.00 Relative area Relative optical density.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
The Wellcome Trust Sanger Institute
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
Statistical Tests We propose a novel test that takes into account both the genes conserved in all three regions ( x 123 ) and in only pairs of regions.
Accessing and visualizing genomics data
Plasmodium falciparum (3D7) - published in Draft coverage. No sequence updates for a year. No new annotation since? Leishmania major Friedlin - version.
Welcome to the combined BLAST and Genome Browser Tutorial.
Linear Inequalities in Two Variables Write each inequality in interval notation and graph the interval. EXAMPLE 1 Graphing Intervals Written in Interval.
Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute.
Shruthi Prabhakara, Raj Acharya Department of Computer Science and Engineering, Pennsylvania State University We propose a two-pass semi-supervised fuzzy.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
METHOD: Family Classification Scheme 1)Set for a model building: 67 microbial genomes with identified protein sequences (Table 1) 2)Set for a model.
Accelerating positional cloning in mice using ancestral haplotype patterns Mark Daly Whitehead Institute for Biomedical Research.
Tomato Sequencing Project Meeting at SOL 2008, Oct. 15, 2008
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Mattew Mazowita, Lani Haque, and David Sankoff
Visualising and Exploring BS-Seq Data
Model of segmental duplication Acceptor regions of the genome acquire segments of genomic material that range from 1–200 kb from disparate regions.
Volume 9, Issue 1, Pages (July 2017)
Basic Local Alignment Search Tool
Eric Samorodnitsky, Jharna Datta, Benjamin M
A Sequenciação em Análises Clínicas
CSCI 1810 Computational Molecular Biology 2018
Copyright Pearson Prentice Hall
Qian Cong, Dominika Borek, Zbyszek Otwinowski, Nick V. Grishin 
Presentation transcript:

Stickleback Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis 4.Files and images are at The Data is in directory

Stickleback Genome The Genome(v1.0) is down loaded from UCSU. Total Length is 463,354,448bp which contains a chrUn of 62,550,211bp Total of gene annotations from ensemble gene annotation were down loaded from UCSC.

Seg Dup detection pipelines WGACto detect Seg Dup in genomic assembly by looking for homology pairs. ( >1kb in length >90% identity) WSSDto detect Seg Dup in given sequences based on depth coverage of WGS (whole Genome shot gun reads). Depth coverage > Average + 3SD.

Parameters and Notes for WGAC pipeline Repeats –Standard repeat coordinated were reverse generated from the soft mask data. –The secondary repeat masker were done using two repeat libraries, the ab_initio_lib.txt and supplemental_lib.txt. –Repeat Mask result for all three libraries were combined and sorted, then used for both pipelines Blast parsing seeds in WGAC pipeline: –the seed size is 500bp

Result from WGAC Pipeline Total pairs of SD detected(>1kb and >90% identity) Interchromosome pairs63744 Intrachromosome pairs88528 chrUn intra81641 chrUn inter and intra Total NR40,573,574bp Notes: In general, the number of WGAC pairs is too high (10%) for stickleback genome with only 400mb. 92% of total intra chromosomal WGAC pairs and 81% total pairs has at least one sequence in the pair is on chrUn. The result is expected, since chrUn contains high percentage of redundant poorly assembled sequences. Our analysis also suggest that the potential repeats which are not covered by the repeat libraries, may also detected as WGAC pairs. Next slid.

Repeats? Since the repeats might be an issue, I set up a filter to determine how many of WGACs may be affected. If I use >20hit, 400bp on boundary, hit length 10hit, and 400bp bound overlap, and hit < 10kb, 60% of WGAC is affected. I then generate the nr space of these hit. They are total of 7,481,640bp from 103, 157 pairs in total WGAC (152, 272 pairs of total 40,473,574bp). It has 2/3 of hits, but only 1/5 of total nr space. I think it is very reasonable. Because the high proportion of the WGAC pairs only affect a small proportion of NR space. These sequence intervals should also be detected by WSSD if they are the repeats. However, I did not take them out from Alldup(which is a merge of WGAC and WSSD) yet, because many of them has high frequency hit on chrUn. At this stage we do not know if they are the redundant sequences or the real seg dup. But we can pull them out at any time based on the coordinates. If I use >20hit, 400bp on boundary, hit length <10kb, 30% of WGAC can be

General analysis of WGAC length and identity distribution 1.Length distribution peaked at inter, with 92% of intra on chrUn. 2.Identity distribution peaked at 96%. Few is high than 99%.

General analysis, NR distribution on chromosome. high SD in chrUn

General view which show all WGAC on all chromosome Concentration of SD on smaller supercontigs on chrUn

Global image shows the inter and intra pairs of 5kb and above 90% without the chrUn. The red indicates the inter chromosomal pairs and blue indicates intra chromosomal pairs

Global image shows the inter and intra pairs of 10kb and 90% without chrUn. The red indicates the inter chromosomal pairs and blue indicates intra chromosomal pairs

Global image shows the inter and intra pairs of WGAC with10kb and 90%. ChrUn is also included. The red indicates the inter chromosomal pairs and blue indicates intra chromosomal pairs chrUn

WSSD analysis Down load the WGS reads about 6 million. Down load Stickleback finished BAC. These BACs are used to determine the threshold for WGS depth coverage. For 5k window, the average number of reads is 78, with SD 27. The threshold for 5k window is 125. for 1k window is 25. (Average+3SD) Repeat mask of the stickleback genome. I used the standard, ab_initio_lib.txt and supplemental_lib.txt. In addition I added the potential repeats I detected in WGAC process which shows more than 20 hit pairs the same region.

WSSD result There are total of 729 regions with 22,324,144bp were found in wssdGE10K_nogap.tab ( which has a 10k cut off), 251 of them are on chrUn. 850 regions in wssd.tab with 23,116,317 total base. It has 125 more regions and less than 1mb extra sequences comparing to 10k hits. A summary table of WGAC intersect with WSSD is at acCMPwssd.xls acCMPwssd.xls

Union of WSSD and WGAC Gene intersect with Seg Dups First a none redundant Union of WGAC and WSSD is generated. AllDup.tab A list of genes intersect with the AllDup is performed to identify genes overlap with Dup space in genome. There are 3135 ensemble genes identified. Both data sets are at

The general view of WGAC and WSSD on chromosome Wssd black above chrom line WGAC 5k94% black below chrom line WGAC 10k brown below chrom line

Summary table 1 totalchrNchrUnNo. nr intervalfile wssd (bp) wssdGE10K_nogap.tab wgac (bp) data/wgac/NRspace AllDup (bp) data/allDup.tab Genome (bp) repeats ? (bp) data/repeathitMerge

The intersect between WSSD and WGAC chromsizeallWGAC gt94WGAC _ge10KWSSDShared gt94WGAC_ge10 K_WGAConlyWSSDonly <=94%W GAC <=94%WGA C +shared chrI chrII chrIII chrIV chrIX chrUn chrV chrVI chrVII chrVIII chrX chrXI chrXII chrXIII chrXIV chrXIX chrXV chrXVI chrXVII chrXVIII chrXX chrXXI total

Summary Stickleback Seg Dup has been detected using two independent pipelines WGAC and WSSD. Since each pipeline is based on its unique mechanism, we expect majority of the interval should be consistent with some variation. From the result of two pipeline, two set of genomic intervals were generated for Seg Dup. –The first set consists of the genomic intervals detected by WGAC and WSSD, which is the intersect interval between WGAC and WSSD. This set represents the most conservative estimate of SEG DUPs in Genome. –The second set is a union of the interval of WAGC and WSSD (AllDup.tab), which represent the largest estimate of the SEG DUP in the genome. –A list of genes intersecting with each set were also generated. With AllDUp, union of WGAC and WSSD. There are total 3153 genes. With Dup from WGAC and WSSD intersect. There are total 1267 genes. A list of interval with potential to be repeats is also generated. They are the region with high frequency of hit with defined the boundary ( >10hits, 60% of total WAGC pairs and 1/5 of WGAC NR intervals. / repeathitMerge / repeathitMerge ChrUn contigs contribute great deal to the total SD in both WGAC and WSSD. The identity distribution analysis shows that the identity of pairs are less than 99%, suggest they may contain true SD which are hard to assemble. But how many of them remain to be determined.