Stickleback Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis 4.Files and images are at

Stickleback Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis 4.Files and images are at http://eichlerlab.gs.washington.edu/help/linchen/stickleback/sticklebackwgac.html http://eichlerlab.gs.washington.edu/help/linchen/stickleback/sticklebackwgac.html 5.The Data is in directory http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/ http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/

Stickleback Genome The Genome(v1.0) is down loaded from UCSU. Total Length is 463,354,448bp which contains a chrUn of 62,550,211bp Total of 29101 gene annotations from ensemble gene annotation were down loaded from UCSC.

Seg Dup detection pipelines WGACto detect Seg Dup in genomic assembly by looking for homology pairs. ( >1kb in length >90% identity) WSSDto detect Seg Dup in given sequences based on depth coverage of WGS (whole Genome shot gun reads). Depth coverage > Average + 3SD.

Parameters and Notes for WGAC pipeline Repeats –Standard repeat coordinated were reverse generated from the soft mask data. –The secondary repeat masker were done using two repeat libraries, the ab_initio_lib.txt and supplemental_lib.txt. –Repeat Mask result for all three libraries were combined and sorted, then used for both pipelines Blast parsing seeds in WGAC pipeline: –the seed size is 500bp

Result from WGAC Pipeline Total pairs of SD detected(>1kb and >90% identity) 152272 Interchromosome pairs63744 Intrachromosome pairs88528 chrUn intra81641 chrUn inter and intra123278 Total NR40,573,574bp Notes: In general, the number of WGAC pairs is too high (10%) for stickleback genome with only 400mb. 92% of total intra chromosomal WGAC pairs and 81% total pairs has at least one sequence in the pair is on chrUn. The result is expected, since chrUn contains high percentage of redundant poorly assembled sequences. Our analysis also suggest that the potential repeats which are not covered by the repeat libraries, may also detected as WGAC pairs. Next slid.

Repeats? Since the repeats might be an issue, I set up a filter to determine how many of WGACs may be affected. If I use >20hit, 400bp on boundary, hit length 10hit, and 400bp bound overlap, and hit < 10kb, 60% of WGAC is affected. I then generate the nr space of these hit. They are total of 7,481,640bp from 103, 157 pairs in total WGAC (152, 272 pairs of total 40,473,574bp). It has 2/3 of hits, but only 1/5 of total nr space. I think it is very reasonable. Because the high proportion of the WGAC pairs only affect a small proportion of NR space. These sequence intervals should also be detected by WSSD if they are the repeats. However, I did not take them out from Alldup(which is a merge of WGAC and WSSD) yet, because many of them has high frequency hit on chrUn. At this stage we do not know if they are the redundant sequences or the real seg dup. But we can pull them out at any time based on the coordinates. If I use >20hit, 400bp on boundary, hit length <10kb, 30% of WGAC can be

General analysis of WGAC length and identity distribution 1.Length distribution peaked at inter, with 92% of intra on chrUn. 2.Identity distribution peaked at 96%. Few is high than 99%.

General analysis, NR distribution on chromosome. high SD in chrUn

General view which show all WGAC on all chromosome Concentration of SD on smaller supercontigs on chrUn

Global image shows the inter and intra pairs of 5kb and above 90% without the chrUn. The red indicates the inter chromosomal pairs and blue indicates intra chromosomal pairs

Global image shows the inter and intra pairs of 10kb and 90% without chrUn. The red indicates the inter chromosomal pairs and blue indicates intra chromosomal pairs

Global image shows the inter and intra pairs of WGAC with10kb and 90%. ChrUn is also included. The red indicates the inter chromosomal pairs and blue indicates intra chromosomal pairs chrUn

WSSD analysis Down load the WGS reads about 6 million. Down load Stickleback finished BAC. These BACs are used to determine the threshold for WGS depth coverage. For 5k window, the average number of reads is 78, with SD 27. The threshold for 5k window is 125. for 1k window is 25. (Average+3SD) Repeat mask of the stickleback genome. I used the standard, ab_initio_lib.txt and supplemental_lib.txt. In addition I added the potential repeats I detected in WGAC process which shows more than 20 hit pairs the same region.

WSSD result http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/wssd/ http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/wssd/ There are total of 729 regions with 22,324,144bp were found in wssdGE10K_nogap.tab ( which has a 10k cut off), 251 of them are on chrUn. 850 regions in wssd.tab with 23,116,317 total base. It has 125 more regions and less than 1mb extra sequences comparing to 10k hits. A summary table of WGAC intersect with WSSD is at http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/wg acCMPwssd.xls http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/wg acCMPwssd.xls

Union of WSSD and WGAC Gene intersect with Seg Dups First a none redundant Union of WGAC and WSSD is generated. AllDup.tab A list of genes intersect with the AllDup is performed to identify genes overlap with Dup space in genome. There are 3135 ensemble genes identified. Both data sets are at http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/

The general view of WGAC and WSSD on chromosome Wssd black above chrom line WGAC 5k94% black below chrom line WGAC 10k brown below chrom line

Summary table 1 totalchrNchrUnNo. nr intervalfile wssd (bp) 22324144135747168749428729wssdGE10K_nogap.tab wgac (bp) 4057357421017679195558957387data/wgac/NRspace AllDup (bp) 4560844024390195212182455934data/allDup.tab Genome (bp) 46335444840080423762550211 repeats ? (bp) 748164017412665740374 data/repeathitMerge

The intersect between WSSD and WGAC chromsizeallWGAC gt94WGAC _ge10KWSSDShared gt94WGAC_ge10 K_WGAConlyWSSDonly <=94%W GAC <=94%WGA C +shared chrI281859141275120315356709840195481119875514359193013388494 chrII23295652713095144114234515770076710715750872943149950 chrIII16798506104184243552282196938968445838432285108184497868 chrIV3263294820938604761911589484379805963861209679306309686114 chrIX2024947913895796103601004524490770119590513754100388591158 chrUn62550211194838691080949987494284789618601988139598106302605419878 chrV122513975919691788513938261668691198222695750079216948 chrVI170836756214951776322451111287784885411633387014215792 chrVII27937443148035552185386105646926452589391792175038644302 chrVIII1936870482460024502727480111993712509015486462353182290 chrX156574401274186735451103947761155212389942792579609691161 chrXI167060521336848499828115224647466425164677582149606624270 chrXII1840106710025894552317217614369541827728480791092528046 chrXIII20083130100161831508950817017438114070833378993504267885 chrXIV1524646147204295357221539604013495616113853894114295 chrXIX202406609180862409506359732129042804642306983718296622 chrXV161987645789951734683039781014137205520256564444165857 chrXVI18115788121625246261981022337576286857434461165325541087 chrXVII1460314127840854942455972420130741213962150945710 chrXVIII162827168277573205855725372739694661629856878890352859 chrXX1973207191647227712955699019301284117363978147507340519 chrXXI1171748710624247173768710996655315184520556858839724370 total46335444840417128182780972232414410811957746614011512187287351813685475

Summary Stickleback Seg Dup has been detected using two independent pipelines WGAC and WSSD. Since each pipeline is based on its unique mechanism, we expect majority of the interval should be consistent with some variation. From the result of two pipeline, two set of genomic intervals were generated for Seg Dup. –The first set consists of the genomic intervals detected by WGAC and WSSD, which is the intersect interval between WGAC and WSSD. This set represents the most conservative estimate of SEG DUPs in Genome. http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/wssd_wgac_intersect http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/wssd_wgac_intersect –The second set is a union of the interval of WAGC and WSSD (AllDup.tab), which represent the largest estimate of the SEG DUP in the genome. http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/allDup.tab http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/allDup.tab –A list of genes intersecting with each set were also generated. With AllDUp, union of WGAC and WSSD. There are total 3153 genes. http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/gene_in_alldup http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/gene_in_alldup With Dup from WGAC and WSSD intersect. There are total 1267 genes. http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/gene_in_wssd_wgac_intersect http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/gene_in_wssd_wgac_intersect A list of interval with potential to be repeats is also generated. They are the region with high frequency of hit with defined the boundary ( >10hits, 60% of total WAGC pairs and 1/5 of WGAC NR intervals. http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data / repeathitMerge http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data / repeathitMerge ChrUn contigs contribute great deal to the total SD in both WGAC and WSSD. The identity distribution analysis shows that the identity of pairs are less than 99%, suggest they may contain true SD which are hard to assemble. But how many of them remain to be determined.

Stickleback Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis 4.Files and images are at

Similar presentations

Presentation on theme: "Stickleback Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis 4.Files and images are at"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Stickleback Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis 4.Files and images are at

Similar presentations

Presentation on theme: "Stickleback Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis 4.Files and images are at"— Presentation transcript:

Similar presentations

About project

Feedback