CTCF Peaks.

Slides:



Advertisements
Similar presentations
 Decide if it’s a permutation or a combination, then find how many are possible:  Your class is having an election. There are 7 candidates, and they.
Advertisements

Biostatistics Unit 3 Graphs 1. Grouped data Data can be grouped into a set of non- overlapping, contiguous intervals called class intervals (Excel calls.
Nucleosomes Chapter The Nucleosome Is the Subunit of All Chromatin Micrococcal nuclease releases individual nucleosomes from chromatin as 11S.
Introduction to Algorithms
The bonobo genome compared with the chimpanzee and human genomes Kay Pruüfer et al. Nature (June,2012) Presenter: Chia-Ying Chen.
Defining the Regulatory Potential of Highly Conserved Vertebrate Non-Exonic Elements Rachel Harte BME230.
1 Distribution Summaries Measures of central tendency Mean Median Mode Measures of spread Range Standard Deviation Interquartile Range (IQR)
MS-Word XP Lesson 8. Inserting Column to Table 1.Select column (click on top margin) 2.Click on table menu 3.Select insert sub menu and click on columns.
“Hotspot” algorithm chr5:131,975, ,012,092 Idea: gauge enrichment of tags relative to a local background model based on the number of tags in a 50kb.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
I519 Introduction to Bioinformatics, Fall, 2012
The Lac Operon An operon is a length of DNA, made up of structural genes and control sites. The structural genes code for proteins, such as enzymes.
LARVA: An integrative framework for Large-scale Analysis of Recurrent Variants in noncoding Annotations M Gerstein, Yale Slides freely downloadable from.
CS 206 Introduction to Computer Science II 04 / 22 / 2009 Instructor: Michael Eckmann.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
The dependence of expression of NF- κ B- dependent genes: statistics and evolutionary conservation of control sequences in the promoter and in the 3 ’
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Leafy Maths. Investigating leaf size for different trees How can we collect a random sample of 20 leaves? What measurement shall we take?
Sort Algorithms.
Approach Extract all annotated exons (Refseq and KnownGene) in region plus buffer (200KB up & down stream. Extract all conserved elements with conservation.
Simon v1.0 Motif Searching Simon v1.0.
Get out your notes we previously took on Box and Whisker Plots.
Figure 1. Annotation and characterization of genomic target of p63 in mouse keratinocytes (MK) based on ChIP-Seq. (A) Scatterplot representing high degree.
Chip – Seq Peak Calling in Galaxy
Merging Merge. Keep track of smallest element in each sorted half.
Mean, Median, Mode and Range
Objectives Solve compound inequalities with one variable.
Lesson 6.2 Mean, Median, Mode and Range
Volume 22, Issue 2, Pages (April 2006)
מדינת ישראל הוועדה לאנרגיה אטומית
X-linked oogenic transcripts are expressed late in the germline
1. Interpreting rich epigenomic datasets
Volume 18, Issue 1, Pages 1-11 (January 2017)
Volume 146, Issue 6, Pages (September 2011)
Finding regulatory modules
Lucas J.T. Kaaij, Robin H. van der Weide, René F. Ketting, Elzo de Wit 
Measures of Central Tendency
Constructing Box Plots
by Varun Narendra, Pedro P. Rocha, Disi An, Ramya Raviram, Jane A
Latent Regulatory Potential of Human-Specific Repetitive Elements
Lesson 13 - Cleaning Data Lesson 14 - Creating Summary Tables
Volume 154, Issue 1, Pages (July 2013)
Kuangyu Yen, Vinesh Vinayachandran, B. Franklin Pugh  Cell 
Simon V Motif Searching Simon V
In collaboration with Mikkelsen Lab
Chromosome Architecture
Honors Statistics Review Chapters 4 - 5
Volume 10, Issue 8, Pages (March 2015)
CSE 373 Data Structures and Algorithms
ECE 692 – Advanced Topics in Computer Vision
Volume 17, Issue 6, Pages (November 2016)
Volume 67, Issue 6, Pages e6 (September 2017)
Topic: Divide and Conquer
Anne Gregor, Martin Oti, Evelyn N
Evolution of Alu Elements toward Enhancers
Global Reorganization of the Nuclear Landscape in Senescent Cells
The Time T.V. Takes … Cory Williams Winter 2010.
Volume 67, Issue 6, Pages e9 (September 2017)
The Selection Problem.
Volume 6, Issue 4, Pages (April 2016)
High Sensitivity Profiling of Chromatin Structure by MNase-SSP
The 3D Genome in Transcriptional Regulation and Pluripotency
Volume 11, Issue 7, Pages (May 2015)
Number Summaries and Box Plots.
Volume 7, Issue 2, Pages e11 (August 2018)
Volume 17, Issue 11, Pages (December 2016)
CMPT 225 Lecture 10 – Merge Sort.
Volume 28, Issue 9, Pages e4 (August 2019)
Presentation transcript:

CTCF Peaks

All CTCF Peaks 15 datasets Cells: CH12, ER4, G1E, HPC7, LSK, NEU, TCD4, TCD8, ERY, MON 123,387 merged peaks HPC7 has a more stringent cutoff, .01 for peaks

Peak length

Separating peaks by number of datasets

Intersect peak categories with ccREs Split Peaks into low(1-4), mid(5-12) and high (13-15) and find overlaps with ccREs 26,550 / 83,816 = .32 18,384 / 26,416 = .70 12,833 / 13,154 = .98 Intersect peak categories with ccREs

Dominant IDEAS states for called peaks Peak found in: Low (1-4 datasets) 139,177 states Medium (5-12 datasets) 209,074 states High (13-15 datasets) 187,955 states Dominant IDEAS states for called peaks Sorted by state count in High grouping One state per dataset with peak called per peak 7 C 13 CN 24 PENCA 25 T C 26 CNE T States with CTCF signal

All States within peak set, except 0’s only counted when only state Peak found in: Low (1-4 datasets) 1,216,173 states Medium (5-12 datasets) 566,212 states High (13-15 datasets) 405,088 states CTCF states are often in peaks, but not always the dominant state in the peak. Especially state 24 PENCA. This also uses states from regions where they are not peaks in the cell type. All states in all peaks (except 0’s when other states present)

Low category peaks in MultiView track

High category peaks in MultiView track

TAD boundaries Number of peaks overlapping end base of TADs Low 185/83,816 = .002 Mid 141/26,416 = .005 High 139/13,154 = .011 Number of peaks overlapping end 20kb of TADs Low 6,128/83,816 = .07 Mid 2,603/26,416 = .10 High 1,719/13,154 = .13 TAD boundaries Boundaries computed as start + 1 and end – 1 and start – 20,000 to start, end – 20,000 to end.

Motifs from SeqUnwinder

Motifs from SeqUnwinder

Motifs from SeqUnwinder On going run adding in ccREs

Other checks ORegAnno Regulatory regions 17,188 Intersect high peaks 509 CTCF peaks Intersect low peaks 1,882 CTCF peaks ORegAnno Transcription Factor binding sites 397,782 Intersect high peaks 5,926 CTCF peaks Intersect low peaks 20,846 CTCF peaks RefSeq Functional elements 1,968 Intersect high peaks 71 RefSeq Ele Intersect low peaks 294 RefSeq Ele Other checks Oreganno sites are tested

Cell specific peaks HPC7 (13,141 peaks) Used all peaks as background. This has only 1 replicate and was only found in this dataset. One of largest peak numbers. Number of cell specific peaks mostly reflects total number of peaks called. (G1E has only 1!)

Cell Specific peaks ERY (1,406 replicated peaks) I required these to be in both replicates.

Cell specific peaks MON (1,307 peaks)

CTCF Peaks found in all 15 datasets (7,116 peaks) I raised the cutoff for the FDR for these to .1 from .05. Without this there were no terms.

Peak length by category Average length Median Low 580 493 Mid 975 819 High 1364 1192 High unmerged 735 689

Worst case: 63 peaks merged to one 9 with 40 or more. 84 with 30 or more. These would have been mid category without the single large peak merging them.

Same region with IDEAS Should I use a window slid over the merged peaks, rather than the peaks themselves? If so what size? Homer peaks – 150, IDEAS windows – 200, ?

THE END

Second largest peak in high category

Slides from before filtering HPC7 peaks and looking for non-zero states

All CTCF Peaks 15 datasets 141,678 merged peaks Cells: CH12, ER4, G1E, HPC7, LSK, NEU, TCD4, TCD8, ERY, MON 141,678 merged peaks

Barplot About half the single peaks come from HPC7. The majority of the peaks found in only 2 datasets are in the datasets with the highest numbers of peaks.

CTCF peaks intersected with ccREs Split Peaks into low(2-4), mid(5-12) and high (13-15) and find overlaps with ccREs 14,315 / 38,794 = .37 18,600 / 26,880 = .69 12,836 / 13,158 = .98 CTCF peaks found in many cell types are very likely to also be ccREs. CTCF peaks intersected with ccREs

Most common pattern is all cell types Most common if only 1 missing is cells with fewest peaks If only 2 missing it is most often MON, then G1E, and CH12 Most common 4 missing is MON + G1E Most common pattern with less than ½ the cells include the cells with the greatest numbers of peaks Both ERY, HPC7, TCD4, TCD8 Looking at ones found in both ERY and ER4, the most common patterns are all or nearly all cell types Patterns in replicated peaks (47,961) are dominated by peak counts in the cell types

CTCF peaks and IDEAS states Out of 141, 678 peaks 50,641 are in state 0 in nearly all used cell types (quiescent) 4,653 are in state 1in nearly all used cell types (transcribed) Other frequently seen and conserved states are 7 (CTCF), 2 (heterochromatin), 10 (active promoter-like), 15(promoter-like) Leaving 81,421 peaks with mixed states CTCF peaks and IDEAS states

Counts of peaks in conserved states

State 7 peaks State 7 (CTCF) peaks are called in most datasets Found 1,714 peaks with min 3 and max 15 average 12.4 median 13 first quartile 11 third quartile 15 State 7 peaks

State 0 peaks State 0 peaks are mostly found in few datasets Found 50,641 peaks with min 1 and max 15 average 3 median 1 first quartile 1 third quartile 3 State 0 peaks