mRNA protein DNA Activation Repression Translation Localization Stability Pol II 3’UTR Transcriptional and post-transcriptional regulation of gene expression
Where does each transcription factor bind in the genome, in each cell type, at a given time ? Near which genes ? What is the cis-regulatory code of each factor ? Does they require any co- factors ? DNA Activation Repression
ChIP-seq Genome Analyzer II (Solexa) Transcription factor of interest Antibody
Control: input DNA Genome Analyzer II (Solexa)
ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGAACTGA TTAGTGAATTC TGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTTGACTA ATCACTTAAG Average length ~ 250bp
ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGAACTGA TTAGTGAATTC TGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTTGACTA ATCACTTAAG Average length ~ 250bp 25-40bp
ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGAACTGA TTAGTGAATTC TGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTTGACTA ATCACTTAAG Average length ~ 250bp 25-40bp
BCL6 ChIP-seq Lymphoma cell line (OCI-Ly1) Solexa/Illumina 6 lanes for ChIP, 1 for input DNA, 1 for QC 36nt long sequences 32 Million reads Aligned/mapped to hg18 with Eland Melnick lab at WCMC
AAAAATTCTCCCAAAACAAAAAAATACGCGTATTCTCCCAAAACAATATCTTACAAGATGTAAATATACCCAAGAT G Reference Human Genome (hg18) AAAATACGCGTATTCTCCCAAAACAATATC Solexa Read Read mapping with Eland
AAAAATTCTCCCAAAACAAAAAAATACGCGTATTCTCCCAAAACAATATCTTACAAGATGTAAATATACCCAAGAT G Reference Human Genome (hg18) AAAATACGCCTATTCTCCCAAAACAATATC Solexa Read Read mapping with Eland
AAAAATTCTCCCAAAACAAAAAAATACGCGTATTCTCCCAAAACAATATCTTACAAGATGTAAATATACCCAAGAT G Reference Human Genome (hg18) AAAATACGCCTATTCTCCCATAACAATATC Solexa Read Read mapping with Eland
Reads can map to multiple locations/chromosomes Solexa Read 1 Solexa Read 2 Reference Human Genome (hg18)
Reads map to one strand or the other Solexa Read 1 Solexa Read 2 hg18
>HWI-EAS83_30UCEAAXX:1:2:915:1011AGGTCACAAAACAAGTCCTAACAAATTTAAGAGTATU011362chr8.fa RDD >HWI-EAS83_30UCEAAXX:1:2:826:1245GTCAGAAAAATCCTTTTTATTATATAAACAATACATU2001chr5.fa FDD15G20G >HWI-EAS83_30UCEAAXX:1:2:900:945GTCATCAAACTCCAAGGATTCTGTTTTCAACATACTU0110chr18.fa RDD >HWI-EAS83_30UCEAAXX:1:2:1037:1118GAAAGTGATTAGCAGATTGTCATTTAATAATTGTCTU2001chr1.fa FDD18G28G >HWI-EAS83_30UCEAAXX:1:2:898:874GATAAATTTTTTCCTACAATCTTAAATTATTACACAU1010chr3.fa RDD10C >HWI-EAS83_30UCEAAXX:1:2:918:928AAAAATTAAACAATTCTAAAAATATTTTTATCTTAAU2001chr2.fa RDD18C31G >HWI-EAS83_30UCEAAXX:1:2:1324:4GCACATGTCATACTCTTTCTAGCTCTCTTATTTTTCU0100chr8.fa RDD >HWI-EAS83_30UCEAAXX:1:2:899:1015AAATTAATGTAAAAAATAGGATACTGAATTGTGATAU1010chr10.fa FDD30G >HWI-EAS83_30UCEAAXX:1:2:909:926GTAGTTAACAATAATTTATTTTATACTTCAAAATTCU10117chrX.fa RDD7A >HWI-EAS83_30UCEAAXX:1:2:701:1702GTCAGAATTAATTAATCAAAACACCAAATGTACTTCU0100chr12.fa FDD >HWI-EAS83_30UCEAAXX:1:2:996:1003ATTTTGACTTTATTATTTTTTCTTCAATGTTTTTAANM000 >HWI-EAS83_30UCEAAXX:1:2:884:1090GAAAGTACATCAAATACATATTATATACTTTACATAR2002 >HWI-EAS83_30UCEAAXX:1:2:911:937AATCCATATACATTTCTTTTTAATCATTTCCTCTTTU1010chr11.fa FDD20G >HWI-EAS83_30UCEAAXX:1:2:1517:330GTGAGTTTCTTAATCCTGAGTTCTAATTTTATTTCAR >HWI-EAS83_30UCEAAXX:1:2:904:1031ACATTTTATAAATTTTTAATTTCATTTTAATTTATANM000 >HWI-EAS83_30UCEAAXX:1:2:1291:1469GTTTTTAAAATCAACACTTTTATTATAGAAGTAGCAU0101chr12.fa RDD >HWI-EAS83_30UCEAAXX:1:2:1697:828GTACTGATGTAAACTTGGTAAAAACATTGACATAAAU0100chr14.fa FDD >HWI-EAS83_30UCEAAXX:1:2:1415:583GAAGAAAATGACTATGTCAAAATATTATCTCTCAATU0100chr5.fa FDD >HWI-EAS83_30UCEAAXX:1:2:1561:1653GTTTTACTGATTTTCTTACTTACTAAACTACCTGTTU0100chr7.fa FDD >HWI-EAS83_30UCEAAXX:1:2:1579:943AATGATACGGCGACCACCGACAGGTTCAGAGTTCTANM000 >HWI-EAS83_30UCEAAXX:1:2:1705:268GAGAATTATTCAGAAGTCAAATCTGTGCTTAGTTTAU2001chr5.fa RDD3G7C >HWI-EAS83_30UCEAAXX:1:2:1489:318GTATGTATCATATATATTTATGTATCATATATATTTR1032 >HWI-EAS83_30UCEAAXX:1:2:1003:1113GATTGCTCCATTATTTGTTAAAAACATAGTAAAATANM000 >HWI-EAS83_30UCEAAXX:1:2:895:1072ATGAGATCAGTACTTCAAAGAGATATCTGCACTCCCU0119chr12.fa RDD >HWI-EAS83_30UCEAAXX:1:2:853:1178GTTAGTCCCAATATTCCATTAATCCCAATAAATATAU2001chr6.fa FDD15G19G >HWI-EAS83_30UCEAAXX:1:2:1432:972GAGATAATAATAGCAGTTATGGCATCGAGATAATTTU0100chr2.fa RDD >HWI-EAS83_30UCEAAXX:1:2:1718:341GTAGAGGGCACACATCACAAACAAGTTTCTGAGAATR2003 >HWI-EAS83_30UCEAAXX:1:2:1171:302GAATATCCACTTGCAGACTTTACAAACAAATTTTTTR2004 >HWI-EAS83_30UCEAAXX:1:2:1055:1126GGCAGATGAAACTTCTATACACTATATTTTAGCCAGU0100chr13.fa FDD >HWI-EAS83_30UCEAAXX:1:2:971:1371GAAAGAAAAACTATTGAAAAAATAGTTACTTTCCAAU0100chr1.fa RDD >HWI-EAS83_30UCEAAXX:1:2:1774:614GTGTAGATGATATCGAGGGCATTAGAAGTAAATAGCU0100chr5.fa FDD >HWI-EAS83_30UCEAAXX:1:2:1207:808GAGAGGAAATAATAAAGATAAAAGTAGAAAAAGTGAU0100chr1.fa FDD >HWI-EAS83_30UCEAAXX:1:2:1680:815GATAATTATGTTGTTGTAATTATTGTTTGTTTTTTTU0100chr15.fa RDD >HWI-EAS83_30UCEAAXX:1:2:1688:260GTTGACAATCCAGCTGTCATAGAAACTGACTATTTTU0100chr12.fa RDD >HWI-EAS83_30UCEAAXX:1:2:1051:916AAAAATTCTCCCAAAACAACAAGATGTAAATATACCU0100chr3.fa RDD >HWI-EAS83_30UCEAAXX:1:2:1771:308GTTCTTACACTGATATGAAGAAATACCTGAGACTGGU01267chr2.fa RDD >HWI-EAS83_30UCEAAXX:1:2:911:917GAGAAACACACATATTTTTGTAAGTGCCATCACATCU1010chr7.fa RDD18C >HWI-EAS83_30UCEAAXX:1:2:1105:348GTATTATCTAACACACAAGATGATGTTTGTTTTTATNM000 >HWI-EAS83_30UCEAAXX:1:2:1048:857GAGTGTAGAAAATTTTCTGCCCTAAAATATTTGTTAU1010chr6.fa FDD13G >HWI-EAS83_30UCEAAXX:1:2:743:1729GTATCCTAAAGTGTATCTTATGTTTTTTCATCTTCTU1010chr12.fa RDD9C >HWI-EAS83_30UCEAAXX:1:2:1287:64AATAAAACAAATTCCAATGGCTTAGATTCTACTTAAU2001chr10.fa RDD15C20C >HWI-EAS83_30UCEAAXX:1:2:940:1059AAATGGTCATACTTCCCAAAGCGATCTACAGATTCAU10129chr3.fa RDD19C >HWI-EAS83_30UCEAAXX:1:2:898:1061ACATTTCCACATTTCTGTGGAAGCCTCACAATCATTR2002 >HWI-EAS83_30UCEAAXX:1:2:913:932ATTAATCAACAGCAACATTAATCAACTGAATCAACAU0100chr2.fa RDD >HWI-EAS83_30UCEAAXX:1:2:43:1647GAATAAATAATCAAAACATATAATACATTTTTTTATU1010chr5.fa FDD32G >HWI-EAS83_30UCEAAXX:1:2:1412:731ATATACACATATATATACATATATATATACACATATR >HWI-EAS83_30UCEAAXX:1:2:1389:1196GAGAAGGAAATGTGTTTTCTAAGTTTCTTTATCTTCU1010chr4.fa FDD32G >HWI-EAS83_30UCEAAXX:1:2:1264:1479GTGTAGGAAAGAAAAAAGGAGGTTGTGTAGAAAAGAU0100chr2.fa FDD >HWI-EAS83_30UCEAAXX:1:2:38:890TTTATTTAAATCTTTTAAAAANTTTTTTCCAACAAANM000 >HWI-EAS83_30UCEAAXX:1:2:1341:1065GATACATATACACAAAGTAAAACTATTCAGCCTCTAU0100chr17.fa FDD >HWI-EAS83_30UCEAAXX:1:2:1132:929GAGTTGTATTAATCTTAAATTGATAATTTACCATATU1010chr10.fa FDD24G >HWI-EAS83_30UCEAAXX:1:2:1758:275GCATTTTAACAAAATCACCATATCTGGGTAACCATTU1010chr21.fa RDD18C >HWI-EAS83_30UCEAAXX:1:2:914:1000GAAAGCACTTTATAATAAAACAACATTGGAGCACCTU1010chr8.fa FDD16G
Number of reads per Eland type U % U % U % R % R % R % NM % QC %
Peak detection Calculate read count at each position (bp) in genome Determine if read count is greater than expected
Peak detection We need to correct for input DNA reads (control) - non-uniformaly distributed (form peaks too) - vastly different numbers of reads between ChIP and input
Peak detection using ChIPseeqer
Read count genome Expected read count Expected read count = total number of reads * extended fragment length / chr length genome T A T T A A T T A T C C C C A T A T A T G A T A T
Is the observed read count at a given genomic position greater than expected ? x = observed read count λ = expected read count The Poisson distribution Read count Frequency
Is the observed read count at a given genomic position greater than expected ? x = 10 reads (observed) λ = 0.5 reads (expected) The Poisson distribution genome P(X>=10) = 1.7 x log10 P(X>=10) = log10 P(X>=10) = 9.77
Read count Expected read count -Log(p) Expected read count = total number of reads * extended frag len / chr len
Read count Expected read count Input reads -Log(p) Expected read count = total number of reads * extended frag len / chr len
Read count Expected read count -Log(P c ) Read count Expected read count -Log(P i ) Log(P c ) - Log(P i ) Threshold Genome positions (bp) INPUT ChIP
Normalized Peak score (at each bp) R = -log10 P(X input ) P(X ChIP ) Will detect peaks with high read counts in ChIP, low in Input Works when no input DNA !
Non-mappable fraction of the genome chr / (=12%) chr / chr / chr / chr / chr / chr / chr / chr / chr / chr / chr / chr / chrX / chr / chr / chr / chr / chrM4628/ chr / chr / chr / chr / chr / chrY / (=74%) We enumerated all 30-mers, counted # occurrences, calculated non-unique fraction of genome
Peak detection Determine all genomic regions with R>=15 Merge peaks separated by less than 100bp Output all peaks with length >= 100bp Process 23M reads in <7mins
ChIP reads Input reads Detected Peaks BCL6: 18,814 peaks 80% are within <20kb of a known gene
Where does each transcription factor bind in the genome, in each cell type, at a given time ? Near which genes ? What is the cis-regulatory code of each factor ? Does they require any co- factors ? DNA Activation Repression
Regulatory Sequence Discovery using FIRE
No … Random regions Discovering regulatory sequences associated with peak regions True TF binding peak? Yes … Target regions True TF peak Absent Present No Yes Motif correlation is quantified using the mutual information
Motif Search Algorithm k-mer MI CTCATCG TCATCGC AAAATTT GATGAGC AAAAATT ATGAGCT TTGCCAC TGCCACC ATCTCAT ACGCGCG CGACGCG TACGCTA ACCCCCT CCACGGC TTCAAAA AGACGCG CGAGAGC CTTATTA Not informative Highly informative... MI=0.081 MI=0.045 MI=0.040
No … Random regions Optimizing k-mers into more informative degenerate motifs ATCCGTACA ATCC[C/G]TACA which character increases the mutual information by the largest amount ? A/G T/G C/GA/C/G A/T/G C/G/T True TF binding peak? Yes … Target regions
Optimizing k-mers into more informative degenerate motifs ATCC[C/G]TACA A/C T/C C/GA/C/G A/T/C C/G/T No … Random regions True TF binding peak? Yes … Target regions
change Motif Conservation with S. bayanus Similarity to ChIP-chip RAP1 motif Mutual information
k-mer MI CTCATCG TCATCGC AAAATTT GCTCATC AAAAATT ATGAGCT TTGCCAC TGCCACC ATCTCAT Highly informative k- mers Only optimize k-mer if I(k-mer;expression | motif) is large enough (for all motifs optimized so far) MI=0.081 MI=0.045 Motifs optimized so far optimize ? Conditional mutual information I(X;Y|Z)
Enrichment Depletion Motif co-occurrence anallysis Discovered Motifs FIRE automatically compares discovered motifs to known motifs in TRANSFAC and JASPAR
ChIPseeqer: an integrated framework for ChIP-seq data analysis ChIPseeqer (peak detection) ChIPseeqer2Track (for Genome Browser) ChIPseeqer2FIRE (+ motif analysis) ChIPseeqer2iPAGE (+ pathway analysis) ChIPseeqer2cons (conservation analysis)
Installing and setting up programs Install ChIPseeqer and FIRE: Execute following commands: export FIREDIR=/Applications/FIRE-1.1 export PATH=$PATH:$FIREDIR export CHIPSEEQERDIR=/Applications/ChIPseeqer-1.0 export PATH=$PATH:$CHIPSEEQERDIR:$CHIPSEEQERDIR/SCRIPTS chmod +x $CHIPSEEQERDIR/ChIP* chmod +x $CHIPSEEQERDIR/SCRIPTS/*.pl
Peak Detection - Input file: CTCF.bed cd ~/Desktop/elemento Or download from: seq/ U0 reads in BED format (check by typing wc –l CTCF.bed) (view by typing more CTCF.bed and q to exit) - No input DNA for this experiment
Peak Detection Step 1: Split big read file into one file per chromosome split_bed_or_mit_files.pl CTCF.bed Expected output: Opening CTCF.bed Current directory =. Creating./reads.chr1 …
Peak Detection Step 2. Detect peaks ChIPseeqer --chipdir=. --t=15 --fraglen=250 --format=bed -outfile=CTCF_peaks_t15.txt Expected output: Processing reads in chrY... done. Processing reads in chrX... done. Processing reads in chr9... done. Processing reads in chr8... done. Step 3. Count how many peaks were found wc -l CTCF_peaks_t15.txt
Making a Genome Browser track Command lines: cd JuliaChild wc –l CTCF_peaks_t15.txt ChIPseeqer2track --targets=CTCF_peaks_t15.txt --trackname=“CTCF peaks” Expected output: CTCF_peaks_t15.txt.wgl.gz created. To check that the file was created: ls
Making a Genome Browser track
Making FIRE input files Command line (type instructions below as one single line): ChIPseeqer2FIRE --targets=CTCF_peaks_t15.txt –genome=wg.fa --suffix=CTCF_peaks_t15_FIRE wg.fa is also available from: (decompress with gunzip wg.fa.gz) Expected output: Extracting sequences... Done. Extracting randomly selected sequences... Done. CTCF_peaks_t15_FIRE.txt and CTCF_peaks_t15_FIRE.seq have been generated. …
FIRE analysis Command line (type instructions below as one single line): fire.pl --expfile=CTCF_peaks_t15_FIRE.txt --fastafile_dna=CTCF_peaks_t15_FIRE.seq --nodups=1 --minr=2 --species=human --dorna=0 --dodnarna=0 Expected output: Extracting sequences... Done. Extracting randomly selected sequences... Done. CTCF_peaks_t15_FIRE.txt and CTCF_peaks_t15_FIRE.seq have been generated. …
FIRE main output file Peak sequences Randomly selected sequences open CTCF_peaks_t15_FIRE.txt_FIRE/DNA/CTCF_peaks_t15_FIRE.txt.summary.pdf