ChIP-seq Olivier Elemento, PhD TA: Jenny Giannopoulou, PhD Institute for Computational Biomedicine CSHL High Throughput Data Analysis Workshop, June 2012.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
ChIP-seq Data Analysis
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Processing of miRNA samples and primary data analysis
High Throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Analysis of ChIP-Seq Data
Data Analysis for High-Throughput Sequencing
HIV Project -Matt Hagen. The Problem Are there any DNA sequences in common between HIV and human genomes? HIV-1, complete genome, chimeric clone AF HIV-1,
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
Bioinformatics Analysis Team McGill University and Genome Quebec Innovation Center
NGS Analysis Using Galaxy
Whole Exome Sequencing for Variant Discovery and Prioritisation
Basics of high-throughput sequencing Olivier Elemento, PhD TA: Jenny Giannopoulou, PhD Institute for Computational Biomedicine CSHL High Throughput Data.
Mapping protein-DNA interactions by ChIP-seq Zsolt Szilagyi Institute of Biomedicine.
MRNA protein DNA Activation Repression Translation Localization Stability Pol II 3’UTR Transcriptional and post-transcriptional regulation of gene expression.
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA
MES Genome Informatics I - Lecture V. Short Read Alignment
RNAseq analyses -- methods
Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15.
“Hotspot” algorithm chr5:131,975, ,012,092 Idea: gauge enrichment of tags relative to a local background model based on the number of tags in a 50kb.
NGS data analysis CCM Seminar series Michael Liang:
CS CM124/224 & HG CM124/224 DISCUSSION SECTION (JUN 6, 2013) TA: Farhad Hormozdiari.
Next Generation DNA Sequencing
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA RNA-seq CHIP-seq DNAse I-seq FAIRE-seq Peaks Transcripts Gene models Binding sites RIP/CLIP-seq.
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics.
ChIP-seq hands-on Iros Barozzi, Campus IFOM-IEO (Milan) Saverio Minucci, Gioacchino Natoli Labs.
I519 Introduction to Bioinformatics, Fall, 2012
Chip – Seq Peak Calling in Galaxy Lisa Stubbs Chip-Seq Peak Calling in Galaxy | Lisa Stubbs | PowerPoint by Casey Hanson.
ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.
EDACC Quality Characterization for Various Epigenetic Assays
Other genomic arrays: Methylation, chIP on chip… UBio Training Courses.
Spliced Transcripts Alignment & Reconstruction
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Day 5-2 What bioinformatics.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Investigate Variation of Chromatin Interactions in Human Tissues Hiren Karathia, PhD., Sridhar Hannenhalli, PhD., Michelle Girvan, PhD.
Trinity College Dublin, The University of Dublin GE3M25: Computer Programming for Biologists Python, Class 4 Karsten Hokamp, PhD Genetics TCD, 01/12/2015.
Analysis of ChIP-Seq Data Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers.
Construction of Substitution matrices
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Introduction of the ChIP-seq pipeline Shigeki Nakagome November 16 th, 2015 Di Rienzo lab meeting.
Chip – Seq Peak Calling in Galaxy Lisa Stubbs Lisa Stubbs | Chip-Seq Peak Calling in Galaxy1.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Short Read Workshop Day 5: Mapping and Visualization
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
HOMER – a one stop shop for ChIP-Seq analysis
Peak Calling for ChIP-Seq data Larry Meyer UCSC Bioinformatics Dept. BME 230 January 11, 2011.
Genome-wide characteristics of sequence coverage by next-generation sequencing: how does this impact interpretation? Jen Taylor Bioinformatics Team CSIRO.
Short Read Workshop Day 5: Mapping and Visualization Video 3 Introduction to BWA.
ChIP-seq Robert J. Trumbly
Using command line tools to process sequencing data
CS273B: Deep learning for Genomics and Biomedicine
Figure 1. The overall workflow of RNA-seq QC
VCF format: variants c.f. S. Brown NYU
Chip – Seq Peak Calling in Galaxy
ChIP-Seq Analysis – Using CLCGenomics Workbench
The FASTQ format and quality control
Analysing ChIP-Seq Data
2nd (Next) Generation Sequencing
Results report: _roreriPE_AGTCAA_L008_R1_all. fastq
ChIP-Seq Data Processing and QC
Exploring and Understanding ChIP-Seq data
Epigenetics System Biology Workshop: Introduction
Canadian Bioinformatics Workshops
BF528 - Sequence Analysis Fundamentals
Chip – Seq Peak Calling in Galaxy
RNA-Seq Data Analysis UND Genomics Core.
Presentation transcript:

ChIP-seq Olivier Elemento, PhD TA: Jenny Giannopoulou, PhD Institute for Computational Biomedicine CSHL High Throughput Data Analysis Workshop, June 2012

Plan 1.ChIP-seq 2.Quality Control of ChIP-seq data 3.ChIP-seq Peak detection 4.Peak Analysis and Interpretation 5.A few interesting ChIP-seq papers

1. ChIP-seq

ChIP-seq Illumina Transcription factor of interest (or histone modification) Antibody

Control: input DNA Illumina Can use IgG as additional control

ChIP-seq methodology Identify ChIP-grade antibody, determine specificity (Western, histone peptide array) Optimize conditions using single- locus ChIP-PCR (positive and negative controls) Sequence ChIP product using 1 Illumina lane per sample (no TruSeq ChIP-seq), single end Sequence input/IgG as control Assessing the specificity of a commercial H3K9m3 antibody using histone peptide arrays, K. Bunting & B. Swed, WCMC Abcam H3K9Me3 rabbit polyclonal (ab8898)

ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGA ACTGATTAGTGAATTC TGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTT GACTAATCACTTAAG Average length ~ 170bp

ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGA ACTGATTAGTGAATTC TGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTT GACTAATCACTTAAG Average length ~ 170bp bp

ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGA ACTGATTAGTGAATTC TGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTT GACTAATCACTTAAG Average length ~ 170bp bp

BWA tutorial (for aligning single end reads to genome) Get genome, e.g., from UCSC – Combine into 1 file – tar zvfx chromFa.tar.gz – cat *.fa > wg.fa Indexing the genome – bwa index -p hg19bwaidx -a bwtsw wg.fa Align ChIP reads to reference genome – bwa aln -t 4 hg19bwaidx s_3_sequence.txt.gz > s_3_sequence.txt.bwa Convert to SAM format – bwa samse hg19bwaidx s_3_sequence.txt.bwa s_3_sequence.txt.gz > s_3_sequence.txt.sam Align input reads to same reference genome – bwa aln -t 4 hg19bwaidx s_4_sequence.txt.gz > s_4_sequence.txt.bwa Convert to SAM format – bwa samse hg19bwaidx s_4_sequence.txt.bwa s_4_sequence.txt.gz > s_4_sequence.txt.sam

Reads can map to multiple locations/chromosomes Read 1 Read 2 Reference Human Genome (hg18)

Reads map to one strand or the other Read 1 Read 2 hg18

SAM format DH1608P1_0130:6:1103:10579:166379#TTAGGC 16 chr M * 0 0 GGGCGTGACTCTGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGACCCGCG eb`XXYbZdadee^ceV]X][ccTcc^ebeece eeeWbeeeeeeeceeaee XX:Z:NM_017871,32 NM:i:0 MD:Z:51 DH1608P1_0130:6:1102:3415:150915#TTAGGC 16 chr M * 0 0 GGGCGGGACTCTGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGACCCGCG BBBBBBBBBBBac]bbbceedaeddeZceeea_ba_\_eee eeeedaeeee XX:Z:NM_017871,32 NM:i:1 MD:Z:5T45 DH1608P1_0130:6:1102:13118:62644#TTAGGC 16 chr M * 0 0 GGGCGTGCCTCGGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGACCCGCG BBBBBBBBBBBBBBBBBBBBB`XTbSa`cffegdggeccbe effdeggggg XX:Z:NM_017871,32 NM:i:2 MD:Z:7A3T39 DH1608P1_0130:6:1203:3012:157120#TTAGGC 16 chr M * 0 0 AAGGCCGTGACTCTGATCTCAGCCCTCGTCTCCGCCGCGCTCCCGGACCCG BBBBBBBB^`QWZZ]UXYSZSTFRU]Z__SO[adcc[acdV \`Y]YWY][_ XX:Z:NM_017871,34 NM:i:3 MD:Z:4G17G1A26 DH1608P1_0130:6:2206:4445:12756#TTAGGC 16 chr M3487N50M * 0 0 CCAAAGGGTGTGACTCTGATCTCGGGCATCGTCTCCGCCGCGCTCCCGGAC BBBBBBBBBBBBBBBBBBBBBBBB`YdddYdc\ cacaNddddcdddaeeee XX:Z:NM_017871,37 NM:i:3 MD:Z:2C5C14A27 DH1608P1_0130:6:2203:7903:43788#TTAGGC 16 chr M3487N50M * 0 0 CCCAAGGGCGTGACTCTGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGAC adbe[fbcbccb_cb^cb^^c^edgegggggdf ggefffgfbfggggegeg XX:Z:NM_017871,37 NM:i:0 MD:Z:51 MD tag, e.g, MD:Z:4T46 = 5 matches, 1 mismatch (T in read), 46 matches CIGAR string, eg 5M3487N46M = 5bp-long block, 3487 insert, 46bp-long block XT tag, e.g. XT:A:U = unique mapper; XT:A:R = more than 1 high-scoring matches

Quality Control

Clonal reads

Fragment size analysis

Fragment size analysis using opposite strand autocorrelation

Fragment size analysis

GC-content analysis

GC-content analysis

Other QC measures Number of peaks: – 0 or very few peaks, even at permissive peak calling thresholds = bad experiment Motif enrichment – is expected motif enriched in peaks ?

ChIP-seq peak calling

MACS

The Poisson distribution MACS # in R P(X>=5|λ=0.001) is 1-sum(dpois(0:4, 0.001)) 2d λ=expected # of reads within an interval of 2d bp Estimate d based on high quality peaks

BayesPeak

BayesPeak (Bayesian Hidden Markov Models) Parameters estimated using Bayesian treatment Observed variable Hidden states

BayesPeak

Peak detection using ChIPseeqer (Elemento and Giannopoulou, 2011)

A nice peak

Not all peaks are that nice

Peak detection Calculate read count at each position (bp) in genome (we don’t use a sliding window) Determine if read count is greater than expected (at each position - bp)

Peak detection We need to correct for input DNA reads (control) - non-uniformaly distributed (form peaks too) - vastly different numbers of reads between ChIP and input

Read count genome Expected read count Expected read count = total number of reads * extended fragment length / chr length genome T A T T A A T T A T C C C C A T A T A T G A T A T Use Bioanalyzer (remove adapter lengths)

Is the observed read count at a given genomic position greater than expected ? x = observed read count λ = expected read count The Poisson distribution Read count Frequency Read counts follow a Poisson distribution

Is the observed read count at a given genomic position greater than expected ? x = 10 reads (observed) λ = 0.5 reads (expected) The Poisson distribution genome P(X>=10) = 1.7 x log10 P(X>=10) = log10 P(X>=10) = 9.77 # in R P(X>=10|λ=0.5) is 1-sum(dpois(0:9, 0.5))

Read count Expected read count -Log(p) Expected read count = total number of reads * extended frag len / chr len

Read count Expected read count Input reads -Log(p) Expected read count = total number of reads * extended frag len / chr len

Read count Expected read count -Log(P c ) Read count Expected read count -Log(P i ) [-Log(P c )] - [-Log(P i )] Threshold Genome positions (bp) INPUT ChIP

Normalized Peak score (at each bp) R = -log10 P(X input ) P(X ChIP ) Will detect peaks with high read counts in ChIP, low in Input Works when no input DNA ! (x=0)

Mappability

Non-mappable fraction of the genome chr / (=12%) chr / chr / chr / chr / chr / chr / chr / chr / chr / chr / chr / chr / chrX / chr / chr / chr / chr / chrM4628/ chr / chr / chr / chr / chr / chrY / (=74%) We enumerated all 30-mers, counted # occurrences, calculated non-unique fraction of genome Unique/mappable fraction = 1 – non- unique fraction

Read count Expected read count -Log(p) Expected read count = total number of reads * extended frag len / ( chr len * mappable fraction)

Peak detection Determine all genomic regions with R>=15 Merge peaks separated by less than 100bp Output all peaks with length >= 100bp Process 23M reads in <5mins

BCL6 ChIP-seq Lymphoma cell line (OCI-Ly1) Illumina 6 GA2x lanes for ChIP, 1 for input DNA, 1 for QC 36nt long sequences 32 Million reads Aligned/mapped to hg18 with BWA With Melnick lab at WCMC

ChIP reads Input reads Detected Peaks BCL6: 18,814 peaks 80% are within <20kb of a known gene

Loading peaks into GRange system(“split_samfile s_1_sequence.txt.sam –outdir CHIP/”) system(“split_samfile s_2_sequence.txt.sam –outdir INPUT/”) system(“ChIPseeqer.bin –chipdir CHIP –inputdir INPUT –t 15 –fold 2 –outfile peaks.txt”) tpeaks = read.table(paste(dataFolder, ”peaks.txt”, sep = ""), header = F) peaks = RangedData(ranges = IRanges(start = tpeaks[, 2], end = tpeaks[, 3]), space = tpeaks[, 1], summit = tpeaks[, 6], score = tpeaks[, 5])...

Other peak finders

Promoter-based analysis (not peak- based) h1 h2 h3 h5 … h1 h2 h3 h4 h5 Maximum peak height in 2kb promoter 2kb All TSS

4. Peak analysis and interpretation

Gene-based peak annotation

Integration of multiple peak lists RangeData in R

Conservation analysis fixedStep chrom=chr1 start= step= fixedStep chrom=chr1 start= step=

What is the cis-regulatory code of each factor ? Does they require any co- factors ? DNA Activation Repression

No … Random regions Discovering regulatory sequences associated with peak regions True TF binding peak? Yes … Target regions True TF peak Absent Present No Yes Motif correlation is quantified using the mutual information

Motif Search Algorithm k-mer MI CTCATCG TCATCGC AAAATTT GATGAGC AAAAATT ATGAGCT TTGCCAC TGCCACC ATCTCAT ACGCGCG CGACGCG TACGCTA ACCCCCT CCACGGC TTCAAAA AGACGCG CGAGAGC CTTATTA Not informative Highly informative... MI=0.081 MI=0.045 MI=0.040

Enrichment Depletion Motif co-occurrence anallysis Discovered Motifs FIRE automatically compares discovered motifs to known motifs in TRANSFAC and JASPAR

5. A few interesting papers

First ChIP-seq paper

Epigenetic modifications at enhancer regions

Chromatin states

Nucleosome localization

Whole-genome nucleosome location mapping in B cells Yanwen Jiang, PhD Principal Component Analysis of Nucleosome profiles