Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University
Transcription in higher eukaryotes Gene Expression Chromatin structure Initiation of transcription Processing of the transcript Transport to the cytoplasm mRNA translation mRNA stability Protein activity stability
Transcriptional Regulation Nuclear membrane
Transcriptional Regulation Nuclear membrane Binding site/motif CCG__CCG Genome-wide mRNA transcript data (e.g. microarrays)
Learning problems: Transcriptional Regulation Understand which regulators control which target genes Nuclear membrane Binding site/motif CCG__CCG Discover motifs representing regulatory elements
Cluster-first motif discovery Some common approaches Cluster-first motif discovery Cluster genes by expression profile, annotation, … to find potentially coregulated genes Find overrepresented motifs in promoter sequences of similar genes (algorithms: MEME, Consensus, Gibbs sampler, AlignACE, …) (Spellman et al. 1998)
Training data – Features regulator expression promoter sequence label feature vector
What is PWM? Transcription factor binding sites (TFBSs) are usually slightly variable in their sequences. A positional weight matrix (PWM) specifies the probability that you will see a given base at each index position of the motif. N C A G T Con 16 5 2 3 1 42 6 9 7 4 24 44 19 15 11 10 8 34 31 13 18 39 43 14 21 33 29 12 Pos
. PWM for ERE Position frequency matrix (PFM) (also known as raw count matrix) acggcagggTGACCc aGGGCAtcgTGACCc cGGTCGccaGGACCt tGGTCAggcTGGTCt aGGTGGcccTGACCc cTGTCCctcTGACCc aGGCTAcgaTGACGt . cagggagtgTGACCc gagcatgggTGACCa aGGTCAtaacgattt gGAACAgttTGACCc cGGTGAcctTGACCc gGGGCAaagTGACTg Given N sequence fragments of fixed length, one can assemble a position frequency matrix (number of times a particular nucleotide appears at a given position). A normalized PFM, in which each column adds up to a total of one, is a matrix of probabilities for observing each nucleotide at each position. Position weight matrix (PWM) (also known as position-specific scoring matrix) PFM should be converted to log-scale for efficient computational analysis. To eliminate null values before log-conversion, and to correct for small samples of binding sites, a sampling correction, known as pseudocounts, is added to each cell of the PFM.
Converting a PFM into a PWM Position Weight Matrix for ERE Converting a PFM into a PWM For each matrix element do: A 0.58 -0.44 -0.98 -1.21 -2.29 1.22 -0.60 -2.96 1.62 -0.72 C -1.49 -0.30 1.39 0.78 0.34 0.25 1.76 0.46 G 0.16 1.31 1.44 -0.17 -0.06 0.65 1.79 -0.64 T 0.96 -0.78 1.73 -1.84 0.23 – raw count (PFM matrix element) of nucleotide b in column i N – number of sequences used to create PFM (= column sum) - pseudocounts (correction for small sample size) p(b) - background frequency of nucleotide b
Scoring putative EREs by scanning the promoter with PWM G G G T C A G C A T G G C C A A 0.58 -0.44 -0.98 -1.21 -2.29 1.22 -0.60 -2.96 1.62 -0.72 C -1.49 -0.30 1.39 0.78 0.34 0.25 1.76 0.46 G 0.16 1.31 1.44 -0.17 -0.06 0.65 1.79 -0.64 T 0.96 -0.78 1.73 -1.84 0.23 Absolute score of the site =11.57
Yeast ESR: Biological Validation Universal stress repressor motif Xbp1 universal stress repressor, tbp1 tata box, hap1 hypoxia stress, cbf1 cell cycle regulator, gcn4 aa nitrogen stress, STRE element
Graphical models (and other methods) Previous work: “Structure learning” Graphical models (and other methods) Learn structure of “regulatory network”, “regulatory modules”, etc. Fit interpretable model to training data Model small number of genes or clusters of genes Many computational and statistical challenges; often used for qualitative hypotheses rather than prediction (Pe’er et al. 2001) (Segal et al, 2003, 2004)
Signaling networks in a cell
Network inference Regulator-motif associations in nodes can have different meanings: Need other data to confirm binding relationship between regulator and target (e.g. ChIP-chip) Still, can determine statistically significant regulator-target relationships from regulation program P Mp TF MTF P P M Mp Direct binding Indirect effect Co-occurrence
Example: oxygen sensing and regulatory network
Binding data for regulatory networks ChIP-chip: genome-wide protein-DNA binding data, i.e. what promoters are bound by TF? Investigate regulatory network model: use ChIP-chip data in place of motifs (no motif discovery) Features: (regulator, TF-occupancy) pairs P1 P2 TF
Inferring regulatory networks from the combination of expression data and binding data
An extended ER regulatory network in MCF7 cells FOS MYC CEBP XBP1 RXRA HSF2 PNN NRIP1 TXNDC IVNS1ABP BATF HES1 CHAF1B CSDE1 CUTL1 PURB ADAR C140RF43 SP3 DDX20 ELF3 TXNIP PAWR BRIP1 FOXP4 ZNF394 BAZ1B STRAP ASCC3 MKL2 GTF2I RUVBL1 RFC1 ZNF500 TTF2 RAB18 ZKSCAN1 MSX2 LASS2 HDAC1 ZBTB41 TBX2 THRAP1 VPS72 TLE3 BHLHB2 ZNF38 ZNF239 DNMT1 HIF1A HEY2 CCNL1 BRF1
Glc7 phosphatase complex Signaling molecules -- Networks Find all SMs that associate as regulators with a particular TF’s ChIP occupancy in ADT features e.g. Hypothesis: Glc7 phosphatase complex interacts with Hsf1 in regulation of Hsf1 targets (Interaction supported in literature) Hsf1 Gac1 Gip1 Sds22 Glc7 phosphatase complex TF SM mRNA
http://motif.bmi.ohio-state.edu/ChIPMotifs/ FASTA file Input Data Ab initio Motif Discovery Programs Statistical Methods STAMP Matching Results SeqLog PWM P-value Known or novel motifs Bootstrap re-sampling Fisher test Weeder MaMf MEME FASTA file Contact Info Control data (optional)
http://motif.bmi-ohio-state.edu/HRTBLDb
Software Demo W-ChIPMotifs HRTargetDB