MOTIF ENRICHMENT ANALYSIS IN CO- EXPRESSED GENE SETS AND HIGH- THROUGHPUT SEQUENCE SETS Wyeth Wasserman Jan. 18, 2012 opossum.cisreg.ca/oPOSSUM3
Welcome If you encounter any technical difficulties during the webinar –Type a report using the chat option Slide presentation ~20 min Compile Questions as they are submitted and answer them during the final Q&A/discussion period During the discussion session, we’ll allow audience speaking 2
Webinar Format Introduction Walk-Through Summary Q&A 3
INTRODUCTION 4
Overview Given co-expressed gene sets, what are the key mediators of co-expression? –Focus on TFs Web-based software system for motif enrichment analysis –Co-expressed genes or sequences –Multiple sets of analysis methods –Available for human, mouse, fly, worm, yeast 5
Motif Enrichment Analysis 6 BackgroundTarget p=0.04 p=0.55 p=0.66 Finds over-represented TFBS in co-expressed gene sets
What do we need? Region selection –Where to look for enriched binding sites –Use conservation filter to restrict search space TFBS profiles to search for –Need a pool of validated profiles Scoring metrics for enrichment –How to measure motif over-representation 7
Gene CR1CR2CR4CR3 Threshold Genomic Position phastCons Score Conserved Region Selection 8
TFBS Profiles JASPAR 2010: Portales-Casamar et al. Nucleic Acids Research Expanded collection of TFBS profiles –130 vertebrate profiles –105 insect profiles –5 nematode profiles –177 yeast profiles –PBM (104), PBM_HOMEO (176), PBM_BHLH (19) Standardized 2-level TF classification (class, family) 9
Scoring Metrics Z scores –Based on the number of occurrences of the TFBS relative to background –Normalized for sequence length –Simple binomial distribution model Fisher scores –Fisher exact probability test Fisher score = -log(Fisher p-value) –Based on the number of genes containing the TFBS relative to background 10
Additional Metric for Seq-Based KS scores –Kolmogorov-Smirnoff test –Compares the empirical distribution of the distances of the binding sites from the maximum point of confidence (MPC) to the background –Expect real binding sites to be centered around the MPC 11 MPC Foreground Background KS score = -log(KS test p-value)
Analysis Methods 12
WALK-THROUGH 13
14
Human SSA - Input 15
16
17
Human SSA - Results 18
19 TFHNF1A JASPAR IDMA ClassHelix-Turn-Helix FamilyHomeo Tax GroupVertebrates IC GC Content0.259
20 Target Gene Hits19 Target Gene Non-Hits36 Background Gene Hits1113 Background Gene Non-Hits3887 Target TFBS Hits41 Target TFBS Nucleotide Rate Background TFBS Hits2127 Background TFBS Nucleotide Rate0.009
21 Z-score Fisher score3.646
22
oPOSSUM methods 23
24
Human aCSA - Input 25
Human aCSA - Input 26
Human aCSA - Input 27
Human aCSA - Results 28
29
30
TFBS Cluster Analysis 31 TFBS Profile Cluster
Gene CR1CR2CR4CR3 TFBSs TFBS Cluster Hits Merge Overrepresentation Analysis based on merged TFBS cluster hits TFBS Cluster Analysis (TCA) 32
Human TCA – TFBS cluster selection 33
Human TCA - Results 34
TFCluster Info Page 35
36
Seq SSA - Input 37
Seq SSA - Input 38
39
40
41
42
43
44
Seq SSA - Results 45
46 KS score
47
Seq TCA - Input 48
SUMMARY 49
oPOSSUM-3 Web-based system for motif enrichment analysis in co-expressed gene sets and sequences from high-throughput experiments Important functionalities –Gene-based vs. Sequence-based –Single site vs. Anchored combination site –Individual vs. clusters of TFBS profiles –Human, mouse, fly, worm and yeast 50
Development Team 51 Version 1CSAVersion 2Version 3 Ho Sui, SJ Mortimer, JR Arenillas, DJ Brumm, J Walsh, CJ Kennedy, BP Wasserman, WW Huang, S Fulton, DL Arenillas, DJ Perco, P Ho Sui, SJ Mortimer, JR Wasserman, WW Ho Sui, SJ Fulton, DL Arenillas, DJ Kwon, AT Wasserman, WW Kwon, AT Arenillas, DJ Worsely Hunt, R Wasserman, WW
QUESTIONS & ANSWERS Please take a moment to type questions/comments into the chat box. The questions will be answered shortly. 52