1 Harvard Medical School Mapping Transcription Mechanisms from Multimodal Genomic Data Hsun-Hsien Chang, Michael McGeachie, and Marco F. Ramoni Children ’ s Hospital Informatics Program Harvard-MIT Division of Health Sciences and Technology Harvard Medical School March 10, 2010
2 Harvard Medical School Information Flow in Multimodal Genomic Data Genetic Variants –100k – 1000k SNPs –250k copy number variations (CNVs) –250k methylation measurements Transcripts –50k mRNA expression levels –50k microRNA expression levels –1.5M exon expression / splicing Information
3 Harvard Medical School Expression Quantitative Trait Loci (eQTLs) Connection from variant to expression is an information channel –A DNA locus is modulating the expression level of a gene = eQTL Cis(Trans) eQTLs are the genetic variants located close to (far away) genes. Identifying cis-eQTLs is easier –Focusing on cis-eQTL reduces search space –trans eQTLs?
4 Harvard Medical School Cancer: based on genetic modification (variants) and cellular malfunction (gene expression) Identification of eQTLs helps understand molecular mechanisms in cancer and provides biological insight. Clinical study of Acute lymphoblastic leukemia (ALL) –The most common malignancy in children, nearly one third of all pediatric cancers. –A few cases are associated with inherited genetic syndromes (i.e., Down syndrome, Bloom syndrome, Fanconi anemia), but the cause remains unknown. Data –29 patients. –Genotyped 100,000 SNPs (Affymetrix Human Mapping 100K). –Profiled 50,000 gene expressions (Affymetrix HG-U133 Plus 2.0). Clinical Study on Pediatric Leukemia
5 Harvard Medical School Challenges in Finding eQTLs Compare the distribution of each Variant to the levels of each expression measurement –Computational All pairs of variants vs. expressions is costly Usually discretize expression levels (Pensa et al., BioKDD, 2004) –Multiple testing considerations Understanding –Too many associations to test via laboratory science Computational methods of biological discovery Want to summarize main informational (biological) pathways Answer: Use transcriptional information
6 Harvard Medical School Transcriptional Information Channel X Y SNPs are modeled as binomial variables. Expressions are modeled as log-normal variables. Mutual Information quantifies information flow: Higher MI is achieved by larger σ 2 and smaller σ k 2, i.e., when expression level Y is more likely modulated by SNP X. Transcription Channel Info Theory: measures Entropy, H(X)
7 Harvard Medical School Transcript Y is modulated by SNP X : Transcript Y is independent of SNP X :
8 Harvard Medical School Transcriptional Information Map X1X1 Y1Y1 X2X2 Y2Y2 X3X3 X4X4 Y4Y4 X5X5 Y5Y5 X6X6 X7X7 Y7Y7 Y8Y8 X9X9 Y9Y9 X8X8 Y3Y3 Y6Y6
9 Harvard Medical School ALL Transcriptional Information Map of Chr21
10 Harvard Medical School Cluster Genes and SNPs into Networks X1X1 Y1Y1 X2X2 Y2Y2 X3X3 X4X4 Y4Y4 X5X5 Y5Y5 X6X6 X7X7 Y7Y7 Y8Y8 X9X9 Y9Y9 X8X8 Y3Y3 Y6Y6
11 Harvard Medical School X1X1 Y1Y1 Y2Y2 X3X3 X4X4 Y9Y9 X8X8 Cluster Genes and SNPs into Networks We can further infer the optimal modulation patterns using Bayesian networks.
12 Harvard Medical School Bayesian networks are directed acyclic graphs: –Nodes correspond to random variables. –Directed arcs encode conditional probabilities of the target nodes on the source nodes. –p(X) depends on (A,B) –p(Z|X,Y) independent of (A,B) Bayesian Networks AB XYZ
13 Harvard Medical School Infer Bayesian Networks in Individual Clusters X1X1 Y1Y1 Y2Y2 X3X3 X4X4 Y9Y9 X8X8 Step 1: Use TIM as the initial network. Step 2: Bayesian network infers SNP-SNP connections.
14 Harvard Medical School A Bayesian Network Inferred from Chr21 TIM
15 Harvard Medical School Information Theoretic Network Analysis Find hubs, motifs, guilds, etc. –Abstract edges –Global patterns -> local patterns –Reveal emergent properties –Information theoretic approach using Data Compression Alterovitz G, and Ramoni MF, “Discovering biological guilds through topological abstraction,” AMIA Annu Symp Proc, pp. 1-5, 2006.
16 Harvard Medical School Identified Fundamental Components Reference: Alterovitz and Ramoni, AMIA Annu Symp Proc, pp. 1-5, 2006.
17 Harvard Medical School Identification of Cis- and Trans eQTL RIPK4, 21q22.3 –Related to Downs Syndrome –RIPK4 has 5 (trans) SNPs in q11.2 (shown as blue in the figure) affecting its expression. RIPK4
18 Harvard Medical School Identification of Cis and Trans eQTL CYYR1, 21q21.1 –Recently discovered. –Encodes a cysteine and tyrosine-rich protein. –Recent study found a correlation with neuroendocrine tumors. –TIM shows CYYR1 modulated by SNPs across the q arm of chromosome 21. –DSCAM related to Down’s syndrome –DSCAM-CYYR1 interaction leads to ALL? DSCAM
19 Harvard Medical School Complete TIM Algorithm Infer Network in Individual Clusters Cluster 1 Cluster N Compute Transcriptional Information Genetic Variant Transcript Group Linked SNPs and Transcripts Cluster 1 Cluster N... Network Topology Analysis and Summary
20 Harvard Medical School Transcriptional Information Maps Make large multimodal genetic dataset amenable to transcriptional analysis Identifies –Modulation patterns between genetic variants and transcripts. – CIS and TRANS eQTL. Analysis of pediatric ALL helps identify biological hypotheses regarding connection to Down’s syndrome
21 Harvard Medical School Questions? Thanks to Prof. Marco F. Ramoni, Dr. Hsun-Hsien Chang, Dr. Gil Alterowitz, Children’s Hospital Informatics Program, Brigham and Women’s Hospital