Presentation is loading. Please wait.

Presentation is loading. Please wait.

Boolean Analysis of High-Throughput Biological Datasets

Similar presentations


Presentation on theme: "Boolean Analysis of High-Throughput Biological Datasets"— Presentation transcript:

1 Boolean Analysis of High-Throughput Biological Datasets
Debashis Sahoo PhD Candidate, Electrical Engineering, Stanford University Integrative Cancer Biology Program, Stanford University ICBP, Stanford University

2 ICBP, Stanford University
Outline Introduction Research Contributions StepMiner BooleanNet Conclusion ICBP, Stanford University

3 ICBP, Stanford University
Introduction George Boole ( ) Two values High, Low 1, 0 Boolean operations AND, OR, NOT Implication x  y Add High/Low – maybe remove True/False ICBP, Stanford University

4 ICBP, Stanford University
Molecular Biology ICBP, Stanford University

5 ICBP, Stanford University
Microarrays High throughput gene expression measurement Gene expression for all genes are assigned a real valued number. ICBP, Stanford University

6 ICBP, Stanford University
Microarray Analysis Pearson’s correlation Hierarchical clustering Significance analysis of microarrays (SAM) This slide does not add a lot of value. ICBP, Stanford University

7 ICBP, Stanford University
Research Direction What can you learn by studying Boolean relationships between genes? Are there any fundamental Boolean relationship on gene expressions that are conserved? Is it possible to gain new insight into human diseases? Can we generate hypotheses to test in a model organism? Dele ICBP, Stanford University

8 ICBP, Stanford University
Outline Introduction Research Contributions StepMiner BooleanNet Conclusion ICBP, Stanford University

9 Research Contributions
New method of analyzing timecourse microarray dataset Developed a statistical test for fitting step functions. Developed a tool called “StepMiner”. [Sahoo et al. NAR 2007] StepMiner is applied to biology Mouse osteosarcoma [Wu et al. PLoS Genetics 08] Mouse lymphoma [Shachaf et al. Cancer Research 08] Prostate cancer [In preparation] New computational analysis of time courses. Method for extracting(?) Boolean relationships. Developed technique - implemented a tool - Used it to discover stuff ICBP, Stanford University

10 Research Contributions
New method for analyzing large microarray dataset Developed a statistical test for discovering Boolean relationships between pairs of genes. Developed a tool called “BooleanNet”. [Sahoo et al. RECOMB Sat 2007] Developed a web interface for BooleanNet BooleanNet is applied to biology Predicts developmentally regulated genes. [In preparation] Predicts highly conserved regulatory relationships associated with DREAM complex. [In preparation] New computational analysis of time courses. Method for extracting(?) Boolean relationships. Developed technique - implemented a tool - Used it to discover stuff ICBP, Stanford University

11 ICBP, Stanford University
Outline Introduction Research Contributions StepMiner BooleanNet Conclusion ICBP, Stanford University

12 ICBP, Stanford University
Timecourse Analysis Timecourse microarray data Temporal progression of transcriptional behavior Characteristics of the data Voluminous, small number of time points, full of noise Current analysis techniques Hierarchical clustering [Eisen et al. 1998] Significance analysis – SAM [Storey et al. 2003] Clustering for timecourse data [Ernst et al. 2005, Tavazoie et al. 1999] Model based [Storey et al. 2005] Motivational slide: Time courses to see genetic response to stimulus Problems: Too many genes Biologists would like simple answers to most basic question: What genes change, what direction do they go, and when do they change? ICBP, Stanford University

13 ICBP, Stanford University
StepMiner Directly answer: Which genes turn “on”/”off”? When do they change? Algorithm: Fit step functions using adaptive regression Group genes that change at the same time and direction. Gene expression [Sahoo et al. 07] ICBP, Stanford University

14 ICBP, Stanford University
StepMiner Algorithm Step Error 3 7.87 4 3.72 5 7.61 ICBP, Stanford University

15 ICBP, Stanford University
StepMiner Algorithm Step Error 3 7.87 4 3.72 5 7.61 ICBP, Stanford University

16 ICBP, Stanford University
Regression Statistic Compute the degrees of freedom Adaptive one-step – 3 Adaptive two-step – 4 Compute F-statistic using the adjusted degrees of freedom m F = (SSTOT – SSE)/(m – 1) * (n-m)/SSE Compute p-values F-distribution with (m–1, n–m) degrees of freedom. Compute FDR Random permutations of timepoints Ratio of the expected to the observed significant genes ICBP, Stanford University

17 ICBP, Stanford University
Outline Introduction Research Contributions StepMiner BooleanNet Conclusion ICBP, Stanford University

18 StepMiner on Mouse Osteosarcoma
Experiment overview: Specific biological question: What MYC target genes are responsible for tumor maintenance? MYC Tet-O c-MYC MYC ON No Doxycycline tTA MYC OFF Tet-O c-MYC DOX Plus Doxycycline tTA Proliferation Microarray arrays – 7 arrays MYC is a famous oncogene that cause Human lymphoma…a type of deadly Blood cancer. There are MYC target genes that responsible for cancer. [The Felsher Lab, Wu et al. PLoS Genetics 08] ICBP, Stanford University

19 Tumor Maintenance Genes
Hypothesis Expressions of tumor maintenance genes follow the cell proliferation phenotype. Time Course (hours) Permanently Repressed (PR) Permanently Induced (PI) [The Felsher Lab, Wu et al. PLoS Genetics 08] ICBP, Stanford University

20 StepMiner on Mouse Osteosarcoma
PI PR MYC Audience ought to understand here that StepMiner solves the problem of interpreting the data. Describe colors/row/column. [The Felsher Lab, Wu et al. PLoS Genetics 08] ICBP, Stanford University

21 Tumor Maintenance Genes
MYC [The Felsher Lab, Wu et al. PLoS Genetics 08] ICBP, Stanford University

22 Permanent Changes in Ribosomal Proteins
[The Felsher Lab, Wu et al. PLoS Genetics 08] ICBP, Stanford University

23 MYC, Master of Ribosome Synthesis
Sci. STKE, 8 March 2005 Vol. 2005, Issue 274, p. tw89 Grewal et al. 2005 Grandori et al. 2005 Oskarsson and Trumpp 2005 Arabi et al. 2005 Be careful about what you say. “Master regulator” may not be well defined. Expression of dMyc is both necessary and sufficient to control rRNA synthesis and ribosome biogenesis during larval development. Expression of c-Myc correlates with increased synthesis of rRNAs and their precursors, as indicated by labelling c-Myc-overexpressing primary human fibroblasts. RNA interference was used to determine whether or not pre-rRNA synthesis is sensitive to the level of endogenous c-Myc. RT-PCR showed that pre-rRNA expression is decreased significantly by c-Myc depletion from HeLa cells using two different small interfering RNAs (siRNAs; Fig. 1f,g). ICBP, Stanford University

24 Permanent Changes in Ribosomal Proteins
[The Felsher Lab, Wu et al. PLoS Genetics 08] ICBP, Stanford University

25 ICBP, Stanford University
Outline Introduction Research Contributions StepMiner BooleanNet Conclusion ICBP, Stanford University

26 ICBP, Stanford University
Background Analysis of large gene expression datasets Clustering [Eisen et al. 1998] Co-expression [Arkin and Ross 1995; Allocco et al. 2004; Day et al. 2007; Jordan et al. 2004; Lee et al. 2004; Tavazoie et al. 1999] Bayesian analysis [Friedman et al. 2000; Lee et al. 2006; Pe'er et al. 2001; Segal et al. 2001] Mutual information [Basso et al. 2005; Butte and Kohane 2000; Margolin et al. 2006; Wang et al ] Problem and motivation. Huge datasets, difficult to extract knowledge from them. Lots of existing techniques. Why Boolean might be useful. Maybe start off with L-shaped scatterplot, argue that other techniques miss it. ICBP, Stanford University

27 ICBP, Stanford University
Limitations Current approaches Explore symmetric relationships only Hard to extend across species Do not scale to very large datasets Not easy to interpret [Basso et al. 2005] ICBP, Stanford University

28 ICBP, Stanford University
BooleanNet Get data GEO [Edgar et al. 02] Normalize RMA [Irizarry et al. 03] Determine thresholds Discover Boolean relationships This is an overview of the Boolean analysis. - brain image Make it more professional. Flow chart Biological interpretation ICBP, Stanford University

29 ICBP, Stanford University
Determine threshold A threshold is determined for each gene. The arrays are sorted by gene expression StepMiner is used to determine the threshold High CDH expression Intermediate Threshold Low Say about linear shape. Labels in the graph bigger. Put forbidden zone threshold. Labels. Sorted arrays [Sahoo et al. 07] ICBP, Stanford University

30 Discovering Boolean Relationships
Analyze pairs of genes. Analyze the four different quadrants. Identify sparse quadrants. Record the Boolean relationships. ACPP high  GABRB1 low GABRB1 high  ACPP low 2 4 GABRB1 If -> then Describe x and y axis. Describe a point. Statistical tests for identifying sparse quadrant. 1 3 ACPP ICBP, Stanford University

31 ICBP, Stanford University
Statistical Tests Compute the expected number of points under the independence model Compute maximum likelihood estimate of the error rate a00 a01 a11 a10 A B statistic = (expected – observed) expected a00 (a00+ a01) (a00+ a10) + ( ) 1 2 error rate = ICBP, Stanford University

32 Sparse Quadrant Identification
(Statistic, Error Rate) (-0.5, 0.57) (3.7, 0.00) 2 4 GABRB1 1 3 (0.2, 0.92) ACPP (-1.3, 0.51) ICBP, Stanford University

33 Sparse Quadrant Identification
(Statistic > 3, Error Rate < 0.1) (-0.5, 0.57) (3.7, 0.00) 2 4 ACPP high  GABRB1 low GABRB1 1 3 (0.2, 0.92) ACPP (-1.3, 0.51) ICBP, Stanford University

34 Four Asymmetric Boolean Relationships
A low  B low A low  B high A high  B low A high  B high PTPRC low  CD19 low XIST high  RPS4Y1 low RPS4Y1 CD19 FAM60A low  NUAK1 high PTPRC COL3A1 high  SPARC high XIST Sparse quadrants are highlighted. Divide the pictures: Two slides First show Asymmetric Symmetric SPARC NUAK1 FAM60A COL3A1 ICBP, Stanford University

35 Two Symmetric Boolean Relationships
Opposite Equivalent CCNB2 EED Sometime they have good correlation. But boolean relationships is strong in other cases Divide the pictures: Two slides First show Asymmetric Symmetric BUB1B XTP7 ICBP, Stanford University

36 Boolean Relationships
There are six possible Boolean relationships A low  B low A low  B high A high  B low A high  B high Equivalent Opposite ICBP, Stanford University

37 Size of The Boolean Networks
lowhigh highlow lowlow highhigh Equivalent Opposite ICBP, Stanford University

38 Conserved Boolean Networks
Find orthologs between human, mouse and fly using EUGene database. Search for orthologous gene pairs that have the same Boolean relationship. Fly 17M Human 208M [Gilbert, 02] Mouse 336M 41K 4M ICBP, Stanford University

39 ICBP, Stanford University
Web Interface Easy navigation and visualization of Boolean network Search for genes Scatterplot for a pair of genes Retrieve all Boolean relationships for a gene Build a subnetwork from a collection of genes Maybe talk about size of network before this slide. ICBP, Stanford University

40 ICBP, Stanford University

41 ICBP, Stanford University

42 ICBP, Stanford University

43 BooleanNet Reveals Known Biology
Gender Tissue Development Differentiation XIST ACPP HOXD3 PTPRC HOXA13 CD19 RPS4Y1 GABRB1 (UBX) (Antp) ICBP, Stanford University

44 ICBP, Stanford University
Outline Introduction Research Contributions StepMiner BooleanNet Conclusion ICBP, Stanford University

45 Prediction of Developmentally Regulated Genes
HSC – Stem Cell that makes blood cells B Cell – A type of blood cell that makes antibodies Goal is to identify genes that are expressed at the intermediate stages of development ICBP, Stanford University

46 Prediction of Developmentally Regulated Genes
Boolean relationships follow known developmental relationships KIT Hematopoietic stem cell marker CD19 Mature B cell marker Boolean relationships KIT high  CD19 low CD19 high  KIT low Problem statement & motivation Inferring gene expression during development from gene expression in “other” cell types. ICBP, Stanford University

47 Computational Discovery of Human B Cell Precursors
Boolean Interpolation: KIT high  A low  CD19 low ICBP, Stanford University

48 Interpolation Between KIT and CD19
WASPIP TRAF3IP3 ZC3HAV1 SEPT6 BACH2 TBC1D1 PTPRCAP NUP153 ITGB7 CD53 CD72 ZC3H12D ARHGAP4 LY9 ITGA4 CENTB1 LAT2 SEPT1 IL12RB1 19 Predicted genes Published recently [Sintes et al. Tissue Antigen 2007] ICBP, Stanford University

49 Experimental Validation (In Mice)
Cell sorting HSC MPP fl- MPP fl+ CLP Frac A [Pre-ProB] Frac B [Pro-B] Frac C [Large Pre-B] Frac D [small Pre-B] Frac E [Immature B] T1 T2 Mature B GC [Deepta Bhattacharya] qPCR on 14 genes [Jun Seita] Different stages of B cell development ICBP, Stanford University

50 ICBP, Stanford University
qPCR Results X X ICBP, Stanford University

51 ICBP, Stanford University
Analysis of qPCR Data Determine a threshold to call “low” and “high” expression levels. Using HSC and MPP FL- expression levels Test for Genes that turn “on” between “MPP FL+” and “Frac C” stages Genes that are “on” at the mature B cell stage 12/14 genes pass this test (FDR 1%) ICBP, Stanford University

52 ICBP, Stanford University
More B Cell Precursors AICDA low AICDA high Boolean Interpolation: KIT high  A low  AICDA low ICBP, Stanford University

53 Interpolation Between KIT and AICDA
BLK BTLA CCR6 CCR7 CD180 CD19 CD22 CD53 CD79A CD80 CD84 CD86 CENTB1 CLEC2D DENND1C DOK3 FAIM3 FCER2 FCRLM1 FLT3LG GCET2 GPR132 IL21R IRF4 ITGA4 ITGB7 KIAA0674 KYNU LAT2 LCK LNPEP LY9 MAP4K2 MGC10986 NCOA3 PAX5 PIK3R5 RAB8B RAC2 RHOF SLAMF1 SLAMF7 SP100 SP110 SPIB SYK TBX21 TRAF1 TREML2 WASPIP ZC3H12D ZNFN1A3 52 Predicted genes Check BCL6 LMO2 BCL2 –ve control BCL6 – BCL2 are inversely related – DLBCL hypothesis 6 genes ICBP, Stanford University

54 ICBP, Stanford University

55 Analysis of Predicted Genes
Total number of genes predicted: 62 33 genes have been knocked out in mice. [Literature] 18 genes have defects in B cell function and B cell differentiation. 2 genes are known prognostic markers of B cell lymphomas: WASPIP and GCET2. ICBP, Stanford University

56 ICBP, Stanford University
Conclusion results ICBP, Stanford University

57 ICBP, Stanford University
Conclusion StepMiner reports important gene activity in timecourse microarray datasets. BooleanNet discovers both symmetric and asymmetric relationships. Boolean implication network is a platform for new biological hypothesis Predicts genes related to B cell development. Boolean implication is easy to interpret. Write this carefully to summarize the key points. ICBP, Stanford University

58 ICBP, Stanford University
Conclusion A High B low StepMiner Boolean Analysis Gene Regulation Human Diseases ICBP, Stanford University

59 ICBP, Stanford University
Future Work Application of StepMiner and BooleanNet to other type of datasets Improvement of the statistical tests Threshold for BooleanNet Prediction of genes in less well-characterized developmental pathway ICBP, Stanford University

60 ICBP, Stanford University
Grand Challenge Understand gene regulatory relationships In normal and disease processes Build models of cellular processes Build models of different developmental processes ICBP, Stanford University

61 ICBP, Stanford University
Acknowledgements David L. Dill Teresa Meng Robert Tibshirani Andrew J. Gentles Judy Polenta, Maggie Bos Friends at Stanford Friends from Undergrad Parents, brother and sister Sylvia K. Plevritis Dean W. Felsher Irving L. Weissman The Dill Lab: Jacob, Eric The Sylvia Lab: Pong, Andrew, Ray, Slava Emily, Stephanie The Felsher Lab: Natalie, Cathy Joseph Lipsick James D. Brooks The Fujitsu Lab: Jawahar, Amit, Subbu The Lipsick Lab: Hong, Laura, Wai The Brooks Lab: Suvarna The Weissman Lab: Deepta, Jun Same font size. Funding: ICBP Program (NIH grant: 5U56CA ) ICBP, Stanford University

62 ICBP, Stanford University
The END ICBP, Stanford University

63 Adaptive Regression Method
Transition edges are called “knots” For each possible knot position Levels of constant segments are means of the relevant measurements. Compute SSE (sum-of-squares error). Choose knot position that minimizes SSE. ICBP, Stanford University

64 ICBP, Stanford University
Statistical Tests Compute the expected number of points under the independence model Compute maximum likelihood estimate of the error rate a00 a01 a11 a10 A B nAlow = (a00+ a01), nBlow = (a00+ a10) total = a00+ a01+ a10+ a11, observed = a00 expected = (nAlow/ total * nBlow/ total) * total a00 (a00+ a01) (a00+ a10) + ( ) 1 2 error rate = ICBP, Stanford University

65 Dynamic Range Consideration
Width of the intermediate region is 1 (2 fold-change) More than 5% low or high points for each gene. Number of points in the intermediate region is less than 2/3 of the total number of points. ICBP, Stanford University


Download ppt "Boolean Analysis of High-Throughput Biological Datasets"

Similar presentations


Ads by Google