Exhaustive Signature Algorithm

Slides:



Advertisements
Similar presentations
Lazy Paired Hyper-Parameter Tuning
Advertisements

Face Alignment by Explicit Shape Regression
Tetris – Genetic Algorithm Presented by, Jeethan & Jun.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data.
Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics.
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
Post Silicon Test Optimization Ron Zeira
Bi-correlation clustering algorithm for determining a set of co- regulated genes BIOINFORMATICS vol. 25 no Anindya Bhattacharya and Rajat K. De.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Mutual Information Mathematical Biology Seminar
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Reduced Support Vector Machine
Clustering (Part II) 11/26/07. Spectral Clustering.
Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Continuous Random Variables and Probability Distributions
CSCE822 Data Mining and Warehousing
Supplementary material Figure S1. Cumulative histogram of the fitness of the pairwise alignments of random generated ESSs. In order to assess the statistical.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
An Unsupervised Learning Approach for Overlapping Co-clustering Machine Learning Project Presentation Rohit Gupta and Varun Chandola
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Using Error-Correcting Codes For Text Classification Rayid Ghani Center for Automated Learning & Discovery, Carnegie Mellon University.
Genetic Algorithm.
Reconstructing Gene Networks Presented by Andrew Darling Based on article  “Research Towards Reconstruction of Gene Networks from Expression Data by Supervised.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Proliferation cluster (G12) Figure S1 A The proliferation cluster is a stable one. A dendrogram depicting results of cluster analysis of all varying genes.
Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012.
Bug Localization with Machine Learning Techniques Wujie Zheng
Fast and incoherent dictionary learning algorithms with application to fMRI Authors: Vahid Abolghasemi Saideh Ferdowsi Saeid Sanei. Journal of Signal Processing.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside.
Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Comp. Genomics Recitation 3 The statistics of database searching.
1 Gene Ontology Javier Cabrera. 2 Outline Goal: How to identify biological processes or biochemical pathways that are changed by treatment.Goal: How to.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Robustness of complex networks with the local protection strategy against cascading failures Jianwei Wang Adviser: Frank,Yeong-Sung Lin Present by Wayne.
Workshop report: Biclustering Methods for Microarray Data, Hassalt University, Belgium Guy Harari.
Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor network reconstruction.
3 common measures of dispersion or variability Range Range Variance Variance Standard Deviation Standard Deviation.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Graph-based Deformable Matching of 3D Line Segments with Application in Protein Fitting 12 1 HANG DOU 1, MATTHEW L BAKER 2, TAO JU Washington University.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Analyzing Expression Data: Clustering and Stats Chapter 16.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
Flat clustering approaches
Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004.
The Broad Institute of MIT and Harvard Differential Analysis.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.
Visual Correlation Analysis of Numerical and Categorical Data on the Correlation Map Zhiyuan Zhang, Kevin T. McDonnell, Erez Zadok, Klaus Mueller.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Two études on modularity
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.
Machine Learning: Lecture 5
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Exhaustive Signature Algorithm Guy Harari

Outline ISA biclustering algorithm Bimax biclustering algorithm Exhaustive Signature Algorithm Results and future work

ISA algorithm Was developed by Sven Bergmann in 2003. Goal: find genes/conditions having correlated expression. Frequently used, compared and improved. Good results in real data.

ISA - details Input – expression matrix , initial gene set. Compute by normalizing each column. For each condition z-test avg. normalized expression in gene subset against avg. expression in condition. If above a threshold, select the condition. Do the same for resulting condition set. Repeat until convergence of gene set.

ISA - drawbacks Initial gene set should be given. Few biclusters for specific parameter value. Parameter values are hard to optimize. Expression values aren’t normally distributed. Genes might not be independent.

Exhaustive approach Use Bimax algorithm to find seeds. For each seed apply ISA with random parameters. Drop similar seeds while running. Drop similar biclusters from ISA. Observation: applying the algorithm separately for positive and negative values improves results.

Bimax algorithm Input – expression matrix Binarize matrix (1 value for b% highest and lowest values). Goal – find all submatrices which: Contain only 1’s. Are inclusion-maximal. Method: Drop areas in matrix with 0’s only. Recursively apply Bimax on other areas.

Bimax - illustration 1

Bimax - illustration 1

Bimax - illustration 1

Bimax - illustration 1

Bimax - illustration 1

Bimax - drawbacks Information loss due to binarization. Binarization parameter is hard to control. Runtime depends linearly on no. of biclusters. Usually returns millions of biclusters. Poor results on real data.

Exhaustive Signature Algorithm Apply Bimax on the input expression matrix. Keep biclusters that: Do not overlap with other biclusters. Have low p-value w.r.t a bicluster score. Sort resulting biclusters by size. Begin with the largest, apply ISA for each one. Keep new biclusters that do not overlap with previous ones. Stop if more than N biclusters found.

ESA – details Overlaps – use Jaccard index, take the larger. Score – average abs. Pearson correlation between gene pairs. P-value: Randomize input matrix using edge shuffling. Apply ESA on randomized matrix. Keep score distribution of all biclusters found. P-value = right tail of score distribution of resulting biclusters.

ESA – details Observation: anti-correlated genes usually do not pass enrichment tests simultaneously. So apply ESA separately on positive and negative expression values. Also change ISA: For positive run, test: score>threshold For negative run, test: –score>threshold

ESA - experiments Apply the algorithms: SAMBA, Bimax, ISA,ESA and ESANP (negative and positive values separately). Datasets: Gasch 2001 (yeast heat shock) Whitfield 2002 (human cell cycle) Evaluation: GO, TF and KEGG enrichment tests

Results – Yeast, GO

Results – Yeast, TF

Results – Yeast, KEGG

Results – Human, GO

Results – Human, KEGG

Conclusions ESA exploits both Bimax’s power and ISA’s accuracy. ESA avoids ISA’s parameter selection. ESA avoids ISA’s seed generation. ESA reduces #biclusters from Bimax. ESA shows good results on real data.

Future work Test the algorithm on other datasets. Initiate binarization parameter automatically. Evaluate results with other criteria. Avoid bias towards large biclusters.