1 Exploratory Tools for Follow-up Studies to Microarray Experiments Kaushik Sinha Ruoming Jin Gagan Agrawal Helen Piontkivska Ohio State and Kent State.

Slides:

Advertisements

Similar presentations

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Advertisements

Association Analysis (Data Engineering). Type of attributes in assoc. analysis Association rule mining assumes the input data consists of binary attributes.

Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.

Frequent Closed Pattern Search By Row and Feature Enumeration

Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.

Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Complexity 15-1 Complexity Andrei Bulatov Hierarchy Theorem.

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

Data Mining Association Analysis: Basic Concepts and Algorithms

Mutual Information Mathematical Biology Seminar

Data Mining Association Analysis: Basic Concepts and Algorithms

Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.

Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.

Session 6: Introduction to cryptanalysis part 1. Contents Problem definition Symmetric systems cryptanalysis Particularities of block ciphers cryptanalysis.

CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling

1 A Presentation of ‘Bayesian Models for Gene Expression With DNA Microarray Data’ by Ibrahim, Chen, and Gray Presentation By Lara DePadilla.

Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.

1 7. Two Random Variables In many experiments, the observations are expressible not as a single quantity, but as a family of quantities. For example to.

Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.

1 A Theoretical Framework for Association Mining based on the Boolean Retrieval Model on the Boolean Retrieval Model Peter Bollmann-Sdorra.

Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012.

1 2. Independence and Bernoulli Trials Independence: Events A and B are independent if It is easy to show that A, B independent implies are all independent.

MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.

Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.

Multiple Random Variables Two Discrete Random Variables –Joint pmf –Marginal pmf Two Continuous Random Variables –Joint Distribution (PDF) –Joint Density.

EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.

Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.

MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:

Measuring Association Rules Shan “Maggie” Duanmu Project for CSCI 765 Dec 9 th 2002.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.

1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Hong.

Statistical Testing with Genes Saurabh Sinha CS 466.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Yu Cheng Chen Author: Chung-hung.

Output Grouping Method Based on a Similarity of Boolean Functions Petr Fišer, Pavel Kubalík, Hana Kubátová Czech Technical University in Prague Department.

Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need.

Clustering by soft-constraint affinity propagation: applications to gene- expression data Michele Leone, Sumedha and Martin Weight Bioinformatics, 2007.

CMU SCS : Multimedia Databases and Data Mining Lecture #30: Data Mining - assoc. rules C. Faloutsos.

Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.

Ariel Fuxman, Panayiotis Tsaparas, Kannan Achan, Rakesh Agrawal (2008) - Akanksha Saxena 1.

Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.

Unsupervised Learning

An Efficient Algorithm for Incremental Update of Concept space

Statistical Testing with Genes

Frequent Pattern Mining

Factorization by Cross-method

CARPENTER Find Closed Patterns in Long Biological Datasets

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Rule Mining

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Rule Mining

Supervisors: Prof. Gagan Agrawal and Prof. Mikhail Belkin

Data Mining Association Analysis: Basic Concepts and Algorithms

Citation-based Extraction of Core Contents from Biomedical Articles

Statistical Testing with Genes

Association Analysis: Basic Concepts

Unsupervised Learning

Presentation transcript:

1 Exploratory Tools for Follow-up Studies to Microarray Experiments Kaushik Sinha Ruoming Jin Gagan Agrawal Helen Piontkivska Ohio State and Kent State Universities

2 Overall Motivation Biological literature is vast Need tools to find interesting patterns from literature Specific Example Identify genes from DNA microarray and other gene and protein assays Next step What is known about these genes? How are these genes related to each other or other genes identified in similar studies? Which other genes are most similar

3 Outline Hypergraph Mining Similarity Measures Evaluation and Observations

4 Hypergraph Mining: Motivating Example Micro array experiment - suspects that a small set of genes are related to a disease Confirm by searching existing literature - expect related genes to appear together in literature However, suppose Gene A and C are related and both of them are weakly related to another term B In literature, one would expect A,C appear together OR/AND A,B appear together B,C appear together How do we efficiently conclude that A,C are actually related?

5 Hypergraph Mining Basic Motivation To find useful “Transitive Relation” (hyperedges) among genes Example (Gene-Disease Relationship) Gene A is related to a term B Term B is related to a gene C Is Gene A related to Gene C ? Gene Source Microarray Experiments Information Source Online Literature abstracts

6 Formal Problem Definition Given A dictionary K T A set K M of user provided keywords (K T כK M ) Collection of literature abstracts - each abstract is represented as a set of words from dictionary Task To find hyperedges exceeding user defined threshold, each of which involves a set of key words from K M and are potentially connected by another set of linking words from K T -K M

7 Relationship to Work on Frequent Pattern Mining Frequent itemset mining Can represent each document abstract as a transaction with several keywords Find sets of keywords that appear together and often Cannot capture cross relationships Differences How do we define support ? How do we prune search space

8 Solution Approach Define total weight=support + cross support Support: set of keywords appear together in one document Cross support: set of keywords can be partitioned each partition appears in different document Common linking words Issues Since downclosure property does not hold for total weight modified downclosure property can be defined

9 Idea Support satisfies downclosure property Let X be a set, Ω be its power set. A function f : Ω →R + satisfies downclosure property if for all A,B ∈ Ω, A כ B,f(B)>f(A) Cross support can be designed to be restricted below a particular value, i.e., it is bounded Form a function h as addition of two functions h=f+g f satisfies downclosure property g is bounded h satisfies modified down closure property For any θ≥0, if h(K n ) ≥θ then f(K n-1 ) ≥ max{0,(θ-sup(g))} This property can be used to devise efficient algorithm

10 Outline Hypergraph Mining Similarity Measures Evaluation and Observations

11 Similarity Measure among Sets of Genes Given two list of gene names Need to find most similar genes, based on literature abstract occurrences Standard statistics approach Each file containing gene names can be considered as a Discrete Random Variable (DRV) Each such DRV can take several values (gene names) For two such files X,Y and for any pair (x,y), joint probability mass function p(x,y)=P(X=x,Y=y) Compute from online abstracts based on co-occurrence

12 Probability Computation Assume, File X has n gene names x i, i ∈ {1,…,n} File Y has m gene names y j, j ∈ {1,…,m} M(i,j) is the number of times (x i,y j ) appears together in transactions (article abstracts) Then, p(x i,y j )=M(i,j)/{∑ i ∑ j M(i,j)}

13 Expectation Computation Now define, Z=g(X,Y), where g: X x Y →[0, ∞ ) Clearly, Z is a random variable Expectation of Z is, E(Z)=E(g(X,Y))=∑ i ∑ j (g(x i,y j )M(i,j)/M t ) Where, M t =∑ i ∑ j M(i,j) Expected value of Z can directly be used as a similarity measure Different choices of g, give rise to different similarity measures

14 Some Choices of function g First Choice, Choose g=M(i,j) This choice leads to similarity measure, s e1 = ∑ i ∑ j M(i,j) 2 /M t Second Choice, Choose g=tot_length(x i,y j ), where tot_length (x i,y j ) is the sum of transaction lengths where (x i,y j ) co-occur The idea is longer the transaction length, higher the chance of having related linking key words This choice leads to similarity measure, s e2 = ∑ i ∑ j tot_length(x i,y j )*M(i,j) /M t

15 Extending the notion towards gene ranking Extend to rank genes from a list Y Most similar to the genes from list X Here, instead of Y as a random variable, for each y j ∈ Y, consider U j as a random variable taking value only y j Find the similarity measure between X and U j for all j ∈ {1,…,m} Sort the genes from list Y according to decreasing similarity measure

16 Datasets Used two sets of 21 and 31 genes These genes are differentially expressed between prostate epithelial and stromal cells in prostate cancer patients Dr Gail Frazer’s lab, Kent State University A standard dictionary, as reported in literature, containing 300 genes was used These genes were significantly up or down regulated in tumor and adjacent normal tissues when compared with a normal donor tissue Each literature abstract was represented in a bag of word format containing words, where each word comes from a dataset or the dictionary or is a GO term

17 Results: Hypergraph Mining Results show the linking GO terms and linking genes from the dictionary for 21 and 31 dataset obtained by hypergraph mining

18 Results: Similarity Measures 4 sets of 300 genes each,- A,B,C,D were formed A is the dictionary of 300 genes as mentioned before B,C,D were randomly chosen from superarray’s DNA micro-array experiments The task is to identify which of A,B,C,D is most similar to the 21 or 31 dataset As one would expect, A is most similar to the 21 dataset as shown below It also shows that some naïve similarity measure, such as s 1, fails to capture this Sometimes, this tool discovers some interesting result,- For 31 dataset, randomly chosen list C was most similar This has been justified by checking the functionalities of top ranked genes from list C

19 Results: Ranking Results of the ranked genes from the most similar list to either 21 or 31 data set Linking words from hypergraph mining were also found within top 20 genes

20 Summary Biological Literature is large and complex Need data mining tools to summarize interesting patterns Proposed hypergraph mining and similarity metrics Initial results are promising