Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore.

Slides:

Advertisements

Similar presentations

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

Advertisements

Frequent Closed Pattern Search By Row and Feature Enumeration

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions Paper by Umar Syed and Golan Yona department of CS, Cornell.

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.

Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.

NUS-KI Course on Bioinformatics, Nov 2005 Sequence Analysis and Function Prediction Limsoon Wong.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms

Model and Variable Selections for Personalized Medicine Lu Tian (Northwestern University) Hajime Uno (Kitasato University) Tianxi Cai, Els Goetghebeur,

III 1 Sorin Alexe RUTCOR, Rutgers University, Piscataway, NJ URL: rutcor.rutgers.edu/~salexe Datascope - a new tool.

Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Association Analysis: Basic Concepts and Algorithms.

Data Mining Association Analysis: Basic Concepts and Algorithms

Reduced Support Vector Machine

. Differentially Expressed Genes, Class Discovery & Classification.

Copyright © 2005 by Limsoon Wong Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research.

Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

Lecture 12 Splicing and gene prediction in eukaryotes

Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.

Performance and Scalability: Apriori Implementation.

Copyright  2003 limsoon wong Diagnosis of Childhood Acute Lymphoblastic Leukemia and Optimization of Risk-Benefit Ratio of Therapy Limsoon Wong Institute.

Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.

Exciting Bioinformatics Adventures Limsoon Wong Institute for Infocomm Research.

A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.

Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.

Knowledge Discovery in Biomedicine Limsoon Wong Institute for Infocomm Research.

Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.

Analysis of Molecular and Clinical Data at PolyomX Adrian Driga 1, Kathryn Graham 1, 2, Sambasivarao Damaraju 1, 2, Jennifer Listgarten 3, Russ Greiner.

Evaluation of Supervised Learning Algorithms on Gene Expression Data CSCI 6505 – Machine Learning Adan Cosgaya Winter 2006 Dalhousie University.

University of Washington Institute of Technology Tacoma, WA, USA Ecole des Hautes Etudes en Santé Publique Département Infobiostat Rennes, France Isabelle.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Chapter 9 – Classification and Regression Trees

NUS-KI IMS 28 Nov 2005 Protein Function Prediction from Protein Interactions Limsoon Wong.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read Chapters 4 and 7 of The Practical Bioinformatician.

Selection of Patient Samples and Genes for Disease Prognosis Limsoon Wong Institute for Infocomm Research Joint work with Jinyan Li & Huiqing Liu.

Knowledge Discovery from Biological and Clinical Data: BASIC BACKGROUND.

Copyright  2004 limsoon wong A Practical Introduction to Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture 2, May 2004 For written notes.

CpSc 810: Machine Learning Evaluation of Classifier.

Copyright  2004 limsoon wong A Practical Introduction to Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture 1, May 2004 For written notes.

Copyright  2003 limsoon wong From Informatics to Bioinformatics: The Knowledge Discovery Perspective Limsoon Wong Institute for Infocomm Research Singapore.

Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.

Copyright  2003 limsoon wong Recognition of Gene Features Limsoon Wong Institute for Infocomm Research BI6103 guest lecture on ?? February 2004 For written.

Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.

Copyright © 2005 by Limsoon Wong Discovering Binding Motif Pairs from Interacting Protein Groups Limsoon Wong Institute for Infocomm Research Singapore.

Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.

Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read chapter 3 of The Practical Bioinformatician, CS2220:

Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.

Copyright  2004 limsoon wong CS2220: Computation Foundation in Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture slides for 13 January.

Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.

Show & Tell Limsoon Wong Kent Ridge Digital Labs Singapore Role of Bioinformatics in the Genomic Era.

Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.

Limsoon Wong Laboratories for Information Technology Singapore From Informatics to Bioinformatics.

Slides for KDD07 Mining statistically important equivalence classes and delta-discriminative emerging patterns Jinyan Li School of Computer Engineering.

bacteria and eukaryotes

Reducing Number of Candidates

Data Mining Association Analysis: Basic Concepts and Algorithms

Fanfan Zeng & Roland Yap National University of Singapore Limsoon Wong

Association Rule Mining

Frequent-Pattern Tree

Volume 5, Issue 6, Pages e3 (December 2017)

George Bebis and Wenjing Li Computer Vision Laboratory

Presentation transcript:

Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Plan Bioinformatics Examples –Treatment prognosis of DLBC lymphoma –Prediction of translation initiation site –Prediction of protein function from PPI data What have we learned from these projects? What have I been looking at recently? –Statistical measures beyond frequent items –Small changes that have large impact –Evolution of pattern spaces

Bertinoro, Nov 2005 Example #1: Treatment Prognosis for DLBC Lymphoma Image credit: Rosenwald et al, 2002 Ref: H. Liu et al, “Selection of patient samples and genes for outcome prediction”, Proc. CSB2004, pages

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Diffuse Large B-Cell Lymphoma DLBC lymphoma is the most common type of lymphoma in adults Can be cured by anthracycline-based chemotherapy in 35 to 40 percent of patients  DLBC lymphoma comprises several diseases that differ in responsiveness to chemotherapy Intl Prognostic Index (IPI) –age, “Eastern Cooperative Oncology Group” Performance status, tumor stage, lactate dehydrogenase level, sites of extranodal disease,... Not very good for stratifying DLBC lymphoma patients for therapeutic trials  Use gene-expression profiles to predict outcome of chemotherapy?

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Knowledge Discovery from Gene Expression of “Extreme” Samples “extreme” sample selection: 8 yrs knowledge discovery from gene expression 240 samples 80 samples 26 long- term survivors 47 short- term survivors 7399 genes 84 genes T is long-term if S(T) < 0.3 T is short-term if S(T) > 0.7

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Kaplan-Meier Plot for 80 Test Cases p-value of log-rank test: < Risk score thresholds: 0.7, 0.3 Low risk High risk No clear difference on the overall survival of the 80 samples in the validation group of DLBCL study, if no training sample selection conducted

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong (A) IPI low, p-value = (B) IPI intermediate, p-value = Improvement Over IPI

Bertinoro, Nov 2005 Example #2: Protein Translation Initiation Site Recognition Ref: L. Wong et al., “Using feature generation and feature selection for accurate prediction of translation initiation sites”, GIW 13: , 2002

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong 299 HSU CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT A Sample cDNA What makes the second ATG the TIS? Approach –Training data gathering –Signal generation k-grams, distance, domain know-how,... –Signal selection Entropy,  2, CFS, t-test, domain know-how... –Signal integration SVM, ANN, PCL, CART, C4.5, kNN,...

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Signal Generation K-grams (ie., k consecutive letters) –K = 1, 2, 3, 4, 5, … –Window size vs. fixed position –Up-stream, downstream vs. any where in window –In-frame vs. any frame Examples: –Window =  100 bases –In-frame, downstream GCT = 1, TTT = 1, ATG = 1… –Any-frame, downstream GCT = 3, TTT = 2, ATG = 2… –In-frame, upstream GCT = 2, TTT = 0, ATG = 0, HSU CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Too Many Signals  Feature Selection For each value of k, there are 4k * 3 * 2 k-grams If we use k = 1, 2, 3, 4, 5, we have = 8184 features! This is too many for most machine learning algorithms Choose a signal w/ low intra-class distance Choose a signal w/ high inter-class distance E.g.,

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Sample k-grams Selected by CFS Position – 3 in-frame upstream ATG in-frame downstream –TAA, TAG, TGA, –CTG, GAC, GAG, and GCC Kozak consensus Leaky scanning Stop codon Codon bias?

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Results (3-fold x-validation)

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong ATGpr Our method Validation Results (on Chr X and Chr 21) Using top 100 features selected by entropy and trained on Pedersen & Nielsen’s

Bertinoro, Nov 2005 Example #3: Protein Function Prediction from Protein Interactions Level-1 neighbour Level-2 neighbour

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong An illustrative Case of Indirect Functional Association? Is indirect functional association plausible? Is it found often in real interaction data? Can it be used to improve protein function prediction from protein interaction data? SH3 Proteins SH3-Binding Proteins

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong YBR055C | YDR158W | |1.1.9 YJR091C | | YMR101C |42.1 YPL149W |14.4 | |42.25 | YPL088W |2.16 |1.1.9 YMR300C |1.3.1 YBL072C | YOR312C | YBL061C |1.5.4 | | | |42.1 | | YBR023C | | | |42.1 | | | YKL006W | | YPL193W | YAL012W | |1.1.9 YBR293W | |42.25 |1.1.3 |1.1.9 YLR330W |1.5.4 | | | | YLR140W YDL081C | YDR091C |1.4.1 | | | YPL013C | |42.16 YMR047C | |14.4 |16.7 | | | Freq of Indirect Functional Association 59.2% proteins in dataset share some function with level-1 neighbours 27.9% share some function with level-2 neighbours but share no function with level-1 neighbours

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Over-Rep of Functions in L1 & L2 Neighbours Sensitivity vs Precision Precision Sensitivity L1 - L2 L2 - L1 L1 ∩ L2

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Use L1 & L2 Neighbours for Prediction Weighted Average –Over-rep of functions in L1 and L2 neighbours –Each observation of L1 or L2 neighbour is summed S(u,v) is an “index” for function xfer betw u and v,  (k, x) = 1 if k has function x, 0 otherwise N k is the set of interacting partners of k  x is freq of function x in the dataset

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Reliability of Expt Sources Diff Expt Sources have diff reliabilities –Assign reliability to an interaction based on its expt sources (Nabieva et al, 2004) Reliability betw u and v computed by: r i is reliability of expt source i, E u,v is the set of expt sources in which interaction betw u and v is observed SourceReliability Affinity Chromatography Affinity Precipitation Biochemical Assay Dosage Lethality0.5 Purified Complex Reconstituted Complex0.5 Synthetic Lethality Synthetic Rescue1 Two Hybrid

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong An “Index” for Function Transfer Based on Reliability of Interactions Take reliability into consideration when computing Equiv Measure: N k is the set of interacting partners of k r u,w is reliability weight of interaction betw u and v

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Performance Evaluation Prediction performance improves after incorporation of L1, L2, & interaction reliability info Informative FCs Precision Sensitivity NC Chi² PRODISTIN Weighted Avg Weighted Avg R

Bertinoro, Nov 2005 What Have We Learned?

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Some of those “techniques” frequently needed in analysis of biomedical data are insufficiently studied by current data mining researchers Recognizing what samples are relevant and what are not Recognizing what features are relevant and what are not & handling missing or incorrect values Recognizing trends, changes, and their causes

Bertinoro, Nov 2005 Action #1: Going Beyond Frequent Patterns to Recognize What Features Are Relevant and What Are Not

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Going Beyond Frequent Patterns Statisticians use a battery of “interestingness” measures to decide if a feature/factor is relevant Examples: –Odds ratio –Relative risk –Gini index –Yule’s Q & Y –etc Odds ratio

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong OR search space {A}: ∞ {A,B,C}:3 {A,B}:1 Challenge: Frequent Pattern Mining Relies on Convexity for Efficiency, But … Proposition: Let S k OR (ms,D) = { P  F(ms,D) | OR(P,D)  k}. Then S k OR (ms,D) is not convex i.e., the space of odds ratio patterns is not convex. Ditto for many other types of patterns

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Solution: Luckily They Become Convex When Decomposed Into Plateaus Theorem: Let S n,k OR (ms,D) = { P  F(ms,D) | P D,ed =n, OR(P,D)  k}. Then S n,k OR (ms,D) is convex  The space of odds ratio patterns becomes convex when stratified into plateaus based on support levels on positive (or negative) dataset Proposition: Let Q ∊ [P] D, then OR(Q,D)=OR(P,D)  The plateau space can be further divided into convex equivalence classes on the whole dataset  The space of equivalence classes can be concisely represented by generators and closed patterns

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong How do you find these fast is key! Efficient Mining of Odds Ratio Patterns GC-growth can find these fast :-)

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong From SE-Tree To Trie To FP-Tree {} bcda abacad abcabd abcd acd bcbd bcd cd SE-tree of possible itemsets T T 1 = {a,c,d} T 2 = {b,c,d} T 3 = {a,b,c,d} T 4 = {a,d} a b c d d c d b cd d d c d d Trie of transactions FP-tree head table

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong GC-growth, Fast Simultaneous Mining of Generators & Closed Patterns From FP Tree to Gr Tree –Frequent item in head_table  key item –Subset checking using hash table From Gr Tree to GC- Growth algorithm: –Output closed patterns once the corresponding generators are produced –As Gr Tree only saves prefix, Gc Tree should save full items in each branch, called tails

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Performance Mining odds ratio and relative patterns depends on GC-growth GC-Growth is mining both generators and closed patterns It is comparable in speed to the fastest algorithms that mined only closed patterns

Bertinoro, Nov 2005 Action #2: Tipping Factors---The Small Changes With Large Impact

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Tipping Events Given a data set, such as those related to human health, it is interesting to determine impt cohorts and impt factors causing transition betw cohorts  Tipping events  Tipping factors are “action items” for causing transitions “Tipping event” is two or more population cohorts that are significantly different from each other “Tipping factors” (TF) are small patterns whose presence or absence causes significant difference in population cohorts “Tipping base” (TB) is the pattern shared by the cohorts in a tipping event “Tipping point” (TP) is the combination of TB and a TF

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Impact-To-Cost-Ratio of Tipping Points

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Some Simple Results Useful For Constructing TPs

Bertinoro, Nov 2005 Action #3: Evolution of Pattern Spaces---How Do They Change When the Sample Space Changes?

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Impact of Adding New Transactions on Key and Closed Patterns

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Impact of Removing Items From All Transactions

Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Acknowledgements DLBC Lymphoma: –Jinyan Li, Huiqing Liu Translation Initiation: –Fanfan Zeng, Roland Yap –Huiqing Liu Protein Function Prediction: –Kenny Chua, Ken Sung Odds Ratio & Relative Risk –Mengling Feng, Yap-Peng Tan, –Haiquan Li, Jinyan Li Tipping Points: –Guimei Liu, Jinyan Li –Guozhu Dong Pattern Space Evolution: –Mengling Feng, Yap-Peng Tan –Guozhu Dong –Jinyan Li