Download presentation
Presentation is loading. Please wait.
Published byAnnabella Ethel Bradford Modified over 9 years ago
1
Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore
2
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Plan Bioinformatics Examples –Treatment prognosis of DLBC lymphoma –Prediction of translation initiation site –Prediction of protein function from PPI data What have we learned from these projects? What have I been looking at recently? –Statistical measures beyond frequent items –Small changes that have large impact –Evolution of pattern spaces
3
Bertinoro, Nov 2005 Example #1: Treatment Prognosis for DLBC Lymphoma Image credit: Rosenwald et al, 2002 Ref: H. Liu et al, “Selection of patient samples and genes for outcome prediction”, Proc. CSB2004, pages 382--392
4
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Diffuse Large B-Cell Lymphoma DLBC lymphoma is the most common type of lymphoma in adults Can be cured by anthracycline-based chemotherapy in 35 to 40 percent of patients DLBC lymphoma comprises several diseases that differ in responsiveness to chemotherapy Intl Prognostic Index (IPI) –age, “Eastern Cooperative Oncology Group” Performance status, tumor stage, lactate dehydrogenase level, sites of extranodal disease,... Not very good for stratifying DLBC lymphoma patients for therapeutic trials Use gene-expression profiles to predict outcome of chemotherapy?
5
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Knowledge Discovery from Gene Expression of “Extreme” Samples “extreme” sample selection: 8 yrs knowledge discovery from gene expression 240 samples 80 samples 26 long- term survivors 47 short- term survivors 7399 genes 84 genes T is long-term if S(T) < 0.3 T is short-term if S(T) > 0.7
6
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Kaplan-Meier Plot for 80 Test Cases p-value of log-rank test: < 0.0001 Risk score thresholds: 0.7, 0.3 Low risk High risk No clear difference on the overall survival of the 80 samples in the validation group of DLBCL study, if no training sample selection conducted
7
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong (A) IPI low, p-value = 0.0063 (B) IPI intermediate, p-value = 0.0003 Improvement Over IPI
8
Bertinoro, Nov 2005 Example #2: Protein Translation Initiation Site Recognition Ref: L. Wong et al., “Using feature generation and feature selection for accurate prediction of translation initiation sites”, GIW 13:192-- 200, 2002
9
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT A Sample cDNA What makes the second ATG the TIS? Approach –Training data gathering –Signal generation k-grams, distance, domain know-how,... –Signal selection Entropy, 2, CFS, t-test, domain know-how... –Signal integration SVM, ANN, PCL, CART, C4.5, kNN,...
10
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Signal Generation K-grams (ie., k consecutive letters) –K = 1, 2, 3, 4, 5, … –Window size vs. fixed position –Up-stream, downstream vs. any where in window –In-frame vs. any frame Examples: –Window = 100 bases –In-frame, downstream GCT = 1, TTT = 1, ATG = 1… –Any-frame, downstream GCT = 3, TTT = 2, ATG = 2… –In-frame, upstream GCT = 2, TTT = 0, ATG = 0,... 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT
11
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Too Many Signals Feature Selection For each value of k, there are 4k * 3 * 2 k-grams If we use k = 1, 2, 3, 4, 5, we have 24 + 96 + 384 + 1536 + 6144 = 8184 features! This is too many for most machine learning algorithms Choose a signal w/ low intra-class distance Choose a signal w/ high inter-class distance E.g.,
12
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Sample k-grams Selected by CFS Position – 3 in-frame upstream ATG in-frame downstream –TAA, TAG, TGA, –CTG, GAC, GAG, and GCC Kozak consensus Leaky scanning Stop codon Codon bias?
13
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Results (3-fold x-validation)
14
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong ATGpr Our method Validation Results (on Chr X and Chr 21) Using top 100 features selected by entropy and trained on Pedersen & Nielsen’s
15
Bertinoro, Nov 2005 Example #3: Protein Function Prediction from Protein Interactions Level-1 neighbour Level-2 neighbour
16
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong An illustrative Case of Indirect Functional Association? Is indirect functional association plausible? Is it found often in real interaction data? Can it be used to improve protein function prediction from protein interaction data? SH3 Proteins SH3-Binding Proteins
17
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong YBR055C |11.4.3.1 YDR158W |1.1.6.5 |1.1.9 YJR091C |1.3.16.1 |16.3.3 YMR101C |42.1 YPL149W |14.4 |20.9.13 |42.25 |14.7.11 YPL088W |2.16 |1.1.9 YMR300C |1.3.1 YBL072C |12.1.1 YOR312C |12.1.1 YBL061C |1.5.4 |10.3.3 |18.2.1.1 |32.1.3 |42.1 |43.1.3.5 |1.5.1.3.2 YBR023C |10.3.3 |32.1.3 |34.11.3.7 |42.1 |43.1.3.5 |43.1.3.9 |1.5.1.3.2 YKL006W |12.1.1 |16.3.3 YPL193W |12.1.1 YAL012W |1.1.6.5 |1.1.9 YBR293W |16.19.3 |42.25 |1.1.3 |1.1.9 YLR330W |1.5.4 |34.11.3.7 |41.1.1 |43.1.3.5 |43.1.3.9 YLR140W YDL081C |12.1.1 YDR091C |1.4.1 |12.1.1 |12.4.1 |16.19.3 YPL013C |12.1.1 |42.16 YMR047C |11.4.2 |14.4 |16.7 |20.1.10 |20.1.21 |20.9.1 Freq of Indirect Functional Association 59.2% proteins in dataset share some function with level-1 neighbours 27.9% share some function with level-2 neighbours but share no function with level-1 neighbours
18
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Over-Rep of Functions in L1 & L2 Neighbours Sensitivity vs Precision 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 00.10.20.30.40.50.60.70.80.91 Precision Sensitivity L1 - L2 L2 - L1 L1 ∩ L2
19
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Use L1 & L2 Neighbours for Prediction Weighted Average –Over-rep of functions in L1 and L2 neighbours –Each observation of L1 or L2 neighbour is summed S(u,v) is an “index” for function xfer betw u and v, (k, x) = 1 if k has function x, 0 otherwise N k is the set of interacting partners of k x is freq of function x in the dataset
20
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Reliability of Expt Sources Diff Expt Sources have diff reliabilities –Assign reliability to an interaction based on its expt sources (Nabieva et al, 2004) Reliability betw u and v computed by: r i is reliability of expt source i, E u,v is the set of expt sources in which interaction betw u and v is observed SourceReliability Affinity Chromatography0.823077 Affinity Precipitation0.455904 Biochemical Assay0.666667 Dosage Lethality0.5 Purified Complex0.891473 Reconstituted Complex0.5 Synthetic Lethality0.37386 Synthetic Rescue1 Two Hybrid0.265407
21
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong An “Index” for Function Transfer Based on Reliability of Interactions Take reliability into consideration when computing Equiv Measure: N k is the set of interacting partners of k r u,w is reliability weight of interaction betw u and v
22
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Performance Evaluation Prediction performance improves after incorporation of L1, L2, & interaction reliability info Informative FCs 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 00.10.20.30.40.50.60.70.80.91 Precision Sensitivity NC Chi² PRODISTIN Weighted Avg Weighted Avg R
23
Bertinoro, Nov 2005 What Have We Learned?
24
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Some of those “techniques” frequently needed in analysis of biomedical data are insufficiently studied by current data mining researchers Recognizing what samples are relevant and what are not Recognizing what features are relevant and what are not & handling missing or incorrect values Recognizing trends, changes, and their causes
25
Bertinoro, Nov 2005 Action #1: Going Beyond Frequent Patterns to Recognize What Features Are Relevant and What Are Not
26
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Going Beyond Frequent Patterns Statisticians use a battery of “interestingness” measures to decide if a feature/factor is relevant Examples: –Odds ratio –Relative risk –Gini index –Yule’s Q & Y –etc Odds ratio
27
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong OR search space {A}: ∞ {A,B,C}:3 {A,B}:1 Challenge: Frequent Pattern Mining Relies on Convexity for Efficiency, But … Proposition: Let S k OR (ms,D) = { P F(ms,D) | OR(P,D) k}. Then S k OR (ms,D) is not convex i.e., the space of odds ratio patterns is not convex. Ditto for many other types of patterns
28
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Solution: Luckily They Become Convex When Decomposed Into Plateaus Theorem: Let S n,k OR (ms,D) = { P F(ms,D) | P D,ed =n, OR(P,D) k}. Then S n,k OR (ms,D) is convex The space of odds ratio patterns becomes convex when stratified into plateaus based on support levels on positive (or negative) dataset Proposition: Let Q ∊ [P] D, then OR(Q,D)=OR(P,D) The plateau space can be further divided into convex equivalence classes on the whole dataset The space of equivalence classes can be concisely represented by generators and closed patterns
29
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong How do you find these fast is key! Efficient Mining of Odds Ratio Patterns GC-growth can find these fast :-)
30
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong From SE-Tree To Trie To FP-Tree {} bcda abacad abcabd abcd acd bcbd bcd cd SE-tree of possible itemsets T T 1 = {a,c,d} T 2 = {b,c,d} T 3 = {a,b,c,d} T 4 = {a,d}............ a b c d d c d b cd d d c d d Trie of transactions FP-tree head table
31
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong GC-growth, Fast Simultaneous Mining of Generators & Closed Patterns From FP Tree to Gr Tree –Frequent item in head_table key item –Subset checking using hash table From Gr Tree to GC- Growth algorithm: –Output closed patterns once the corresponding generators are produced –As Gr Tree only saves prefix, Gc Tree should save full items in each branch, called tails
32
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Performance Mining odds ratio and relative patterns depends on GC-growth GC-Growth is mining both generators and closed patterns It is comparable in speed to the fastest algorithms that mined only closed patterns
33
Bertinoro, Nov 2005 Action #2: Tipping Factors---The Small Changes With Large Impact
34
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Tipping Events Given a data set, such as those related to human health, it is interesting to determine impt cohorts and impt factors causing transition betw cohorts Tipping events Tipping factors are “action items” for causing transitions “Tipping event” is two or more population cohorts that are significantly different from each other “Tipping factors” (TF) are small patterns whose presence or absence causes significant difference in population cohorts “Tipping base” (TB) is the pattern shared by the cohorts in a tipping event “Tipping point” (TP) is the combination of TB and a TF
35
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Impact-To-Cost-Ratio of Tipping Points
36
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Some Simple Results Useful For Constructing TPs
37
Bertinoro, Nov 2005 Action #3: Evolution of Pattern Spaces---How Do They Change When the Sample Space Changes?
38
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Impact of Adding New Transactions on Key and Closed Patterns
39
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Impact of Removing Items From All Transactions
40
Bertinoro, Nov 2005Copyright 2005 © Limsoon Wong Acknowledgements DLBC Lymphoma: –Jinyan Li, Huiqing Liu Translation Initiation: –Fanfan Zeng, Roland Yap –Huiqing Liu Protein Function Prediction: –Kenny Chua, Ken Sung Odds Ratio & Relative Risk –Mengling Feng, Yap-Peng Tan, –Haiquan Li, Jinyan Li Tipping Points: –Guimei Liu, Jinyan Li –Guozhu Dong Pattern Space Evolution: –Mengling Feng, Yap-Peng Tan –Guozhu Dong –Jinyan Li
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.