Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Show & Tell Limsoon Wong KRDL Datamining: Turning Biological Data into Gold.
Learning on Probabilistic Labels Peng Peng, Raymond Chi-wing Wong, Philip S. Yu CSE, HKUST 1.
Introduction to the Knowledge Discovery Department Institute for Infocomm Research Limsoon Wong Deputy Executive Director (Research) I 2 R: Imagination.
Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.
Reduced Support Vector Machine
Multidimensional Analysis If you are comparing more than two conditions (for example 10 types of cancer) or if you are looking at a time series (cell cycle.
Machine Learning CMPT 726 Simon Fraser University
Data Mining: Discovering Information From Bio-Data Present by: Hongli Li & Nianya Liu University of Massachusetts Lowell.
Introduction to the Knowledge Discovery Department Institute for Infocomm Research Limsoon Wong Deputy Executive Director (Research) I 2 R: Imagination.
Rotation Forest: A New Classifier Ensemble Method 交通大學 電子所 蕭晴駿 Juan J. Rodríguez and Ludmila I. Kuncheva.
Copyright  2003 limsoon wong Diagnosis of Childhood Acute Lymphoblastic Leukemia and Optimization of Risk-Benefit Ratio of Therapy Limsoon Wong Institute.
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Exciting Bioinformatics Adventures Limsoon Wong Institute for Infocomm Research.
AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Knowledgebase Creation & Systems Biology: A new prospect in discovery informatics S.Shriram, Siri Technologies (Cytogenomics), Bangalore S.Shriram, Siri.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Whole Genome Expression Analysis
Knowledge Discovery in Biomedicine Limsoon Wong Institute for Infocomm Research.
Copyright  2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read Chapters 4 and 7 of The Practical Bioinformatician.
Experimental Evaluation of Learning Algorithms Part 1.
Selection of Patient Samples and Genes for Disease Prognosis Limsoon Wong Institute for Infocomm Research Joint work with Jinyan Li & Huiqing Liu.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Knowledge Discovery from Biological and Clinical Data: BASIC BACKGROUND.
Copyright  2004 limsoon wong A Practical Introduction to Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture 2, May 2004 For written notes.
Construction of cancer pathways for personalized medicine | Presented By Date Construction of cancer pathways for personalized medicine Predictive, Preventive.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore.
Copyright  2004 limsoon wong A Practical Introduction to Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture 1, May 2004 For written notes.
Copyright  2003 limsoon wong From Informatics to Bioinformatics: The Knowledge Discovery Perspective Limsoon Wong Institute for Infocomm Research Singapore.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Copyright  2003 limsoon wong Recognition of Gene Features Limsoon Wong Institute for Infocomm Research BI6103 guest lecture on ?? February 2004 For written.
Limsoon Wong Laboratories for Information Technology Singapore From Informatics to Bioinformatics.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
From Genomes to Genes Rui Alves.
Bioinformatics and Computational Biology
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
How can we find genes? Search for them Look them up.
 Developed Struct-SVM classifier that takes into account domain knowledge to improve identification of protein-RNA interface residues  Results show that.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Finding genes in the genome
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read chapter 3 of The Practical Bioinformatician, CS2220:
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
WCPM 1 Chang-Tsun Li Department of Computer Science University of Warwick UK Image Clustering Based on Camera Fingerprints.
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Copyright  2004 limsoon wong CS2220: Computation Foundation in Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture slides for 13 January.
Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),
Show & Tell Limsoon Wong Kent Ridge Digital Labs Singapore Role of Bioinformatics in the Genomic Era.
Limsoon Wong Laboratories for Information Technology Singapore From Informatics to Bioinformatics.
IMMUNOGRID Nikolai Petrovsky and Vladimir Brusic
Bacterial infection by lytic virus
bacteria and eukaryotes
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Gene expression.
Fanfan Zeng & Roland Yap National University of Singapore Limsoon Wong
M. Fu, G. Huang, Z. Zhang, J. Liu, Z. Zhang, Z. Huang, B. Yu, F. Meng 
From Informatics to Bioinformatics Limsoon Wong
From Informatics to Bioinformatics Limsoon Wong
BIOINFORMATICS Summary
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

What is Bioinformatics?

Themes of Bioinformatics Bioinformatics = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery = Statistics + Algorithms + Databases

Benefits of Bioinformatics To the patient: Better drug, better treatment To the pharma: Save time, save cost, make more $ To the scientist: Better science

From Informatics to Bioinformatics Integration Technology (Kleisli) Cleansing & Warehousing (FIMM) MHC-Peptide Binding (PREDICT) Protein Interactions Extraction (PIES) Gene Expression & Medical Record Datamining (PCL) Gene Feature Recognition (Dragon) Venom Informatics years of bioinformatics R&D in Singapore ISS KRDL LIT

Quick Samplings

Epitope Prediction TRAP-559AA MNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSE EVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLN LNDNAIHLYVNVFSNNAKEIIRLHSDASKNKEKALIIIRS LLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVIL TDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNR FLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAVCVEVEK TASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQ CEEERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENI IDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQ KPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDN QNNLPNDKSDRNIPYSPLPPKVLDNERKQSDPQSQDNNGN RHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHE KPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVP GAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN

Epitope Prediction Results  Prediction by our ANN model for HLA-A11  29 predictions  22 epitopes  76% specificity Rank by BIMAS Number of experimental binders 19 (52.8%) 5 (13.9%) 12 (33.3%)  Prediction by BIMAS matrix for HLA-A*1101

Transcription Start Prediction

Transcription Start Prediction Results

Medical Record Analysis  Looking for patterns that are  valid  novel  useful  understandable

Gene Expression Analysis  Classifying gene expression profiles  find stable differentially expressed genes  find significant gene groups  derive coordinated gene expression

Medical Record & Gene Expression Analysis Results  PCL, a novel “emerging pattern’’ method  Beats C4.5, CBA, LB, NB, TAN in 21 out of 32 UCI benchmarks  Works well for gene expressions Cancer Cell, March 2002, 1(2)

Behind the Scene  Vladimir Bajic  Vladimir Brusic  Jinyan Li  See-Kiong Ng  Limsoon Wong  Louxin Zhang  Allen Chong  Judice Koh  SPT Krishnan  Huiqing Liu  Seng Hong Seah  Soon Heng Tan  Guanglan Zhang  Zhuo Zhang and many more: students, folks from geneticXchange, MolecularConnections, and other collaborators….

Questions?

A More Detailed Account

Jonathan’s rules: Blue or Circle Jessica’s rules: All the rest What is Datamining? Whose block is this? Jonathan’s blocks Jessica’s blocks

What is Datamining? Question: Can you explain how?

The Steps of Data Mining  Training data gathering  Signal generation  k-grams, colour, texture, domain know-how,...  Signal selection  Entropy,  2, CFS, t-test, domain know-how...  Signal integration  SVM, ANN, PCL, CART, C4.5, kNN,...

Translation Initiation Recognition

A Sample cDNA 299 HSU CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE What makes the second ATG the translation initiation site?

Signal Generation  K-grams (ie., k consecutive letters) l K = 1, 2, 3, 4, 5, … l Window size vs. fixed position l Up-stream, downstream vs. any where in window l In-frame vs. any frame

Too Many Signals  For each value of k, there are 4 k * 3 * 2 k-grams  If we use k = 1, 2, 3, 4, 5, we have = 8188 features!  This is too many for most machine learning algorithms

Signal Selection (Basic Idea)  Choose a signal w/ low intra-class distance  Choose a signal w/ high inter-class distance  Which of the following 3 signals is good?

Signal Selection (eg., t-statistics)

Signal Selection (eg., MIT-correlation)

Signal Selection (eg.,  2)

Signal Selection (eg., CFS)  Instead of scoring individual signals, how about scoring a group of signals as a whole?  CFS l A good group contains signals that are highly correlated with the class, and yet uncorrelated with each other  Homework: find a formula that captures the key idea of CFS above

Sample k-grams Selected  Position –3  in-frame upstream ATG  in-frame downstream l TAA, TAG, TGA, l CTG, GAC, GAG, and GCC Kozak consensus Leaky scanning Stop codon Codon bias

Signal Integration  kNN Given a test sample, find the k training samples that are most similar to it. Let the majority class win.  SVM Given a group of training samples from two classes, determine a separating plane that maximises the margin of error.  Naïve Bayes, ANN, C4.5,...

Results (on Pedersen & Nielsen’s mRNA)

Acknowledgements  Roland Yap  Zeng Fanfan  A.G. Pedersen  H. Nielsen

Questions?

Common Mistakes

Self-fulfilling Oracle  Consider this scenario l Given classes C1 and C2 w/ explicit signals l Use  2 to C1 and C2 to select signals s1, s2, s3 l Run 3-fold x-validation on C1 and C2 using s1, s2, s3 and get accuracy of 90%  Is the accuracy really 90%?  What can be wrong with this?

Phil Long’s Experiment  Let there be classes C1 and C2 w/ features having randomly generated values  Use  2 to select 20 features  Run k-fold x-validation on C1 and C2 w/ these 20 features  Expect: 50% accuracy  Get: 90% accuracy!  Lesson: choose features at each fold

Apples vs Oranges  Consider this scenario: l Fanfan reported 89% accuracy on his TIS prediction method l Hatzigeorgiou reported 94% accuracy on her TIS prediction method  So Hatzigeorgiou’s method is better  What is wrong with this conclusion?

Apples vs Oranges  Differences in datasets used: l Fanfan’s expt used Pedersen’s dataset l Hatzigeorgiou’s used her own dataset  Differences in counting: l Fanfan’s expt was on a per ATG basis l Hatzigeorgiou’s expt used the scanning rule and thus was on a per cDNA basis  When Fanfan ran the same dataset and count the same way as Hatzigeorgiou, got 94% also!

Questions?