Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics
What is Bioinformatics?
Themes of Bioinformatics Bioinformatics = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery = Statistics + Algorithms + Databases
Benefits of Bioinformatics To the patient: Better drug, better treatment To the pharma: Save time, save cost, make more $ To the scientist: Better science
From Informatics to Bioinformatics Integration Technology (Kleisli) Cleansing & Warehousing (FIMM) MHC-Peptide Binding (PREDICT) Protein Interactions Extraction (PIES) Gene Expression & Medical Record Datamining (PCL) Gene Feature Recognition (Dragon) Venom Informatics years of bioinformatics R&D in Singapore ISS KRDL LIT
Quick Samplings
Epitope Prediction TRAP-559AA MNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSE EVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLN LNDNAIHLYVNVFSNNAKEIIRLHSDASKNKEKALIIIRS LLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVIL TDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNR FLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAVCVEVEK TASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQ CEEERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENI IDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQ KPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDN QNNLPNDKSDRNIPYSPLPPKVLDNERKQSDPQSQDNNGN RHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHE KPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVP GAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN
Epitope Prediction Results Prediction by our ANN model for HLA-A11 29 predictions 22 epitopes 76% specificity Rank by BIMAS Number of experimental binders 19 (52.8%) 5 (13.9%) 12 (33.3%) Prediction by BIMAS matrix for HLA-A*1101
Transcription Start Prediction
Transcription Start Prediction Results
Medical Record Analysis Looking for patterns that are valid novel useful understandable
Gene Expression Analysis Classifying gene expression profiles find stable differentially expressed genes find significant gene groups derive coordinated gene expression
Medical Record & Gene Expression Analysis Results PCL, a novel “emerging pattern’’ method Beats C4.5, CBA, LB, NB, TAN in 21 out of 32 UCI benchmarks Works well for gene expressions Cancer Cell, March 2002, 1(2)
Behind the Scene Vladimir Bajic Vladimir Brusic Jinyan Li See-Kiong Ng Limsoon Wong Louxin Zhang Allen Chong Judice Koh SPT Krishnan Huiqing Liu Seng Hong Seah Soon Heng Tan Guanglan Zhang Zhuo Zhang and many more: students, folks from geneticXchange, MolecularConnections, and other collaborators….
Questions?
A More Detailed Account
Jonathan’s rules: Blue or Circle Jessica’s rules: All the rest What is Datamining? Whose block is this? Jonathan’s blocks Jessica’s blocks
What is Datamining? Question: Can you explain how?
The Steps of Data Mining Training data gathering Signal generation k-grams, colour, texture, domain know-how,... Signal selection Entropy, 2, CFS, t-test, domain know-how... Signal integration SVM, ANN, PCL, CART, C4.5, kNN,...
Translation Initiation Recognition
A Sample cDNA 299 HSU CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE What makes the second ATG the translation initiation site?
Signal Generation K-grams (ie., k consecutive letters) l K = 1, 2, 3, 4, 5, … l Window size vs. fixed position l Up-stream, downstream vs. any where in window l In-frame vs. any frame
Too Many Signals For each value of k, there are 4 k * 3 * 2 k-grams If we use k = 1, 2, 3, 4, 5, we have = 8188 features! This is too many for most machine learning algorithms
Signal Selection (Basic Idea) Choose a signal w/ low intra-class distance Choose a signal w/ high inter-class distance Which of the following 3 signals is good?
Signal Selection (eg., t-statistics)
Signal Selection (eg., MIT-correlation)
Signal Selection (eg., 2)
Signal Selection (eg., CFS) Instead of scoring individual signals, how about scoring a group of signals as a whole? CFS l A good group contains signals that are highly correlated with the class, and yet uncorrelated with each other Homework: find a formula that captures the key idea of CFS above
Sample k-grams Selected Position –3 in-frame upstream ATG in-frame downstream l TAA, TAG, TGA, l CTG, GAC, GAG, and GCC Kozak consensus Leaky scanning Stop codon Codon bias
Signal Integration kNN Given a test sample, find the k training samples that are most similar to it. Let the majority class win. SVM Given a group of training samples from two classes, determine a separating plane that maximises the margin of error. Naïve Bayes, ANN, C4.5,...
Results (on Pedersen & Nielsen’s mRNA)
Acknowledgements Roland Yap Zeng Fanfan A.G. Pedersen H. Nielsen
Questions?
Common Mistakes
Self-fulfilling Oracle Consider this scenario l Given classes C1 and C2 w/ explicit signals l Use 2 to C1 and C2 to select signals s1, s2, s3 l Run 3-fold x-validation on C1 and C2 using s1, s2, s3 and get accuracy of 90% Is the accuracy really 90%? What can be wrong with this?
Phil Long’s Experiment Let there be classes C1 and C2 w/ features having randomly generated values Use 2 to select 20 features Run k-fold x-validation on C1 and C2 w/ these 20 features Expect: 50% accuracy Get: 90% accuracy! Lesson: choose features at each fold
Apples vs Oranges Consider this scenario: l Fanfan reported 89% accuracy on his TIS prediction method l Hatzigeorgiou reported 94% accuracy on her TIS prediction method So Hatzigeorgiou’s method is better What is wrong with this conclusion?
Apples vs Oranges Differences in datasets used: l Fanfan’s expt used Pedersen’s dataset l Hatzigeorgiou’s used her own dataset Differences in counting: l Fanfan’s expt was on a per ATG basis l Hatzigeorgiou’s expt used the scanning rule and thus was on a per cDNA basis When Fanfan ran the same dataset and count the same way as Hatzigeorgiou, got 94% also!
Questions?