Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics.

Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics

What is Bioinformatics?

Themes of Bioinformatics Bioinformatics = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery = Statistics + Algorithms + Databases

Benefits of Bioinformatics To the patient: Better drug, better treatment To the pharma: Save time, save cost, make more $ To the scientist: Better science

From Informatics to Bioinformatics Integration Technology (Kleisli) Cleansing & Warehousing (FIMM) MHC-Peptide Binding (PREDICT) Protein Interactions Extraction (PIES) Gene Expression & Medical Record Datamining (PCL) Gene Feature Recognition (Dragon) Venom Informatics 1994 19981996 2000 2002 8 years of bioinformatics R&D in Singapore ISS KRDL LIT

Quick Samplings

Epitope Prediction TRAP-559AA MNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSE EVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLN LNDNAIHLYVNVFSNNAKEIIRLHSDASKNKEKALIIIRS LLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVIL TDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNR FLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAVCVEVEK TASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQ CEEERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENI IDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQ KPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDN QNNLPNDKSDRNIPYSPLPPKVLDNERKQSDPQSQDNNGN RHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHE KPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVP GAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN

Epitope Prediction Results  Prediction by our ANN model for HLA-A11  29 predictions  22 epitopes  76% specificity 1 66 100 Rank by BIMAS Number of experimental binders 19 (52.8%) 5 (13.9%) 12 (33.3%)  Prediction by BIMAS matrix for HLA-A*1101

Transcription Start Prediction

Transcription Start Prediction Results

Medical Record Analysis  Looking for patterns that are  valid  novel  useful  understandable

Gene Expression Analysis  Classifying gene expression profiles  find stable differentially expressed genes  find significant gene groups  derive coordinated gene expression

Medical Record & Gene Expression Analysis Results  PCL, a novel “emerging pattern’’ method  Beats C4.5, CBA, LB, NB, TAN in 21 out of 32 UCI benchmarks  Works well for gene expressions Cancer Cell, March 2002, 1(2)

Behind the Scene  Vladimir Bajic  Vladimir Brusic  Jinyan Li  See-Kiong Ng  Limsoon Wong  Louxin Zhang  Allen Chong  Judice Koh  SPT Krishnan  Huiqing Liu  Seng Hong Seah  Soon Heng Tan  Guanglan Zhang  Zhuo Zhang and many more: students, folks from geneticXchange, MolecularConnections, and other collaborators….

Questions?

A More Detailed Account

Jonathan’s rules: Blue or Circle Jessica’s rules: All the rest What is Datamining? Whose block is this? Jonathan’s blocks Jessica’s blocks

What is Datamining? Question: Can you explain how?

The Steps of Data Mining  Training data gathering  Signal generation  k-grams, colour, texture, domain know-how,...  Signal selection  Entropy,  2, CFS, t-test, domain know-how...  Signal integration  SVM, ANN, PCL, CART, C4.5, kNN,...

Translation Initiation Recognition

A Sample cDNA 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT............................................................ 80................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE What makes the second ATG the translation initiation site?

Signal Generation  K-grams (ie., k consecutive letters) l K = 1, 2, 3, 4, 5, … l Window size vs. fixed position l Up-stream, downstream vs. any where in window l In-frame vs. any frame

Too Many Signals  For each value of k, there are 4 k * 3 * 2 k-grams  If we use k = 1, 2, 3, 4, 5, we have 4 + 24 + 96 + 384 + 1536 + 6144 = 8188 features!  This is too many for most machine learning algorithms

Signal Selection (Basic Idea)  Choose a signal w/ low intra-class distance  Choose a signal w/ high inter-class distance  Which of the following 3 signals is good?

Signal Selection (eg., t-statistics)

Signal Selection (eg., MIT-correlation)

Signal Selection (eg.,  2)

Signal Selection (eg., CFS)  Instead of scoring individual signals, how about scoring a group of signals as a whole?  CFS l A good group contains signals that are highly correlated with the class, and yet uncorrelated with each other  Homework: find a formula that captures the key idea of CFS above

Sample k-grams Selected  Position –3  in-frame upstream ATG  in-frame downstream l TAA, TAG, TGA, l CTG, GAC, GAG, and GCC Kozak consensus Leaky scanning Stop codon Codon bias

Signal Integration  kNN Given a test sample, find the k training samples that are most similar to it. Let the majority class win.  SVM Given a group of training samples from two classes, determine a separating plane that maximises the margin of error.  Naïve Bayes, ANN, C4.5,...

Results (on Pedersen & Nielsen’s mRNA)

Acknowledgements  Roland Yap  Zeng Fanfan  A.G. Pedersen  H. Nielsen

Questions?

Common Mistakes

Self-fulfilling Oracle  Consider this scenario l Given classes C1 and C2 w/ explicit signals l Use  2 to C1 and C2 to select signals s1, s2, s3 l Run 3-fold x-validation on C1 and C2 using s1, s2, s3 and get accuracy of 90%  Is the accuracy really 90%?  What can be wrong with this?

Phil Long’s Experiment  Let there be classes C1 and C2 w/ 100000 features having randomly generated values  Use  2 to select 20 features  Run k-fold x-validation on C1 and C2 w/ these 20 features  Expect: 50% accuracy  Get: 90% accuracy!  Lesson: choose features at each fold

Apples vs Oranges  Consider this scenario: l Fanfan reported 89% accuracy on his TIS prediction method l Hatzigeorgiou reported 94% accuracy on her TIS prediction method  So Hatzigeorgiou’s method is better  What is wrong with this conclusion?

Apples vs Oranges  Differences in datasets used: l Fanfan’s expt used Pedersen’s dataset l Hatzigeorgiou’s used her own dataset  Differences in counting: l Fanfan’s expt was on a per ATG basis l Hatzigeorgiou’s expt used the scanning rule and thus was on a per cDNA basis  When Fanfan ran the same dataset and count the same way as Hatzigeorgiou, got 94% also!

Questions?

Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics.

Similar presentations

Presentation on theme: "Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics.

Similar presentations

Presentation on theme: "Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics."— Presentation transcript:

Similar presentations

About project

Feedback