Copyright  2003 limsoon wong From Informatics to Bioinformatics: The Knowledge Discovery Perspective Limsoon Wong Institute for Infocomm Research Singapore.

Slides:



Advertisements
Similar presentations
Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
Advertisements

. Context-Specific Bayesian Clustering for Gene Expression Data Yoseph Barash Nir Friedman School of Computer Science & Engineering Hebrew University.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Show & Tell Limsoon Wong KRDL Datamining: Turning Biological Data into Gold.
Bioinformatics “Other techniques raise more questions than they answer. Bioinformatics is what answers the questions those techniques generate.” SheAvery
Copyright © 2004 by Limsoon Wong Research & Discovery: Technologies Today for Solving Problems Tomorrow Limsoon Wong Institute for Infocomm Research.
Bioinformatics at IU - Ketan Mane. Bioinformatics at IU What is Bioinformatics? Bioinformatics is the study of the inherent structure of biological information.
Data-intensive Computing: Case Study Area 1: Bioinformatics B. Ramamurthy 6/17/20151.
Scientific Data Mining: Emerging Developments and Challenges F. Seillier-Moiseiwitsch Bioinformatics Research Center Department of Mathematics and Statistics.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Introduction to the Knowledge Discovery Department Institute for Infocomm Research Limsoon Wong Deputy Executive Director (Research) I 2 R: Imagination.
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
Copyright  2003 limsoon wong Diagnosis of Childhood Acute Lymphoblastic Leukemia and Optimization of Risk-Benefit Ratio of Therapy Limsoon Wong Institute.
From T. MADHAVAN, & K.Chandrasekaran Lecturers in Zoology.. EXIT.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
Exciting Bioinformatics Adventures Limsoon Wong Institute for Infocomm Research.
AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.
Intrusion Detection Jie Lin. Outline Introduction A Frame for Intrusion Detection System Intrusion Detection Techniques Ideas for Improving Intrusion.
Development of Bioinformatics and its application on Biotechnology
Classification of multiple cancer types by multicategory support vector machines using gene expression data.
Whole Genome Expression Analysis
Knowledge Discovery in Biomedicine Limsoon Wong Institute for Infocomm Research.
Copyright  2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.
Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with.
Exagen Diagnostics, Inc., all rights reserved Biomarker Discovery in Genomic Data with Partial Clinical Annotation Cole Harris, Noushin Ghaffari.
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
A New Oklahoma Bioinformatics Company. Microarray and Bioinformatics.
Copyright  2004 limsoon wong CS2220: Computation Foundation in Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture slides for 3 February.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read Chapters 4 and 7 of The Practical Bioinformatician.
The Broad Institute of MIT and Harvard Classification / Prediction.
Selection of Patient Samples and Genes for Disease Prognosis Limsoon Wong Institute for Infocomm Research Joint work with Jinyan Li & Huiqing Liu.
Using Emerging Patterns to Analyze Gene Expression Data Jinyan Li BioComputing Group Knowledge & Discovery Program Laboratories for Information Technology.
Knowledge Discovery from Biological and Clinical Data: BASIC BACKGROUND.
Copyright  2004 limsoon wong A Practical Introduction to Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture 2, May 2004 For written notes.
Construction of cancer pathways for personalized medicine | Presented By Date Construction of cancer pathways for personalized medicine Predictive, Preventive.
Enabling Reproducible Gene Expression Analysis Using Biological Pathways Limsoon Wong 7 April 2011 (Joint work with Donny Soh, Difeng Dong, Yike Guo)
Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore.
Copyright  2004 limsoon wong A Practical Introduction to Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture 1, May 2004 For written notes.
+ Get Rich and Cure Cancer with Support Vector Machines (Your Summer Projects)
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Copyright  2003 limsoon wong Recognition of Gene Features Limsoon Wong Institute for Infocomm Research BI6103 guest lecture on ?? February 2004 For written.
Limsoon Wong Laboratories for Information Technology Singapore From Informatics to Bioinformatics.
Overview of Bioinformatics 1 Module Denis Manley..
From Genomes to Genes Rui Alves.
Bioinformatics MEDC601 Lecture by Brad Windle Ph# Office: Massey Cancer Center, Goodwin Labs Room 319 Web site for lecture:
Copyright  2004 limsoon wong A Practical Introduction to Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture 3, May 2004 For written notes.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Bioinformatics and Computational Biology
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is.
Finding genes in the genome
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read chapter 14 of The Practical Bioinformatician, CS2220:
Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics.
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read chapter 3 of The Practical Bioinformatician, CS2220:
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Copyright  2004 limsoon wong CS2220: Computation Foundation in Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture slides for 13 January.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Show & Tell Limsoon Wong Kent Ridge Digital Labs Singapore Role of Bioinformatics in the Genomic Era.
Limsoon Wong Laboratories for Information Technology Singapore From Informatics to Bioinformatics.
David Amar, Tom Hait, and Ron Shamir
Bioinformatics Overview
An Artificial Intelligence Approach to Precision Oncology
Gene expression.
Fanfan Zeng & Roland Yap National University of Singapore Limsoon Wong
Volume 1, Issue 2, Pages (March 2002)
Presentation transcript:

Copyright  2003 limsoon wong From Informatics to Bioinformatics: The Knowledge Discovery Perspective Limsoon Wong Institute for Infocomm Research Singapore

Copyright  2003 limsoon wong Plan Overview of recent knowledge discovery successes in bioinformatics Risk assignment of childhood ALL patients to optimize risk-benefit ratio of therapy Recognition of translation intiation sites from DNA sequences

Copyright  2003 limsoon wong overview of recent knowledge discovery successes in bioinformatics

Copyright  2003 limsoon wong Jonathan’s rules: Blue or Circle Jessica’s rules: All the rest What is Datamining? Whose block is this? Jonathan’s blocks Jessica’s blocks

Copyright  2003 limsoon wong What is Datamining? Question: Can you explain how?

Copyright  2003 limsoon wong What is Bioinformatics?

Copyright  2003 limsoon wong Bioinformatics brings benefits To the patient: Better drug, better treatment To the pharma: Save time, save cost, make more $ To the scientist: Better science

Copyright  2003 limsoon wong To figure these out, we bet on... “solution” = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery = Statistics + Algorithms + Databases

Copyright  2003 limsoon wong Integration Technology (Kleisli) Cleansing & Warehousing (FIMM) MHC-Peptide Binding (PREDICT) Protein Interactions Extraction (PIES) Gene Expression & Medical Record Datamining (PCL) Gene Feature Recognition (Dragon) Venom Informatics years of bioinformatics R&D in Singapore ISS KRDL LIT/I 2 R GeneticXchange Molecular Connections Biobase History

Copyright  2003 limsoon wong Predict Epitopes, Find Vaccine Targets Vaccines are often the only solution for viral diseases Finding & developing effective vaccine targets (epitopes) is slow and expensive process

Copyright  2003 limsoon wong Recognize Functional Sites, Help Scientists Effective recognition of initiation, control, and termination of biological processes is crucial to speeding up and focusing scientific experiments Data mining of bio seqs to find rules for recognizing & understanding functional sites Dragon’s 10x reduction of TSS recognition false positives

Copyright  2003 limsoon wong Diagnose Leukaemia, Benefit Children Childhood leukaemia is a heterogeneous disease Treatment is based on subtype 3 different tests and 4 different experts are needed for diagnosis  Curable in USA,  fatal in Indonesia

Copyright  2003 limsoon wong Understand Proteins, Fight Diseases Understanding function and role of protein needs organised info on interaction pathways Such info are often reported in scientific paper but are seldom found in structured databases Knowledge extraction system to process free text extract protein names extract interactions

Copyright  2003 limsoon wong risk assignment of childhood ALL patients to optimize risk-benefit ratio of therapy

Copyright  2003 limsoon wong Childhood ALL Heterogeneous Disease Major subtypes are –T-ALL –E2A-PBX1 –TEL-AML1 –MLL genome rearrangements –Hyperdiploid>50 –BCR-ABL

Copyright  2003 limsoon wong Childhood ALL Treatment Failure Overly intensive treatment leads to –Development of secondary cancers –Reduction of IQ Insufficiently intensive treatment leads to –Relapse

Copyright  2003 limsoon wong Childhood ALL Risk-Stratified Therapy Different subtypes respond differently to the same treatment intensity  Match patient to optimum treatment intensity for his subtype & prognosis BCR-ABL, MLL TEL-AML1, Hyperdiploid>50 T-ALLE2A-PBX1 Generally good-risk, lower intensity Generally high-risk, higher intensity

Copyright  2003 limsoon wong Childhood ALL Risk Assignment The major subtypes look similar Conventional diagnosis requires –Immunophenotyping –Cytogenetics –Molecular diagnostics

Copyright  2003 limsoon wong Mission Conventional risk assignment procedure requires difficult expensive tests and collective judgement of multiple specialists Generally available only in major advanced hospitals  Can we have a single-test easy-to-use platform instead?

Copyright  2003 limsoon wong Single-Test Platform of Microarray & Machine Learning

Copyright  2003 limsoon wong Overall Strategy Diagnosis of subtype Subtype- dependent prognosis Risk- stratified treatment intensity For each subtype, select genes to develop classification model for diagnosing that subtype For each subtype, select genes to develop prediction model for prognosis of that subtype

Copyright  2003 limsoon wong Childhood ALL Subtype Diagnosis by PCL Gene expression data collection Gene selection by  2 Classifier training by emerging pattern Classifier tuning (optional for some machine learning methods) Apply classifier for diagnosis of future cases by PCL

Copyright  2003 limsoon wong Childhood ALL Subtype Diagnosis Our Workflow A tree-structured diagnostic workflow was recommended by our doctor collaborator

Copyright  2003 limsoon wong Childhood ALL Subtype Diagnosis Training and Testing Sets

Copyright  2003 limsoon wong Childhood ALL Subtype Diagnosis Signal Selection Basic Idea Choose a signal w/ low intra-class distance Choose a signal w/ high inter-class distance

Copyright  2003 limsoon wong Childhood ALL Subtype Diagnosis Signal Selection by  2

Copyright  2003 limsoon wong Childhood ALL Subtype Diagnosis Emerging Patterns An emerging pattern is a set of conditions –usually involving several features –that most members of a class satisfy –but none or few of the other class satisfy A jumping emerging pattern is an emerging pattern that –some members of a class satisfy –but no members of the other class satisfy We use only jumping emerging patterns

Copyright  2003 limsoon wong Childhood ALL Subtype Diagnosis PCL: Prediction by Collective Likelihood

Copyright  2003 limsoon wong Childhood ALL Subtype Diagnosis Accuracy of PCL (vs. other classifiers) The classifiers are all applied to the 20 genes selected by  2 at each level of the tree

Copyright  2003 limsoon wong Multidimensional Scaling Plot Subtype Diagnosis

Copyright  2003 limsoon wong Multidimensional Scaling Plot Subtype-Dependent Prognosis Similar computational analysis was carried out to predict relapse and/or secondary AML in a subtype- specific manner >97% accuracy achieved

Copyright  2003 limsoon wong Childhood ALL Is there a new subtype? Hierarchical clustering of gene expression profiles reveals a novel subtype of childhood ALL

Copyright  2003 limsoon wong Childhood ALL Cure Rates in ASEAN Countries Conventional risk assignment procedure requires difficult expensive tests and collective judgement of multiple specialists  Not available in less advanced ASEAN countries

Copyright  2003 limsoon wong Childhood ALL Treatment Cost Treatment for childhood ALL over 2 yrs –Intermediate intensity: US$60k –Low intensity: US$36k –High intensity: US$72k Treatment for relapse: US$150k Cost for side-effects: Unquantified

Copyright  2003 limsoon wong Childhood ALL in ASEAN Counties Current Situation (2000 new cases/yr) Intermediate intensity conventionally applied in less advanced ASEAN countries  Over intensive for 50% of patients, thus more side effects  Under intensive for 10% of patients, thus more relapse  5-20% cure rates US$120m (US$60k * 2000) for intermediate intensity treatment US$30m (US$150k * 2000 * 10%) for relapse treatment Total US$150m/yr plus un-quantified costs for dealing with side effects

Copyright  2003 limsoon wong Childhood ALL in ASEAN Counties Using Our Platform (2000 new cases/yr) Low intensity applied to 50% of patients Intermediate intensity to 40% of patients High intensity to 10% of patients  Reduced side effects  Reduced relapse  75-80% cure rates US$36m (US$36k * 2000 * 50%) for low intensity US$48m (US$60k * 2000 * 40%) for intermediate intensity US$14.4m (US$72k * 2000 * 10%) for high intensity Total US$98.4m/yr  Save US$51.6m/yr

Copyright  2003 limsoon wong Acknowledgements

Copyright  2003 limsoon wong recognition of translation intiation sites from DNA sequences

Copyright  2003 limsoon wong Translation Initiation Site

Copyright  2003 limsoon wong A Sample mRNA 299 HSU CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE What makes the second ATG the translation initiation site?

Copyright  2003 limsoon wong Translation Initiation Site Recognition: Steps of a General Approach Training data gathering Signal generation  k-grams, colour, texture, domain know-how,... Signal selection  Entropy,  2, CFS, t-test, domain know-how... Signal integration  SVM, ANN, PCL, CART, C4.5, kNN,...

Copyright  2003 limsoon wong Translation Initiation Site Recognition: Training & Testing Data Vertebrate dataset of Pedersen & Nielsen [ISMB’97] 3312 sequences ATG sites 3312 (24.5%) are TIS (75.5%) are non-TIS Use for 3-fold x-validation expts

Copyright  2003 limsoon wong Translation Initiation Site Recognition: Signal Generation K-grams (ie., k consecutive letters) –K = 1, 2, 3, 4, 5, … –Window size vs. fixed position –Up-stream, downstream vs. any where in window –In-frame vs. any frame

Copyright  2003 limsoon wong Signal Generation: An Example 299 HSU CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT Window =  100 bases In-frame, downstream –GCT = 1, TTT = 1, ATG = 1… Any-frame, downstream –GCT = 3, TTT = 2, ATG = 2… In-frame, upstream –GCT = 2, TTT = 0, ATG = 0,...

Copyright  2003 limsoon wong Signal Generation: Too Many Signals For each value of k, there are 4 k * 3 * 2 k-grams If we use k = 1, 2, 3, 4, 5, we have = 8188 features! This is too many for most machine learning algorithms

Copyright  2003 limsoon wong Translation Initiation Site Recognition: Signal Selection (eg.,  2)

Copyright  2003 limsoon wong Translation Initiation Site Recognition: Signal Selection (eg., CFS) Instead of scoring individual signals, how about scoring a group of signals as a whole? CFS –Correlation-based Feature Selection –A good group contains signals that are highly correlated with the class, and yet uncorrelated with each other

Copyright  2003 limsoon wong Signal Selection: Sample k-grams Selected Position – 3 in-frame upstream ATG in-frame downstream –TAA, TAG, TGA, –CTG, GAC, GAG, and GCC Kozak consensus Leaky scanning Stop codon Codon bias

Copyright  2003 limsoon wong Translation Initiation Site Recognition: Signal Integration kNN Given a test sample, find the k training samples that are most similar to it. Let the majority class win. SVM Given a group of training samples from two classes, determine a separating plane that maximises the margin of error. Naïve Bayes, ANN, C4.5, PCL,...

Copyright  2003 limsoon wong Translation Initiation Site Recognition: Results (on Pedersen & Nielsen’s mRNA)

Copyright  2003 limsoon wong Translation Initiation Site Recognition: mRNA  protein F L I M V S P T A Y H Q N K D E C W R G A T E L R S stop How about using k-grams from the translation?

Copyright  2003 limsoon wong Signal Generation: Amino-Acid Features

Copyright  2003 limsoon wong Signal Generation: Amino-Acid Features

Copyright  2003 limsoon wong Signal Selection: Amino Acid K-grams Discovered

Copyright  2003 limsoon wong Translation Initiation Site Recognition: Results (based on amino acid features) Performance based on amino-acid features: is better than performance based on DNA seq. features:

Copyright  2003 limsoon wong Acknowledgements Huiqing Liu Jinyan Li Roland Yap Zeng Fanfan A.G. Pedersen H. Nielsen

Copyright  2003 limsoon wong To give this lecture to SMA students. Date: 28 Oct 2003 Time: am Venue: Video Conference Room, S