Download presentation
Presentation is loading. Please wait.
Published byDerick Lane Modified over 8 years ago
1
Copyright 2003 limsoon wong From Informatics to Bioinformatics: The Knowledge Discovery Perspective Limsoon Wong Institute for Infocomm Research Singapore
2
Copyright 2003 limsoon wong Plan Overview of recent knowledge discovery successes in bioinformatics Risk assignment of childhood ALL patients to optimize risk-benefit ratio of therapy Recognition of translation intiation sites from DNA sequences
3
Copyright 2003 limsoon wong overview of recent knowledge discovery successes in bioinformatics
4
Copyright 2003 limsoon wong Jonathan’s rules: Blue or Circle Jessica’s rules: All the rest What is Datamining? Whose block is this? Jonathan’s blocks Jessica’s blocks
5
Copyright 2003 limsoon wong What is Datamining? Question: Can you explain how?
6
Copyright 2003 limsoon wong What is Bioinformatics?
7
Copyright 2003 limsoon wong Bioinformatics brings benefits To the patient: Better drug, better treatment To the pharma: Save time, save cost, make more $ To the scientist: Better science
8
Copyright 2003 limsoon wong To figure these out, we bet on... “solution” = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery = Statistics + Algorithms + Databases
9
Copyright 2003 limsoon wong Integration Technology (Kleisli) Cleansing & Warehousing (FIMM) MHC-Peptide Binding (PREDICT) Protein Interactions Extraction (PIES) Gene Expression & Medical Record Datamining (PCL) Gene Feature Recognition (Dragon) Venom Informatics 1994 19981996 2000 2002 8 years of bioinformatics R&D in Singapore ISS KRDL LIT/I 2 R GeneticXchange Molecular Connections Biobase History
10
Copyright 2003 limsoon wong Predict Epitopes, Find Vaccine Targets Vaccines are often the only solution for viral diseases Finding & developing effective vaccine targets (epitopes) is slow and expensive process
11
Copyright 2003 limsoon wong Recognize Functional Sites, Help Scientists Effective recognition of initiation, control, and termination of biological processes is crucial to speeding up and focusing scientific experiments Data mining of bio seqs to find rules for recognizing & understanding functional sites Dragon’s 10x reduction of TSS recognition false positives
12
Copyright 2003 limsoon wong Diagnose Leukaemia, Benefit Children Childhood leukaemia is a heterogeneous disease Treatment is based on subtype 3 different tests and 4 different experts are needed for diagnosis Curable in USA, fatal in Indonesia
13
Copyright 2003 limsoon wong Understand Proteins, Fight Diseases Understanding function and role of protein needs organised info on interaction pathways Such info are often reported in scientific paper but are seldom found in structured databases Knowledge extraction system to process free text extract protein names extract interactions
14
Copyright 2003 limsoon wong risk assignment of childhood ALL patients to optimize risk-benefit ratio of therapy
15
Copyright 2003 limsoon wong Childhood ALL Heterogeneous Disease Major subtypes are –T-ALL –E2A-PBX1 –TEL-AML1 –MLL genome rearrangements –Hyperdiploid>50 –BCR-ABL
16
Copyright 2003 limsoon wong Childhood ALL Treatment Failure Overly intensive treatment leads to –Development of secondary cancers –Reduction of IQ Insufficiently intensive treatment leads to –Relapse
17
Copyright 2003 limsoon wong Childhood ALL Risk-Stratified Therapy Different subtypes respond differently to the same treatment intensity Match patient to optimum treatment intensity for his subtype & prognosis BCR-ABL, MLL TEL-AML1, Hyperdiploid>50 T-ALLE2A-PBX1 Generally good-risk, lower intensity Generally high-risk, higher intensity
18
Copyright 2003 limsoon wong Childhood ALL Risk Assignment The major subtypes look similar Conventional diagnosis requires –Immunophenotyping –Cytogenetics –Molecular diagnostics
19
Copyright 2003 limsoon wong Mission Conventional risk assignment procedure requires difficult expensive tests and collective judgement of multiple specialists Generally available only in major advanced hospitals Can we have a single-test easy-to-use platform instead?
20
Copyright 2003 limsoon wong Single-Test Platform of Microarray & Machine Learning
21
Copyright 2003 limsoon wong Overall Strategy Diagnosis of subtype Subtype- dependent prognosis Risk- stratified treatment intensity For each subtype, select genes to develop classification model for diagnosing that subtype For each subtype, select genes to develop prediction model for prognosis of that subtype
22
Copyright 2003 limsoon wong Childhood ALL Subtype Diagnosis by PCL Gene expression data collection Gene selection by 2 Classifier training by emerging pattern Classifier tuning (optional for some machine learning methods) Apply classifier for diagnosis of future cases by PCL
23
Copyright 2003 limsoon wong Childhood ALL Subtype Diagnosis Our Workflow A tree-structured diagnostic workflow was recommended by our doctor collaborator
24
Copyright 2003 limsoon wong Childhood ALL Subtype Diagnosis Training and Testing Sets
25
Copyright 2003 limsoon wong Childhood ALL Subtype Diagnosis Signal Selection Basic Idea Choose a signal w/ low intra-class distance Choose a signal w/ high inter-class distance
26
Copyright 2003 limsoon wong Childhood ALL Subtype Diagnosis Signal Selection by 2
27
Copyright 2003 limsoon wong Childhood ALL Subtype Diagnosis Emerging Patterns An emerging pattern is a set of conditions –usually involving several features –that most members of a class satisfy –but none or few of the other class satisfy A jumping emerging pattern is an emerging pattern that –some members of a class satisfy –but no members of the other class satisfy We use only jumping emerging patterns
28
Copyright 2003 limsoon wong Childhood ALL Subtype Diagnosis PCL: Prediction by Collective Likelihood
29
Copyright 2003 limsoon wong Childhood ALL Subtype Diagnosis Accuracy of PCL (vs. other classifiers) The classifiers are all applied to the 20 genes selected by 2 at each level of the tree
30
Copyright 2003 limsoon wong Multidimensional Scaling Plot Subtype Diagnosis
31
Copyright 2003 limsoon wong Multidimensional Scaling Plot Subtype-Dependent Prognosis Similar computational analysis was carried out to predict relapse and/or secondary AML in a subtype- specific manner >97% accuracy achieved
32
Copyright 2003 limsoon wong Childhood ALL Is there a new subtype? Hierarchical clustering of gene expression profiles reveals a novel subtype of childhood ALL
33
Copyright 2003 limsoon wong Childhood ALL Cure Rates in ASEAN Countries Conventional risk assignment procedure requires difficult expensive tests and collective judgement of multiple specialists Not available in less advanced ASEAN countries
34
Copyright 2003 limsoon wong Childhood ALL Treatment Cost Treatment for childhood ALL over 2 yrs –Intermediate intensity: US$60k –Low intensity: US$36k –High intensity: US$72k Treatment for relapse: US$150k Cost for side-effects: Unquantified
35
Copyright 2003 limsoon wong Childhood ALL in ASEAN Counties Current Situation (2000 new cases/yr) Intermediate intensity conventionally applied in less advanced ASEAN countries Over intensive for 50% of patients, thus more side effects Under intensive for 10% of patients, thus more relapse 5-20% cure rates US$120m (US$60k * 2000) for intermediate intensity treatment US$30m (US$150k * 2000 * 10%) for relapse treatment Total US$150m/yr plus un-quantified costs for dealing with side effects
36
Copyright 2003 limsoon wong Childhood ALL in ASEAN Counties Using Our Platform (2000 new cases/yr) Low intensity applied to 50% of patients Intermediate intensity to 40% of patients High intensity to 10% of patients Reduced side effects Reduced relapse 75-80% cure rates US$36m (US$36k * 2000 * 50%) for low intensity US$48m (US$60k * 2000 * 40%) for intermediate intensity US$14.4m (US$72k * 2000 * 10%) for high intensity Total US$98.4m/yr Save US$51.6m/yr
37
Copyright 2003 limsoon wong Acknowledgements
38
Copyright 2003 limsoon wong recognition of translation intiation sites from DNA sequences
39
Copyright 2003 limsoon wong Translation Initiation Site
40
Copyright 2003 limsoon wong A Sample mRNA 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT............................................................ 80................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE What makes the second ATG the translation initiation site?
41
Copyright 2003 limsoon wong Translation Initiation Site Recognition: Steps of a General Approach Training data gathering Signal generation k-grams, colour, texture, domain know-how,... Signal selection Entropy, 2, CFS, t-test, domain know-how... Signal integration SVM, ANN, PCL, CART, C4.5, kNN,...
42
Copyright 2003 limsoon wong Translation Initiation Site Recognition: Training & Testing Data Vertebrate dataset of Pedersen & Nielsen [ISMB’97] 3312 sequences 13503 ATG sites 3312 (24.5%) are TIS 10191 (75.5%) are non-TIS Use for 3-fold x-validation expts
43
Copyright 2003 limsoon wong Translation Initiation Site Recognition: Signal Generation K-grams (ie., k consecutive letters) –K = 1, 2, 3, 4, 5, … –Window size vs. fixed position –Up-stream, downstream vs. any where in window –In-frame vs. any frame
44
Copyright 2003 limsoon wong Signal Generation: An Example 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT Window = 100 bases In-frame, downstream –GCT = 1, TTT = 1, ATG = 1… Any-frame, downstream –GCT = 3, TTT = 2, ATG = 2… In-frame, upstream –GCT = 2, TTT = 0, ATG = 0,...
45
Copyright 2003 limsoon wong Signal Generation: Too Many Signals For each value of k, there are 4 k * 3 * 2 k-grams If we use k = 1, 2, 3, 4, 5, we have 4 + 24 + 96 + 384 + 1536 + 6144 = 8188 features! This is too many for most machine learning algorithms
46
Copyright 2003 limsoon wong Translation Initiation Site Recognition: Signal Selection (eg., 2)
47
Copyright 2003 limsoon wong Translation Initiation Site Recognition: Signal Selection (eg., CFS) Instead of scoring individual signals, how about scoring a group of signals as a whole? CFS –Correlation-based Feature Selection –A good group contains signals that are highly correlated with the class, and yet uncorrelated with each other
48
Copyright 2003 limsoon wong Signal Selection: Sample k-grams Selected Position – 3 in-frame upstream ATG in-frame downstream –TAA, TAG, TGA, –CTG, GAC, GAG, and GCC Kozak consensus Leaky scanning Stop codon Codon bias
49
Copyright 2003 limsoon wong Translation Initiation Site Recognition: Signal Integration kNN Given a test sample, find the k training samples that are most similar to it. Let the majority class win. SVM Given a group of training samples from two classes, determine a separating plane that maximises the margin of error. Naïve Bayes, ANN, C4.5, PCL,...
50
Copyright 2003 limsoon wong Translation Initiation Site Recognition: Results (on Pedersen & Nielsen’s mRNA)
51
Copyright 2003 limsoon wong Translation Initiation Site Recognition: mRNA protein F L I M V S P T A Y H Q N K D E C W R G A T E L R S stop How about using k-grams from the translation?
52
Copyright 2003 limsoon wong Signal Generation: Amino-Acid Features
53
Copyright 2003 limsoon wong Signal Generation: Amino-Acid Features
54
Copyright 2003 limsoon wong Signal Selection: Amino Acid K-grams Discovered
55
Copyright 2003 limsoon wong Translation Initiation Site Recognition: Results (based on amino acid features) Performance based on amino-acid features: is better than performance based on DNA seq. features:
56
Copyright 2003 limsoon wong Acknowledgements Huiqing Liu Jinyan Li Roland Yap Zeng Fanfan A.G. Pedersen H. Nielsen
57
Copyright 2003 limsoon wong To give this lecture to SMA students. Date: 28 Oct 2003 Time: 10-11.30am Venue: Video Conference Room, S15-04-30
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.