Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright  2003 limsoon wong From Informatics to Bioinformatics: The Knowledge Discovery Perspective Limsoon Wong Institute for Infocomm Research Singapore.

Similar presentations


Presentation on theme: "Copyright  2003 limsoon wong From Informatics to Bioinformatics: The Knowledge Discovery Perspective Limsoon Wong Institute for Infocomm Research Singapore."— Presentation transcript:

1 Copyright  2003 limsoon wong From Informatics to Bioinformatics: The Knowledge Discovery Perspective Limsoon Wong Institute for Infocomm Research Singapore

2 Copyright  2003 limsoon wong Plan Overview of recent knowledge discovery successes in bioinformatics Risk assignment of childhood ALL patients to optimize risk-benefit ratio of therapy Recognition of translation intiation sites from DNA sequences

3 Copyright  2003 limsoon wong overview of recent knowledge discovery successes in bioinformatics

4 Copyright  2003 limsoon wong Jonathan’s rules: Blue or Circle Jessica’s rules: All the rest What is Datamining? Whose block is this? Jonathan’s blocks Jessica’s blocks

5 Copyright  2003 limsoon wong What is Datamining? Question: Can you explain how?

6 Copyright  2003 limsoon wong What is Bioinformatics?

7 Copyright  2003 limsoon wong Bioinformatics brings benefits To the patient: Better drug, better treatment To the pharma: Save time, save cost, make more $ To the scientist: Better science

8 Copyright  2003 limsoon wong To figure these out, we bet on... “solution” = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery = Statistics + Algorithms + Databases

9 Copyright  2003 limsoon wong Integration Technology (Kleisli) Cleansing & Warehousing (FIMM) MHC-Peptide Binding (PREDICT) Protein Interactions Extraction (PIES) Gene Expression & Medical Record Datamining (PCL) Gene Feature Recognition (Dragon) Venom Informatics 1994 19981996 2000 2002 8 years of bioinformatics R&D in Singapore ISS KRDL LIT/I 2 R GeneticXchange Molecular Connections Biobase History

10 Copyright  2003 limsoon wong Predict Epitopes, Find Vaccine Targets Vaccines are often the only solution for viral diseases Finding & developing effective vaccine targets (epitopes) is slow and expensive process

11 Copyright  2003 limsoon wong Recognize Functional Sites, Help Scientists Effective recognition of initiation, control, and termination of biological processes is crucial to speeding up and focusing scientific experiments Data mining of bio seqs to find rules for recognizing & understanding functional sites Dragon’s 10x reduction of TSS recognition false positives

12 Copyright  2003 limsoon wong Diagnose Leukaemia, Benefit Children Childhood leukaemia is a heterogeneous disease Treatment is based on subtype 3 different tests and 4 different experts are needed for diagnosis  Curable in USA,  fatal in Indonesia

13 Copyright  2003 limsoon wong Understand Proteins, Fight Diseases Understanding function and role of protein needs organised info on interaction pathways Such info are often reported in scientific paper but are seldom found in structured databases Knowledge extraction system to process free text extract protein names extract interactions

14 Copyright  2003 limsoon wong risk assignment of childhood ALL patients to optimize risk-benefit ratio of therapy

15 Copyright  2003 limsoon wong Childhood ALL Heterogeneous Disease Major subtypes are –T-ALL –E2A-PBX1 –TEL-AML1 –MLL genome rearrangements –Hyperdiploid>50 –BCR-ABL

16 Copyright  2003 limsoon wong Childhood ALL Treatment Failure Overly intensive treatment leads to –Development of secondary cancers –Reduction of IQ Insufficiently intensive treatment leads to –Relapse

17 Copyright  2003 limsoon wong Childhood ALL Risk-Stratified Therapy Different subtypes respond differently to the same treatment intensity  Match patient to optimum treatment intensity for his subtype & prognosis BCR-ABL, MLL TEL-AML1, Hyperdiploid>50 T-ALLE2A-PBX1 Generally good-risk, lower intensity Generally high-risk, higher intensity

18 Copyright  2003 limsoon wong Childhood ALL Risk Assignment The major subtypes look similar Conventional diagnosis requires –Immunophenotyping –Cytogenetics –Molecular diagnostics

19 Copyright  2003 limsoon wong Mission Conventional risk assignment procedure requires difficult expensive tests and collective judgement of multiple specialists Generally available only in major advanced hospitals  Can we have a single-test easy-to-use platform instead?

20 Copyright  2003 limsoon wong Single-Test Platform of Microarray & Machine Learning

21 Copyright  2003 limsoon wong Overall Strategy Diagnosis of subtype Subtype- dependent prognosis Risk- stratified treatment intensity For each subtype, select genes to develop classification model for diagnosing that subtype For each subtype, select genes to develop prediction model for prognosis of that subtype

22 Copyright  2003 limsoon wong Childhood ALL Subtype Diagnosis by PCL Gene expression data collection Gene selection by  2 Classifier training by emerging pattern Classifier tuning (optional for some machine learning methods) Apply classifier for diagnosis of future cases by PCL

23 Copyright  2003 limsoon wong Childhood ALL Subtype Diagnosis Our Workflow A tree-structured diagnostic workflow was recommended by our doctor collaborator

24 Copyright  2003 limsoon wong Childhood ALL Subtype Diagnosis Training and Testing Sets

25 Copyright  2003 limsoon wong Childhood ALL Subtype Diagnosis Signal Selection Basic Idea Choose a signal w/ low intra-class distance Choose a signal w/ high inter-class distance

26 Copyright  2003 limsoon wong Childhood ALL Subtype Diagnosis Signal Selection by  2

27 Copyright  2003 limsoon wong Childhood ALL Subtype Diagnosis Emerging Patterns An emerging pattern is a set of conditions –usually involving several features –that most members of a class satisfy –but none or few of the other class satisfy A jumping emerging pattern is an emerging pattern that –some members of a class satisfy –but no members of the other class satisfy We use only jumping emerging patterns

28 Copyright  2003 limsoon wong Childhood ALL Subtype Diagnosis PCL: Prediction by Collective Likelihood

29 Copyright  2003 limsoon wong Childhood ALL Subtype Diagnosis Accuracy of PCL (vs. other classifiers) The classifiers are all applied to the 20 genes selected by  2 at each level of the tree

30 Copyright  2003 limsoon wong Multidimensional Scaling Plot Subtype Diagnosis

31 Copyright  2003 limsoon wong Multidimensional Scaling Plot Subtype-Dependent Prognosis Similar computational analysis was carried out to predict relapse and/or secondary AML in a subtype- specific manner >97% accuracy achieved

32 Copyright  2003 limsoon wong Childhood ALL Is there a new subtype? Hierarchical clustering of gene expression profiles reveals a novel subtype of childhood ALL

33 Copyright  2003 limsoon wong Childhood ALL Cure Rates in ASEAN Countries Conventional risk assignment procedure requires difficult expensive tests and collective judgement of multiple specialists  Not available in less advanced ASEAN countries

34 Copyright  2003 limsoon wong Childhood ALL Treatment Cost Treatment for childhood ALL over 2 yrs –Intermediate intensity: US$60k –Low intensity: US$36k –High intensity: US$72k Treatment for relapse: US$150k Cost for side-effects: Unquantified

35 Copyright  2003 limsoon wong Childhood ALL in ASEAN Counties Current Situation (2000 new cases/yr) Intermediate intensity conventionally applied in less advanced ASEAN countries  Over intensive for 50% of patients, thus more side effects  Under intensive for 10% of patients, thus more relapse  5-20% cure rates US$120m (US$60k * 2000) for intermediate intensity treatment US$30m (US$150k * 2000 * 10%) for relapse treatment Total US$150m/yr plus un-quantified costs for dealing with side effects

36 Copyright  2003 limsoon wong Childhood ALL in ASEAN Counties Using Our Platform (2000 new cases/yr) Low intensity applied to 50% of patients Intermediate intensity to 40% of patients High intensity to 10% of patients  Reduced side effects  Reduced relapse  75-80% cure rates US$36m (US$36k * 2000 * 50%) for low intensity US$48m (US$60k * 2000 * 40%) for intermediate intensity US$14.4m (US$72k * 2000 * 10%) for high intensity Total US$98.4m/yr  Save US$51.6m/yr

37 Copyright  2003 limsoon wong Acknowledgements

38 Copyright  2003 limsoon wong recognition of translation intiation sites from DNA sequences

39 Copyright  2003 limsoon wong Translation Initiation Site

40 Copyright  2003 limsoon wong A Sample mRNA 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT............................................................ 80................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE What makes the second ATG the translation initiation site?

41 Copyright  2003 limsoon wong Translation Initiation Site Recognition: Steps of a General Approach Training data gathering Signal generation  k-grams, colour, texture, domain know-how,... Signal selection  Entropy,  2, CFS, t-test, domain know-how... Signal integration  SVM, ANN, PCL, CART, C4.5, kNN,...

42 Copyright  2003 limsoon wong Translation Initiation Site Recognition: Training & Testing Data Vertebrate dataset of Pedersen & Nielsen [ISMB’97] 3312 sequences 13503 ATG sites 3312 (24.5%) are TIS 10191 (75.5%) are non-TIS Use for 3-fold x-validation expts

43 Copyright  2003 limsoon wong Translation Initiation Site Recognition: Signal Generation K-grams (ie., k consecutive letters) –K = 1, 2, 3, 4, 5, … –Window size vs. fixed position –Up-stream, downstream vs. any where in window –In-frame vs. any frame

44 Copyright  2003 limsoon wong Signal Generation: An Example 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT Window =  100 bases In-frame, downstream –GCT = 1, TTT = 1, ATG = 1… Any-frame, downstream –GCT = 3, TTT = 2, ATG = 2… In-frame, upstream –GCT = 2, TTT = 0, ATG = 0,...

45 Copyright  2003 limsoon wong Signal Generation: Too Many Signals For each value of k, there are 4 k * 3 * 2 k-grams If we use k = 1, 2, 3, 4, 5, we have 4 + 24 + 96 + 384 + 1536 + 6144 = 8188 features! This is too many for most machine learning algorithms

46 Copyright  2003 limsoon wong Translation Initiation Site Recognition: Signal Selection (eg.,  2)

47 Copyright  2003 limsoon wong Translation Initiation Site Recognition: Signal Selection (eg., CFS) Instead of scoring individual signals, how about scoring a group of signals as a whole? CFS –Correlation-based Feature Selection –A good group contains signals that are highly correlated with the class, and yet uncorrelated with each other

48 Copyright  2003 limsoon wong Signal Selection: Sample k-grams Selected Position – 3 in-frame upstream ATG in-frame downstream –TAA, TAG, TGA, –CTG, GAC, GAG, and GCC Kozak consensus Leaky scanning Stop codon Codon bias

49 Copyright  2003 limsoon wong Translation Initiation Site Recognition: Signal Integration kNN Given a test sample, find the k training samples that are most similar to it. Let the majority class win. SVM Given a group of training samples from two classes, determine a separating plane that maximises the margin of error. Naïve Bayes, ANN, C4.5, PCL,...

50 Copyright  2003 limsoon wong Translation Initiation Site Recognition: Results (on Pedersen & Nielsen’s mRNA)

51 Copyright  2003 limsoon wong Translation Initiation Site Recognition: mRNA  protein F L I M V S P T A Y H Q N K D E C W R G A T E L R S stop How about using k-grams from the translation?

52 Copyright  2003 limsoon wong Signal Generation: Amino-Acid Features

53 Copyright  2003 limsoon wong Signal Generation: Amino-Acid Features

54 Copyright  2003 limsoon wong Signal Selection: Amino Acid K-grams Discovered

55 Copyright  2003 limsoon wong Translation Initiation Site Recognition: Results (based on amino acid features) Performance based on amino-acid features: is better than performance based on DNA seq. features:

56 Copyright  2003 limsoon wong Acknowledgements Huiqing Liu Jinyan Li Roland Yap Zeng Fanfan A.G. Pedersen H. Nielsen

57 Copyright  2003 limsoon wong To give this lecture to SMA students. Date: 28 Oct 2003 Time: 10-11.30am Venue: Video Conference Room, S15-04-30


Download ppt "Copyright  2003 limsoon wong From Informatics to Bioinformatics: The Knowledge Discovery Perspective Limsoon Wong Institute for Infocomm Research Singapore."

Similar presentations


Ads by Google