Limsoon Wong Laboratories for Information Technology Singapore From Informatics to Bioinformatics.

Slides:



Advertisements
Similar presentations
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Advertisements

Show & Tell Limsoon Wong KRDL Datamining: Turning Biological Data into Gold.
Introduction to the Knowledge Discovery Department Institute for Infocomm Research Limsoon Wong Deputy Executive Director (Research) I 2 R: Imagination.
Finding Eukaryotic Open reading frames.
TRANSFAC Project Roadmap Discussion.  Structure DNA-binding domain (DBD)  The portion (domain) of the transcription factor that binds DNA Trans-activating.
1 CIS607, Fall 2006 Semantic Information Integration Instructor: Dejing Dou Week 10 (Nov. 29)
Multidimensional Analysis If you are comparing more than two conditions (for example 10 types of cancer) or if you are looking at a time series (cell cycle.
Introduction to the Knowledge Discovery Department Institute for Infocomm Research Limsoon Wong Deputy Executive Director (Research) I 2 R: Imagination.
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
Exciting Bioinformatics Adventures Limsoon Wong Institute for Infocomm Research.
Protein Tertiary Structure Prediction
Knowledgebase Creation & Systems Biology: A new prospect in discovery informatics S.Shriram, Siri Technologies (Cytogenomics), Bangalore S.Shriram, Siri.
Knowledge Discovery in Biomedicine Limsoon Wong Institute for Infocomm Research.
Copyright  2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.
Life Sciences Integrated Demo Joyce Peng Senior Product Manager, Life Sciences Oracle Corporation
A New Oklahoma Bioinformatics Company. Microarray and Bioinformatics.
The rise of digitized medicine disrupts current research and business models Jesper Tegnér Director of the Unit for Computational Medicine, Department.
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read Chapters 4 and 7 of The Practical Bioinformatician.
Selection of Patient Samples and Genes for Disease Prognosis Limsoon Wong Institute for Infocomm Research Joint work with Jinyan Li & Huiqing Liu.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Knowledge Discovery from Biological and Clinical Data: BASIC BACKGROUND.
Copyright  2004 limsoon wong A Practical Introduction to Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture 2, May 2004 For written notes.
Construction of cancer pathways for personalized medicine | Presented By Date Construction of cancer pathways for personalized medicine Predictive, Preventive.
Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore.
Copyright  2004 limsoon wong A Practical Introduction to Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture 1, May 2004 For written notes.
Copyright  2003 limsoon wong From Informatics to Bioinformatics: The Knowledge Discovery Perspective Limsoon Wong Institute for Infocomm Research Singapore.
Predicting protein degradation rates Karen Page. The central dogma DNA RNA protein Transcription Translation The expression of genetic information stored.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Copyright  2003 limsoon wong Recognition of Gene Features Limsoon Wong Institute for Infocomm Research BI6103 guest lecture on ?? February 2004 For written.
Limsoon Wong Laboratories for Information Technology Singapore From Informatics to Bioinformatics.
Medstar: a prototype for biomedical social network Xiaoli Li Institute for Infocomm Research A*Star, Singapore.
From Genomes to Genes Rui Alves.
GeWorkbench John Watkinson Columbia University. geWorkbench The bioinformatics platform of the National Center for the Multi-scale Analysis of Genomic.
While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible.
Bioinformatics and Computational Biology
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Finding genes in the genome
Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics.
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read chapter 3 of The Practical Bioinformatician, CS2220:
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Copyright  2004 limsoon wong CS2220: Computation Foundation in Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture slides for 13 January.
PROTEIN INTERACTION NETWORK – INFERENCE TOOL DIVYA RAO CANDIDATE FOR MASTER OF SCIENCE IN BIOINFORMATICS ADVISOR: Dr. FILIPPO MENCZER CAPSTONE PROJECT.
Show & Tell Limsoon Wong Kent Ridge Digital Labs Singapore Role of Bioinformatics in the Genomic Era.
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
IMMUNOGRID Nikolai Petrovsky and Vladimir Brusic
David Amar, Tom Hait, and Ron Shamir
Bacterial infection by lytic virus
bacteria and eukaryotes
Genome Annotation (protein coding genes)
Bacterial infection by lytic virus
An Artificial Intelligence Approach to Precision Oncology
Gregory Cooper Professor of Biomedical Informatics Director, Center for Causal Discovery Vice Chair Research, Department of Biomedical Informatics.
Gene expression.
Techniques for Finding Patterns in Large Amounts of Data: Applications in Biology Vipin Kumar William Norris Professor and Head, Department of Computer.
Data Warehouse.
Fanfan Zeng & Roland Yap National University of Singapore Limsoon Wong
From Informatics to Bioinformatics Limsoon Wong
From Informatics to Bioinformatics Limsoon Wong
חיזוי ואפיון אתרי קישור של חלבון לדנ"א מתוך הרצף
Life Sciences Integrated Demo Senior Product Manager, Life Sciences
BIOINFORMATICS Summary
Computer Science Issues In a Patient’s Perspective
Working with RNA-Seq Data
Databases and Information Systems
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001) Weekly Lab. Seminar
BIOBASE Training TRANSFAC® ExPlain™
Presentation transcript:

Limsoon Wong Laboratories for Information Technology Singapore From Informatics to Bioinformatics

What is Bioinformatics?

Themes of Bioinformatics Bioinformatics = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery = Statistics + Algorithms + Databases

Benefits of Bioinformatics To the patient: Better drug, better treatment To the pharma: Save time, save cost, make more $ To the scientist: Better science

From Informatics to Bioinformatics Integration Technology (Kleisli) Cleansing & Warehousing (FIMM) MHC-Peptide Binding (PREDICT) Protein Interactions Extraction (PIES) Gene Expression & Medical Record Datamining (PCL) Gene Feature Recognition (Dragon) Venom Informatics years of bioinformatics R&D in Singapore ISS KRDL LIT

Quick Samplings

Data Integration A DOE “impossible query”: For each gene on a given cytogenetic band, find its non-human homologs.

Data Integration Results sybase-add (#name:”GDB",...); create view L from locus_cyto_location using GDB; create view E from object_genbank_eref using GDB; select #accn: g.#genbank_ref, #nonhuman-homologs: H from L as c, E as g, {select u from g.#genbank_ref.na-get-homolog-summary as u where not(u.#title string-islike "%Human%") andalso not(u.#title string-islike "%H.sapien%")} as H where c.#chrom_num = "22” andalso g.#object_id = c.#locus_id andalso not (H = { }); Using Kleisli : Clear Succinct Efficient Handles heterogeneity complexity

Data Warehousing Motivation efficiency availabilty “denial of service” data cleansing Requirements efficient to query easy to update. model data naturally {(#uid: , #title: "Homo sapiens adrenergic...", #accession: "NM_001619", #organism: "Homo sapiens", #taxon: 9606, #lineage: ["Eukaryota", "Metazoa", …], #seq: "CTCGGCCTCGGGCGCGGC...", #feature: { (#name: "source", #continuous: true, #position: [ (#accn: "NM_001619", #start: 0, #end: 3602, #negative: false)], #anno: [ (#anno_name: "organism", #descr: "Homo sapiens"), …] ), …)}

Data Warehousing Results Relational DBMS is insufficient because it forces us to fragment data into 3NF. Kleisli turns flat relational DBMS into nested relational DBMS. It can use flat relational DBMS such as Sybase, Oracle, MySQL, etc. to be its update-able complex object store. ! Log in oracle-cplobj-add (#name: "db",...); ! Define table create table GP (#uid: "NUMBER", #detail: "LONG") using db; ! Populate table with GenPept reports select #uid: x.#uid, #detail: x into GP from aa-get-seqfeat-general "PTP” as x using db; ! Map GP to that table create view GP from GP using db; ! Run a queryto get title of select x.#detail.#title from GP as x where x.#uid = ;

Epitope Prediction TRAP-559AA MNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSE EVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLN LNDNAIHLYVNVFSNNAKEIIRLHSDASKNKEKALIIIRS LLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVIL TDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNR FLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAVCVEVEK TASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQ CEEERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENI IDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQ KPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDN QNNLPNDKSDRNIPYSPLPPKVLDNERKQSDPQSQDNNGN RHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHE KPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVP GAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN

Epitope Prediction Results  Prediction by our ANN model for HLA-A11  29 predictions  22 epitopes  76% specificity Rank by BIMAS Number of experimental binders 19 (52.8%) 5 (13.9%) 12 (33.3%)  Prediction by BIMAS matrix for HLA-A*1101

Transcription Start Prediction

Transcription Start Prediction Results

Medical Record Analysis  Looking for patterns that are  valid  novel  useful  understandable

Gene Expression Analysis  Classifying gene expression profiles  find stable differentially expressed genes  find significant gene groups  derive coordinated gene expression

Medical Record & Gene Expression Analysis Results  PCL, a novel “emerging pattern’’ method  Beats C4.5, CBA, LB, NB, TAN in 21 out of 32 UCI benchmarks  Works well for gene expressions Cancer Cell, March 2002, 1(2)

Protein Interaction Extraction “What are the protein-protein interaction pathways from the latest reported discoveries?”

Protein Interaction Extraction Results  Rule-based system for processing free texts in scientific abstracts  Specialized in  extracting protein names  extracting protein- protein interactions

Behind the Scene  Vladimir Bajic  Vladimir Brusic  Jinyan Li  See-Kiong Ng  Limsoon Wong  Louxin Zhang  Allen Chong  Judice Koh  SPT Krishnan  Huiqing Liu  Seng Hong Seah  Soon Heng Tan  Guanglan Zhang  Zhuo Zhang and many more: students, folks from geneticXchange, MolecularConnections, and other collaborators….

A More Detailed Account

Jonathan’s rules: Blue or Circle Jessica’s rules: All the rest What is Datamining? Whose block is this? Jonathan’s blocks Jessica’s blocks

What is Datamining? Question: Can you explain how?

The Steps of Data Mining  Training data gathering  Signal generation  k-grams, colour, texture, domain know-how,...  Signal selection  Entropy,  2, CFS, t-test, domain know-how...  Signal integration  SVM, ANN, PCL, CART, C4.5, kNN,...

Translation Initiation Recognition

A Sample mRNA 299 HSU CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE What makes the second ATG the translation initiation site?

Signal Generation  K-grams (ie., k consecutive letters) l K = 1, 2, 3, 4, 5, … l Window size vs. fixed position l Up-stream, downstream vs. any where in window l In-frame vs. any frame

Too Many Signals  For each value of k, there are 4 k * 3 * 2 k-grams  If we use k = 1, 2, 3, 4, 5, we have = 8188 features!  This is too many for most machine learning algorithms

Signal Selection (eg.,  2)

Sample k-grams Selected  Position –3  in-frame upstream ATG  in-frame downstream l TAA, TAG, TGA, l CTG, GAC, GAG, and GCC Kozak consensus Leaky scanning Stop codon Codon bias

Signal Integration  kNN Given a test sample, find the k training samples that are most similar to it. Let the majority class win.  SVM Given a group of training samples from two classes, determine a separating plane that maximises the margin of error.  Naïve Bayes, ANN, C4.5,...

Results (on Pedersen & Nielsen’s mRNA)

Acknowledgements  Roland Yap  Zeng Fanfan  A.G. Pedersen  H. Nielsen

Questions?