Limsoon Wong Laboratories for Information Technology Singapore From Informatics to Bioinformatics.

Limsoon Wong Laboratories for Information Technology Singapore From Informatics to Bioinformatics

What is Bioinformatics?

Themes of Bioinformatics Bioinformatics = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery = Statistics + Algorithms + Databases

Benefits of Bioinformatics To the patient: Better drug, better treatment To the pharma: Save time, save cost, make more $ To the scientist: Better science

From Informatics to Bioinformatics Integration Technology (Kleisli) Cleansing & Warehousing (FIMM) MHC-Peptide Binding (PREDICT) Protein Interactions Extraction (PIES) Gene Expression & Medical Record Datamining (PCL) Gene Feature Recognition (Dragon) Venom Informatics 1994 19981996 2000 2002 8 years of bioinformatics R&D in Singapore ISS KRDL LIT

Quick Samplings

Data Integration A DOE “impossible query”: For each gene on a given cytogenetic band, find its non-human homologs.

Data Integration Results sybase-add (#name:”GDB",...); create view L from locus_cyto_location using GDB; create view E from object_genbank_eref using GDB; select #accn: g.#genbank_ref, #nonhuman-homologs: H from L as c, E as g, {select u from g.#genbank_ref.na-get-homolog-summary as u where not(u.#title string-islike "%Human%") andalso not(u.#title string-islike "%H.sapien%")} as H where c.#chrom_num = "22” andalso g.#object_id = c.#locus_id andalso not (H = { }); Using Kleisli : Clear Succinct Efficient Handles heterogeneity complexity

Data Warehousing Motivation efficiency availabilty “denial of service” data cleansing Requirements efficient to query easy to update. model data naturally {(#uid: 6138971, #title: "Homo sapiens adrenergic...", #accession: "NM_001619", #organism: "Homo sapiens", #taxon: 9606, #lineage: ["Eukaryota", "Metazoa", …], #seq: "CTCGGCCTCGGGCGCGGC...", #feature: { (#name: "source", #continuous: true, #position: [ (#accn: "NM_001619", #start: 0, #end: 3602, #negative: false)], #anno: [ (#anno_name: "organism", #descr: "Homo sapiens"), …] ), …)}

Data Warehousing Results Relational DBMS is insufficient because it forces us to fragment data into 3NF. Kleisli turns flat relational DBMS into nested relational DBMS. It can use flat relational DBMS such as Sybase, Oracle, MySQL, etc. to be its update-able complex object store. ! Log in oracle-cplobj-add (#name: "db",...); ! Define table create table GP (#uid: "NUMBER", #detail: "LONG") using db; ! Populate table with GenPept reports select #uid: x.#uid, #detail: x into GP from aa-get-seqfeat-general "PTP” as x using db; ! Map GP to that table create view GP from GP using db; ! Run a queryto get title of 131470 select x.#detail.#title from GP as x where x.#uid = 131470;

Epitope Prediction TRAP-559AA MNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSE EVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLN LNDNAIHLYVNVFSNNAKEIIRLHSDASKNKEKALIIIRS LLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVIL TDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNR FLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAVCVEVEK TASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQ CEEERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENI IDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQ KPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDN QNNLPNDKSDRNIPYSPLPPKVLDNERKQSDPQSQDNNGN RHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHE KPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVP GAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN

Epitope Prediction Results  Prediction by our ANN model for HLA-A11  29 predictions  22 epitopes  76% specificity 1 66 100 Rank by BIMAS Number of experimental binders 19 (52.8%) 5 (13.9%) 12 (33.3%)  Prediction by BIMAS matrix for HLA-A*1101

Transcription Start Prediction

Transcription Start Prediction Results

Medical Record Analysis  Looking for patterns that are  valid  novel  useful  understandable

Gene Expression Analysis  Classifying gene expression profiles  find stable differentially expressed genes  find significant gene groups  derive coordinated gene expression

Medical Record & Gene Expression Analysis Results  PCL, a novel “emerging pattern’’ method  Beats C4.5, CBA, LB, NB, TAN in 21 out of 32 UCI benchmarks  Works well for gene expressions Cancer Cell, March 2002, 1(2)

Protein Interaction Extraction “What are the protein-protein interaction pathways from the latest reported discoveries?”

Protein Interaction Extraction Results  Rule-based system for processing free texts in scientific abstracts  Specialized in  extracting protein names  extracting protein- protein interactions

Behind the Scene  Vladimir Bajic  Vladimir Brusic  Jinyan Li  See-Kiong Ng  Limsoon Wong  Louxin Zhang  Allen Chong  Judice Koh  SPT Krishnan  Huiqing Liu  Seng Hong Seah  Soon Heng Tan  Guanglan Zhang  Zhuo Zhang and many more: students, folks from geneticXchange, MolecularConnections, and other collaborators….

A More Detailed Account

Jonathan’s rules: Blue or Circle Jessica’s rules: All the rest What is Datamining? Whose block is this? Jonathan’s blocks Jessica’s blocks

What is Datamining? Question: Can you explain how?

The Steps of Data Mining  Training data gathering  Signal generation  k-grams, colour, texture, domain know-how,...  Signal selection  Entropy,  2, CFS, t-test, domain know-how...  Signal integration  SVM, ANN, PCL, CART, C4.5, kNN,...

Translation Initiation Recognition

A Sample mRNA 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT............................................................ 80................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE What makes the second ATG the translation initiation site?

Signal Generation  K-grams (ie., k consecutive letters) l K = 1, 2, 3, 4, 5, … l Window size vs. fixed position l Up-stream, downstream vs. any where in window l In-frame vs. any frame

Too Many Signals  For each value of k, there are 4 k * 3 * 2 k-grams  If we use k = 1, 2, 3, 4, 5, we have 4 + 24 + 96 + 384 + 1536 + 6144 = 8188 features!  This is too many for most machine learning algorithms

Signal Selection (eg.,  2)

Sample k-grams Selected  Position –3  in-frame upstream ATG  in-frame downstream l TAA, TAG, TGA, l CTG, GAC, GAG, and GCC Kozak consensus Leaky scanning Stop codon Codon bias

Signal Integration  kNN Given a test sample, find the k training samples that are most similar to it. Let the majority class win.  SVM Given a group of training samples from two classes, determine a separating plane that maximises the margin of error.  Naïve Bayes, ANN, C4.5,...

Results (on Pedersen & Nielsen’s mRNA)

Acknowledgements  Roland Yap  Zeng Fanfan  A.G. Pedersen  H. Nielsen

Questions?

Limsoon Wong Laboratories for Information Technology Singapore From Informatics to Bioinformatics.

Similar presentations

Presentation on theme: "Limsoon Wong Laboratories for Information Technology Singapore From Informatics to Bioinformatics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Limsoon Wong Laboratories for Information Technology Singapore From Informatics to Bioinformatics.

Similar presentations

Presentation on theme: "Limsoon Wong Laboratories for Information Technology Singapore From Informatics to Bioinformatics."— Presentation transcript:

Similar presentations

About project

Feedback