From Informatics to Bioinformatics Limsoon Wong

From Informatics to Bioinformatics Limsoon Wong
LIT was formed in January 2002 from the merger of Kent Ridge Digital Labs (KRDL) and Centre for Signal Processing(CSP) in Singapore. LIT performs research and manpower training in distributed systems, ubiquitous computing, knowledge & discovery, media engineering, and signal processing. It seeded bioinformatics R&D in Singapore in 1994 by starting the Kleisli project. LIT has about 200 researchers and an annual budget of about US$25M. Limsoon Wong is Deputy Director of LIT. He is concurrently an adjunct associate professor at the National University of Singapore School of Computing. Prior to his current position, he directed bioinformatics research at KRDL for 8 years. He received his BSc(Eng) from the Imperial College in London and his PhD from the University of Pennsylvania in Philadelphia. He is recipient of the Rubinoff Award (1995),National Academy of Science Young Scientist Award (1997), Tan Kah Kee Young Inventors Gold Award (1997), ASEAN Certificate of Achievements (1998), Singapore Youth Award (1999). Limsoon has done break-through research in several areas. He is well known for his theorems on the ``Bounded Degree Property'' of Nested Relational Calculus and SQL. He contributed to the final solution of the ``Kanellakis Conjecture'' in Finite Model Theory. He invented the ``Kleisli Query System'', which was the first broad-scale data integration system that solved many of the so-called ``impossible'' bioinformatics integration problems identified by the US Department of Energy in His current research is directed at emerging pattern based datamining techniques. Limsoon Wong Institute for Infocomm Research Singapore

What is Bioinformatics?
Modern molecular biology and medical research involves an increasing amount of data, as well as an increasing variety of data. The use of informatics to organize, manage, and analyse these data has consequently become an important element of biology and medical research. Bioinformatics is the fusion of computing, mathematics, and biology to address this need. The effective deployment of Bioinformatics requires the user to have a reasonable idea of the questions that he wants answers to. Then for each such question, Bioinformatics can be use to first organize the relevant data and then to analyse these data to make predictions or to draw conclusions.

Themes of Bioinformatics
Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery = Statistics + Algorithms + Databases In this talk, we consider two major themes in Bioinformatics, viz. data management and knowledge discovery. Data management involves tasks such as integration of relevant data from various sources, transformation of the integrated data into more suitable forms for analysis, cleansing of data to avoid errors in analysis, etc. Knowledge discovery involves the construction of databases and the application of statistics and datamining algorithms to extract various information from these databases, such as prediction models for disease diagnosis. Both themes of Bioinformatics rely on the effective adoption of techniques developed in Computer Science and Mathematics for Biological data. We will describe a few of them in subsequent slides, using recent results obtained in our lab.

Benefits of Bioinformatics
To the patient: Better drug, better treatment To the pharma: Save time, save cost, make more $ To the scientist: Better science

From Informatics to Bioinformatics
8 years of bioinformatics R&D in Singapore MHC-Peptide Binding (PREDICT) Protein Interactions Extraction (PIES) Gene Expression & Medical Record Datamining (PCL) Cleansing & Warehousing (FIMM) Gene Feature Recognition (Dragon) Integration Technology (Kleisli) In the beginning when bioinformatics was first started in LIT (then called ISS), we worked on data integration technology. That required only extremely good computer, but almost no biology. As we acquired slightly more biology background, we began constructing specialized high value-added databases for biologists. We focused then on immunology. We have thus entered the data cleansing and warehouse phase of our development. Once we had sufficient amount of information in our immunology warehouse, which concentrated on the binding of peptide to MHC molecules, we constructed highly accurate models for predicting epitopes (or immunogenic peptides that bind MHC molecules). This of course required significantly more biology. By the end of that, around 2000, I would say we completed our successful transition from informatics to bioinformatics. We launched ourselves into a diversed number of projects dealing with many different aspects of bioinformatics knowledge discovery. Today, we have projects on extracting protein interactions from texts, on recognizing gene features from genomic DNA sequences, on analysing medical records and gene expression, and on the study of toxins and ion channels. In the rest of this talk, I will show you some of our past and present results. Venom Informatics 1994 1996 1998 2000 2002 ISS KRDL LIT/I2R

Data Integration A DOE “impossible query”:
For each gene on a given cytogenetic band, find its non-human homologs. The first example is that of data integration. Many questions that a biologist is interested in could not be answered using any single data source. However, quite a few of these queries can be satisfactorily solved by using information from several sources. Unfortunately, this has proved to be quite difficult in practice. In fact, the US Dept of Energy published a list of queries that they considered `ìmpossible’’ to solve in1993. The interesting thing about these queries was that there was a conceptually straightforward answer to each of them using the databases in What made it `ìmpossible’’ was that the databases needed were geographically distributed, were running on different computer systems with different capabilities, and had very different formats. An example of the US Dept of Energy’s `ìmpossible queries’’ is given in this slide. It required two databases, viz. GDB for information on which gene was on which cytogenetic band and Entrez for information on which gene was a homolog of which other genes. GDB was then located in Baltimore and was a Sybase relational database that supported SQL queries. Entrez was then located in Bethesda and was to be accessed through an ASN.1 interface that supported simple keyword indexing.

Data Integration Results
sybase-add (#name:”GDB", ...); create view L from locus_cyto_location using GDB; create view E from object_genbank_eref using GDB; select #accn: g.#genbank_ref, #nonhuman-homologs: H from L as c, E as g, {select u from g.#genbank_ref.na-get-homolog-summary as u where not(u.#title string-islike "%Human%") andalso not(u.#title string-islike "%H.sapien%")} as H where c.#chrom_num = "22” andalso g.#object_id = c.#locus_id andalso not (H = { }); Using Kleisli: Clear Succinct Efficient Handles heterogeneity complexity Kleisli is a broad-scale data integration system. It allows many data sources to be viewed as if they reside within a federated nested relational database system. It automatically handles heterogeneity so that a user can formulate his queries in a way that is independent of the geographic location of the data sources, independent of whether the data source is a sophisticated relational database system or a dumb flat file, and independent of the access protocols to these data sources. It also has a good query optimizer so that a user can formulate his queries in a clear and succint way without having to worry about whether the queries will run fast. The first prototype of Kleisli was constructed in early That very primitive prototype became the first general query system to solve those ``impossible queries’’ published in 1993 by the US Department of Energy. This slide shows a solution in Kleisli to the example ``impossible’’ query given in the previous slide. Kleisli is licensed to GeneticXchange of Menlo Park and serves as the back bone of their system. Please visit

Data Warehousing efficiency availabilty “denial of service”
{(#uid: , #title: "Homo sapiens adrenergic ...", #accession: "NM_001619", #organism: "Homo sapiens", #taxon: 9606, #lineage: ["Eukaryota", "Metazoa", …], #seq: "CTCGGCCTCGGGCGCGGC...", #feature: { (#name: "source", #continuous: true, #position: [ (#accn: "NM_001619", #start: 0, #end: 3602, #negative: false)], #anno: [ (#anno_name: "organism", #descr: "Homo sapiens"), …] ), …)} Motivation efficiency availabilty “denial of service” data cleansing Requirements efficient to query easy to update. model data naturally Besides querying data sources on the fly, there is also a great need by biologists and biotechnology companies to create their own customized data warehouses. These warehouses are motivated by the following factors: Execution of queries can be more efficient assuming data reside locally on a powerful database system Execution of queries can be more reliable assuming data reside locally on a high-availability database system and high-availability network Execution of queries on a local warehouse avoids unintended ``denial of service’’ attacks on the original sources Most importantly, many public sources contain errors. Some of these errors cannot be corrected or detected on the fly. Hence, human effort must be used (perhaps assisted by computers) to perform cleansing. The cleansed data are warehoused to avoid repeating this task. The requirements of a warehouse of biological data are that it should be efficient to query, easy to update, and that it should model data naturally. This last requirement is very important because biological data, such as the GenBank report shown in this slide, have very complex nesting structure. Warehousing such data in a radically different form are likely to cause problems later in the effective use of these data.

Data Warehousing Results
Relational DBMS is insufficient because it forces us to fragment data into 3NF. Kleisli turns flat relational DBMS into nested relational DBMS. It can use flat relational DBMS such as Sybase, Oracle, MySQL, etc. to be its update-able complex object store. ! Log in oracle-cplobj-add (#name: "db", ...); ! Define table create table GP (#uid: "NUMBER", #detail: "LONG") using db; ! Populate table with GenPept reports select #uid: x.#uid, #detail: x into GP from aa-get-seqfeat-general "PTP” as x ! Map GP to that table create view GP from GP using db; ! Run a queryto get title of select x.#detail.#title from GP as x where x.#uid = ; Due to the complex structure of biological data, a relational DBMS such as Sybase is not suitable as a warehouse. The reason is that they force us to fragment our data into many pieces in order to satisfy the 3rd normal form requirement. This fragmentation or normalization process needs a skilled expert to get right. However, the final user is often not the same expert. So when the user wants to ask question on the data, he may face some conceptual overhead to first figure out how the original data got fragmented into the many pieces in the warehouse. The fragmentation may also pose efficiency problems, as a query may cause many joins to be performed to reassemble the fragments into the original data. Kleisli has the capability to turn relational DBMS into nested relational DBMS. It can use flat DBMS such as Sybase, Oracle, MySQL, etc. to be its update-able complex object store. It can even use all of these variety of DBMS simultaneously. This capability makes Kleisli a good system for warehousing complex biological data. This slide provides a simple example where Kleisli is used to warehouse GenPept data (which is similar in structure and complexity to the GenBank report from the previous slide).

Epitope Prediction TRAP-559AA MNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSE
EVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLN LNDNAIHLYVNVFSNNAKEIIRLHSDASKNKEKALIIIRS LLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVIL TDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNR FLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAVCVEVEK TASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQ CEEERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENI IDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQ KPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDN QNNLPNDKSDRNIPYSPLPPKVLDNERKQSDPQSQDNNGN RHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHE KPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVP GAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN We now turn to our first example in knowledge discovery. This example is on epitope predicton. Epitopes are immunogenic peptides in viral antigens that bind to MHC molecules. They are the starting point for the design of vaccines, as well as the starting point for the de-immunization of gene therapy vectors. Different epitopes bind to different combination of MHC molecules. Epitopes can be detected by wet experiments. However, the cost of such experiments is quite high. Typically a chemistry binding assay and a T-cell assay are needed per peptide, costing US$500 (for a large company like Amgen) to US$2000 (for a small university lab). A typical viral antigen has 500 amino acids or more. Thus to exhaustively identity all its epitopes with respect to a MHC molecule can cost US$250,000 to US$1,000,000 or more. This is prohibitively expensive. An example antigen is shown in this slide. The task of an epitope prediction system is to reliably identify peptides, from a given antigen protein, that bind a given MHC molecule, using computer. Such peptides can then be validated by wet experiments. Significant cost savings are achieved if the predictions are reliable.

Epitope Prediction Results
Prediction by our ANN model for HLA-A11 29 predictions 22 epitopes 76% specificity Prediction by BIMAS matrix for HLA-A*1101 Number of experimental binders 19 (52.8%) (13.9%) (33.3%) At LIT, we developed a very detailed warehouse on the binding and non-binding of peptides to different MHC molecules. From this warehouse of data, we constructed very accurate models for predicting peptide binding to specific MHC molecules. The system is called PREDICT/PREDMODEL. This slide compares the prediction performance of PREDICT/PREDMODEL on a particular antigen wrt HLA-A11 (an example MHC molecule) with that of the popular public epitope prediction system called BIMAS. This antigen is known to have just over 30 epitopes wrt HLA-A11. Just 19 epitopes are included among BIMAS’ top 66 predictions. In contrast, 22 epitopes are included among PREDICT/PREDMODEL’s top 29 predictions. We have so far made predictions for many collaboration partners from WEHI (IDDM), Case Western (Malaria parasite), Pittsburg Univ (Melanoma), Kumamoto Univ (HIV), etc. Rank by BIMAS

Transcription Start Prediction
A draft human genome sequence have been assembled. We even know the rough position of many of the genes. However, the precise structure such as translation initiation sites, transcription start sites, splice points, etc. of many of these genes are unknown. Fully wet lab-based determination of these features are costly and slow. We have developed Dragon, a reliable transcription start site prediction system. The basic idea of this system is shown in this slide. It has a number of signal sensors based on pentamer frequencies and uses an artificial neural network to integrate these signals to decide if the current position under consideration is a transcription start site.

Transcription Start Prediction Results
The preliminary results of Dragon are very promising. This slide shows its performance on 1.3MB of benchmark data. The vertical axis is the sensitivity level. The horizontal axis is the number of false positives. The solid black curve plots the number of false positives produced by Dragon at each sensitivity level. The coloured spots are the performance of several popular transcription start site prediction systems at the sensivity levels recommended by their respective creators. As can be seen, at any level of sensitivity, Dragon produced significantly less false positives than other prediction systems. In fact, at least an order of magnitude less. We are currently making further improvement to Dragon, as well as validating it on the very large DMD gene with our wet lab collaborators at the National University Hospital.

Medical Record Analysis
Looking for patterns that are valid novel useful understandable In our previous slides on our knowledge discovery projects, the data involved were pretty homogeneous: amino acids in the case of epitope prediction, nucleic acids in the case of transcription start site prediction, and gene expression levels in the case of gene expression profile classification. We have also worked on medical records. These records are much more heterogeneous in terms of what you find in them, as shown in this slide. The analysis of medical records is aimed mainly at diagnosis, prognosis, and treatment planning. Here we are looking for patterns that are valid: they are also observed in new data with high certainty novel: they are not obvious to experts and provide new insights useful: they enable reliable predictions understandable: they pose no obstacle in their interpretation Traditional datamining methods that look for high frequency patterns are not useful on these data. Eg., if you use these methods in the Singapore General Hospital, they will produce totally useless patterns such as ``everyone here has black hair and black eyes.’’ We want to develop something more meaningful….

Gene Expression Analysis
Classifying gene expression profiles find stable differentially expressed genes find significant gene groups derive coordinated gene expression Microarrays are now being used to measure the expression level of thousands of genes simultaneously. The gene expression profiles thus obtained may be useful in understanding the interactions of genes under various experimental conditions and the correlation of gene expressions to disease states, provided gene expression analysis can be carried out successfully. . We has mainly worked on classification analysis: aims at finding stable differentially expressed genes from two groups of samples and using these genes as a means to distinguish (ie. classify) new samples into one of the two groups. Currently most work on gene expression profile classification considers the significance of each gene individually. We want to go beyond that and consider groupings of genes, because it is more reasonable to assume that the disease relevant of genes require coordinated expression of groups of genes, and these groups may vary from patient to patient.

Medical Record & Gene Expression Analysis Results
PCL, a novel “emerging pattern’’ method Beats C4.5, CBA, LB, NB, TAN in 21 out of 32 UCI benchmarks Works well for gene expressions There are many methods for analysing medical records, such as decision tree induction (C4.5, CBA), Bayesian networks (LB, NB), neural networks (TAN), etc. Decision trees are easy to understand and are very fast to construct and use. However, they are usually not accurate if the decision boundary is non-linear. Bayesian networks and neural networks performed better in non-linear situations. However, their resultant models are ``black boxes’’ that may not be easy to understand. We have been developing a novel datamining method called PCL for Prediction by Collective Likelihood of emerging patterns. This method focuses on fast techniques for identifying patterns whose frequencies in two classes differ by a large ratio. Note that a pattern is still emerging if its frequencies are as low as 1% in one class and 0.1% in another class, because the ratio indicates a 10 times difference! And then combining these patterns to make decision. Preliminary testing on 32 benchmark datasets is very promising. DeEPs performed best on 21 of the datasets and was not far behind in the remainder. Even more promising is that it also seems to work very well on gene expression data. See our cover page paper in Cancer Cell, March 2002, 1(2). Cancer Cell, March 2002, 1(2)

Protein Interaction Extraction
“What are the protein-protein interaction pathways from the latest reported discoveries?” WEB While scientific databases have been proliferating in these few years, much of the scientific data reported in the literature have not been captured in these databases. For the benefit of speeding up the capture of results reported in research journals into structured databases, sophisticated natural language-based information extraction tools are needed. This slide depicts an idealized situation where a user posts a high-level query requesting for protein interaction information. An engine is envisioned that will download many scientific texts, extract precise facts on the interactions of individual proteins, and combine these facts into an interaction pathway for the user.

Protein Interaction Extraction Results
Rule-based system for processing free texts in scientific abstracts Specialized in extracting protein names extracting protein-protein interactions Jak1 Over the last couple of years, we have been developing the PIES, a protein interaction extraction system. The PIES is a rule-based system for analysing biology research papers written in English. It specializes in recognizing names of proteins and molecules and their interactions. It is one of the first system capable of this kind of analysis and information extraction. This slide shows the output of the system given ``Jak1’’ as the protein whose pathway we are interested in. The PIES downloaded and examined over 500 Medline abstracts. It recognized close to 500 interactions involving close to 350 proteins and molecules. The PIES is licensed to Molecular Connections of Bangalore.

Behind the Scene Allen Chong Vladimir Bajic Judice Koh Vladimir Brusic
SPT Krishnan Huiqing Liu Seng Hong Seah Soon Heng Tan Guanglan Zhang Zhuo Zhang Vladimir Bajic Vladimir Brusic Jinyan Li See-Kiong Ng Limsoon Wong Louxin Zhang The results described in the preceding slides are of course not the work of one man. I am pleased to acknowledge the contributions of members of my lab listed above. Limsoon Wong 20 August 2001 Updated 6 April 2002 and many more: students, folks from geneticXchange, MolecularConnections, and other collaborators….

Using Feature Generation & Feature Selection for Accurate Prediction of Translation Initiation Sites
A more detailed example of post-genome knowledge discovery

Translation Initiation Recognition

A Sample cDNA What makes the second ATG the translation
299 HSU CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT iEEEEEEEEEEEEEEEEEEEEEEEEEEE EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE What makes the second ATG the translation initiation site?

Approach Training data gathering Signal generation Signal selection
k-grams, distance, domain know-how, ... Signal selection Entropy, 2, CFS, t-test, domain know-how... Signal integration SVM, ANN, PCL, CART, C4.5, kNN, ...

Training & Testing Data
Vertebrate dataset of Pedersen & Nielsen [ISMB’97] 3312 sequences 13503 ATG sites 3312 (24.5%) are TIS 10191 (75.5%) are non-TIS Use for 3-fold x-validation expts

Signal Generation K-grams (ie., k consecutive letters)
Window size vs. fixed position Up-stream, downstream vs. any where in window In-frame vs. any frame

Too Many Signals For each value of k, there are 4k * 3 * 2 k-grams
If we use k = 1, 2, 3, 4, 5, we have = 8188 features! This is too many for most machine learning algorithms

Signal Selection (Basic Idea)
Choose a signal w/ low intra-class distance Choose a signal w/ high inter-class distance Which of the following 3 signals is good?

Signal Selection (eg., t-statistics)

Signal Selection (eg., CFS)
Instead of scoring individual signals, how about scoring a group of signals as a whole? CFS A good group contains signals that are highly correlated with the class, and yet uncorrelated with each other

Sample k-grams Selected by CFS
Leaky scanning Kozak consensus Position –3 in-frame upstream ATG in-frame downstream TAA, TAG, TGA, CTG, GAC, GAG, and GCC Stop codon Codon bias?

Signal Integration kNN SVM Naïve Bayes, ANN, C4.5, ...
Given a test sample, find the k training samples that are most similar to it. Let the majority class win. SVM Given a group of training samples from two classes, determine a separating plane that maximises the margin of error. Naïve Bayes, ANN, C4.5, ...

Results (3-fold x-validation)

Improvement by Voting Apply any 3 of Naïve Bayes, SVM, Neural Network, & Decision Tree. Decide by majority.

Improvement by Scanning
Apply Naïve Bayes or SVM left-to-right until first ATG predicted as positive. That’s the TIS. Naïve Bayes & SVM models were trained using TIS vs. Up-stream ATG

Performance Comparisons
* result not directly comparable

Technique Comparisons
Pedersen&Nielsen [ISMB’97] Neural network No explicit features Zien [Bioinformatics’00] SVM+kernel engineering Hatzigeorgiou [Bioinformatics’02] Multiple neural networks Scanning rule Our approach Explicit feature generation Explicit feature selection Use any machine learning method w/o any form of complicated tuning Scanning rule is optional

Acknowledgements A.G. Pedersen H. Nielsen Roland Yap Fanfan Zeng

From Informatics to Bioinformatics Limsoon Wong

Similar presentations

Presentation on theme: "From Informatics to Bioinformatics Limsoon Wong"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

From Informatics to Bioinformatics Limsoon Wong

Similar presentations

Presentation on theme: "From Informatics to Bioinformatics Limsoon Wong"— Presentation transcript:

Similar presentations

About project

Feedback