From Informatics to Bioinformatics Limsoon Wong

From Informatics to Bioinformatics Limsoon Wong
KRDL spearheads networking and information technology R&D in Singapore. It seeded bioinformatics R&D in Singapore in 1994 by starting the Kleisli project. KRDL has 250 researchers and an annual budget of US$27M. It creates economic impact by collaborating with industry, by licensing its technology, and by spinning off companies. Please visit Limsoon Wong is Director of Bioinformatics at KRDL. He is concurrently an adjunct associate professor at the National University of Singapore School of Computing and a senior scientist at the Institute of Molecular & Cell Biology. He received his BSc(Eng) from the Imperial College in London and his PhD from the University of Pennsylvania in Philadelphia. Limsoon is recipient of the Rubinoff Award (1995),National Academy of Science Young Scientist Award (1997), Tan Kah Kee Young Inventors Gold Award (1997), ASEAN Certificate of Achievements (1998), Singapore Youth Award (1999). Limsoon has done break-through research in several areas. He is well known for his theorems on the ``Bounded Degree Property'' of Nested Relational Calculus and SQL. He contributed to the final solution of the ``Kanellakis Conjecture'' in Finite Model Theory. He invented the ``Kleisli Query System'', which was the first broad-scale data integration system that solved many of the so-called ``impossible'' bioinformatics integration problems identified by the US Department of Energy in Please visit sdmc.krdl.org.sg/~limsoon. Limsoon Wong Kent Ridge Digital Labs Singapore

What is Bioinformatics?
Modern molecular biology and medical research involves an increasing amount of data, as well as an increasing variety of data. The use of informatics to organize, manage, and analyse these data has consequently become an important element of biology and medical research. Bioinformatics is the fusion of computing, mathematics, and biology to address this need. The effective deployment of Bioinformatics requires the user to have a reasonable idea of the questions that he wants answers to. Then for each such question, Bioinformatics can be use to first organize the relevant data and then to analyse these data to make predictions or to draw conclusions.

What are the Themes of Bioinformatics?
Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery = Statistics + Algorithms + Databases In this talk, we consider two major themes in Bioinformatics, viz. data management and knowledge discovery. Data management involves tasks such as integration of relevant data from various sources, transformation of the integrated data into more suitable forms for analysis, cleansing of data to avoid errors in analysis, etc. Knowledge discovery involves the construction of databases and the application of statistics and datamining algorithms to extract various information from these databases, such as prediction models for disease diagnosis. Both themes of Bioinformatics rely on the effective adoption of techniques developed in Computer Science and Mathematics for Biological data. We will describe a few of them in subsequent slides, using recent results obtained in our lab. Please visit sdmc.krdl.org.sg/bic.

What are the Benefits of Bioinformatics?
To the patient: Better drug, better treatment To the pharma: Save time, save cost, make more $ To the scientist: Better science

Data Integration A DOE “impossible query”: For each gene on a given cytogenetic band, find its non-human homologs. The first example is that of data integration. Many questions that a biologist is interested in could not be answered using any single data source. However, quite a few of these queries can be satisfactorily solved by using information from several sources. Unfortunately, this has proved to be quite difficult in practice. In fact, the US Dept of Energy published a list of queries that they considered `ìmpossible’’ to solve in1993. The interesting thing about these queries was that there was a conceptually straightforward answer to each of them using the databases in What made it `ìmpossible’’ was that the databases needed were geographically distributed, were running on different computer systems with different capabilities, and had very different formats. An example of the US Dept of Energy’s `ìmpossible queries’’ is given in this slide. It required two databases, viz. GDB for information on which gene was on which cytogenetic band and Entrez for information on which gene was a homolog of which other genes. GDB was then located in Baltimore and was a Sybase relational database that supported SQL queries. Entrez was then located in Bethesda and was to be accessed through an ASN.1 interface that supported simple keyword indexing.

Data Integration Results
Using Kleisli: Clear Succint Efficient Handles heterogeneity complexity sybase-add (#name:”GDB", ...); create view L from locus_cyto_location using GDB; create view E from object_genbank_eref using GDB; select #accn: g.#genbank_ref, #nonhuman-homologs: H from L as c, E as g, (select u from g.#genbank_ref.na-get-homolog-summary as u where not(u.#title string-islike "%Human%") andalso not(u.#title string-islike "%H.sapien%")) as H where c.#chrom_num = "22” andalso g.#object_id = c.#locus_id andalso not (H = { }); Kleisli is a broad-scale data integration system. It allows many data sources to be viewed as if they reside within a federated nested relational database system. It automatically handles heterogeneity so that a user can formulate his queries in a way that is independent of the geographic location of the data sources, independent of whether the data source is a sophisticated relational database system or a dumb flat file, and independent of the access protocols to these data sources. It also has a good query optimizer so that a user can formulate his queries in a clear and succint way without having to worry about whether the queries will run fast. The first prototype of Kleisli was constructed in early That very primitive prototype became the first general query system to solve those ``impossible queries’’ published in 1993 by the US Department of Energy. This slide shows a solution in Kleisli to the example ``impossible’’ query given in the previous slide. Kleisli is licensed to GeneticXchange of Menlo Park and serves as the back bone of their gX-engine. Please visit Please also visit sdmc.krdl.org.sg/kleisli.

Data Warehousing Motivation efficiency availabilty “denial of service”
{(#uid: , #title: "Homo sapiens adrenergic ...", #accession: "NM_001619", #organism: "Homo sapiens", #taxon: 9606, #lineage: ["Eukaryota", "Metazoa", …], #seq: "CTCGGCCTCGGGCGCGGC...", #feature: { (#name: "source", #continuous: true, #position: [ (#accn: "NM_001619", #start: 0, #end: 3602, #negative: false)], #anno: [ (#anno_name: "organism", #descr: "Homo sapiens"), …] ), …)} Motivation efficiency availabilty “denial of service” data cleansing Requirements efficient to query easy to update. model data naturally Besides querying data sources on the fly, there is also a great need by biologists and biotechnology companies to create their own customized data warehouses. These warehouses are motivated by the following factors: Execution of queries can be more efficient assuming data reside locally on a powerful database system Execution of queries can be more reliable assuming data reside locally on a high-availability database system and high-availability network Execution of queries on a local warehouse avoids unintended ``denial of service’’ attacks on the original sources Most importantly, many public sources contain errors. Some of these errors cannot be corrected or detected on the fly. Hence, human effort must be used (perhaps assisted by computers) to perform cleansing. The cleansed data are warehoused to avoid repeating this task. The requirements of a warehouse of biological data are that it should be efficient to query, easy to update, and that it should model data naturally. This last requirement is very important because biological data, such as the GenBank report shown in this slide, have very complex nesting structure. Warehousing such data in a radically different form are likely to cause problems later in the effective use of these data.

Data Warehousing Results
Relational DBMS is insufficient because it forces us to fragment data into 3NF. Kleisli turns flat relational DBMS into nested relational DBMS. It can use flat relational DBMS such as Sybase, Oracle, MySQL, etc. to be its updatable complex object store. It can even use all of these systems simultaneously! ! Log in oracle-cplobj-add (#name: "db", ...); ! Define table create table GP (#uid: "NUMBER", #detail: "LONG") using db; ! Populate table with GenPept reports select #uid: x.#uid, #detail: x into GP from aa-get-seqfeat-general "PTP” as x ! Map GP to that table create view GP from GP using db; ! Run a queryto get title of select x.#detail.#title from GP as x where x.#uid = ; Due to the complex structure of biological data, a relational DBMS such as Sybase is not suitable as a warehouse. The reason is that they force us to fragment our data into many pieces in order to satisfy the 3rd normal form requirement. This fragmentation or normalization process needs a skilled expert to get right. However, the final user is often not the same expert. So when the user wants to ask question on the data, he may face some conceptual overhead to first figure out how the original data got fragmented into the many pieces in the warehouse. The fragmentation may also pose efficiency problems, as a query may cause many joins to be performed to reassemble the fragments into the original data. Kleisli has the capability to turn relational DBMS into nested relational DBMS. It can use flat DBMS such as Sybase, Oracle, MySQL, etc. to be its update-able complex object store. It can even use all of these variety of DBMS simultaneously. This capability makes Kleisli a good system for warehousing complex biological data. This slide provides a simple example where Kleisli is used to warehouse GenPept data (which is similar in structure and complexity to the GenBank report from the previous slide).

Epitope Prediction TRAP-559AA MNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSE
EVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLN LNDNAIHLYVNVFSNNAKEIIRLHSDASKNKEKALIIIRS LLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVIL TDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNR FLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAVCVEVEK TASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQ CEEERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENI IDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQ KPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDN QNNLPNDKSDRNIPYSPLPPKVLDNERKQSDPQSQDNNGN RHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHE KPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVP GAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN We now turn to our first example in knowledge discovery. This example is on epitope predicton. Epitopes are immunogenic peptides in viral antigens that bind to MHC molecules. They are the starting point for the design of vaccines, as well as the starting point for the de-immunization of gene therapy vectors. Different epitopes bind to different combination of MHC molecules. Epitopes can be detected by wet experiments. However, the cost of such experiments is quite high. Typically a chemistry binding assay and a T-cell assay are needed per peptide, costing US$500 (for a large company like Amgen) to US$2000 (for a small university lab). A typical viral antigen has 500 amino acids or more. Thus to exhaustively identity all its epitopes with respect to a MHC molecule can cost US$250,000 to US$1,000,000 or more. This is prohibitively expensive. An example antigen is shown in this slide. The task of an epitope prediction system is to reliably identify peptides, from a given antigen protein, that bind a given MHC molecule, using computer. Such peptides can then be validated by wet experiments. Significant cost savings are achieved if the predictions are reliable.

Epitope Prediction Results
Prediction by our ANN model for HLA-A11 29 predictions 22 epitopes 76% specificity Prediction by BIMAS matrix for HLA-A*1101 Number of experimental binders 19 (52.8%) (13.9%) (33.3%) At KRDL, we have developed a very detailed warehouse on the binding and non-binding of peptides to different MHC molecules. From this warehouse of data, we constructed very accurate models for predicting peptide binding to specific MHC molecules. The system is called PREDICT/PREDMODEL. This slide compares the prediction performance of PREDICT/PREDMODEL on a particular antigen wrt HLA-A11 (an example MHC molecule) with that of the popular public epitope prediction system called BIMAS. This antigen is known to have just over 30 epitopes wrt HLA-A11. Just 19 epitopes are included among BIMAS’ top 66 predictions. In contrast, 22 epitopes are included among PREDICT/PREDMODEL’s top 29 predictions. We have so far made predictions for many collaboration partners from WEHI (IDDM), Case Western (Malaria parasite), Pittsburg Univ (Melanoma), Kumamoto Univ (HIV), etc. Please visit sdmc.krdl.org.sg/predict. Rank by BIMAS

Gene Expression Analysis
Clustering gene expression profiles Classifying gene expression profiles find stable differentially expressed genes Microarrays are now being used to measure the expression level of thousands of genes simultaneously. The gene expression profiles thus obtained may be useful in understanding the interactions of genes under various experimental conditions and the correlation of gene expressions to disease states, provided gene expression analysis can be carried out successfully. There are two main kinds of analysis, viz. clustering analysis and classification analysis. Clustering analysis aims at grouping genes by similarity of their profiles under varying experimental conditions, such as time. Classification analysis aims at finding stable differentially expressed genes from two groups of samples and using these genes as a means to distinguish (ie. classify) new samples into one of the two groups.

Gene Expression Analysis Results
The Discovery System Correlation test Voter selection Class prediction At KRDL, we have developed BioCluster for clustering of gene expression profiles and Discovery classification of gene expression profiles. This slide shows the Discovery. Please visit sdmc.krdl.org.sg/~lxzhang/discovery. In both systems, we follow the ``Macdonald menu’’ approach. That is, we provide multiple methods at each step of the analysis process. Some of these methods are based on results published by other groups. Some of these methods are novel ones developed by us. Having multiple methods available is very important because no single method is known to work uniformly well on all gene expression datasets today. Besides Discovery, we are also working on ArrayQuerier, a gene expression classification system and microarray data management system. The St. Jude Children’s Research Hospital is the first testbed user of a prototype of ArrayQuerier.

Protein Interaction Extraction
“What are the protein-protein interaction pathways from the latest reported discoveries?” WEB While scientific databases have been proliferating in these few years, much of the scientific data reported in the literature have not been captured in these databases. For the benefit of speeding up the capture of results reported in research journals into structured databases, sophisticated natural language-based information extraction tools are needed. This slide depicts an idealized situation where a user posts a high-level query requesting for protein interaction information. An engine is envisioned that will download many scientific texts, extract precise facts on the interactions of individual proteins, and combine these facts into an interaction pathway for the user.

Protein Interaction Extraction Results
Rule-based system for processing free texts in scientific abstracts Specialized in extracting protein names extracting protein-protein interactions Jak1 Over the last couple of years, we have been developing the PIES, a protein interaction extraction system. The PIES is a rule-based system for analysing biology research papers written in English. It specializes in recognizing names of proteins and molecules and their interactions. It is one of the first system capable of this kind of analysis and information extraction. This slide shows the output of the system given ``Jak1’’ as the protein whose pathway we are interested in. The PIES downloaded and examined over 500 Medline abstracts. It recognized close to 500 interactions involving close to 350 proteins and molecules. The PIES is licensed to Molecular Connections of Bangalore.

Transcription Start Prediction
A draft human genome sequence have been assembled. We even know the rough position of many of the genes. However, the precise structure such as translation initiation sites, transcription start sites, splice points, etc. of many of these genes are unknown. Fully wet lab-based determination of these features are costly and slow. We have developed Dragon, a reliable transcription start site prediction system. The basic idea of this system is shown in this slide. It has a number of signal sensors based on pentamer frequencies and uses an artificial neural network to integrate these signals to decide if the current position under consideration is a transcription start site.

Transcription Start Prediction Results
The preliminary results of Dragon are very promising. This slide shows its performance on 1.3MB of benchmark data. The vertical axis is the sensitivity level. The horizontal axis is the number of false positives. The solid black curve plots the number of false positives produced by Dragon at each sensitivity level. The coloured spots are the performance of several popular transcription start site prediction systems at the sensivity levels recommended by their respective creators. As can be seen, at any level of sensitivity, Dragon produced significantly less false positives than other prediction systems. In fact, at least an order of magnitude less. We are currently making further improvement to Dragon, as well as validating it on the very large DMD gene with our wet lab collaborators at the National University Hospital. Please visit Dragon at sdmc.krdl.org.sg/promoter.

Medical Record Analysis
Looking for patterns that are valid novel useful understandable In our previous slides on knowledge discovery projects, the data involved were pretty homogeneous: amino acids in the case of epitope prediction, nucleic acids in the case of transcription start site prediction, and gene expression levels in the case of gene expression profile classification. We have also worked on medical records. These records are much more heterogeneous in terms of what you find in them, as shown in this slide. The analysis of medical records is aimed mainly at diagnosis, prognosis, and treatment planning. Here we are looking for patterns that are valid: they are also observed in new data with high certainty novel: they are not obvious to experts and provide new insights useful: they enable reliable predictions understandable: they pose no obstacle in their interpretation Traditional datamining methods that look for high frequency patterns are not useful on these data. Eg., if you use these methods in the Singapore General Hospital, they will produce totally useless patterns such as ``everyone here has black hair and black eyes.’’

Medical Record Analysis Results
DeEPs, a novel “emerging pattern’’ method Beats C4.5, CBA, LB, NB, TAN in 21 out of 32 UCI benchmarks Works for gene expressions There are many methods for analysing medical records, such as decision tree induction (C4.5, CBA), Bayesian networks (LB, NB), neural networks (TAN), etc. Decision trees are easy to understand and are very fast to construct and use. However, they are usually not accurate if the decision boundary is non-linear. Bayesian networks and neural networks performed better in non-linear situations. However, their resultant models are ``black boxes’’ that may not be easy to understand. We have been developing a novel datamining method called DeEPs for making decisions through Emerging Patterns. This method focuses on fast techniques for identifying patterns whose frequencies in two classes differ by a large ratio. Note that a pattern is still emerging if its frequencies are as low as 1% in one class and 0.1% in another class, because the ratio indicates a 10 times difference! Preliminary testing on 32 benchmark datasets is very promising. DeEPs performed best on 21 of the datasets and was not far behind in the remainder. Even more promising is that it also seems to work very well on gene expression data, giving us a system that can potentially deal with two highly related kinds of data.

Behind the Scene Vladimir Bajic Allen Chong Vladimir Brusic Judice Koh
Research Vladimir Bajic Vladimir Brusic Jinyan Li See-Kiong Ng Limsoon Wong Louxin Zhang Business Peter Saunders Industry Assignees Hao Han (gX) Rahul Despande (MC) Engineering Allen Chong Judice Koh SPT Krishnan Seng Hong Seah Guanglan Zhang Zhuo Zhang Students Huiqing Liu Song Zhu Kun Yu The results described in the preceding slides are of course not the work of one man. I am pleased to acknowledge the contributions of members of my lab listed above. Please visit us at sdmc.krdl.org.sg/bic. Limsoon Wong 20 August 2001

From Informatics to Bioinformatics Limsoon Wong

Similar presentations

Presentation on theme: "From Informatics to Bioinformatics Limsoon Wong"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

From Informatics to Bioinformatics Limsoon Wong

Similar presentations

Presentation on theme: "From Informatics to Bioinformatics Limsoon Wong"— Presentation transcript:

Similar presentations

About project

Feedback