Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory.

Slides:



Advertisements
Similar presentations
Pat Langley Computational Learning Laboratory Center for the Study of Language and Information Stanford University, Stanford, California
Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
LESSON 1: What is Genetic Research? PowerPoint slides to accompany Using Bioinformatics : Genetic Research.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Automated Exploration of Bioinformatics Spaces Simon Colton Computational Bioinformatics Laboratory.
Induction and Decision Trees. Artificial Intelligence The design and development of computer systems that exhibit intelligent behavior. What is intelligence?
Active subgroup mining for descriptive induction tasks Dragan Gamberger Rudjer Bošković Instute, Zagreb Zdenko Sonicki University of Zagreb.
Collaborative Information Management: Advanced Information Processing in Bioinformatics Joost N. Kok LIACS - Leiden Institute of Advanced Computer Science.
AI and Bioinformatics From Database Mining to the Robot Scientist.
APRIL, Application of Probabilistic Inductive Logic Programming, IST Albert-Ludwigs-University, Freiburg, Germany & Imperial College of Science,
Jeffery Loo NLM Associate Fellow ’03 – ’05 chemicalinformaticsforlibraries.
Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
Science Inquiry Minds-on Hands-on.
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
ILP for Mathematical Discovery Simon Colton & Stephen Muggleton Computational Bioinformatics Laboratory Imperial College.
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
9/30/2004TCSS588A Isabelle Bichindaritz1 Introduction to Bioinformatics.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Knowledgebase Creation & Systems Biology: A new prospect in discovery informatics S.Shriram, Siri Technologies (Cytogenomics), Bangalore S.Shriram, Siri.
Ch1 AI: History and Applications Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2011.
Inductive Logic Programming Includes slides by Luis Tari CS7741L16ILP.
Lakatos-style Methods in Automated Reasoning Alison Pease University of Edinburgh Simon Colton Imperial College, London.
1 Bio + Informatics AAACTGCTGACCGGTAACTGAGGCCTGCCTGCAATTGCTTAACTTGGC An Overview پرتال پرتال بيوانفورماتيك ايرانيان.
Automated Explanation of Gene-Gene Relationships Wacek Kuśnierczyk.
Automated Theory Formation: First Steps in Bioinformatics Simon Colton Computational Bioinformatics Laboratory.
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Information Systems Basic Core Specialization Clinical Imaging BioInformatics Public Health Computer Science Methods (formal models) Biomedical Decision.
Artificial Intelligence at Imperial Dr. Simon Colton Computational Bioinformatics Laboratory Department of Computing.
Functional Genomic Hypothesis Generation and Experimentation by a Robot Scientist King et al, Nature : Presented by Monica C. Sleumer February.
Mathematics – A new Domain for Datamining? Simon Colton Universities of Edinburgh & York United.
1 Abduction and Induction in Scientific Knowledge Development Peter Flach, Antonis Kakas & Oliver Ray AIAI Workshop 2006 ECAI August, 2006.
Automated Theory Formation in Bioinformatics Simon Colton Computational Bioinformatics Lab Imperial College, London.
Automated Reasoning for Classifying Finite Algebras Simon Colton Computational Bioinformatics Laboratory Imperial College, London.
Chapter 9 Neural Network.
Master’s Degrees in Bioinformatics in Switzerland: Past, present and near future Patricia M. Palagi Swiss Institute of Bioinformatics.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
A Theory of Theory Formation Simon Colton Universities of Edinburgh and York.
Function first: a powerful approach to post-genomic drug discovery Stephen F. Betz, Susan M. Baxter and Jacquelyn S. Fetrow GeneFormatics Presented by.
Edinburgh and Calculemus Simon Colton Universities of Edinburgh and York.
+ => Bioinformatics: from Sequence to Knowledge Outline: Introduction to bioinformatics The TAU Bioinformatics unit Useful bioinformatics issues and databases:
Bioinformatics Core Facility Guglielmo Roma January 2011.
Learning Metabolic Network Inhibition using Abductive Stochastic Logic Programming Jianzhong Chen, Stephen Muggleton, José Santos Imperial College, London.
Working Group 4 Creative Systems for Knowledge Management in Life Sciences.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Lesson Overview Lesson Overview What Is Science? Lesson Overview 1.1 What Is Science?
1 CSCD 326 Data Structures I Software Design. 2 The Software Life Cycle 1. Specification 2. Design 3. Risk Analysis 4. Verification 5. Coding 6. Testing.
Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Bioinformatics and Computational Biology
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College Bio Informatics January
Introduction to biological molecular networks
Course Overview  What is AI?  What are the Major Challenges?  What are the Main Techniques?  Where are we failing, and why?  Step back and look at.
Data Mining and Decision Support
Automatic Generation of First Order Theorems Simon Colton Universities of Edinburgh and York Funded by EPSRC grant GR/M98012 and the Calculemus Network.
Bioinformatics Dipl. Ing. (FH) Patrick Grossmann
Artificial Intelligence: Research and Collaborative Possibilities a presentation by: Dr. Ernest L. McDuffie, Assistant Professor Department of Computer.
Bioinformatics Research Overview Li Liao Develop new algorithms and (statistical) learning methods > Capable of incorporating domain knowledge > Effective,
Machine Creativity Edinburgh Simon Colton Universities of Edinburgh and York.
Bioinformatics Teaching in the Department of Computing Dr. Simon Colton Computational Bioinformatics Laboratory.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
Sub-fields of computer science. Sub-fields of computer science.
Brief Intro to Machine Learning CS539
Data Warehousing and Data Mining
A Short Tutorial on Causal Network Modeling and Discovery
LESSON 1 INTNRODUCTION HYE-JOO KWON, Ph.D /
The Nature of Science.
Presentation transcript:

Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Talk Overview Our research group –Aims, people, publications Machine learning –A balancing act Bioinformatics –Holy grails Our bioinformatics research projects –From small to large A future direction –Integration of reasoning techniques

Computational Bioinformatics Laboratory Our aim is to: –Study the theory, implementation and application of computational techniques to problems in biology and medicine Our emphasis is on: –Machine learning representations, algorithms and applications Our favourite techniques are: –ILP, SLPs, ATF, ATP, CSP, GAs, SVMs –Kernel methods, Bayes nets, Action Languages The (major) research tools we’ve produced are: –Progol, HR, MetaLog (in production)

The Research Group Members Hiroaki Watanabe (RA, BBSRC) Alireza Tamaddoni-Nezhad (RA, DTI) Stephen Muggleton (Professor) Ali Hafiz (PhD) Huma Lodhi (RA, DTI) Simon Colton (Lecturer) Jung-Wook Bang (RA, DTI) (Nicos Angeloupolos, now in York) (RA, BBSRC) Room 407 –

Some External Collaborators Mike Sternberg (Biochemistry, Imperial) Jeremy Nicholson (Biomedical Sciences, Imperial) Steve Oliver (Biology, Manchester) Ross King (Computing, Aberystwyth) Doug Kell (Chemistry, Manchester) Chris Rawlings (Oxagen) Charlie Hodgman (GSK) Alan Bundy (Informatics, Edinburgh) Toby Walsh (Cork Constraint Computation Centre)

Some Departmental Collaborators Krysia Broda, Allesandro Russo, Oliver Ray –Aspects of ILP and ALP Marek Sergot –Action Languages Tony Kakas (Visiting professor, Cyprus) –Abductive Logic Programming

Machine Learning Overview Ultimately about writing programs which improve with experience –Experience through data –Experience through knowledge –Experience through experimentation (active) Some common tasks: –Concept learning for prediction –Clustering –Association rule mining

Maintaining a Balance Predictive tasks Descriptive tasks Supervised learning Unsupervised learning Know what you’re looking for Don’t know you’re even looking Don’t know what you’re looking for

A Partial Characterisation of Learning Tasks Concept learning Outlier/anomaly detection Clustering Concept formation Conjecture making Puzzle generation Theory formation

Maintaining a Balance in Predictive/Descriptive tasks Predictive tasks –From accuracy to understanding –Need to show statistical significance But hypotheses generated often need to be understandable –Difference between the stock market and biology Descriptive tasks –From pebbles to pearls –Lots of rubbish produced Cannot rely on statistical significance –Have to worry about notions of interestingness And provide tools to extract useful information from output

Maintaining a Balance in Scientific Discovery tasks Machine learning researchers –Are generally not domain scientists also Extremely important to collaborate –To provide interesting projects Remembering that we are scientists not IT consultants –To gain materials Data, background knowledge, heuristics, –To assess the value of the output

Inductive Logic Programming Concept/rule learning technique (usually) –Hypotheses represented as Logic Programs Search for LPs –From general to specific or vice-versa One method is inverse entailment –Use measures to guide the search Predictive accuracy and compression (info. theory) –Search performed within a language bias Produces good accuracy and understanding –Logic programs are easier to decipher than ANNs Our implementation: Progol (and others)

Example learned LP fold('Four-helical up-and-down bundle',P) :- helix(P,H1), length(H1,hi), position(P,H1,Pos), interval(1 =< Pos =< 3), adjacent(P,H1,H2), helix(P,H2). Predicting protein folds from helices

Stochastic Logic Programs Generalisation of HMMs Probabilistic logic programs –More expressive language than LPs –Quantative rather than qualitative Express arbitrary intervals over probability distributions Issues in learning SLPs –Structure estimation –Parameter estimation Applications –More appropriate for biochemical networks

Automated Theory Formation Descriptive learning technique –Which can also be used for prediction tasks Cycle of activity –Form concepts, make hypotheses, explain hypotheses, evaluate concepts, start again,… –15 production rules for concepts –7 methods to discover and extract conjectures –Uses third party software to prove/disprove (maths) –25 heuristic measures of interestingness Project: see whether this works in bioinformatics Our implementation: HR

Other Machine Learning Methods used in our Group Genetic algorithms –To perform ILP search (Alireza) Bayes nets –Introduction of hidden nodes (Philip) Kernel methods –Relational kernels for SVMs and regression (Huma) Action Languages –Stochastic (re)actions (Hiraoki)

Bioinformatics Overview “Bioinformatics is the study of information content and information flow in biological systems and proceses” (Michael Liebman) –Not just storage and analysis of huge DNA sequences “Bioinformaticians have to be a Jack of all trades and a master of one” (Charlie Hodgman, GSK) Highly collaborative –biology, mathematics, statistics, computer science, biochemistry, physics, chemistry, medicine, …

From Sequence to Structure MRPQAPGSLVDPNEDELRMAPWYWGRISREEA KSILHGKPDGSFLVRDALSMKGEYTLTLMKDG CEKLIKICHMDRKYGFIETDLFNSVVEMINYY KENSLSMYNKTLDITLSNPIVRAREDEESQPH GDLCLLSNEFIRTCQLLQNLEQNLENKRNSFN AIREELQEKKLHQSVFGNTEKIFRNQIKLNES FMKAPADA…… attcgatcgatcgatcgatcaggcgcgcta Cgagcggcgaggacctcatcatcgatcag… There is a computer program…?

Holy Grail Number One From protein sequence to protein function HGP data needs to be interpreted –Genome split into genes, which code for a protein –Biological function of protein dictated by structure Structure of many proteins already determined –By X-ray crystallography Best idea so far: given a new gene sequence –Find sequence most similar to it with known structure And look at the structure/function of the protein Other alternatives –Use ML techniques to predict where secondary structures will occur (e.g., hairpins, alpha-helices, beta-sheets)

Holy Grail Number Two Drug companies lose millions –Developing drugs which turn out to be toxic Predictive Toxicology –Determine in advance which will be toxic Approach 1: Mapping molecules to toxicity –Using ML and statistical techniques Approach 2: –Producing metabolic explanations of toxic effects –Using probabilistic logics to represent pathways And learning structures and parameters over this

Other aims of Bioinformatics Organisation of Data –Cross referencing –Data integration is a massive problem Analysing data from –High-throughput methods for gene expression –Ask Yike about this! Produce Ontologies –And get everyone to use them?

Some Current Bioinformatics Projects SGC –The Substructure Server SGC and SHM –Discovery in medical ontologies SHM –Studying biochemical networks (£400k, BBSRC) –Closed loop learning (£200k, EPSRC) –The Metalog project (£1.1 million, DTI) –APRIL 2 (£400k, EC)

A Substructure Server Lesson from Automated Theorem Proving –Best (most complex) methods not most used Other considerations: ease of use, stability, simplicity, e.g., Otter Aim: provide a simple predictive toxicology program –Via a server with a very simple interface Sub-projects –Find substructures in many positives, few negatives: Colton Simple Prolog program, writing Java version, use ILP?? –Put program on server: Anandathiyagar (MSc.) –Distribute process over our Linux cluster: Darby (MEng.) –Babel preprocessor (50+ repns), Rasmol back-end: ???

The Substructure Server

Using Medical Ontologies Use Ontology and ML for database integration –Muggleton and Tamaddoni-Nezhad –Bridge between two disparate databases LIGAND (biochemical reactions) Enzyme classification system (EC) = ontology Automated ontology maintenance –Colton and Traganidas (MSc. Last year) –Gene Ontology (big project) –Use data to find links between GO terms Equivalence and implication finding using HR

Gene Ontology Discovery 55%

Studying Biochemical Networks Use SLPs to find mappings between genomes –Map function of pairs of homologous proteins E.g., mouse and human –Homology is probabilistic Developed SLP learning algorithms Initial results applying them in biological networks Work by –Muggleton, Angeloupolos and Watanabe

Closed Loop Machine Learning Active learning –Information theoretic algorithm designs and chooses the most informative and lowest cost experiments to carry out Implemented in the ASE-Progol system –Learning generates hypotheses –Being studied by Ali Hafiz (PhD) Idea: use machine learning to guide experimentation –using a real robot geneticist in a cyclic process Aims of current project: determine the function of genes Cost savings of 2 to 4 times over alternatives Upcoming Nature article

APRIL 2 Applications of Probabalistic Relational Induction in Logic Aim: develop representations and learning algorithms for probabilistic logics Applications: bioinformatics –Metabolic networks –Phylo-genetics 2 RAs at Imperial (with Mike Sternberg) –Starting in January

The Metalog Project Overview Aim: –Modelling disease pathways and predicting toxicity –Gap filling: existing representations correct but incomplete –Predict where the toxin is acting (focus) Multi-layered problem representation –Meta-network level (Bayes nets) Philip –Network level (SLPs) Huma –Biochemical reaction level (LPs) Alireza –Problog lingua-franca developed to represent learned knowledge NMR Data from metabonomics from Jeremy Nicholson KEGG Background knowledge from Mike Sternberg

The Metalog Project Progress Year 1 achievements (all objectives achieved) Function predictions from LIGAND Mapping between KEGG and metabolic networks Initial Bayes-net model –Drawn much interest from experts Agrees with KEGG, and disagrees in interesting ways Interaction between metabolytes which are not explained Year 2 –Working towards abductive model for gap filling

Future Directions for Machine Learning in Bioinformatics In-silico modelling of complete organisms Representation and reasoning at all levels –From patient to the molecule Probabalistic models –For more complex biological processes Such as biochemical pathways

Biochemical Pathways 1/120 th of a biochemical network

Future Directions for My Research Descriptive Induction meets Biology data Most ML bioinformatics projects are predictive –Very carefully compressed notions of interestingness Into a single measure: predictive accuracy Domain scientist not bombarded with a lot of information A correctly answered question can be highly revealing Can we push this envelope slightly? –Use descriptive induction (WARMR, CLAUDIEN, HR) To tell biologists something they weren’t expecting about the data they have collated –Have to worry hard about dull output Need to determine heuristics from domain scientists

More Future Directions Put “Automated Reasoning” back together again –Essential for scientific discovery ML, ATP, CSP, etc., all work well individually –Surely work better in combination… Improve ATP to prove a different theorem? –Make flexible using CSP and ATP Improve ML by rationalising input concepts? –Use ATF and ATP to find concepts and hypotheses Improve CSP by introducing additional constraints –Use ATF, ML to find constraints, ATP to prove them