Data mining in bioinformatics: problems and challenges Sorin Draghici WWW:

Slides:



Advertisements
Similar presentations
Use of Bioisosteric Replacement Tools to Obtain Mutation- Resistant Antivirals Mattia CF Prosperi University of Roma TRE Faculty of Computer Science Engineering.
Advertisements

Antiretroviral Drug Resistance
Review of main points from last week Medical costs escalating largely due to new technology This is an ethical/social problem with major conseq. Many new.
1 VLDB 2006, Seoul Mapping a Moving Landscape by Mining Mountains of Logs Automated Generation of a Dependency Model for HUG’s Clinical System Mirko Steinle,
ANIMAL MODELS HIV Cure Research Training Curriculum The HIV CURE training curriculum is a collaborative project aimed at making HIV cure research science.
6/28/00TPED1 Resistance Testing: What is it? What does it mean? How does drug resistance emerge? Overview of methods Advantages and disadvantages Current.
1 Genetics The Study of Biological Information. 2 Chapter Outline DNA molecules encode the biological information fundamental to all life forms DNA molecules.
Salvage Antiretroviral Therapy Guiding Principles, Strategies and the Role of Resistance Testing.
Celera and the Virus Celera and the Virus A review of current prospects of Applera SoCalBSI 2004 Timothy Ng 07/07/04.
HIV and AIDS Human Immunodeficiency Virus (HIV) is the virus that causes Acquired Immunodeficiency Syndrome (AIDS).
Introduction to Genomics, Bioinformatics & Proteomics Brian Rybarczyk, PhD PMABS Department of Biology University of North Carolina Chapel Hill.
Informatics Support for Vaccine Projects Using and extending the UCSC bioinformatics infrastructure.
1 CIS607, Fall 2006 Semantic Information Integration Instructor: Dejing Dou Week 10 (Nov. 29)
10 Genomics, Proteomics and Genetic Engineering. 2 Genomics and Proteomics The field of genomics deals with the DNA sequence, organization, function,
Combination of Drugs and Drug-Resistant Reverse Transcriptase Results in a Multiplicative Increase of Human Immunodeficiency Virus Type 1 Mutant Frequencies.
ANTIRETROVIRAL RESISTANCE Jennifer Fulcher, MD, PhD.
Bioinformatics Jan Taylor. A bit about me Biochemistry and Molecular Biology Computer Science, Computational Biology Multivariate statistics Machine learning.
Persisting long term benefit of genotypic guided treatment in HIV infected patients failing HAART and Importance of Protease Inhibitor plasma levels. Viradapt.
KEY CONCEPT Genetics provides a basis for new medical treatments.
LO: Be able to describe what gene therapy is and how it could be used.
JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division.
Development of Bioinformatics and its application on Biotechnology
Biotechnology SB2.f – Examine the use of DNA technology in forensics, medicine and agriculture.
Whole Genome Expression Analysis
ARV Nurse Training, Africaid, 2004 ARV Nurse Training Programme Marcus McGilvray & Nicola Willis About Resistance.
Chapter 31 Advances in Molecular Genetics. What is a genome? Genome: is all of an organism’s genetic information. Genomic map of E. coli bacteria.
Biomedical Research.
Guidelines for the Use of Antiretroviral Agents in Pediatric HIV Infection DR. S.K CHATURVEDI DR. KANUPRIYA CHATURVEDI.
Estimating fitness landscapes John Pinney
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
AI Week 14 Machine Learning: Introduction to Data Mining Lee McCluskey, room 3/10
1 ARV Drug Resistance HAIVN Harvard Medical School AIDS Initiative in Vietnam.
HIV i-Base: Training for Advocates, 10/2004www.i-Base.info Section 3: Introduction to ARV Therapy HIV i-Base STEP EATG HIV Training for Advocates.
The Human Genome Project & Pedigrees Chapter 11 & 12.
1 Machine Learning 1.Where does machine learning fit in computer science? 2.What is machine learning? 3.Where can machine learning be applied? 4.Should.
Chapter 10 An Evolving Enemy Silvio Penta Silvio Penta Christie DiDonato Christie DiDonato Carl Tuoni Carl Tuoni Beth Miller Beth Miller.
Clinical case 19 Lin, I-Yao (Sally). Case 19 Having been confined in the hospital for almost a month due recurrent pneumonia, Mr. XXX, 42 y/o, married,
What is a QUASI-SPECIES By Ye Dan U062281A USC3002 Picturing the World through Mathematics.
How Does Antiretroviral Therapy Affect HIV Mutation and Vice Versa? Arlin Toro Devin Iimoto Devin Iimoto.
KEY CONCEPT Biotechnology relies on cutting DNA at specific places.
Enfuvirtide for Drug-Resistant HIV Infection in North and South America Simon R. Bababeygy.
Drug Resistance Reports
BIOSTATISTICS Lecture 2. The role of Biostatisticians Biostatisticians play essential roles in designing studies, analyzing data and creating methods.
Examining the Genetic Similarity and Difference of the Three Progressor Groups at the First and Middle Visits Nicole Anguiano BIOL398: Bioinformatics Laboratory.
1 Adherence to ARV Therapy and Resistance HAIVN Havard Medical School AIDS Initiative in Vietnam.
SC.912.L Mutations 2. Genetic Recombination (sexual reproduction)
INTERPRETING GENETIC MUTATIONAL DATA FOR CLINICAL ONCOLOGY Ben Ho Park, M.D., Ph.D. Associate Professor of Oncology Johns Hopkins University May 2014.
Deep Belief Network Training Same greedy layer-wise approach First train lowest RBM (h 0 – h 1 ) using RBM update algorithm (note h 0 is x) Freeze weights.
WHAT IS THE IMPACT OF THE HUMAN GENOME PROJECT FOR DRUG DEVELOPMENT? Arman & Fin.
Examining Genetic Similarity and Difference of the Three Progressor Groups at the First and Middle Visits Nicole Anguiano BIOL398: Bioinformatics Laboratory.
Modeling Cell Proliferation Activity of Human Interleukin-3 (IL-3) Upon Single Residue Replacements Majid Masso Bioinformatics and Computational Biology.
No New Virus Produced No New Virus Produced New Virus Produced New Virus Produced Ligand Inserted Into Exosite Ligand Inserted Into Exosite HIV Protease.
Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.
EuResist? EuResist is an international project designed to improve the treatment of HIV patients by developing a computerized system that can recommend.
Dawit Assefa Ethiopia Health and Nutrition Research Institute Dawit Assefa Ethiopia Health and Nutrition Research Institute Evaluation of an in-house HIV.
Looking Within Human Genome King abdulaziz university Dr. Nisreen R Tashkandy GENOMICS ; THE PIG PICTURE.
Genomic Data Clustering on FPGAs for Compression
Human Health and Disease
Agenda 4/10 Biotech Intro Uses for Bacteria and Viruses
Gene Therapy Contemporary Issue – Genetic Disorders and Gene Therapy
KEY CONCEPT Genetics provides a basis for new medical treatments.
KEY CONCEPT Genetics provides a basis for new medical treatments.
KEY CONCEPT Genetics provides a basis for new medical treatments.
WELCOME TO ALL.
Agenda 4/8 Biotech Intro Uses for Bacteria and Viruses
ARV Nurse Training Programme Marcus McGilvray & Nicola Willis
KEY CONCEPT Genetics provides a basis for new medical treatments.
KEY CONCEPT Genetics provides a basis for new medical treatments.
Indicators, Data Sources and Data Quality for TB M&E
KEY CONCEPT Genetics provides a basis for new medical treatments.
Presentation transcript:

Data mining in bioinformatics: problems and challenges Sorin Draghici WWW:

Why bioinformatics? = We are witnessing a "biotechnology revolution" = Biotechnology –has the potential to improve our lives dramatically (new drugs, treatments, etc.) –has also a huge distructive potential (careless genetic manipulations, etc)

Why bioinformatics? = Human genome project –completed by Celera = How is that to be used? –map functions on genes –find/treat/correct/eliminate genetic diseases –gene treatment –patient oriented treatment and drugs (pharmacogenomics): ACE inhibitors (blood pressure medication

The HIV virus = HIV is a retrovirus that attacks the immune system = Replication mechanism: –RNA based –makes lots of mistakes during the replication = Compensates for the primitive replication through a high replication speed

Why is it so deadly? = 10 billion copies of HIV are produced every day = High replication speed + Many random mutations + Selection pressure from the drug = Selection pressure from the drug = very good search ability in the version space of all viable HIV viruses

Current treatments = Protease inhibitors = Reverse transcriptease inhibitors

Current problems = Very few drugs available – 5 FDA approved protease inhibitors – 9 FDA approved RT inhibitors = Cross-resistance –patient treated with drug A may develop resistance to drug B as well

Current problems = Drug development is: = very slow (10 years) = very expensive ($10-$30 milion/year) = Viral mutations are: = very probable in each generation = very rapid (10 billion copies a day) The result: throwing stones at fighter planes

Our approach = Find the structural features which: –cause drug resistance –are common to several mutants = Design drugs to counteract such common features as opposed to individual mutants –secondary therapy

wild type HIV mutant HIV drug development wild type HIV mutant HIV 2 FAT drug(s) mutant HIV 1 mutant HIV 3 genotyping first antiretroviral therapy (FAT) resistance SAT drug(s) second antiretroviral therapy (SAT) option 1 option 2 option 3 effective less effective

Our data = Genotypic data (genetic sequences of mutants) = easy to obtain = there are lots of them = Structural data (X ray crystallography) = difficult to obtain = not very many = Phenotypic data (drug resistance) = very difficult to obtain = very few available

Our data = Genotypic data PQITLWQRPLVTIKIGGQLKEALLDTGADDT... (approx. 200 residues for protease) = Structure data = Phenotypic data –IC90 = 3.51 –fold resistance: IC90 mutant/IC90 wildtype

Our work = Develop a structure-function model of HIV drug resistance structure sequence resistance Machine Learning

Dataflow Sequence Contacts/PDB Structures IDVNFVSQV Machine learning

Supervised learning = Inputs: –Atomic contacts between the inhibitor and the protease –Atomic distances = Output –Fold resistance

Ligplot Contacts File ligplot.nnb output: Atom 1 Atom 2 Distance Atom 1 Atom 2 Distance BLK 199 C9 ILE 183 CD BLK 199 C36 PRO 180 CG 3.87 BLK 199 C31 PRO 180 CG 3.69 BLK 199 C32 PRO 180 CB 3.72 BLK 199 C31 PRO 180 CB BLK 199 C26 ILE 146 CG BLK 199 C6 VAL 32 CG BLK 199 C6 ALA 28 CB 3.79 BLK 199 C10 GLY 27 C 3.66 BLK 199 C16 LEU 23 CD2 3.81

Atomic contacts - resistance = I nput Units: 200 = Hidden Units: 2 = Output Units: 1 = Number of Patterns: 21 Results: – Excellent training – Awful generalization Reason: – Not enough data points for an input space with 200 dimensions!!

Unsupervised learning = Inputs: –Contact residues (21 distinct contacts) = Output: –A self organized map embedding structural information

Ligplot Contacts File ligplot.nnb output: Atom 1 Atom 2 Distance Atom 1 Atom 2 Distance BLK 199 C9 ILE 183 CD BLK 199 C36 PRO 180 CG 3.87 BLK 199 C31 PRO 180 CG 3.69 BLK 199 C32 PRO 180 CB 3.72 BLK 199 C31 PRO 180 CB BLK 199 C26 ILE 146 CG BLK 199 C6 VAL 32 CG BLK 199 C6 ALA 28 CB 3.79 BLK 199 C10 GLY 27 C 3.66 BLK 199 C16 LEU 23 CD2 3.81

Self-organizing feature maps

Residue contacts - resistance = Results: – Leave-one-out cross validation = between 60% and 70% correct = no prediction for 12 (out of 22) = Conclusions: = Not enough data for reliable prediction = But results are very encouraging...

Problems and challenges in bioinformatics = Insufficient data = Example: –Largest data set has 50 mutants = Why? –The field is very recent –Data collection can be very difficult (one structure may take 1-2 years if done from scratch; one IC90 value may take up to two weeks) –Data has commercial value = Solutions: –Get more data –Cross-validate very carefully

Problems and challenges in bioinformatics = Data consistency = Example: –Same sample sent to two different labs can come back with different IC90 values = Why? –The experimental tools are not mature yet = Solutions: –Select your data carefully –Use data from consistent sources –If not possible, pre-process the data to make it consistent (not very good since you actually change the data!)

Problems and challenges in bioinformatics = Data accuracy = Example: –Same sample sent to the same lab at different times can be reported with different IC90 values (4 fold error) = Why? –The experimental tools are not mature yet = Solutions: –Use relative values to reduce the requirement for high numerical precision –Map data into clusters and attach values to clusters (1- 4 no resistance, 4-10 reduced resistance, >10 resistance)

Problems and challenges in bioinformatics = Data quality = Example: –Papers reporting IC90 values do not give the whole sequence = Why? –People are not aware of its importance –Data may have commercial value = Solutions: –Never trust your data...

Problems and challenges in bioinformatics = The choice of features = Example: –Atoms?, Residues?, Genes?, Larger structures? = Why? –The phenomena are very complex and span different scales in time and space = Solutions: –Try to merge different types of data in order to capture the complexity of the phenomenon –Use several qualitatively different analysis and machine learning techniques

Problems and challenges in bioinformatics = Lack of tools = Example: –There were no tools able to correlate sequence/structure/resistance data for the HIV virus –We wrote more than 15,000 lines of code for this problem = Why? –The field is new –The structure/function problem is just starting to be addressed = Solutions: –Develop your own software –Partnerships with bioinformatics companies?

Problems and challenges in bioinformatics = Difficult communication between the "bio" and the "informatics" sides = Example: –Definition of "successful prediction" = Why? –Different backgrounds, different traditions = Solution: –Cross-training –Exposure to "the other" field

Conclusions = Data mining in bioinformatics is: = Challenging = Interesting = Useful