Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data mining in bioinformatics: problems and challenges Sorin Draghici WWW:

Similar presentations


Presentation on theme: "Data mining in bioinformatics: problems and challenges Sorin Draghici WWW:"— Presentation transcript:

1 Data mining in bioinformatics: problems and challenges Sorin Draghici Email: sod@cs.wayne.edu WWW: http://vortex.cs.wayne.edu http://www.cs.wayne.edu/~sod http://www.cs.wayne.edu/~sod

2 Why bioinformatics? = We are witnessing a "biotechnology revolution" = Biotechnology –has the potential to improve our lives dramatically (new drugs, treatments, etc.) –has also a huge distructive potential (careless genetic manipulations, etc)

3 Why bioinformatics? = Human genome project –completed by Celera = How is that to be used? –map functions on genes –find/treat/correct/eliminate genetic diseases –gene treatment –patient oriented treatment and drugs (pharmacogenomics): ACE inhibitors (blood pressure medication

4 The HIV virus = HIV is a retrovirus that attacks the immune system = Replication mechanism: –RNA based –makes lots of mistakes during the replication = Compensates for the primitive replication through a high replication speed

5 Why is it so deadly? = 10 billion copies of HIV are produced every day = High replication speed + Many random mutations + Selection pressure from the drug = Selection pressure from the drug = very good search ability in the version space of all viable HIV viruses

6 Current treatments = Protease inhibitors = Reverse transcriptease inhibitors

7 Current problems = Very few drugs available – 5 FDA approved protease inhibitors – 9 FDA approved RT inhibitors = Cross-resistance –patient treated with drug A may develop resistance to drug B as well

8 Current problems = Drug development is: = very slow (10 years) = very expensive ($10-$30 milion/year) = Viral mutations are: = very probable in each generation = very rapid (10 billion copies a day) The result: throwing stones at fighter planes

9 Our approach = Find the structural features which: –cause drug resistance –are common to several mutants = Design drugs to counteract such common features as opposed to individual mutants –secondary therapy

10 wild type HIV mutant HIV drug development wild type HIV mutant HIV 2 FAT drug(s) mutant HIV 1 mutant HIV 3 genotyping first antiretroviral therapy (FAT) resistance SAT drug(s) second antiretroviral therapy (SAT) option 1 option 2 option 3 effective less effective

11 Our data = Genotypic data (genetic sequences of mutants) = easy to obtain = there are lots of them = Structural data (X ray crystallography) = difficult to obtain = not very many = Phenotypic data (drug resistance) = very difficult to obtain = very few available

12 Our data = Genotypic data PQITLWQRPLVTIKIGGQLKEALLDTGADDT... (approx. 200 residues for protease) = Structure data = Phenotypic data –IC90 = 3.51 –fold resistance: IC90 mutant/IC90 wildtype

13 Our work = Develop a structure-function model of HIV drug resistance structure sequence resistance Machine Learning

14 Dataflow Sequence Contacts/PDB Structures IDVNFVSQV Machine learning

15 Supervised learning = Inputs: –Atomic contacts between the inhibitor and the protease –Atomic distances = Output –Fold resistance

16 Ligplot Contacts File ligplot.nnb output: Atom 1 Atom 2 Distance Atom 1 Atom 2 Distance BLK 199 C9 ILE 183 CD1 3.70 BLK 199 C36 PRO 180 CG 3.87 BLK 199 C31 PRO 180 CG 3.69 BLK 199 C32 PRO 180 CB 3.72 BLK 199 C31 PRO 180 CB 3.73... BLK 199 C26 ILE 146 CG2 3.86 BLK 199 C6 VAL 32 CG2 3.81 BLK 199 C6 ALA 28 CB 3.79 BLK 199 C10 GLY 27 C 3.66 BLK 199 C16 LEU 23 CD2 3.81

17 Atomic contacts - resistance = I nput Units: 200 = Hidden Units: 2 = Output Units: 1 = Number of Patterns: 21 Results: – Excellent training – Awful generalization Reason: – Not enough data points for an input space with 200 dimensions!!

18 Unsupervised learning = Inputs: –Contact residues (21 distinct contacts) = Output: –A self organized map embedding structural information

19 Ligplot Contacts File ligplot.nnb output: Atom 1 Atom 2 Distance Atom 1 Atom 2 Distance BLK 199 C9 ILE 183 CD1 3.70 BLK 199 C36 PRO 180 CG 3.87 BLK 199 C31 PRO 180 CG 3.69 BLK 199 C32 PRO 180 CB 3.72 BLK 199 C31 PRO 180 CB 3.73... BLK 199 C26 ILE 146 CG2 3.86 BLK 199 C6 VAL 32 CG2 3.81 BLK 199 C6 ALA 28 CB 3.79 BLK 199 C10 GLY 27 C 3.66 BLK 199 C16 LEU 23 CD2 3.81

20 Self-organizing feature maps

21 Residue contacts - resistance = Results: – Leave-one-out cross validation = between 60% and 70% correct = no prediction for 12 (out of 22) = Conclusions: = Not enough data for reliable prediction = But results are very encouraging...

22 Problems and challenges in bioinformatics = Insufficient data = Example: –Largest data set has 50 mutants = Why? –The field is very recent –Data collection can be very difficult (one structure may take 1-2 years if done from scratch; one IC90 value may take up to two weeks) –Data has commercial value = Solutions: –Get more data –Cross-validate very carefully

23 Problems and challenges in bioinformatics = Data consistency = Example: –Same sample sent to two different labs can come back with different IC90 values = Why? –The experimental tools are not mature yet = Solutions: –Select your data carefully –Use data from consistent sources –If not possible, pre-process the data to make it consistent (not very good since you actually change the data!)

24 Problems and challenges in bioinformatics = Data accuracy = Example: –Same sample sent to the same lab at different times can be reported with different IC90 values (4 fold error) = Why? –The experimental tools are not mature yet = Solutions: –Use relative values to reduce the requirement for high numerical precision –Map data into clusters and attach values to clusters (1- 4 no resistance, 4-10 reduced resistance, >10 resistance)

25 Problems and challenges in bioinformatics = Data quality = Example: –Papers reporting IC90 values do not give the whole sequence = Why? –People are not aware of its importance –Data may have commercial value = Solutions: –Never trust your data...

26 Problems and challenges in bioinformatics = The choice of features = Example: –Atoms?, Residues?, Genes?, Larger structures? = Why? –The phenomena are very complex and span different scales in time and space = Solutions: –Try to merge different types of data in order to capture the complexity of the phenomenon –Use several qualitatively different analysis and machine learning techniques

27 Problems and challenges in bioinformatics = Lack of tools = Example: –There were no tools able to correlate sequence/structure/resistance data for the HIV virus –We wrote more than 15,000 lines of code for this problem = Why? –The field is new –The structure/function problem is just starting to be addressed = Solutions: –Develop your own software –Partnerships with bioinformatics companies?

28 Problems and challenges in bioinformatics = Difficult communication between the "bio" and the "informatics" sides = Example: –Definition of "successful prediction" = Why? –Different backgrounds, different traditions = Solution: –Cross-training –Exposure to "the other" field

29 Conclusions = Data mining in bioinformatics is: = Challenging = Interesting = Useful


Download ppt "Data mining in bioinformatics: problems and challenges Sorin Draghici WWW:"

Similar presentations


Ads by Google