Data mining in bioinformatics: problems and challenges Sorin Draghici WWW:

Data mining in bioinformatics: problems and challenges Sorin Draghici Email: sod@cs.wayne.edu WWW: http://vortex.cs.wayne.edu http://www.cs.wayne.edu/~sod http://www.cs.wayne.edu/~sod

Why bioinformatics? = We are witnessing a "biotechnology revolution" = Biotechnology has the potential to improve our lives dramatically (new drugs, treatments, etc.) has also a huge distructive potential (careless genetic manipulations, etc)

Why bioinformatics? = Human genome project completed by Celera = How is that to be used? map functions on genes find/treat/correct/eliminate genetic diseases gene treatment patient oriented treatment and drugs (pharmacogenomics): ACE inhibitors (blood pressure medication

The HIV virus = HIV is a retrovirus that attacks the immune system = Replication mechanism: RNA based makes lots of mistakes during the replication = Compensates for the primitive replication through a high replication speed

Why is it so deadly? = 10 billion copies of HIV are produced every day = High replication speed + Many random mutations + Selection pressure from the drug = Selection pressure from the drug = very good search ability in the version space of all viable HIV viruses

Current treatments = Protease inhibitors = Reverse transcriptease inhibitors

Current problems = Very few drugs available 5 FDA approved protease inhibitors 9 FDA approved RT inhibitors = Cross-resistance patient treated with drug A may develop resistance to drug B as well

Current problems = Drug development is: = very slow (10 years) = very expensive ($10-$30 milion/year) = Viral mutations are: = very probable in each generation = very rapid (10 billion copies a day) The result: throwing stones at fighter planes

Our approach = Find the structural features which: cause drug resistance are common to several mutants = Design drugs to counteract such common features as opposed to individual mutants secondary therapy

wild type HIV mutant HIV drug development wild type HIV mutant HIV 2 FAT drug(s) mutant HIV 1 mutant HIV 3 genotyping first antiretroviral therapy (FAT) resistance SAT drug(s) second antiretroviral therapy (SAT) option 1 option 2 option 3 effective less effective

Our data = Genotypic data (genetic sequences of mutants) = easy to obtain = there are lots of them = Structural data (X ray crystallography) = difficult to obtain = not very many = Phenotypic data (drug resistance) = very difficult to obtain = very few available

Our data = Genotypic data PQITLWQRPLVTIKIGGQLKEALLDTGADDT... (approx. 200 residues for protease) = Structure data = Phenotypic data IC90 = 3.51 fold resistance: IC90 mutant/IC90 wildtype

Our work = Develop a structure-function model of HIV drug resistance structure sequence resistance Machine Learning

Dataflow Sequence Contacts/PDB Structures IDVNFVSQV Machine learning

Supervised learning = Inputs: Atomic contacts between the inhibitor and the protease Atomic distances = Output Fold resistance

Ligplot Contacts File ligplot.nnb output: Atom 1 Atom 2 Distance Atom 1 Atom 2 Distance BLK 199 C9 ILE 183 CD1 3.70 BLK 199 C36 PRO 180 CG 3.87 BLK 199 C31 PRO 180 CG 3.69 BLK 199 C32 PRO 180 CB 3.72 BLK 199 C31 PRO 180 CB 3.73... BLK 199 C26 ILE 146 CG2 3.86 BLK 199 C6 VAL 32 CG2 3.81 BLK 199 C6 ALA 28 CB 3.79 BLK 199 C10 GLY 27 C 3.66 BLK 199 C16 LEU 23 CD2 3.81

Atomic contacts - resistance = I nput Units: 200 = Hidden Units: 2 = Output Units: 1 = Number of Patterns: 21 Results: Excellent training Awful generalization Reason: Not enough data points for an input space with 200 dimensions!!

Unsupervised learning = Inputs: Contact residues (21 distinct contacts) = Output: A self organized map embedding structural information

Ligplot Contacts File ligplot.nnb output: Atom 1 Atom 2 Distance Atom 1 Atom 2 Distance BLK 199 C9 ILE 183 CD1 3.70 BLK 199 C36 PRO 180 CG 3.87 BLK 199 C31 PRO 180 CG 3.69 BLK 199 C32 PRO 180 CB 3.72 BLK 199 C31 PRO 180 CB 3.73... BLK 199 C26 ILE 146 CG2 3.86 BLK 199 C6 VAL 32 CG2 3.81 BLK 199 C6 ALA 28 CB 3.79 BLK 199 C10 GLY 27 C 3.66 BLK 199 C16 LEU 23 CD2 3.81

Self-organizing feature maps

Residue contacts - resistance = Results: Leave-one-out cross validation = between 60% and 70% correct = no prediction for 12 (out of 22) = Conclusions: = Not enough data for reliable prediction = But results are very encouraging...

Problems and challenges in bioinformatics = Insufficient data = Example: Largest data set has 50 mutants = Why? The field is very recent Data collection can be very difficult (one structure may take 1-2 years if done from scratch; one IC90 value may take up to two weeks) Data has commercial value = Solutions: Get more data Cross-validate very carefully

Problems and challenges in bioinformatics = Data consistency = Example: Same sample sent to two different labs can come back with different IC90 values = Why? The experimental tools are not mature yet = Solutions: Select your data carefully Use data from consistent sources If not possible, pre-process the data to make it consistent (not very good since you actually change the data!)

Problems and challenges in bioinformatics = Data accuracy = Example: Same sample sent to the same lab at different times can be reported with different IC90 values (4 fold error) = Why? The experimental tools are not mature yet = Solutions: Use relative values to reduce the requirement for high numerical precision Map data into clusters and attach values to clusters (1- 4 no resistance, 4-10 reduced resistance, >10 resistance)

Problems and challenges in bioinformatics = Data quality = Example: Papers reporting IC90 values do not give the whole sequence = Why? People are not aware of its importance Data may have commercial value = Solutions: Never trust your data...

Problems and challenges in bioinformatics = The choice of features = Example: Atoms?, Residues?, Genes?, Larger structures? = Why? The phenomena are very complex and span different scales in time and space = Solutions: Try to merge different types of data in order to capture the complexity of the phenomenon Use several qualitatively different analysis and machine learning techniques

Problems and challenges in bioinformatics = Lack of tools = Example: There were no tools able to correlate sequence/structure/resistance data for the HIV virus We wrote more than 15,000 lines of code for this problem = Why? The field is new The structure/function problem is just starting to be addressed = Solutions: Develop your own software Partnerships with bioinformatics companies?

Problems and challenges in bioinformatics = Difficult communication between the "bio" and the "informatics" sides = Example: Definition of "successful prediction" = Why? Different backgrounds, different traditions = Solution: Cross-training Exposure to "the other" field

Conclusions = Data mining in bioinformatics is: = Challenging = Interesting = Useful

Data mining in bioinformatics: problems and challenges Sorin Draghici WWW:

Similar presentations

Presentation on theme: "Data mining in bioinformatics: problems and challenges Sorin Draghici WWW:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data mining in bioinformatics: problems and challenges Sorin Draghici WWW:

Similar presentations

Presentation on theme: "Data mining in bioinformatics: problems and challenges Sorin Draghici WWW:"— Presentation transcript:

Similar presentations

About project

Feedback