An algorithm to guide selection of specific biomolecules to be studied by wet-lab experiments Jessica Wehner and Madhavi Ganapathiraju Department of Biomedical Informatics University of Pittsburgh School of Medicine Pittsburgh PA USA Presented by Thahir P. Mohamed Advancing Practice, Instruction & Innovation through Informatics October 19-23, 2008
2 Protein Structure Primary Structure: Chain of amino acids Secondary Structure: Sub- structures such as helixes and strands Tertiary Structure: Atomic resolution of protein structure Protein structure is essential for successful design of drugs
3 Challenges in Protein Structure Prediction X-ray crystallography, NMR spectroscopy are wet-lab methods to determine structure. Very expensive Very time consuming Computational techniques are applied to predict protein structure
4 Computational Protein Structure Prediction Machine Learning techniques applied to predict structure Experimentally determined structures are used to learn to predict new structures When not enough data to learn from: Active learning is applied to select the next protein to be studied experimentally
5 Active Learning Unlabeled Proteins Possible Labels:
6 Cluster Unlabeled Proteins Clustered Protiens Possible Labels: Active Learning
7 Cluster Unlabeled Proteins Selection Algorithm Clustered Proteins Possible Labels: Active Learning
8 Cluster Unlabeled Proteins Selection Algorithm Clustered Proteins Possible Labels: Active Learning
9 Prediction Labeled Protiens Cluster Unlabeled Proteins Selection Algorithm Possible Labels: Active learning guides selection of data points for which you ask for labels Active Learning
10 Membrane Protein Structure Prediction Membrane Protein importance and challenges Membrane Proteins: 30% of genes cell regulation and signaling pathways 60% of drug targets Yet, Difficult to study experimentally 1% of known protein structures Active learning can be used as a tool against the limited number of known MP structures despite the large number of known MP sequences
11 ‘Features’ Representation Data reduction is performed by SVD, resulting in a final 4 features per window Residue: A L H W R A A G A A T V L L V I V E R G A P G A Q L I Topology: M M M M M M M M M M M M Charge: - - p – p n p E-Prop: D d.. A D D. D D a d d d d d d D A. D D. D a d d Properties Charge Size Polarity Aromaticity Electronic Properties
12 Clustering the Data Dim 1 Dim 2 Dim 3 Neural Network Self Organizing Map (SOM) Finds centroids of clusters in the data
13 Design 1: Density-based Selection Find the most dense cluster – Choose N points closest to its centroid –Find labels for these points (TM or NTM) –Find the majority label, say L –Assign L to all points in the cluster Repeat for next dense cluster Clusters with no known structures are marked for study by experiments
14 Design 1 Results Increase the number of data points for which we ask structure Compare how accuracy varies between guided selection (via active learning) versus random selection. A total of only 10 labels per node ~ 1% data
15 Design 2: Protein – based Selection Pick a random protein Find labels for all windows in this protein For each node containing labels, find the mode L of all labels it contains Assign L to remaining data in node Repeat and update for new protein, until half have been selected
16 Protein-based results Repeated for different permutations of protein selection order, and observed several metrics. Percent
Conclusions 17 We developed a framework that allows us to select a few proteins or fragments of proteins which, when annotated with experimental methods, may be used to label remaining protein sequences. We have shown that it is possible to achieve higher accuracy values with guided selection of data compared to random selection of data.
Acknowledgements Madhavi Ganapathiraju Jessica Wehner JW funded through NIH-NSF Bioengineering & Bioinformatics Summer Institute Visit us at Department of Biomedical Informatics University of Pittsburgh Thank you! Cathedral of Learning, University of Pittsburgh