Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM) Dr. Bernard Chen Assistant Professor Department of Computer.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Transmembrane Protein Topology Prediction Using Support Vector Machines Tim Nugent and David Jones Bioinformatics Group, Department of Computer Science,
Florida International University COP 4770 Introduction of Weka.
Christoph F. Eick Questions and Topics Review Dec. 10, Compare AGNES /Hierarchical clustering with K-means; what are the main differences? 2. K-means.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Pattern Recognition and Machine Learning
SVM—Support Vector Machines
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
AN IMPROVED AUDIO Jenn Tam Computer Science Dept. Carnegie Mellon University SOAPS 2008, Pittsburgh, PA.
Discriminative and generative methods for bags of features
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Research Topics Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Mismatch string kernels for discriminative protein classification By Leslie. et.al Presented by Yan Wang.
Fuzzy rule-based system derived from similarity to prototypes Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Poland School.
Segmentation Divide the image into segments. Each segment:
Image Categorization by Learning and Reasoning with Regions Yixin Chen, University of New Orleans James Z. Wang, The Pennsylvania State University Published.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.
CHURN PREDICTION MODEL IN RETAIL BANKING USING FUZZY C- MEANS CLUSTERING Džulijana Popović Consumer Finance, Zagrebačka banka d.d. Consumer Finance, Zagrebačka.
Clustering Algorithms Mu-Yu Lu. What is Clustering? Clustering can be considered the most important unsupervised learning problem; so, as every other.
JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division.
CSE 185 Introduction to Computer Vision Pattern Recognition.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
How to make a presentation (Oral and Poster) Dr. Bernard Chen Ph.D. University of Central Arkansas July 5 th Applied Research in Healthy Information.
Presented by Tienwei Tsai July, 2005
Protein Secondary Structure Prediction with inclusion of Hydrophobicity information Tzu-Cheng Chuang, Okan K. Ersoy and Saul B. Gelfand School of Electrical.
Professor: S. J. Wang Student : Y. S. Wang
Generalized Fuzzy Clustering Model with Fuzzy C-Means Hong Jiang Computer Science and Engineering, University of South Carolina, Columbia, SC 29208, US.
Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU.
My Research Work and Clustering Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
1/15 Strengthening I-ReGEC classifier G. Attratto, D. Feminiano, and M.R. Guarracino High Performance Computing and Networking Institute Italian National.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
1 Pattern Recognition Pattern recognition is: 1. A research area in which patterns in data are found, recognized, discovered, …whatever. 2. A catchall.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-
Outline Intro to Representation and Heuristic Search Machine Learning (Clustering) and My Research.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
 Developed Struct-SVM classifier that takes into account domain knowledge to improve identification of protein-RNA interface residues  Results show that.
A new initialization method for Fuzzy C-Means using Fuzzy Subtractive Clustering Thanh Le, Tom Altman University of Colorado Denver July 19, 2011.
CS378 Final Project The Netflix Data Set Class Project Ideas and Guidelines.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Date: 2011/1/11 Advisor: Dr. Koh. Jia-Ling Speaker: Lin, Yi-Jhen Mr. KNN: Soft Relevance for Multi-label Classification (CIKM’10) 1.
Speaker Change Detection using Support Vector Machines V.Kartik, D.Srikrishna Satish and C.Chandra Sekhar Speech and Vision Laboratory Department of Computer.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
Identifying Ethnic Origins with A Prototype Classification Method Fu Chang Institute of Information Science Academia Sinica ext. 1819
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Fuzzy C-means Clustering Dr. Bernard Chen University of Central Arkansas.
May 2003 SUT Color image segmentation – an innovative approach Amin Fazel May 2003 Sharif University of Technology Course Presentation base on a paper.
Unsupervised Classification
1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.
Sparse nonnegative matrix factorization for protein sequence motifs information discovery Presented by Wooyoung Kim Computer Science, Georgia State University.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
Jawad Tahsin Danish Mustafa Zaidi Kazim Zaidi Zulfiqar Hadi.
Experience Report: System Log Analysis for Anomaly Detection
Chapter 6 Classification and Prediction
Introductory Seminar on Research: Fall 2017
Basic machine learning background with Python scikit-learn
An Enhanced Support Vector Machine Model for Intrusion Detection
Extra Tree Classifier-WS3 Bagging Classifier-WS3
חיזוי ואפיון אתרי קישור של חלבון לדנ"א מתוך הרצף
Unsupervised Learning and Clustering
Presentation transcript:

Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM) Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas Fall 2009

Goal of the Dissertation The main purpose is trying to obtain and extract protein sequence motifs information which are universally conserved and across protein family boundaries. And then use these information to do Protein Local 3D Structure Prediction

Research Flow Part3 Motif Information Extraction Part2 Discovering Protein Sequence Motifs Part1 Bioinformatics Knowledge and Dataset Collection Part4 Protein Local Tertiary Structure Prediction

Data set

HSSP matrix: 1b25

Representation of Segment Sliding window size: 9 Each window corresponds to a sequence segment, which is represented by a 9 × 20 matrix plus additional nine corresponding secondary structure information obtained from DSSP. More than 560,000 segments (413MB) are generated by this method. DSSP: Obtain 2 nd Structure information

Research Flow Part3 Motif Information Extraction Part2 Discovering Protein Sequence Motifs Part1 Bioinformatics Knowledge and Dataset Collection Part4 Protein Local Tertiary Structure Prediction

Granular Computing Model Original dataset Fuzzy C-Means Clustering Informatio n Granule 1 Informatio n Granule M New Improved or Greedy K-means Clustering Join Information Final Sequence Motifs Information...

Reduce Time-complexity Wei’s method: sec (15 days) * 6 = sec (90 days) Granular Model: sec sec * 6 = sec (18 days) ( FCM exe time) (2.7 Days)

Comparison of Quality Measures Different Methods>60%S.D.>70%S.D.H-B Measure Traditional25.82% % Zhong % % Zhong % % Zhong % % FCM-K-means37.14% % FIK Model FIK Model % % FIK Model % % FIK Model % % FIK Model % % FIK Model % % FGK Model FGK Model % % FGK Model % % FGK Model % % FGK Model % % FGK Model % % Best Selection44.18%015.02%

Research Flow Part3 Motif Information Extraction Part2 Discovering Protein Sequence Motifs Part1 Bioinformatics Knowledge and Dataset Collection Part4 Protein Local Tertiary Structure Prediction

Super GSVM-FE Motivation First, the information we try to generate is about sequence motifs, but the original input data are derived from whole protein sequences by a sliding window technique; Second, during fuzzy c-means clustering, it has the ability to assign one segment to more than one information granule.

Original dataset Fuzzy C-Means Clustering Information Granule 1 Information Granule M Greedy K-means Clustering Join Information Final Sequence Motifs Information... For Each Cluster Ranking SVM Feature Elimination... Ranking SVM Feature Elimination Greedy K-means Clustering... …… For Each Cluster Collect Survived Segments Collect Survived Segments …… Five iterations of traditional K-maens For Each Cluster For Each Cluster... Super GSVM-FE Additional Portion

Extracted Motif Information

Research Flow Part3 Motif Information Extraction Part2 Discovering Protein Sequence Motifs Part1 Bioinformatics Knowledge and Dataset Collection Part4 Protein Local Tertiary Structure Prediction

3D information 3D information is generated from PDB (Protein Data Bank), an example of 1a3c PDB file

3D information 3D information is generated from PDB (Protein Data Bank), an example of 1a3c PDB file

Testing Data The latest release of PISCES includes 4345 PDB files. Compare with the dataset in our experiment, 2419 PDB files are excluded. Therefore, we regard our 2710 protein files as the training dataset and 2419 protein files as the independent testing dataset.

Testing Data We convert the testing dataset by the approach we introduced more than 490,000 segments are generated as testing dataset.

Super GSVM Training dataset Fuzzy C-Means Clustering Information Granule 1 Information Granule M Greedy K-means Clustering Collect all extracted clusters and Ranking-SVMs... For Each Cluster Train Ranking SVM and then Eliminate 20% lower rank members... Train Ranking SVM and then Eliminate 20% lower rank members …… For Each Cluster Five iterations of traditional K- means All Sequence clusters All Ranking SVMs Independent testing Dataset Feed to the belonging SVM Predict the local 3D structure If the rank belongs to cluster Find the closest cluster within a given distance threshold If not, find the next closest cluster

Prediction Accuracy

Prediction Coverage

Future Works Incorporate Chou-Fasman parameter for SVM training

Future Works For each cluster, instead of building SVM model, we build Decision Tree instead Training dataset Fuzzy C-Means Clustering Information Granule 1 Information Granule M Greedy K-means Clustering Collect all extracted clusters and Ranking-SVMs... For Each Cluster Build Decision Tree... Build Decision Tree …… For Each Cluster Five iterations of traditional K- means All Sequence clusters Test by DT Independent testing Dataset Feed to the belonging DT Predict the local 3D structure If the rank belongs to cluster Find the closest cluster within a given distance threshold If not, find the next closest cluster