Final Report (30% final score) Bin Liu, PhD, Associate Professor.

Slides:

Advertisements

Similar presentations

Transmembrane Protein Topology Prediction Using Support Vector Machines Tim Nugent and David Jones Bioinformatics Group, Department of Computer Science,

Advertisements

Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions Paper by Umar Syed and Golan Yona department of CS, Cornell.

Machine learning continued Image source:

Protein Backbone Angle Prediction with Machine Learning Approaches by R Kang, C Leslie, & A Yang in Bioinformatics, 1 July 2004, vol 20 nbr 10 pp

Structural bioinformatics

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Project 4 U-Pick – A Project of Your Own Design Proposal Due: April 14 th (earlier ok) Project Due: April 25 th.

Mismatch string kernels for discriminative protein classification By Leslie. et.al Presented by Yan Wang.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.

Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]

Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.

Remote homology detection  Remote homologs:  low sequence similarity, conserved structure/function  A number of databases and tools are available 

CISC667, F05, Lec23, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines (II) Bioinformatics Applications.

Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.

Semi-supervised protein classification using cluster kernels Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff and William Stafford.

Project list 1.Peptide MHC binding predictions using position specific scoring matrices including pseudo counts and sequences weighting clustering (Hobohm)

Distributed Representations of Sentences and Documents

Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.

Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

Template-based Prediction of Protein 8-state Secondary Structures June 12 th 2013 Ashraf Yaseen and Yaohang Li DEPARTMENT OF COMPUTER SCIENCE OLD DOMINION.

M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.

Protein Tertiary Structure Prediction

SUPERVISED NEURAL NETWORKS FOR PROTEIN SEQUENCE ANALYSIS Lecture 11 Dr Lee Nung Kion Faculty of Cognitive Sciences and Human Development UNIMAS,

Masquerade Detection Mark Stamp 1Masquerade Detection.

How to make a presentation (Oral and Poster) Dr. Bernard Chen Ph.D. University of Central Arkansas July 5 th Applied Research in Healthy Information.

Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with.

From Genomic Sequence Data to Genotype: A Proposed Machine Learning Approach for Genotyping Hepatitis C Virus Genaro Hernandez Jr CMSC 601 Spring 2011.

Predicting Secondary Structure of All-Helical Proteins Using Hidden Markov Support Vector Machines Blaise Gassend, Charles W. O'Donnell, William Thies,

An Example of Course Project Face Identification.

GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

Discovering the Correlation Between Evolutionary Genomics and Protein-Protein Interaction Rezaul Kabir and Brett Thompson

Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.

1 Pattern Recognition Pattern recognition is: 1. A research area in which patterns in data are found, recognized, discovered, …whatever. 2. A catchall.

Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.

What is a Project Purpose –Use a method introduced in the course to describe some biological problem How –Construct a data set describing the problem –Define.

Protein Fold Recognition as a Data Mining Coursework Project Badri Adhikari Department of Computer Science University of Missouri-Columbia.

Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.

Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.

Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.

Protein Classification Using Averaged Perceptron SVM

PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),

Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-

Journal report: High Resolution Model of Transcription Factor- DNA Affinities Improve In Vitro and In Vivo Binding Predictions Paper by: Phadera Gius,

1 CISC 841 Bioinformatics (Fall 2007) Kernel engineering and applications of SVMs.

Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.

LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.

Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.

2 classes: ICS 280, BIT Forum Meeting only on Mondays from 5 to 6:20 in CS2 136 (BIT). (P. Baldi and L. Ralaivola) ICS 280: Baldi group meeting and projects.

Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.

Data Mining and Decision Support

Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S

Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.

Protein Folding recognition with Committee Machine Mika Takata.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.

Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.

Pattern Recognition. What is Pattern Recognition? Pattern recognition is a sub-topic of machine learning. PR is the science that concerns the description.

Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA

Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)

BIOINFORMATION A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation - - 王红刚 14S

Bioinformatics Overview

Introduction Feature Extraction Discussions Conclusions Results

חיזוי ואפיון אתרי קישור של חלבון לדנ"א מתוך הרצף

Combining HMMs with SVMs

Presentation transcript:

Final Report (30% final score) Bin Liu, PhD, Associate Professor

Contents There are two parts: project+report Project (remote homology detection) Report Review the methods for remote homology detection. Point out their advantages and disadvantages. How did you do the experiments? Information for each step. What are your results? What are the advantages, disadvantages, and novelty of your methods?

Protein Remote Homology Detection Background Problem definition ： classification problem: The schematic plot of the hierarchy for the SCOP database Sequence similarity are from high to low

Overview

Dataset pairwise/ pairwise/ 54 families and 4352 proteins. For More information about the dataset, refer to: Li Liao and William Stafford Noble. "Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships." Journal of Computational Biology. 10(6): , 2003.

Data set

Tab-delimited table 0 = not present; 1 = positive train; 2 = negative train; 3 = positive test; 4 = negative test.

Feature extraction Extracting the features from the protein sequences, which can be found at “ Sequence file ” file in the supplementary. Sequence file Using your imagination to extract the features that can capture the character of the protein sequences.

Dataset construction Based on supplementary files “ Tab- delimited table ” and “ Sequence file ”, the training sets and test sets can be constructed. Tab- delimited table Sequence file There are totally 54 datasets.

Classifiers You are free to choose any classifiers, such as Support Vector Machines (SVMs), Artificial Neural network (ANN), Random Forest (RF), etc.

Performance measure ROC score (AUC) The average ROC scores of all the 54 families should be given.

Scoring function for the project and report Novelty and completeness: new features, new machine learning models, etc. Write down what makes your method different from others in this field. Does your method work? (40%) Mid results and source code (20%) Results (based on average ROC score) (10%) Report (30%)

Important information This is individual work, not team work, so do it alone, but you are free to discuss with others. Due date: 30th April, 2015 (1 month later), all the data should be stored in one ZIP or RAR file and sent to TA via or QQ. The title of the and your data: your name + student ID. (If your data is too large, contact TA directly). The slides of your presentation should be attached too.

Other topic you can choose DNA binding protein identification Dataset is available at Prot_dis/data.jsp Prot_dis/data.jsp Fold recognition Enhancer prediction

Problem description DNA-binding proteins are very important components of both eukaryotic and prokaryotic proteomes. As approximately at least 2% of prokaryotic and 3% of eukaryotic proteins are able to bind to DNA, these proteins are important for various cellular processes.

Problem description Therefore Developing an efficient model for identifying DNA-binding proteins from non DNA-binding proteins is an urgent research problem. Up to now, Although many efforts have been made in this regard, further effort is needed to enhance the prediction power.

Dataset description There are two datasets in this project, including a benchmark dataset and an independent dataset, which are available at course website Prot_dis/data.jsp Prot_dis/data.jsp For more information, see the following paper:

Task and evaluation Task: Identify DNA-binding proteins from non DNA- binding proteins. Evaluation scheme: 1.Use validation techniques to optimize the parameters of your methods (if any), and obtain the results on the benchmark dataset 2. Train your classifiers on the benchmark dataset, and predict the proteins in the independent dataset. 3. Analysis the feature, and find some interesting patterns.

Task and evaluation

TP refers to the number of positive samples that are classified correctly; FP denotes the number of negative samples that are classified as positive sample; TN denotes the number of negative samples that are classified correctly; FN denotes that number of positive samples that are classified as negative samples. Task and evaluation

Students from other majors. If you are not in CS department, please select one computational task in the field of bioinformatics. Write a review of the state-of-the-art predictors for this task. Discuss their advantages and disadvantages. Discuss the relationship between bioinformatics and your major. Can you use the idea from bioinformatics to your own project? At least 4000 words.

Data Driven Machine Learning Approaches for Bioinformatics Case study--protein remote homology detection

outline Overview Feature extraction Sequence-based features Profile-based features Other features Classifiers Feature analysis

Data Driven Machine Learning Approaches for Bioinformatics Protein Function Data Key idea: Learn from known data and Generalize to unseen data Input: sequence features Output: function category Classifier : Map Input to Output Training Data Test Data Training Test Training: Build a classifier Test: Test the model Prediction New Data Split

Several important components in this model Feature extraction. Given a protein, how to extract features only based on the primary sequence? Brainstorming?

A study case: remote homology detection and protein-protein interaction Features derived from the primary sequence only. Ngrams. Leslie et al (possible subsequences of amino acids of a fxed length N); SVM-npeptide. Ogul et al (reduced amino acid alphabets) Mismatch kernel and Pattern (TEIRESIAS algorithm) Leslie CS et al and Dong et al 2005.

Feature extraction Distance-based approach. Lingner et al 2006 Word correlation matrics. Lingner et al 2008

SVM-pairwise Feature vector is a list of pairwise sequence similarity scores. Liao et al. 2002

Profile-based features Profiles ACDEFGHIKLMNPQRSTVWY 1I V E G Q D A E V G L S P W …… Brainstorming. How to use the profile feature?

Binary profile Dong et al. 2007

N-profile Liu et al. 2008

Order profile Liu et al. 2009

Top-n-grams Liu et al. 2008

ACC Dong et al AC ACC

Other features (AAindex-based features) Physicochemical Distance Transformation (PDT) Liu et al. 2012

LSA (latent semantic analysis) Dong et al. 2006

Classifiers

SVM

kernel combination methodology VBKC Damoulas et al. 2008

Summary To establish a really useful statistical predictor for a biological system: (i) Benchmark dataset; (ii) Feature extraction; (iii)Machine learning algorithm; (iv)Web server or stand alone tools