Download presentation
Presentation is loading. Please wait.
Published byMeagan Skinner Modified over 9 years ago
1
Final Report (30% final score) Bin Liu, PhD, Associate Professor
2
Contents There are two parts: project+report Project (remote homology detection) Report Review the methods for remote homology detection. Point out their advantages and disadvantages. How did you do the experiments? Information for each step. What are your results? What are the advantages, disadvantages, and novelty of your methods?
3
Protein Remote Homology Detection Background Problem definition : classification problem: The schematic plot of the hierarchy for the SCOP database Sequence similarity are from high to low
4
Overview
5
Dataset http://noble.gs.washington.edu/proj/svm- pairwise/ http://noble.gs.washington.edu/proj/svm- pairwise/ 54 families and 4352 proteins. For More information about the dataset, refer to: Li Liao and William Stafford Noble. "Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships." Journal of Computational Biology. 10(6):857-868, 2003.
6
Data set
7
Tab-delimited table 0 = not present; 1 = positive train; 2 = negative train; 3 = positive test; 4 = negative test.
8
Feature extraction Extracting the features from the protein sequences, which can be found at “ Sequence file ” file in the supplementary. Sequence file Using your imagination to extract the features that can capture the character of the protein sequences.
9
Dataset construction Based on supplementary files “ Tab- delimited table ” and “ Sequence file ”, the training sets and test sets can be constructed. Tab- delimited table Sequence file There are totally 54 datasets.
10
Classifiers You are free to choose any classifiers, such as Support Vector Machines (SVMs), Artificial Neural network (ANN), Random Forest (RF), etc.
11
Performance measure ROC score (AUC) The average ROC scores of all the 54 families should be given.
12
Scoring function for the project and report Novelty and completeness: new features, new machine learning models, etc. Write down what makes your method different from others in this field. Does your method work? (40%) Mid results and source code (20%) Results (based on average ROC score) (10%) Report (30%)
13
Important information This is individual work, not team work, so do it alone, but you are free to discuss with others. Due date: 30th April, 2015 (1 month later), all the data should be stored in one ZIP or RAR file and sent to TA via email or QQ. The title of the email and your data: your name + student ID. (If your data is too large, contact TA directly). The slides of your presentation should be attached too.
14
Other topic you can choose DNA binding protein identification Dataset is available at http://bioinformatics.hitsz.edu.cn/iDNA- Prot_dis/data.jsp http://bioinformatics.hitsz.edu.cn/iDNA- Prot_dis/data.jsp Fold recognition Enhancer prediction
15
Problem description DNA-binding proteins are very important components of both eukaryotic and prokaryotic proteomes. As approximately at least 2% of prokaryotic and 3% of eukaryotic proteins are able to bind to DNA, these proteins are important for various cellular processes.
16
Problem description Therefore Developing an efficient model for identifying DNA-binding proteins from non DNA-binding proteins is an urgent research problem. Up to now, Although many efforts have been made in this regard, further effort is needed to enhance the prediction power.
17
Dataset description There are two datasets in this project, including a benchmark dataset and an independent dataset, which are available at course website http://bioinformatics.hitsz.edu.cn/iDNA- Prot_dis/data.jsp http://bioinformatics.hitsz.edu.cn/iDNA- Prot_dis/data.jsp For more information, see the following paper:
19
Task and evaluation Task: Identify DNA-binding proteins from non DNA- binding proteins. Evaluation scheme: 1.Use validation techniques to optimize the parameters of your methods (if any), and obtain the results on the benchmark dataset 2. Train your classifiers on the benchmark dataset, and predict the proteins in the independent dataset. 3. Analysis the feature, and find some interesting patterns.
20
Task and evaluation
21
TP refers to the number of positive samples that are classified correctly; FP denotes the number of negative samples that are classified as positive sample; TN denotes the number of negative samples that are classified correctly; FN denotes that number of positive samples that are classified as negative samples. Task and evaluation
22
Students from other majors. If you are not in CS department, please select one computational task in the field of bioinformatics. Write a review of the state-of-the-art predictors for this task. Discuss their advantages and disadvantages. Discuss the relationship between bioinformatics and your major. Can you use the idea from bioinformatics to your own project? At least 4000 words.
23
Data Driven Machine Learning Approaches for Bioinformatics Case study--protein remote homology detection
24
outline Overview Feature extraction Sequence-based features Profile-based features Other features Classifiers Feature analysis
25
Data Driven Machine Learning Approaches for Bioinformatics Protein Function Data Key idea: Learn from known data and Generalize to unseen data Input: sequence features Output: function category Classifier : Map Input to Output Training Data Test Data Training Test Training: Build a classifier Test: Test the model Prediction New Data Split
26
Several important components in this model Feature extraction. Given a protein, how to extract features only based on the primary sequence? Brainstorming?
27
A study case: remote homology detection and protein-protein interaction Features derived from the primary sequence only. Ngrams. Leslie et al. 2002 (possible subsequences of amino acids of a fxed length N); SVM-npeptide. Ogul et al. 2007 (reduced amino acid alphabets) Mismatch kernel and Pattern (TEIRESIAS algorithm) Leslie CS et al. 2004 and Dong et al 2005.
28
Feature extraction Distance-based approach. Lingner et al 2006 Word correlation matrics. Lingner et al 2008
29
SVM-pairwise Feature vector is a list of pairwise sequence similarity scores. Liao et al. 2002
30
Profile-based features Profiles ACDEFGHIKLMNPQRSTVWY 1I0.060.010.020.03 0.020.010.200.030.130.02 0.030.040.250.000.02 2V0.060.01 0.02 0.010.250.030.100.020.01 0.020.050.300.000.02 3E0.050.000.040.120.010.500.020.010.030.020.000.060.02 0.040.02 0.000.01 4G0.080.010.04 0.020.490.010.020.03 0.010.040.02 0.030.060.03 0.01 5Q0.040.030.060.110.060.030.050.020.050.030.020.030.010.140.06 0.070.040.020.08 6D0.050.000.170.260.02 0.010.020.040.030.010.080.05 0.020.06 0.000.01 7A0.180.230.02 0.010.030.010.050.020.040.010.02 0.010.020.09 0.070.000.01 8E0.080.000.080.160.010.040.02 0.090.040.010.060.070.050.080.090.050.060.000.01 9V0.050.000.040.080.020.030.020.070.100.070.010.020.200.04 0.030.020.140.01 10G0.050.000.050.040.010.220.160.010.02 0.000.190.010.020.040.080.030.020.010.02 11L0.080.000.050.180.010.02 0.050.130.010.030.010.05 0.220.030.020.000.01 12S0.040.030.010.020.050.020.060.02 0.030.01 0.090.030.120.030.040.300.04 13P0.060.000.030.040.010.030.010.020.040.030.010.020.480.020.080.040.03 0.000.01 14W0.030.010.02 0.120.030.050.030.020.050.01 0.02 0.030.300.17 …… Brainstorming. How to use the profile feature?
31
Binary profile Dong et al. 2007
32
N-profile Liu et al. 2008
33
Order profile Liu et al. 2009
34
Top-n-grams Liu et al. 2008
35
ACC Dong et al. 2009 AC ACC
36
Other features (AAindex-based features) Physicochemical Distance Transformation (PDT) Liu et al. 2012
37
LSA (latent semantic analysis) Dong et al. 2006
38
Classifiers
39
SVM
40
kernel combination methodology VBKC Damoulas et al. 2008
41
Summary To establish a really useful statistical predictor for a biological system: (i) Benchmark dataset; (ii) Feature extraction; (iii)Machine learning algorithm; (iv)Web server or stand alone tools
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.