杜嘉晨 2015.4.1 PlantMiRNAPred: efficient classification of real and pseudo plant pre-miRNAs.

Slides:



Advertisements
Similar presentations
Complex Networks for Representation and Characterization of Images For CS790g Project Bingdong Li 9/23/2009.
Advertisements

CS 206 Introduction to Computer Science II 04 / 01 / 2009 Instructor: Michael Eckmann.
Clustering.
DECISION TREES. Decision trees  One possible representation for hypotheses.
Chapter 5: Introduction to Information Retrieval
Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:
Zhimin CaoThe Chinese University of Hong Kong Qi YinITCS, Tsinghua University Xiaoou TangShenzhen Institutes of Advanced Technology Chinese Academy of.
Introduction to Graph Theory Instructor: Dr. Chaudhary Department of Computer Science Millersville University Reading Assignment Chapter 1.
1 CS 391L: Machine Learning: Instance Based Learning Raymond J. Mooney University of Texas at Austin.
A Novel Knowledge Based Method to Predicting Transcription Factor Targets
K Means Clustering , Nearest Cluster and Gaussian Mixture
6 - 1 Chapter 6 The Secondary Structure Prediction of RNA.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
1 A scheme for racquet sports video analysis with the combination of audio-visual information Visual Communication and Image Processing 2005 Liyuan Xing,
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Chapter 1: Introduction to Pattern Recognition
Offset of curves. Alina Shaikhet (CS, Technion)
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
HCS Clustering Algorithm
Generic Object Detection using Feature Maps Oscar Danielsson Stefan Carlsson
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
Ensemble Tracking Shai Avidan IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE February 2007.
Reduced Support Vector Machine
Visual Querying By Color Perceptive Regions Alberto del Bimbo, M. Mugnaini, P. Pala, and F. Turco University of Florence, Italy Pattern Recognition, 1998.
Scale Invariant Feature Transform (SIFT)
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Multiclass object recognition
Information Visualization using graphs algorithms Symeonidis Alkiviadis
1 Graph Embedding (GE) & Marginal Fisher Analysis (MFA) 吳沛勳 劉冠成 韓仁智
Presented by Tienwei Tsai July, 2005
Edge Linking & Boundary Detection
Boundary Recognition in Sensor Networks by Topology Methods Yue Wang, Jie Gao Dept. of Computer Science Stony Brook University Stony Brook, NY Joseph S.B.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
CSCI 115 Chapter 3 Counting. CSCI 115 §3.1 Permutations.
1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,
Soft Computing Lecture 14 Clustering and model ART.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Semantic Wordfication of Document Collections Presenter: Yingyu Wu.
D. M. J. Tax and R. P. W. Duin. Presented by Mihajlo Grbovic Support Vector Data Description.
Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
Improving Intergenic miRNA Target Genes Prediction Rikky Wenang Purbojati.
Efficient Computing k-Coverage Paths in Multihop Wireless Sensor Networks XuFei Mao, ShaoJie Tang, and Xiang-Yang Li Dept. of Computer Science, Illinois.
Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
An Improved Algorithm for Decision-Tree-Based SVM Sindhu Kuchipudi INSTRUCTOR Dr.DONGCHUL KIM.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Visualization of Biological Information with Circular Drawings.
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
CIVET seminar Presentation day: Presenter : Park, GilSoon.
10. Decision Trees and Markov Chains for Gene Finding.
Instance Based Learning
Pattern Recognition Sergios Theodoridis Konstantinos Koutroumbas
Mean Shift Segmentation
Instance Based Learning (Adapted from various sources)
Outline Nonlinear Dimension Reduction Brief introduction Isomap LLE
Boosting Nearest-Neighbor Classifier for Character Recognition
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chap 8. Instance Based Learning
Generally Discriminant Analysis
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
NON-NEGATIVE COMPONENT PARTS OF SOUND FOR CLASSIFICATION Yong-Choon Cho, Seungjin Choi, Sung-Yang Bang Wen-Yi Chu Department of Computer Science &
Presentation transcript:

杜嘉晨 PlantMiRNAPred: efficient classification of real and pseudo plant pre-miRNAs

Outline Introduction Method Result and Discussion Conclusion

Introduction microRNAs(miRNAs) are non- coding RNAs that play important roles in gene regulation for cleavage or translational repression. Our main task is to develop an efficient classifier to distinguish real microRNA and pseudo plant mircoRNA.

Method Feature of plant pre-miRNAs Feature selection Training samples selection Classification based on svm

Method

Feature of plant pre-miRNAs Dinucleotide frequency %XYX,Y ∈ {A,C,G,U}(16) %G+C(1)  Totally 17 features Thermodynamic and stability Minimum free entropy(MFE) etc.  Totally 31 features

Feature of plant pre-miRNAs Structural characteristic Structured triplet composition(32) Structured triplet composition from stem (32)  Totally 64 features After feature extraction there are 112( ) features Totally used for classification.

Structural characteristic Common structural characteristic features extraction Use ‘(’ to indicate paired nucleotide Use ’.’ to indicate unpaired nucleotide Considering the middle nucleotide among the three there are 32 (4*8) possible structure-sequence combinations for 3 adjacent nucleotides Count the frequency of each combination

Structural characteristic (a) Use L’s sequence to replace big loops (b) Use L’s sequence to replace big bugles (c) Cut off unmatched sequences

Feature subset selection Information gain: the measurement of the features’ discrimination ability. Feature similarity: represent the similarity between features. The similarity between two features is range from 0 to 1, 0 indicates that these two features are irrelevant, 1 indicates that they are totally same.

Feature subset selection Suppose every feature is a node in the graph if there are two features whose Sim measurement is greater than some threshold ε(ε=0.49), then add an edge between these two corresponding nodes, and the weight of this edge is this Sim measurement. Novel features subset selection method based on graph abcd a b c10.8 d1 a b c d a0.5 b0.8 c0.4 d0.5 IG of featuresSim between two features Feature graph

Every nodes’ Feature selection weight(FWS) is calculated by Select the node with highest FWS remove all the nodes adjacent to this node Repeat this procedure until there are only Isolated nodes in the graph These features corresponding to the remaining nodes in the graph are our ideal feature subset. Feature subset selection

We select features with high IG In each category a)Primary features b)Energy and thermodynamic features c)Secondary structure features Finally there are 68 features after feature selection

Training samples selection The representation of samples Distance between two samples Degree of coverage The degree of coverage of a sample s in a certain area of feature space is defined by the number of samples whose nearest neighbor is s.

Training samples selection Assume that the number of samples in the i-th family is N i. V k is the feature vector corresponding to the k-th sample. c i is then calculated as, Suppose that the selection rate of sample space is 1/n. That is, N i /n samples in the i-th family are selected. The number of the selected samples is denoted as P i =N i /n The distance between the k-th sample (real pre-miRNA) v k and the central point c i is denoted as d vk ci. V tk means the transpose of vector v k. Then, the radius of the i-th family is r i, where r i =max(d vk ci )(1≤k≤N i ).

Training samples selection Suppose that c i is the center of a circle, draw two circles with radius 0r i and (1/P i )r i, respectively. The region between these two circles is denoted as A 0. The degree of coverage for each sample s in A 0 is calculated and denoted as C(s). C(s) represents the number of samples in A 0 whose nearest neighbor sample is s. The sample s with the greatest C(s) value is selected as a training sample. We set (1/P i )r i as the step length and compute the degree of sample coverage in the region A k between two circles with the radius (1/P i )kr i and (1/P i )(k+1)r i (1≤k≤P i −1), respectively. The sample in A k with the largest degree of coverage is selected.

Result and Discussion Positive data 128 families of micro RNA (1612) 431 other microRNA Total: 1906 Negative data 17 groups of pseudo microRNA(2122) Total: 2122 Positive data Total: 980 Negative data Total: 980 Sample selection

Result and Discussion 68 features are the features selected by our algorithm 80 features contains no structural features from stem 51 features contains no structural features 115 features are the whole feature set

Result and Discussion