Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set.

Slides:



Advertisements
Similar presentations
Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:
Advertisements

Zhimin CaoThe Chinese University of Hong Kong Qi YinITCS, Tsinghua University Xiaoou TangShenzhen Institutes of Advanced Technology Chinese Academy of.
Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.
Particle swarm optimization for parameter determination and feature selection of support vector machines Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen,
Naveen K. Bansal and Prachi Pradeep Dept. of Math., Stat., and Comp. Sci. Marquette University Milwaukee, WI (USA)
MiRNA in computational biology 1 The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig C. Mello for their discovery of "RNA interference.
Service Discrimination and Audit File Reduction for Effective Intrusion Detection by Fernando Godínez (ITESM) In collaboration with Dieter Hutter (DFKI)
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
SUPPORT VECTOR MACHINES PRESENTED BY MUTHAPPA. Introduction Support Vector Machines(SVMs) are supervised learning models with associated learning algorithms.
Predicting RNA Structure and Function. Non coding DNA (98.5% human genome) Intergenic Repetitive elements Promoters Introns mRNA untranslated region (UTR)
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
Computational biology seminar
Predicting RNA Structure and Function. Nobel prize 1989Nobel prize 2009 Ribozyme Ribosome RNA has many biological functions The function of the RNA molecule.
MicroRNA genes Ka-Lok Ng Department of Bioinformatics Asia University.
UTR motifs and microRNA analysis 曾 大 千 助 理 教 授 10/28/2008.
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
Identifying Computer Graphics Using HSV Model And Statistical Moments Of Characteristic Functions Xiao Cai, Yuewen Wang.
Masquerade Detection Mark Stamp 1Masquerade Detection.
Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with.
Identifying and classifying functional small RNAs from pine Ryan Morin BC Genome Sciences Centre (presenting research conducted in the lab of Dr. Peter.
From Genomic Sequence Data to Genotype: A Proposed Machine Learning Approach for Genotyping Hepatitis C Virus Genaro Hernandez Jr CMSC 601 Spring 2011.
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
MicroRNA identification based on sequence and structure alignment Presented by - Neeta Jain Xiaowo Wang†, Jing Zhang†, Fei Li, Jin Gu, Tao He, Xuegong.
RNA Folding. RNA Folding Algorithms Intuitively: given a sequence, find the structure with the maximal number of base pairs For nested structures, four.
Small RNAs and their regulatory roles. Presented by: Chirag Nepal.
Computational Identification of Drosophila microRNA Genes Journal Club 09/05/03 Jared Bischof.
1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,
RNA interference Definition: RNA interference (RNAi) is a mechanism where the presence of certain fragments.
Welcome Everyone. Self introduction Sun, Luguo ( 孙陆果) Contact me by Professor in School of Life Sciences & National Engineering.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Improving the prediction of RNA secondary structure by detecting and assessing conserved stems Xiaoyong Fang, et al.
Jeffrey Zheng School of Software, Yunnan University August 4, nd International Summit on Integrative Biology August 4-5, 2014 Chicago, USA.
RNA Structure Prediction
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Data Classification with the Radial Basis Function Network Based on a Novel Kernel Density Estimation Algorithm Yen-Jen Oyang Department of Computer Science.
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
MicroRNAs and Other Tiny Endogenous RNAs in C. elegans Annie Chiang JClub Ambros et al. Curr Biol 13:
Questions?. Novel ncRNAs are abundant: Ex: miRNAs miRNAs were the second major story in 2001 (after the genome). Subsequently, many other non-coding genes.
Nature, 2008, Doi: /nature07103 Semrah Kati
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Exploring Alternative Splicing Features.
Improving Intergenic miRNA Target Genes Prediction Rikky Wenang Purbojati.
LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Computational prediction of miRNA and miRNA-disease relationship
Motif Search and RNA Structure Prediction Lesson 9.
Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
MicroRNA Prediction with SCFG and MFE Structure Annotation Tim Shaw, Ying Zheng, and Bram Sebastian.
杜嘉晨 PlantMiRNAPred: efficient classification of real and pseudo plant pre-miRNAs.
Abstract Premise Figure 1: Flowchart pri-miRNAs were collected from miRBase 10.0 pri-miRNAs were compared to hsa and ptr genomes using BlastN and potential.
RNA Structure Prediction
For Prediction of microRNA Genes Vertebrate MicroRNA Genes Lee P. Lim, et. al. SCIENCE 2003 The microRNAs of Caenorhabditis elegans Lee P. Lim, et al GENES.
Mestrado Integrado em Medicina Biologia Celular e Molecular II
Improvement of SSR Redundancy Identification by Machine Learning Approach Using Dataset from Cotton Marker Database Pengfei Xuan 1,2, Feng Luo 2, Albert.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Building Excellence in Genomics and Computational Bioscience miRNA Workshop: miRNA biogenesis & discovery Simon Moxon
Learning to Detect and Classify Malicious Executables in the Wild by J
Figure Legend: From: Noncoding RNAs:New Players in Chronic Pain
Avdesh Mishra, Manisha Panta, Md Tamjidul Hoque, Joel Atallah
Extra Tree Classifier-WS3 Bagging Classifier-WS3
MicroRNAs: regulators of gene expression and cell differentiation
Identification and Characterization of pre-miRNA Candidates in the C
Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen, Zne-Jung Lee
Erica E. Marsh, M. D. , Zhihong Lin, Ph. D. , Ping Yin, Ph. D
Manisha Panta, Avdesh Mishra, Md Tamjidul Hoque, Joel Atallah
Derek de Rie and Imad Abuessaisa Presented by: Cassandra Derrick
Presentation transcript:

Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Arrangem ent of the Report 1 Introduction 2 Methods 3 Results and Discussion 4 Conclusion 1

Introducti on

Brief Introduction to microRNA MicroRNA (miRNA) is a class of single strand, non-coding endogenous RNAs, with ~22 nucleotides (nt) in sequence length. miRNAs play key roles in regulating biological processes, including affecting stability and translation of mRNAs and negatively regulating gene expression in post-transcriptional processes. 3

Introduction How did the microRNA formed 1)miRNA genes are first transcribed by RNA polymerase II, resulting in the primary transcripts, which are usually termed as pri- miRNAs. 2)The pri-miRNAs are processed by the enzyme Drosha into miRNA precursors (pre-miRNAs) with a distinctive hairpin structure. 3)The pre-miRNAs are exported into the cytoplasm by Exportin-5 and cleaved by the enzyme Dicer to yield miRNA:miRNA* duplexes. One strand of the duplex, denoted with *, is normally degraded. 4

Introduction Traditional method to identity the miRNA 1)Prefer computational method rather than experimental methods for the time and money reasons. 2)Mainly discriminate the real pre-miRNA from the pseudo ones. 3)Triplet-SVM proposed by Xue et al. is a popular tool which employs a support vector machine (SVM) classifier to train 32 triplet sequence- structure features in human sequences, and successfully identified human pre-miRNAs with about 90 percent accuracy on both human data and data from other species. 4)Currently, the widely used classification algorithms include SVM, hidden Markov model (HMM), random forest (RF), linear genetic programming (LGP), and naıve Bayes. 5

Introduction Why we want to improve the quality of negative set ? 1)It is acknowledged that, when negative samples are sufficiently similar to the positive samples, the negative samples are considered to be of high-quality or representativeness. 2)The negative samples were usually collected by a parameter filtering method, which selects those sequences that share the widely accepted characteristics of real pre-miRNAs. 6

Introduction The parameter filter method versus proposed new technique 1)Pre-defined parameter types might not be available because more types of characteristics are being discovered as the number of known miRNAs continues to grow. 2)Confining the values to within a certain scope and reducing the specificity of the collected negative samples. 3)Pre-defined parameter assumptions likely miss other information about real pre- miRNAs. 1)It largely reduces dependence on filtering parameters 2)It has high adaptability and can be adjusted as new miRNAs are discovered, which guarantees that the collected pseudo pre- miRNAs are sufficiently similar to real pre-miRNAs. 7

Methods

1 Proposed feature set Classifier selection and optimization Data Sets Negative Sample Selection A miRNA Mining Tool—mirnaDetect Measurement 9

Feature set The 98-feature set we constructed 1)Primary sequence based features. For a given RNA sequence S, triple- nucleotide frequencies %XYZ are computed, where XYZ represents the contiguous three nucleotides in S, and X,Y and Z belongs to the set {A,U,C,G}. 4*4*4=64 2)Secondary structure based features. The minimum free energy (MFE), and there are also significant differences in the base- pair content of a secondary structure between real and pseudo pre- miRNAs. 2 3)Sequence-structure based features. features containing both local sequence and structure information were also considered. We used the 32 sequence-structure-based features. 2*2*2*4=32 10

Classifier selection and optimization 1)They used the SVM algorithm as the classification algorithm in the present research. 2)The kernel function is the radial basis function (RBF). 3)Conducted a grid search for LibSVM based on our training set, and obtained the optimal parameters. 11

Negative sample selection How to select the negative samples? Step 1. Search for homologous sub-sequences of the CDSs(coding region sequences) from known mature miRNAs using BLAST with its default setting. Collect the homologous sub-sequences into a “homology set” called S-homology. Step 2. Flank all the elements in S-homology by 100 nt upstream and downstream, and compute the secondary structures of the corresponding flanked elements with RNAfold. The extracted subsequences were collected into a “pre-miRNA-like candidate” set called S-candidate. Step 3. Use a filtering method to filter S-candidate. In the first level, two widely accepted criteria of real pre-miRNAs (MFEI > 0.8, and 0.7 > GC%> 0.3) are used to filter. 12

Negative sample selection How to select the negative samples? 13

Negative sample selection How to select the negative samples? 14

Data set and Measurement Data set Positive They removed redundant sequences from the positive set, leaving 16,520 non-redundant premiRNAs, including 1,496 human and 13,588 non-human pre-miRNAs in the final positive set. Negative A total of 14,661 pseudo pre-miRNAs including 1446 pseudo human pre-miRNAs. Training Set The training set consists of 1,155 real and 1,155 pseudo human pre-miRNAs. Test Set There are several kinds of test set we will introduce in next pages. 15

Data set and Measurement Measurement Sensitivity(same as Recall), Specificity, Geometric mean and Accuracy. 16

Results /Discussi on

Importance of Negative Samples These graph reveals the importance of negative samples for machine learning algorithms, and indicate that the higher the similarity between the positive and negative training sets, the higher will be the performance of the classifier. Importance of the Representative Negative Samples 18

Importance of Negative Samples Representativeness of Our Negative Set Modeling the Triplet-SVM-classifier with new negative set Modeling the new miRNApre classifier with Xue’s negative set. 19

Importance of Negative Samples Representativeness of Our Negative Set They also remodeled state-of-the- art classifier (Mirident) using our negative set, and generates a new training model. The performance in the virus set EBV, HCMV, MGHV68 and KSHV is not good. It is expected that as more viral pre-miRNAs are discovered, the new model will perform better than the original one. 20

Performance of MiRNApre Performance of the LibSVM Classifier All four classifiers based on the proposed feature set were evaluated with a 10-fold cross validation on our training set. This experimental result also confirmed the high efficiency of the SVM algorithm. 21

Performance of MiRNApre Performance of LibSVM on the Proposed Feature Set Ten-fold cross validation was used to evaluate the performance of LibSVM based on the same training set but different feature sets. We can implies that the features in C(%XYZ) may contain important attributes for the identification of human pre-miRNAs. We can prove that in the next page. 2 structure-based features 32 sequence-structure-based features 64 primary sequence-based features 22

Performance of MiRNApre Performance of LibSVM on the Proposed Feature Set We also investigate the importance of each of the features in the proposed combined feature set, which we expect will help researchers select the features that are “important” in their specific situations. The top 10 “important” features are listed in Table and it indicates their major influence on the identification of real/pseudo human pre-miRNAs. primary-sequence related features9 structure-based features1 sequence-structure-based features0 23

Performance of MiRNApre Analyzing the Performance of miRNApre We used a 10-fold cross validation test to evaluate the performance of miRNApre on the training set which contains 1,155 real pre-miRNAs and 1,155 pseudo pre-miRNAs. In real applications, the high SP is meaningful because there are far more pseudo pre-miRNAs than real pre-miRNAs in genome data. real pre-miRNAspseudo pre-miRNAsSPSEAccGm %97.9%98.1%98% 24

Performance of MiRNApre Analyzing the Performance of miRNApre The test set contains 69 newly found human pre-miRNAs. The miRNApre performed better, even in Virus it is also comparable. 25

Performance of MiRNApre Performance of mirnaDetect The method should generate as few false positives as possible to save time and money doing experiment on them. From this viewpoint, MIReNA and CSHMM generated 10,626, and 18,258 premiRNA candidates, respectively, while mirnaDetect found only 2,645 candidates. MethodsSESP mirnaDetectproperhigh 26

Conclusio ns

1)In this study, we explored the importance of representative negative samples for machine learning based methods for pre- miRNA identification. We found that existing negative sets suffer from low quality, and based on them it is difficult to generate an effective and promising prediction model. 2)To improve the quality of negative samples, we proposed a multi- level negative sample selection method and successfully constructed a high-quality negative set. 3)The high accuracy of our miRNApre method on different data sets suggests that our method is a promising tool for miRNA identification. 28

Leyi Wei received the BSc degree in computing mathematics and the MSc degree in computer science from Xiamen University, China Minghong Liao Yue Gao About the Authors received the MSc and PhD degrees in computer science and engineering from Harbin Institute of Technology, China, in 1988 and 1993 received the BS degree from the Harbin Institute of Technology, China, in 2005, and the ME and PhD degrees from Tsinghua University, Beijing, China

Any Questions? We can discuss! Q&A

Thanks! 魏琪康