Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set.

Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Arrangem ent of the Report 1 Introduction 2 Methods 3 Results and Discussion 4 Conclusion 1

Introducti on

Brief Introduction to microRNA MicroRNA (miRNA) is a class of single strand, non-coding endogenous RNAs, with ~22 nucleotides (nt) in sequence length. miRNAs play key roles in regulating biological processes, including affecting stability and translation of mRNAs and negatively regulating gene expression in post-transcriptional processes. 3

Introduction How did the microRNA formed 1)miRNA genes are first transcribed by RNA polymerase II, resulting in the primary transcripts, which are usually termed as pri- miRNAs. 2)The pri-miRNAs are processed by the enzyme Drosha into miRNA precursors (pre-miRNAs) with a distinctive hairpin structure. 3)The pre-miRNAs are exported into the cytoplasm by Exportin-5 and cleaved by the enzyme Dicer to yield miRNA:miRNA* duplexes. One strand of the duplex, denoted with *, is normally degraded. 4

Introduction Traditional method to identity the miRNA 1)Prefer computational method rather than experimental methods for the time and money reasons. 2)Mainly discriminate the real pre-miRNA from the pseudo ones. 3)Triplet-SVM proposed by Xue et al. is a popular tool which employs a support vector machine (SVM) classifier to train 32 triplet sequence- structure features in human sequences, and successfully identified human pre-miRNAs with about 90 percent accuracy on both human data and data from other species. 4)Currently, the widely used classification algorithms include SVM, hidden Markov model (HMM), random forest (RF), linear genetic programming (LGP), and naıve Bayes. 5

Introduction Why we want to improve the quality of negative set ? 1)It is acknowledged that, when negative samples are sufficiently similar to the positive samples, the negative samples are considered to be of high-quality or representativeness. 2)The negative samples were usually collected by a parameter filtering method, which selects those sequences that share the widely accepted characteristics of real pre-miRNAs. 6

Introduction The parameter filter method versus proposed new technique 1)Pre-defined parameter types might not be available because more types of characteristics are being discovered as the number of known miRNAs continues to grow. 2)Confining the values to within a certain scope and reducing the specificity of the collected negative samples. 3)Pre-defined parameter assumptions likely miss other information about real pre- miRNAs. 1)It largely reduces dependence on filtering parameters 2)It has high adaptability and can be adjusted as new miRNAs are discovered, which guarantees that the collected pseudo pre- miRNAs are sufficiently similar to real pre-miRNAs. 7

Methods

1 Proposed feature set 3 2 4 56 Classifier selection and optimization Data Sets Negative Sample Selection A miRNA Mining Tool—mirnaDetect Measurement 9

Feature set The 98-feature set we constructed 1)Primary sequence based features. For a given RNA sequence S, triple- nucleotide frequencies %XYZ are computed, where XYZ represents the contiguous three nucleotides in S, and X,Y and Z belongs to the set {A,U,C,G}. 4*4*4=64 2)Secondary structure based features. The minimum free energy (MFE), and there are also significant differences in the base- pair content of a secondary structure between real and pseudo pre- miRNAs. 2 3)Sequence-structure based features. features containing both local sequence and structure information were also considered. We used the 32 sequence-structure-based features. 2*2*2*4=32 10

Classifier selection and optimization 1)They used the SVM algorithm as the classification algorithm in the present research. 2)The kernel function is the radial basis function (RBF). 3)Conducted a grid search for LibSVM based on our training set, and obtained the optimal parameters. 11

Negative sample selection How to select the negative samples? Step 1. Search for homologous sub-sequences of the CDSs(coding region sequences) from known mature miRNAs using BLAST with its default setting. Collect the homologous sub-sequences into a “homology set” called S-homology. Step 2. Flank all the elements in S-homology by 100 nt upstream and downstream, and compute the secondary structures of the corresponding flanked elements with RNAfold. The extracted subsequences were collected into a “pre-miRNA-like candidate” set called S-candidate. Step 3. Use a filtering method to filter S-candidate. In the first level, two widely accepted criteria of real pre-miRNAs (MFEI > 0.8, and 0.7 > GC%> 0.3) are used to filter. 12

Negative sample selection How to select the negative samples? 13

Negative sample selection How to select the negative samples? 14

Data set and Measurement Data set Positive They removed redundant sequences from the positive set, leaving 16,520 non-redundant premiRNAs, including 1,496 human and 13,588 non-human pre-miRNAs in the final positive set. Negative A total of 14,661 pseudo pre-miRNAs including 1446 pseudo human pre-miRNAs. Training Set The training set consists of 1,155 real and 1,155 pseudo human pre-miRNAs. Test Set There are several kinds of test set we will introduce in next pages. 15

Data set and Measurement Measurement Sensitivity(same as Recall), Specificity, Geometric mean and Accuracy. 16

Results /Discussi on

Importance of Negative Samples These graph reveals the importance of negative samples for machine learning algorithms, and indicate that the higher the similarity between the positive and negative training sets, the higher will be the performance of the classifier. Importance of the Representative Negative Samples 18

Importance of Negative Samples Representativeness of Our Negative Set Modeling the Triplet-SVM-classifier with new negative set Modeling the new miRNApre classifier with Xue’s negative set. 19

Importance of Negative Samples Representativeness of Our Negative Set They also remodeled state-of-the- art classifier (Mirident) using our negative set, and generates a new training model. The performance in the virus set EBV, HCMV, MGHV68 and KSHV is not good. It is expected that as more viral pre-miRNAs are discovered, the new model will perform better than the original one. 20

Performance of MiRNApre Performance of the LibSVM Classifier All four classifiers based on the proposed feature set were evaluated with a 10-fold cross validation on our training set. This experimental result also confirmed the high efficiency of the SVM algorithm. 21

Performance of MiRNApre Performance of LibSVM on the Proposed Feature Set Ten-fold cross validation was used to evaluate the performance of LibSVM based on the same training set but different feature sets. We can implies that the features in C(%XYZ) may contain important attributes for the identification of human pre-miRNAs. We can prove that in the next page. 2 structure-based features 32 sequence-structure-based features 64 primary sequence-based features 22

Performance of MiRNApre Performance of LibSVM on the Proposed Feature Set We also investigate the importance of each of the features in the proposed combined feature set, which we expect will help researchers select the features that are “important” in their specific situations. The top 10 “important” features are listed in Table and it indicates their major influence on the identification of real/pseudo human pre-miRNAs. primary-sequence related features9 structure-based features1 sequence-structure-based features0 23

Performance of MiRNApre Analyzing the Performance of miRNApre We used a 10-fold cross validation test to evaluate the performance of miRNApre on the training set which contains 1,155 real pre-miRNAs and 1,155 pseudo pre-miRNAs. In real applications, the high SP is meaningful because there are far more pseudo pre-miRNAs than real pre-miRNAs in genome data. real pre-miRNAspseudo pre-miRNAsSPSEAccGm 1155 98.2%97.9%98.1%98% 24

Performance of MiRNApre Analyzing the Performance of miRNApre The test set contains 69 newly found human pre-miRNAs. The miRNApre performed better, even in Virus it is also comparable. 25

Performance of MiRNApre Performance of mirnaDetect The method should generate as few false positives as possible to save time and money doing experiment on them. From this viewpoint, MIReNA and CSHMM generated 10,626, and 18,258 premiRNA candidates, respectively, while mirnaDetect found only 2,645 candidates. MethodsSESP mirnaDetectproperhigh 26

Conclusio ns

1)In this study, we explored the importance of representative negative samples for machine learning based methods for pre- miRNA identification. We found that existing negative sets suffer from low quality, and based on them it is difficult to generate an effective and promising prediction model. 2)To improve the quality of negative samples, we proposed a multi- level negative sample selection method and successfully constructed a high-quality negative set. 3)The high accuracy of our miRNApre method on different data sets suggests that our method is a promising tool for miRNA identification. 28

Leyi Wei received the BSc degree in computing mathematics and the MSc degree in computer science from Xiamen University, China Minghong Liao Yue Gao About the Authors received the MSc and PhD degrees in computer science and engineering from Harbin Institute of Technology, China, in 1988 and 1993 received the BS degree from the Harbin Institute of Technology, China, in 2005, and the ME and PhD degrees from Tsinghua University, Beijing, China

Any Questions? We can discuss! Q&A

Thanks! 魏琪康

Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set.

Similar presentations

Presentation on theme: "Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set.

Similar presentations

Presentation on theme: "Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set."— Presentation transcript:

Similar presentations

About project

Feedback