Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S051032.

Slides:

Advertisements

Similar presentations

Lecture 9 Support Vector Machines

Advertisements

ECG Signal processing (2)

Perceptron Learning Rule

An Introduction of Support Vector Machine

SVM—Support Vector Machines

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Predicting Enhancers in Co-Expressed Genes Harshit Maheshwari Prabhat Pandey.

A Novel Knowledge Based Method to Predicting Transcription Factor Targets

Semi-Supervised Classification by Low Density Separation Olivier Chapelle, Alexander Zien Student: Ran Chang.

Mismatch string kernels for discriminative protein classification By Leslie. et.al Presented by Yan Wang.

Heuristic alignment algorithms and cost matrices

Reduced Support Vector Machine

Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.

Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.

SVM: Non-coding Neutral Sequences Vs Regulatory Modules Ying Zhang, BMB, Penn State Ritendra Datta, CSE, Penn State Bioinformatics – I Fall 2005.

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.

Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

GATree: Genetically Evolved Decision Trees 전자전기컴퓨터공학과 데이터베이스 연구실 G 김태종.

Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:

1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Phase Congruency Detects Corners and Edges Peter Kovesi School of Computer Science & Software Engineering The University of Western Australia.

Sequence Alignment Csc 487/687 Computing for bioinformatics.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Identification of cell cycle-related regulatory motifs using a kernel canonical correlation analysis Presented by Rhee, Je-Keun Graduate Program in Bioinformatics.

Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.

Localising regulatory elements using statistical analysis and shortest unique substrings of DNA Nora Pierstorff 1, Rodrigo Nunes de Fonseca 2, Thomas Wiehe.

Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Journal report: High Resolution Model of Transcription Factor- DNA Affinities Improve In Vitro and In Vivo Binding Predictions Paper by: Phadera Gius,

Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.

PROTEIN STRUCTURE SIMILARITY CALCULATION AND VISUALIZATION CMPS 561-FALL 2014 SUMI SINGH SXS5729.

LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.

Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.

Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.

Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.

GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function Sara Mostafavi, Debajyoti Ray, David Warde-Farley,

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S

Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Final Report (30% final score) Bin Liu, PhD, Associate Professor.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819

Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.

Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.

A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.

Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.

1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),

BIOINFORMATION A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation - - 王红刚 14S

Hidden Markov Models BMI/CS 576

PREDICT 422: Practical Machine Learning

How to forecast solar flares?

IEEE BIBM 2016 Xu Min, Wanwen Zeng, Ning Chen, Ting Chen*, Rui Jiang*

Results for all features Results for the reduced set of features

Introduction to Bioinformatics II

Generally Discriminant Analysis

Nora Pierstorff Dept. of Genetics University of Cologne

Deep Learning in Bioinformatics

Presentation transcript:

Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S051032

Content 1 、 Abstract 2 、 Introduction 3 、 Method 4 、 Results 5 、 Discussion

1 、 Abstract  Oligomers of length k, or k-mers, are convenient and widely used features for modeling the properties and functions of DNA and protein sequences.  k-mers suffer from the inherent limitation that if the parameter k is increased to resovle long features.  Method: gapped k-mer  Datasets: CTCF dataset and EP300 dataset

2 、 Introduction  Predicting the function of regulatory elements from primary DNA sequence still remains a major problem in computational biology.  kmer-SVM uses combinations of short (6–8 bp) k-mer frequencies to predict the activity of larger functional genomic sequence elements.  The choice to use a single k, and which k, is somewhat arbitrary and based on performance on a limited selection of datasets.

2 、 Introduction  Limitations of k-mers as features: ● sparse problem. ● overfitting problem  Transcription Factor Binding Sites (TFBS) can vary from 6–20 bp, so some are much longer, and thus cannot be completely represented by the short k-mers.

3 、 Method  Support vector machine(SVM) ● The Support Vector Machine (SVM) is one of the most successful binary classifiers and has been widely used in many classification problems. ● SVM is based on the structural risk minimization principle from statistical learning theory.

3 、 Method kmer

3 、 Method  Gapped k-mer method ● gaps —— no information positions. ● gapped k-mer uses a full set of k-mers with gaps as features.

3 、 Method ● Kernel function ● Calculates the similarity between any two sequences.

3 、 Method ● Not involves the computation of all possible gapped k-mer counts. ● The key idea is that only the full l-mers present in the two sequences can contribute to the similarity score via all gapped k-mers derived from them.

3 、 Method  Gkm-kernel with k-mer tree data structure ● Construct the tree by adding a path for every l-mer observed in a training sequence.

3 、 Method  de novo PWMs from gkm-SVM ● determined a set of predictive k-mers by scoring all possible 10-mers and selecting the top 1% of the highscoring 10-mers. ● found a set of distinct PWM models from these predictive 10-mers using a heuristic iterated greedy algorithm. ● for each of the remaining predictive 10-mers, we calculated the log-odd ratios of all possible alignments of the 10-mer to the PWM model, and identified the best alignment.

3 、 Method  Implementation of mismatch and wildcard kernels using gkm-kernel framework ● The maximum number M replaces k in gkm-kernel. ● Consider the contribution of each pair of l-mers with m mismatches in the inner product of the corresponding feature vectors of the two sequences.

3 、 Method ● Coefficients of mismatch kernel

3 、 Method  Gkm-filters for computation of the robust l-mer count estimates ● mapping matrix

3 、 Method  Gkm-kernel with l-mer count estimates ● Coefficients of mismatch kernel

4 、 Results  gkm-SVM outperforms kmer-SVM over a wide range of k- mer length

4 、 Results

 gkm-SVM consistently outperforms kmer-SVM and the best known PWM on human ENCODE ChIP-seq data sets.

4 、 Results

 Comparison of gkm-SVM and exiting methods on the mouse forebrain EP300 data set.

4 、 Results  Gapped k-mer features also improve performance of Naive-Bayes classifiers.

5 、 Discussion In this paper, we presented a significantly improved method for sequence prediction using gapped k-mers as features, gkm-SVM. We introduced a new set of algorithms to efficiently calculate the kernel matrix, and demonstrated that by using these new methods we can significantly overcome the sparse k-mer count problem for long k-mers and hence significantly improve the classification accuracy especially for long TFBSs.