Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Fast Algorithms For Hierarchical Range Histogram Constructions
SVM—Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Ab initio gene prediction Genome 559, Winter 2011.
Robust Multi-Kernel Classification of Uncertain and Imbalanced Data
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
COFFEE: an objective function for multiple sequence alignments
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Fuzzy Support Vector Machines (FSVMs) Weijia Wang, Huanren Zhang, Vijendra Purohit, Aditi Gupta.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Overview of sequence database searching techniques and multiple alignment May 1, 2001 Quiz on May 3-Dynamic programming- Needleman-Wunsch method Learning.
Active Learning with Support Vector Machines
CSE182-L10 Gene Finding.
CSE182-L12 Gene Finding.
Sequence Alignment III CIS 667 February 10, 2004.
The Influence of Alternative Splicing in Protein Structure The fact that gene number is not significantly different between mammals and some invertebrates.
Finding the optimal pairwise alignment We are interested in finding the alignment of two sequences that maximizes the similarity score given an arbitrary.
Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Lecture 12 Splicing and gene prediction in eukaryotes
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Delon Toh. Pitfalls of 2 nd Gen Amplification of cDNA – Artifacts – Biased coverage Short reads – Medium ~100bp for Illumina – 700bp for 454.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Sequence Alignment.
Masquerade Detection Mark Stamp 1Masquerade Detection.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design Won-Hyong Chung and Seong-Bae Park Dept. of Computer Engineering.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Exploring Alternative Splicing Features.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
MicroRNA Prediction with SCFG and MFE Structure Annotation Tim Shaw, Ying Zheng, and Bram Sebastian.
A knowledge-based approach to integrated genome annotation Michael Brent Washington University.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Learning to Align: a Statistical Approach
A Fast Hybrid Short Read Fragment Assembly Algorithm
Supervised Time Series Pattern Discovery through Local Importance
Homology Search Tools Kun-Mao Chao (趙坤茂)
Welcome to Introduction to Bioinformatics
Ab initio gene prediction
Homology Search Tools Kun-Mao Chao (趙坤茂)
From: TopHat: discovering splice junctions with RNA-Seq
Homology Search Tools Kun-Mao Chao (趙坤茂)
Determine CDS Coordinates
Presentation transcript:

Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments of short sequence reads Badil Elhady, Michael Chan

Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms

Motivation Question for the study? –The correct alignment of mRNA sequences to genomic DNA is still a challenging task. ( Due to the presence of sequencing errors, micro-exons, alternative splicing)

Method Splice Site Prediction –SVM with large margin, decided under convex optimization Intron Length Model Dynamic Programming is used to maximize the scoring function, leading to Optimal Alignment. (Smith-Waterman Alignments with Intron Model) This leads to: –Tuning the parameters of scoring function leads to. A larger score Other alignment would score lower

–Accurately differentiates the exon-intron boundaries –Compartmentalize the local alignment of EST. –Claim: Robust to mutations, insertions and deletions, as well as noise levels in accurately identifying intron boundaries as well as boundaries of the optimal local alignment. Slice site prediction

Splice Site Predictions From a set of ETS, sequences were extracted of confirmed donor and acceptor slice sites. To recognize acceptor and donor slice sites, 2 SVM classifiers were trained. Using “ weighted degree ” kernel. kernel computes the similarity between sequences s and s’.

The main idea of the algorithm is to compute a local alignment by determining the maximum over all alignments of all prefixes – S E (1 : i) :=(S E (1),..., S E (i)) –S D (1 : j) := (S D (1),..., S D (j)) »S E  EST Sequences »S D  DNA Sequences –Running time is O(m*n*L) »m  length of S E »n  length of S D »Smith-Waterman does not distinguish between exons and introns. Intron Length Model

Scoring Function

In generalizing the Smith- Waterman algorithm by including an intron model taking splice site predictions as well as intron length into account. The information is then used to optimize the parameters used for alignment. Smith-Waterman Alignments with Intron Model Splice site prediction assisting params

Experimental setup Evaluating PALMA vs. exalin, sim4, and blat. Alignment of mRNA seq. artificially shorting the middle exon (3-50)nt as shown.

Artificially generating the data : as a control to know exactly what the correct alignment has to be. Add varying amounts of noise (p ¼ 0,1,5 and 10% of random mutations, deletions or insertions) to the query sequence. Replace a part of the DNA or mRNA sequence at its terminal ends with random sequence leading to a shortened correct alignment. Experimental setup cont.

PALMA vs. exalin, sim4, and blat. Add noise

PALMA vs. exalin, sim4, and blat. Varying lengths

Conclusion –Motivation: high sensitivity detection of short exons in the midst of noise. –Principles Splice Site Prediction Intron Length Model maximize scoring function, for Optimal Alignment. –Results: PALMA detects short exons while exalin, blat, etc, are unsuccessful

Further Topics vmatch svm convex optimization

Paper 2: Optimal spliced alignments of short sequence reads

Situation NGS has short length and inherent high error rate even compared to Sanger. It is fast but the accuracy? Many methods are efficient and accurate if the sequence blocks (exons) are sufficiently long and are highly similar to the genomic sequence. Reads from NG sequencing techniques do not have either of 2 properties.

Motivation Objective to be able to accurately align the sequence reads over intron boundaries. QPALMA takes the read’s quality information as well as computational splice site predictions to compute accurate spliced alignments.

Principles Learn, in a supervised manner, how to score quality information, splice site predictions and sequence identity based on a representative set of sequence reads with known alignments. Extended Smith-Waterman algorithm: Extension 1 : Quality Scores Extension 2 : Splice Sites Extension 3 : Non-affine Intron Length Model

1)Splice site prediction Need to know acceptor and donor splice sites as well as suitable decoy sequences. Extension 1 : Quality Scores Extension 2 : Splice Sites Extension 3 : Non-affine Intron Length Model

Extension 1 : Quality Scores same computational complexity as the original Smith–Waterman algorithm ( O (mn)). However, it uses a more complex scoring that may depend on the sequencing technology used. Constant

Extension 2 : Splice Sites ( O (mnL)) operations, where L is the maximal length of the intron. The idea is to maintain an additional recurrence matrix W used to keep track of the intron boundaries.

Extension 2 : Splice Sites, cont g o and g e are the intron opening and extension scores The g don (i) and g acc (i) gacc(i) are scoring functions for splice sites at position i in the sequence ˆ f acc (i) := f acc ( g acc (i) ) and ˆ f don (i) := f don ( g don (i) ).

Extension 3 : Non-affine Intron Length Model Here is scoring the intron length with an arbitrary function

Recurrence can be implemented as follows Extension 3 : Non-affine Intron Length Model For long introns this approach seem computationally infeasible.

An alignment pipeline against whole genomes !!! optimal alignments is time consuming => use vmatch(multi-step approach on enhanced suffix arrays) + high quality splice site detection

vmatch (1 st round) finds global alignments of all short reads (max 2 mismatches) against the genome to identify large fraction of unspliced reads. –If there are reads that cannot be aligned (leftover reads) – spliced or low quality reads Yet, there is possibility that the boundary of the reads are the spliced sites –Check with QPALMA scoring function as a filter to quickly decide whether the read is spliced or not. – all combinations of putative donor splice sites within the read and acceptor splice sites ≤2000 nt downstream of the read, and – all combinations of putative acceptor splice sites within the read and donor splice sites ≤2000 nt upstream of the read. [Optional] An alignment pipeline against whole genomes

leftover reads + spliced (predicted to be by QPALMA) used as seeds for vmatch (2 nd round) and localize the splice sites with a ‘window’

Results

Conclusion –Motivation: NGS is inaccurate. –Principles 3 extentions to PALMA Vmatch pipelining, for boundary precision. –Results: lower error QPALMA + vmatch pipelining = PALMA + 3extentions – {SVM, large marigin}

References A Tutorial on Support Vector Machines for Pattern Recognition NCBI National Center for Biotechnology Information genome.htmlNCBI genome.html BioInfoBank Library Libraryhttp://lib.bioinfo.pl/ High Throughput Short Read Alignment via Bi- directional BWT

Mich_a___el__chan Badil_el ha dy