Probe Design Using Exact Repeat Count August 8th, 2007 Aaron Arvey.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
BLAST Sequence alignment, E-value & Extreme value distribution.
Image (and Video) Coding and Processing Lecture 5: Point Operations Wade Trappe.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Next Generation Sequencing, Assembly, and Alignment Methods
1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.
Modern Information Retrieval
Sequence Similarity Searching Class 4 March 2010.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
FALL 2004CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Finding approximate palindromes in genomic sequences.
Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis.
Probe design for microarrays using OligoWiz. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.
Probabilistic Similarity Search for Uncertain Time Series Presented by CAO Chen 21 st Feb, 2011.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
Selection of Optimal DNA Oligos for Gene Expression Arrays Reporter : Wei-Ting Liu Date : Nov
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.
Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
External Sorting 198:541. Why Sort?  A classic problem in computer science!  Data requested in sorted order e.g., find students in increasing gpa order.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
UNC Chapel Hill M. C. Lin Point Location Reading: Chapter 6 of the Textbook Driving Applications –Knowing Where You Are in GIS Related Applications –Triangulation.
Detection and Compensation of Cross- Hybridization in DNA Microarray Data Joint work with Quaid Morris (1), Tim Hughes (2) and Brendan Frey (1) (1)Probabilistic.
Filter Algorithms for Approximate String Matching Stefan Burkhardt.
Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,
Indexing DNA sequences for local similarity search Joint work of Angela, Dr. Mamoulis and Dr. Yiu 17/5/2007.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Read Alignment Algorithms. The Problem 2 Given a very long reference sequence of length n and given several short strings.
Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R 林語君.
CS654: Digital Image Analysis Lecture 18: Image Enhancement in Spatial Domain (Histogram)
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Cache-efficient string sorting for Burrows-Wheeler Transform Advait D. Karande Sriram Saroop.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Comp. Genomics Recitation 10 Clustering and analysis of microarrays.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
From Smith-Waterman to BLAST
Output Sensitive Algorithm for Finding Similar Objects Jul/2/2007 Combinatorial Algorithms Day Takeaki Uno Takeaki Uno National Institute of Informatics,
Doug Raiford Phage class: introduction to sequence databases.
FALL 2005CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
From: Duggan et.al. Nature Genetics 21:10-14, 1999 Microarray-Based Assays (The Basics) Each feature or “spot” represents a specific expressed gene (mRNA).
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
External Sorting Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.
CS4432: Database Systems II
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Selection of Oligonucleotide Probes for Protein Coding Sequences
Searching Similar Segments over Textual Event Sequences
Range Queries on Uncertain Data
CENG 351 Data Management and File Structures
TRC: Trace – Reference Compression
Presentation transcript:

Probe Design Using Exact Repeat Count August 8th, 2007 Aaron Arvey

11/06/072 Outline Qualities of good probes Past Work Our Algorithm

11/06/073 Oligo Probes Sequences are controlled In contrast to cDNA/other in vitro methods More genomes are getting sequenced, enabling oligo probe usage Large genomes are hard Lack of compactness leads to repeats Probes may bind to similar sequences Specificity vs sensitivity

11/06/074 Good Probes “Completely” unique No nonspecific binding to similar sequences (w.r.t. binding energy) Regions (e.g. endpoints) are more important No G-quartets Ends in C/G (less “breathing”) No hairpin structures Fairly balanced AT% vs GC%

11/06/075 Empirically Specific Probes (For Microarrays) “50-mer probe with 15-base, 20-base, or 35-base stretch with nontargets, has approximately 1%, 4%, or 50% of the target signal intensity (Kane et al, Hughes et al)” “...at probe-target identity of 75% or greater, minimal free energy is closely correlated to probe-target identity (data not shown).” (He et al, 2005)

11/06/076 Each graph shows intensity of microarray. Right column shows perfect match. Left columns show mismatches (in gray). Taken from Bozdech et al 2003 (De Risi Lab)

11/06/077 Do Good Probes Exist? A completely unique sequence is likely to be fairly uniformally random From this we can say that a unique sequence will likely lack internal repeats have low probability of forming hairpin structure have fairly balanced AC% vs. GC% Prob(G quartet) =.25^4 = 1/256 Good sequences are likely to exist

11/06/078 Finding Good Probes Computationally naïve approach BLAST every possible kmer probe, given a desired target sequence May take days to complete single probe Heuristics Use kmer substring uniqueness. D. mel: 12mer, 15mer H. sap: 16mer, 18mer, 20mer, 24mer

11/06/079 Past Algorithmic Approaches BLAST searches (DeRisi '03, others) Suffix array/tree (Li & Stormo 2000) Burrows-Wheeler transform (CSHL 2005) Sorting and compressing of suffix array Almost entirely for microarray studies The problem in the in situ domain is harder

11/06/0710 Selection of probe given a sequence Taken from Bozdech et al 2003 (De Risi Lab)

11/06/0711 Binding Energy of nearest BLAST search Self-Binding (Hairpin) Score “Sequence Complexity” (Compressibility) Taken from Bozdech et al 2003 (De Risi Lab)

11/06/0712 Times for past algorithms Suffix Array (Li & Stormo 2001) 4 days for 4-mismatch distance of 24mer probes on E. coli (12MB) Would be months for 50mers on E. coli Would be centuries for 50mers on humans BLAST searches (DeRisi Lab 2003) 12 hours for 70mer probes on P. falciparum 12MB Would be weeks/months for humans

11/06/0713 Times for past algorithms Burrows-Wheeler Transform (CSHL 2005) Requires 6-7 days to construct dictionary Able to store on disk and use disk seeks After preprocessing, queries are as fast as ours Suffix Trees Requires 40-50GB RAM Likely to be slightly slower than our algorithm

11/06/0714 Our Idea Use a heuristic to find good regions Find subsequence uniqueness for 15-20mers Use Bloom Filter (probabilistic) Use boolean “is repeated” Use exact repeat counts Use BLAST to find second most homologous sequence and check binding energy to probe

11/06/0715 What We Want to Know AGCTAGCTAGTCAAAGGT (UNIQUE) GCTAGCTAGTCAAAGGTA (UNIQUE) CTAGCTAGTCAAAGGTAA (UNIQUE) TAGCTAGTCAAAGGTAAT (REPEATED) AGCTAGTCAAAGGTAATA (UNIQUE) GCTAGTCAAAGGTAATAG (UNIQUE) CTAGTCAAAGGTAATAGG (UNIQUE) S=TAG...AAT is repeated AS, GS, or TS exist ST, SG, or SC exist

11/06/0716 Design Preprocessing Find Repeats Print all kmers to file Sample kmer cumulative distribution Count records in sorted files sequentially Create files for CDF sorting Read in repeat counts Read in sequence Find kmer counts via binary search Sequence Input Kmer Input

11/06/0717 Time for H. sapiens Preprocessing Find Repeats Print all kmers to file Sample kmer cumulative distribution Count records in sorted files sequentially Create files for CDF sorting Read in repeat counts Read in sequence Find kmer counts via binary search Sequence Input Kmer Input 45 mi n hrs 30 mi n 20 mi n 8 mi n <1 sec <1 sec

11/06/0718 Size for H. sapiens Preprocessing Find Repeats Print all kmers to file Sample kmer cumulative distribution Count records in sorted files sequentially Create files for CDF sorting Read in repeat counts Read in sequence Find kmer counts via binary search Sequence Input Kmer Input 3GB Disk 1 File 20GB Disk 500 Files 20GB Disk 1 File ½- 2GB Disk 1 File 10 KB RA M 400 KB RA M ½- 2GB RA M

11/06/0719 Algorithm Preprocessing Find Repeats Print all kmers to file Sample kmer cumulative distribution Count records in sorted files sequentially Create files for CDF sorting Read in repeat counts Read in sequence Find kmer counts via binary search Sequence Input Kmer Input

11/06/0720 Preprocessing – Printing Kmers A genome file is read in (3GB) Every kmer (including overlaps) is printed to an output file (10-20GB) Kmers are sampled to create an empirical cumulative distribution (CDF) and put into files Goal is to uniformally sample as to avoid abnormally large files

11/06/0721 Why Sampling of CDF? What we'd like What we getWhat we make

11/06/0722 Algorithm Preprocessing Find Repeats Print all kmers to file Sample kmer cumulative distribution Count records in sorted files sequentially Create files for CDF sorting Read in repeat counts Read in sequence Find kmer counts via binary search Sequence Input Kmer Input

11/06/0723 Preprocessing – Sorting Kmers Each chunk of the CDF is sorted All repeated kmers are adjacent Read the sorted files sequentially Output repeated kmers and number of times repeated to output file

11/06/0724 Algorithm Preprocessing Find Repeats Print all kmers to file Sample kmer cumulative distribution Count records in sorted files sequentially Create files for CDF sorting Read in repeat counts Read in sequence Find kmer counts via binary search Sequence Input Kmer Input

11/06/0725 Finding Repeats – Specifications Input a sequence (e.g. large intron) Specify plus/minus strand Specify granularity of kmer Read in repeat file from preprocessing Read the repeat Read the number of times it was repeated

11/06/0726 Algorithm Preprocessing Find Repeats Print all kmers to file Sample kmer cumulative distribution Count records in sorted files sequentially Create files for CDF sorting Read in repeat counts Read in sequence Find kmer counts via binary search Sequence Input Kmer Input

11/06/0727 Finding Repeats – Searching For each of the kmers in the sequence, see if it is in the repeat file Since the repeat file is sorted, we can do a binary search Each binary search takes < seconds Output a chart showing the repeat structure of the sequence

11/06/0728 Questions?