Whole Genome Repeat Analysis Package A Preliminary Analysis of the Caenorhabditis elegans Genome Paul Poole.

Slides:



Advertisements
Similar presentations
In Silico Primer Design and Simulation for Targeted High Throughput Sequencing I519 – FALL 2010 Adam Thomas, Kanishka Jain, Tulip Nandu.
Advertisements

SCHOOL OF COMPUTING ANDREW MAXWELL 9/11/2013 SEQUENCE ALIGNMENT AND COMPARISON BETWEEN BLAST AND BWA-MEM.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
15-20 september WABI031 A Method to Detect Gene Structure and Alternative Splice Sites by Agreeing ESTs to a Genomic Sequence Paola Bonizzoni Graziano.
GNANA SUNDAR RAJENDIRAN JOYESH MISHRA RISHI MISHRA FALL 2008 BIOINFORMATICS Clustering Method for Repeat Analysis in DNA sequences.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
HCS Clustering Algorithm
Finding approximate palindromes in genomic sequences.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Bioinformatics and Phylogenetic Analysis
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Project Proposals Due Monday Feb. 12 Two Parts: Background—describe the question Why is it important and interesting? What is already known about it? Proposed.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Field-Programmable Logic and its Applications INTERNATIONAL CONFERENCE August 30 – September 01, 2004 Albert A. Conti, Tom Van Court, Martin C. Herbordt.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Genomic Rearrangements CS 374 – Algorithms in Biology Fall 2006 Nandhini N S.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
An Introduction to Bioinformatics
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Whole genome comparison Kelley Crouse And Greg Matuszek.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Copyright OpenHelix. No use or reproduction without express written consent1.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Abel Carrión Ignacio Blanquer Vicente Hernández.
Sequence Alignment.
I.U. School of Informatics Motif Discovery from Large Number of Sequences: A Case Study with Disease Resistance Genes in Arabidopsis thaliana by Irfan.
Big Data Bioinformatics By: Khalifeh Al-Jadda. Is there any thing useful?!
Doug Raiford Phage class: introduction to sequence databases.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
Copyright OpenHelix. No use or reproduction without express written consent1.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
 DNA- genetic material of eukaryotes.  Are highly variable in size and complexity.  About 3.3 billion bp in humans.  Complexity- due to non coding.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Genomic Data Clustering on FPGAs for Compression
Dot Plots Dot Plots provide a graphic view of the amount of similarity between two sequences. The two axes represent the two sequences. In its simplest.
EXTENDING GENE ANNOTATION WITH GENE EXPRESSION
Explore Evolution: Instrument for Analysis
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Presentation transcript:

Whole Genome Repeat Analysis Package A Preliminary Analysis of the Caenorhabditis elegans Genome Paul Poole

Introduction Relevance of detecting repeats:  Evolutionary significance Divergence of repeats between two similar or related organisms Divergence of repeats between two similar or related organisms  Potential binding sites Contribute to DNA packaging (PREs) and structure Contribute to DNA packaging (PREs) and structure Role in gene regulation by interacting with transcription factors Role in gene regulation by interacting with transcription factors  Disease (tandem repeats; trinucleotide pattern) Huntington’s Huntington’s Myotonic dystrophy Myotonic dystrophy Spinal and bulbar muscular dystrophy Spinal and bulbar muscular dystrophy Fragile-X mental retardation Fragile-X mental retardation Alu repeats – tumors Alu repeats – tumors

Background Commonly used sequence analysis tools:  Suffix Tree data structure REPuter REPuter MUMmer MUMmer  Heuristic tools FASTA FASTA BLAST BLAST  Probabilistic modeling tools (Hidden Markov Models) HMMER HMMER MetaMEME MetaMEME Sequence Alignment and Modeling System (SAM) Sequence Alignment and Modeling System (SAM)

Background - REPuter  repfind – perfect and degenerate repeats 1. Build suffix tree 2. Find instances of repeats ≥ 2  repselect Filters repfind results based on options Filters repfind results based on options Shared object file can be used to provide more specific filters Shared object file can be used to provide more specific filters  Current version of Reputer is limited to 67 million bases

Conceptual Overview Problem:  Heuristic tools require a query sequence to get started One approach might be to compare chunks of chromosomes for regions of similarity One approach might be to compare chunks of chromosomes for regions of similarity  Probabilistic models require multiple sequences to build a model Could use results from above to generate a model Could use results from above to generate a model

Conceptual Overview My solution: 1.Use REPuter to locate perfect repeats 2.Cluster similar repeats 3.Build a database of these clusters, which will serve as seeds for further analysis  Clusters can be used to build probabilistic models (such as Hidden Markov Models)  FASTA or BLAST can be used to compare the clusters against a genome to expand the cluster

Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Add Perfect Repeats to Database Consolidate Repeats to produce Maximal repeats Add Maximal Repeats to Database Obtain Sequences of all Maximal repeats Compare Sequences with FASTA Cluster Sequences with BAG Add clusters to Imperfect Sequence Database Generate Perfect Repeats (Reputer) Design – Program Flow Chart

Collect user information Generate Perfect Repeats (Reputer) Design – Program Flow Chart

REPuter repfind – Selected options:  -f Finds forward repeats Finds forward repeats Ignores: Ignores:  Palindromes: attcaat  Reverse repeats: attggtta  Complemented repeats: attgcaat  -b Output file is in binary format to reduce size Output file is in binary format to reduce size  -allmax All maximal repeats All maximal repeats Reported in order found Reported in order found  -l User parameter User parameter Default: 30 Default: 30

REPuter repselect – Implemented filters:  Minimum (20) and Maximum repeat size (300)  Maximum percent of same character in a row aaaaaaaaga = 80% a in a row (Default: 50%) aaaaaaaaga = 80% a in a row (Default: 50%)  Maximum percent of same character in the entire sequence aaaaaaaaga = 90% a (Default: 80%) aaaaaaaaga = 90% a (Default: 80%)  Maximum percent of two characters in the entire sequence aagggaagac = 90% a/g (Default 90%) aagggaagac = 90% a/g (Default 90%)  Sequence must contain at least 2 of the 4 nucleotides

Collect user information Parse Repeats into data structure Generate Perfect Repeats (Reputer) Design – Program Flow Chart

REPuter Results Filter:  Require minimum number of repeats User parameter User parameter Default: 20 Default: 20

Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Generate Perfect Repeats (Reputer) Design – Program Flow Chart

Perfect Repeat Properties  Determined from reputer output: Sequence of repeat Sequence of repeat Start position of repeat Start position of repeat Length of repeat Length of repeat  Calculated from reputer output: End position of repeat End position of repeat Average distance between repeats Average distance between repeats  Calculated across all chromosomes Orientation – plus or minus Orientation – plus or minus

Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Add Perfect Repeats to Database Generate Perfect Repeats (Reputer) Design – Program Flow Chart

Design - Database Layout perf_reps prepid sequence length avgdist perf_pos pposid prepid spos epos direction chrom perf_pos pposid prepid spos epos direction chrom prepidUnique ID of perfect repeat sequenceSequence of the perfect repeat lengthLength of the perfect repeat avgdistAverage distance between repeats pposidUnique ID of perfect repeat position prepidID of perfect repeat (not unique) sposStart position of perfect repeat eposEnd position of perfect repeat directionOrientation, plus or minus strand chromChromosome number

Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Add Perfect Repeats to Database Consolidate Repeats to produce Maximal repeats Generate Perfect Repeats (Reputer) Design – Program Flow Chart

Maximal Repeats  Maximal repeats are determined by comparing smaller repeats to the larger repeats and removing all exact subsets from the list  Given: 1. atggatggc 2. ccgatggatggcatt 3. atggatggcattaaatt  Maximal repeat is: atggatggcattaaatt  Comprised of 1 and 3, but not 2  Why not 2?

Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Add Perfect Repeats to Database Consolidate Repeats to produce Maximal repeats Add Maximal Repeats to Database Generate Perfect Repeats (Reputer) Design – Program Flow Chart

Design - Database Layout max_reps maxid prepid max_comp maxid prepid max_reps maxid prepid max_comp maxid prepid maxidUnique ID of the maximal repeat prepid ID of the perfect repeat that is the maximal repeat maxidID of maximal repeat (not unique) prepid ID of perfect repeats that are sub sequences of the maximal repeat

Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Add Perfect Repeats to Database Consolidate Repeats to produce Maximal repeats Add Maximal Repeats to Database Obtain Sequences of all Maximal Repeats from db Compare Sequences with FASTA Generate Perfect Repeats (Reputer) Design – Program Flow Chart

FASTA  Pairwise comparison of all maximal repeats (FASTA 3.4)  Filter:  Require minimum percent identity User parameter User parameter Default 75% Default 75%

Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Add Perfect Repeats to Database Consolidate Repeats to produce Maximal repeats Add Maximal Repeats to Database Obtain Sequences of all Maximal Repeats from db Compare Sequences with FASTA Cluster Sequences with BAG Generate Perfect Repeats (Reputer) Design – Program Flow Chart

BAG  BAG: A Graph Theoretic Sequence Clustering Algorithm (Sun Kim)  Clustering based on 2 graph properties 1. Biconnected components 2. Articulation points

BAG A B C F D E G Connected graph Articulation point Biconnected graph

BAG  BAG: A Graph Theoretic Sequence Clustering Algorithm (Sun Kim)  Clustering based on 2 graph properties 1. Biconnected components 2. Articulation points  Z-score used to determine similarity User parameter User parameter Default: 400 Default: 400

Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Add Perfect Repeats to Database Consolidate Repeats to produce Maximal repeats Add Maximal Repeats to Database Obtain Sequences of all Maximal Repeats from db Compare Sequences with FASTA Cluster Sequences with BAG Add clusters to Imperfect Sequence Database Generate Perfect Repeats (Reputer) Design – Program Flow Chart

Design - Database Layout cluster clusterid size cluster_comp clusterid prepid clusteridUnique ID for repeat cluster size Size of cluster - number of perfect repeats that comprise the cluster clusteridID of cluster prepid ID of perfect repeats (not unique) that comprise the cluster

Design - Database Layout perf_reps prepid sequence length avgdist max_reps maxid prepid cluster clusterid size cluster_comp clusterid prepid perf_pos pposid prepid spos epos direction chrom max_comp maxid prepid

Results Common options for test cases: Reputer:  Minimum Repeat Size: 30  Maximum Repeat Size: 300 FASTA:  Minimum percent identity: 75% BAG (clustering):  Z-score cutoff: 400

Results Computer specifications:  Dual CPU – Intel Xeon 1.7 GHz  4 GB RAM  Linux RedHat 8.0

Results Results Time (m:s)* Minimum Number repeats 1020**1020** # Perfect repeats Pre-consolidation 30,49212,437 Part 1 (repfind/repselect) 7:207:20 # Perfect repeats Post-consolidation Part 2 (Parse/consolidate) 36:4327:48 # members Part 3 (FASTA/cluster) 41:565:42 # clusters Total85:4740:37 * Average 2 runs ** Default

Acknowledgments Sun Kim Don Gilbert Scott Martin