Whole Genome Repeat Analysis Package A Preliminary Analysis of the Caenorhabditis elegans Genome Paul Poole
Introduction Relevance of detecting repeats: Evolutionary significance Divergence of repeats between two similar or related organisms Divergence of repeats between two similar or related organisms Potential binding sites Contribute to DNA packaging (PREs) and structure Contribute to DNA packaging (PREs) and structure Role in gene regulation by interacting with transcription factors Role in gene regulation by interacting with transcription factors Disease (tandem repeats; trinucleotide pattern) Huntington’s Huntington’s Myotonic dystrophy Myotonic dystrophy Spinal and bulbar muscular dystrophy Spinal and bulbar muscular dystrophy Fragile-X mental retardation Fragile-X mental retardation Alu repeats – tumors Alu repeats – tumors
Background Commonly used sequence analysis tools: Suffix Tree data structure REPuter REPuter MUMmer MUMmer Heuristic tools FASTA FASTA BLAST BLAST Probabilistic modeling tools (Hidden Markov Models) HMMER HMMER MetaMEME MetaMEME Sequence Alignment and Modeling System (SAM) Sequence Alignment and Modeling System (SAM)
Background - REPuter repfind – perfect and degenerate repeats 1. Build suffix tree 2. Find instances of repeats ≥ 2 repselect Filters repfind results based on options Filters repfind results based on options Shared object file can be used to provide more specific filters Shared object file can be used to provide more specific filters Current version of Reputer is limited to 67 million bases
Conceptual Overview Problem: Heuristic tools require a query sequence to get started One approach might be to compare chunks of chromosomes for regions of similarity One approach might be to compare chunks of chromosomes for regions of similarity Probabilistic models require multiple sequences to build a model Could use results from above to generate a model Could use results from above to generate a model
Conceptual Overview My solution: 1.Use REPuter to locate perfect repeats 2.Cluster similar repeats 3.Build a database of these clusters, which will serve as seeds for further analysis Clusters can be used to build probabilistic models (such as Hidden Markov Models) FASTA or BLAST can be used to compare the clusters against a genome to expand the cluster
Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Add Perfect Repeats to Database Consolidate Repeats to produce Maximal repeats Add Maximal Repeats to Database Obtain Sequences of all Maximal repeats Compare Sequences with FASTA Cluster Sequences with BAG Add clusters to Imperfect Sequence Database Generate Perfect Repeats (Reputer) Design – Program Flow Chart
Collect user information Generate Perfect Repeats (Reputer) Design – Program Flow Chart
REPuter repfind – Selected options: -f Finds forward repeats Finds forward repeats Ignores: Ignores: Palindromes: attcaat Reverse repeats: attggtta Complemented repeats: attgcaat -b Output file is in binary format to reduce size Output file is in binary format to reduce size -allmax All maximal repeats All maximal repeats Reported in order found Reported in order found -l User parameter User parameter Default: 30 Default: 30
REPuter repselect – Implemented filters: Minimum (20) and Maximum repeat size (300) Maximum percent of same character in a row aaaaaaaaga = 80% a in a row (Default: 50%) aaaaaaaaga = 80% a in a row (Default: 50%) Maximum percent of same character in the entire sequence aaaaaaaaga = 90% a (Default: 80%) aaaaaaaaga = 90% a (Default: 80%) Maximum percent of two characters in the entire sequence aagggaagac = 90% a/g (Default 90%) aagggaagac = 90% a/g (Default 90%) Sequence must contain at least 2 of the 4 nucleotides
Collect user information Parse Repeats into data structure Generate Perfect Repeats (Reputer) Design – Program Flow Chart
REPuter Results Filter: Require minimum number of repeats User parameter User parameter Default: 20 Default: 20
Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Generate Perfect Repeats (Reputer) Design – Program Flow Chart
Perfect Repeat Properties Determined from reputer output: Sequence of repeat Sequence of repeat Start position of repeat Start position of repeat Length of repeat Length of repeat Calculated from reputer output: End position of repeat End position of repeat Average distance between repeats Average distance between repeats Calculated across all chromosomes Orientation – plus or minus Orientation – plus or minus
Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Add Perfect Repeats to Database Generate Perfect Repeats (Reputer) Design – Program Flow Chart
Design - Database Layout perf_reps prepid sequence length avgdist perf_pos pposid prepid spos epos direction chrom perf_pos pposid prepid spos epos direction chrom prepidUnique ID of perfect repeat sequenceSequence of the perfect repeat lengthLength of the perfect repeat avgdistAverage distance between repeats pposidUnique ID of perfect repeat position prepidID of perfect repeat (not unique) sposStart position of perfect repeat eposEnd position of perfect repeat directionOrientation, plus or minus strand chromChromosome number
Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Add Perfect Repeats to Database Consolidate Repeats to produce Maximal repeats Generate Perfect Repeats (Reputer) Design – Program Flow Chart
Maximal Repeats Maximal repeats are determined by comparing smaller repeats to the larger repeats and removing all exact subsets from the list Given: 1. atggatggc 2. ccgatggatggcatt 3. atggatggcattaaatt Maximal repeat is: atggatggcattaaatt Comprised of 1 and 3, but not 2 Why not 2?
Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Add Perfect Repeats to Database Consolidate Repeats to produce Maximal repeats Add Maximal Repeats to Database Generate Perfect Repeats (Reputer) Design – Program Flow Chart
Design - Database Layout max_reps maxid prepid max_comp maxid prepid max_reps maxid prepid max_comp maxid prepid maxidUnique ID of the maximal repeat prepid ID of the perfect repeat that is the maximal repeat maxidID of maximal repeat (not unique) prepid ID of perfect repeats that are sub sequences of the maximal repeat
Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Add Perfect Repeats to Database Consolidate Repeats to produce Maximal repeats Add Maximal Repeats to Database Obtain Sequences of all Maximal Repeats from db Compare Sequences with FASTA Generate Perfect Repeats (Reputer) Design – Program Flow Chart
FASTA Pairwise comparison of all maximal repeats (FASTA 3.4) Filter: Require minimum percent identity User parameter User parameter Default 75% Default 75%
Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Add Perfect Repeats to Database Consolidate Repeats to produce Maximal repeats Add Maximal Repeats to Database Obtain Sequences of all Maximal Repeats from db Compare Sequences with FASTA Cluster Sequences with BAG Generate Perfect Repeats (Reputer) Design – Program Flow Chart
BAG BAG: A Graph Theoretic Sequence Clustering Algorithm (Sun Kim) Clustering based on 2 graph properties 1. Biconnected components 2. Articulation points
BAG A B C F D E G Connected graph Articulation point Biconnected graph
BAG BAG: A Graph Theoretic Sequence Clustering Algorithm (Sun Kim) Clustering based on 2 graph properties 1. Biconnected components 2. Articulation points Z-score used to determine similarity User parameter User parameter Default: 400 Default: 400
Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Add Perfect Repeats to Database Consolidate Repeats to produce Maximal repeats Add Maximal Repeats to Database Obtain Sequences of all Maximal Repeats from db Compare Sequences with FASTA Cluster Sequences with BAG Add clusters to Imperfect Sequence Database Generate Perfect Repeats (Reputer) Design – Program Flow Chart
Design - Database Layout cluster clusterid size cluster_comp clusterid prepid clusteridUnique ID for repeat cluster size Size of cluster - number of perfect repeats that comprise the cluster clusteridID of cluster prepid ID of perfect repeats (not unique) that comprise the cluster
Design - Database Layout perf_reps prepid sequence length avgdist max_reps maxid prepid cluster clusterid size cluster_comp clusterid prepid perf_pos pposid prepid spos epos direction chrom max_comp maxid prepid
Results Common options for test cases: Reputer: Minimum Repeat Size: 30 Maximum Repeat Size: 300 FASTA: Minimum percent identity: 75% BAG (clustering): Z-score cutoff: 400
Results Computer specifications: Dual CPU – Intel Xeon 1.7 GHz 4 GB RAM Linux RedHat 8.0
Results Results Time (m:s)* Minimum Number repeats 1020**1020** # Perfect repeats Pre-consolidation 30,49212,437 Part 1 (repfind/repselect) 7:207:20 # Perfect repeats Post-consolidation Part 2 (Parse/consolidate) 36:4327:48 # members Part 3 (FASTA/cluster) 41:565:42 # clusters Total85:4740:37 * Average 2 runs ** Default
Acknowledgments Sun Kim Don Gilbert Scott Martin