Download presentation
Presentation is loading. Please wait.
Published byAsher Preston Modified over 8 years ago
1
Whole Genome Repeat Analysis Package A Preliminary Analysis of the Caenorhabditis elegans Genome Paul Poole
2
Introduction Relevance of detecting repeats: Evolutionary significance Divergence of repeats between two similar or related organisms Divergence of repeats between two similar or related organisms Potential binding sites Contribute to DNA packaging (PREs) and structure Contribute to DNA packaging (PREs) and structure Role in gene regulation by interacting with transcription factors Role in gene regulation by interacting with transcription factors Disease (tandem repeats; trinucleotide pattern) Huntington’s Huntington’s Myotonic dystrophy Myotonic dystrophy Spinal and bulbar muscular dystrophy Spinal and bulbar muscular dystrophy Fragile-X mental retardation Fragile-X mental retardation Alu repeats – tumors Alu repeats – tumors
3
Background Commonly used sequence analysis tools: Suffix Tree data structure REPuter REPuter MUMmer MUMmer Heuristic tools FASTA FASTA BLAST BLAST Probabilistic modeling tools (Hidden Markov Models) HMMER HMMER MetaMEME MetaMEME Sequence Alignment and Modeling System (SAM) Sequence Alignment and Modeling System (SAM)
4
Background - REPuter repfind – perfect and degenerate repeats 1. Build suffix tree 2. Find instances of repeats ≥ 2 repselect Filters repfind results based on options Filters repfind results based on options Shared object file can be used to provide more specific filters Shared object file can be used to provide more specific filters Current version of Reputer is limited to 67 million bases
5
Conceptual Overview Problem: Heuristic tools require a query sequence to get started One approach might be to compare chunks of chromosomes for regions of similarity One approach might be to compare chunks of chromosomes for regions of similarity Probabilistic models require multiple sequences to build a model Could use results from above to generate a model Could use results from above to generate a model
6
Conceptual Overview My solution: 1.Use REPuter to locate perfect repeats 2.Cluster similar repeats 3.Build a database of these clusters, which will serve as seeds for further analysis Clusters can be used to build probabilistic models (such as Hidden Markov Models) FASTA or BLAST can be used to compare the clusters against a genome to expand the cluster
7
Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Add Perfect Repeats to Database Consolidate Repeats to produce Maximal repeats Add Maximal Repeats to Database Obtain Sequences of all Maximal repeats Compare Sequences with FASTA Cluster Sequences with BAG Add clusters to Imperfect Sequence Database Generate Perfect Repeats (Reputer) Design – Program Flow Chart
8
Collect user information Generate Perfect Repeats (Reputer) Design – Program Flow Chart
9
REPuter repfind – Selected options: -f Finds forward repeats Finds forward repeats Ignores: Ignores: Palindromes: attcaat Reverse repeats: attggtta Complemented repeats: attgcaat -b Output file is in binary format to reduce size Output file is in binary format to reduce size -allmax All maximal repeats All maximal repeats Reported in order found Reported in order found -l User parameter User parameter Default: 30 Default: 30
10
REPuter repselect – Implemented filters: Minimum (20) and Maximum repeat size (300) Maximum percent of same character in a row aaaaaaaaga = 80% a in a row (Default: 50%) aaaaaaaaga = 80% a in a row (Default: 50%) Maximum percent of same character in the entire sequence aaaaaaaaga = 90% a (Default: 80%) aaaaaaaaga = 90% a (Default: 80%) Maximum percent of two characters in the entire sequence aagggaagac = 90% a/g (Default 90%) aagggaagac = 90% a/g (Default 90%) Sequence must contain at least 2 of the 4 nucleotides
11
Collect user information Parse Repeats into data structure Generate Perfect Repeats (Reputer) Design – Program Flow Chart
12
REPuter Results Filter: Require minimum number of repeats User parameter User parameter Default: 20 Default: 20
13
Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Generate Perfect Repeats (Reputer) Design – Program Flow Chart
14
Perfect Repeat Properties Determined from reputer output: Sequence of repeat Sequence of repeat Start position of repeat Start position of repeat Length of repeat Length of repeat Calculated from reputer output: End position of repeat End position of repeat Average distance between repeats Average distance between repeats Calculated across all chromosomes Orientation – plus or minus Orientation – plus or minus
15
Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Add Perfect Repeats to Database Generate Perfect Repeats (Reputer) Design – Program Flow Chart
16
Design - Database Layout perf_reps prepid sequence length avgdist perf_pos pposid prepid spos epos direction chrom perf_pos pposid prepid spos epos direction chrom prepidUnique ID of perfect repeat sequenceSequence of the perfect repeat lengthLength of the perfect repeat avgdistAverage distance between repeats pposidUnique ID of perfect repeat position prepidID of perfect repeat (not unique) sposStart position of perfect repeat eposEnd position of perfect repeat directionOrientation, plus or minus strand chromChromosome number
17
Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Add Perfect Repeats to Database Consolidate Repeats to produce Maximal repeats Generate Perfect Repeats (Reputer) Design – Program Flow Chart
18
Maximal Repeats Maximal repeats are determined by comparing smaller repeats to the larger repeats and removing all exact subsets from the list Given: 1. atggatggc 2. ccgatggatggcatt 3. atggatggcattaaatt Maximal repeat is: atggatggcattaaatt Comprised of 1 and 3, but not 2 Why not 2?
19
Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Add Perfect Repeats to Database Consolidate Repeats to produce Maximal repeats Add Maximal Repeats to Database Generate Perfect Repeats (Reputer) Design – Program Flow Chart
20
Design - Database Layout max_reps maxid prepid max_comp maxid prepid max_reps maxid prepid max_comp maxid prepid maxidUnique ID of the maximal repeat prepid ID of the perfect repeat that is the maximal repeat maxidID of maximal repeat (not unique) prepid ID of perfect repeats that are sub sequences of the maximal repeat
21
Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Add Perfect Repeats to Database Consolidate Repeats to produce Maximal repeats Add Maximal Repeats to Database Obtain Sequences of all Maximal Repeats from db Compare Sequences with FASTA Generate Perfect Repeats (Reputer) Design – Program Flow Chart
22
FASTA Pairwise comparison of all maximal repeats (FASTA 3.4) Filter: Require minimum percent identity User parameter User parameter Default 75% Default 75%
23
Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Add Perfect Repeats to Database Consolidate Repeats to produce Maximal repeats Add Maximal Repeats to Database Obtain Sequences of all Maximal Repeats from db Compare Sequences with FASTA Cluster Sequences with BAG Generate Perfect Repeats (Reputer) Design – Program Flow Chart
24
BAG BAG: A Graph Theoretic Sequence Clustering Algorithm (Sun Kim) Clustering based on 2 graph properties 1. Biconnected components 2. Articulation points
25
BAG A B C F D E G Connected graph Articulation point Biconnected graph
26
BAG BAG: A Graph Theoretic Sequence Clustering Algorithm (Sun Kim) Clustering based on 2 graph properties 1. Biconnected components 2. Articulation points Z-score used to determine similarity User parameter User parameter Default: 400 Default: 400
27
Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Add Perfect Repeats to Database Consolidate Repeats to produce Maximal repeats Add Maximal Repeats to Database Obtain Sequences of all Maximal Repeats from db Compare Sequences with FASTA Cluster Sequences with BAG Add clusters to Imperfect Sequence Database Generate Perfect Repeats (Reputer) Design – Program Flow Chart
28
Design - Database Layout cluster clusterid size cluster_comp clusterid prepid clusteridUnique ID for repeat cluster size Size of cluster - number of perfect repeats that comprise the cluster clusteridID of cluster prepid ID of perfect repeats (not unique) that comprise the cluster
29
Design - Database Layout perf_reps prepid sequence length avgdist max_reps maxid prepid cluster clusterid size cluster_comp clusterid prepid perf_pos pposid prepid spos epos direction chrom max_comp maxid prepid
30
Results Common options for test cases: Reputer: Minimum Repeat Size: 30 Maximum Repeat Size: 300 FASTA: Minimum percent identity: 75% BAG (clustering): Z-score cutoff: 400
31
Results Computer specifications: Dual CPU – Intel Xeon 1.7 GHz 4 GB RAM Linux RedHat 8.0
32
Results Results Time (m:s)* Minimum Number repeats 1020**1020** # Perfect repeats Pre-consolidation 30,49212,437 Part 1 (repfind/repselect) 7:207:20 # Perfect repeats Post-consolidation 69552697 Part 2 (Parse/consolidate) 36:4327:48 # members 2282782 Part 3 (FASTA/cluster) 41:565:42 # clusters 848312Total85:4740:37 * Average 2 runs ** Default
33
Acknowledgments Sun Kim Don Gilbert Scott Martin
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.