Whole Genome Repeat Analysis Package A Preliminary Analysis of the Caenorhabditis elegans Genome Paul Poole.

Whole Genome Repeat Analysis Package A Preliminary Analysis of the Caenorhabditis elegans Genome Paul Poole

Introduction Relevance of detecting repeats:  Evolutionary significance Divergence of repeats between two similar or related organisms Divergence of repeats between two similar or related organisms  Potential binding sites Contribute to DNA packaging (PREs) and structure Contribute to DNA packaging (PREs) and structure Role in gene regulation by interacting with transcription factors Role in gene regulation by interacting with transcription factors  Disease (tandem repeats; trinucleotide pattern) Huntington’s Huntington’s Myotonic dystrophy Myotonic dystrophy Spinal and bulbar muscular dystrophy Spinal and bulbar muscular dystrophy Fragile-X mental retardation Fragile-X mental retardation Alu repeats – tumors Alu repeats – tumors

Background Commonly used sequence analysis tools:  Suffix Tree data structure REPuter REPuter MUMmer MUMmer  Heuristic tools FASTA FASTA BLAST BLAST  Probabilistic modeling tools (Hidden Markov Models) HMMER HMMER MetaMEME MetaMEME Sequence Alignment and Modeling System (SAM) Sequence Alignment and Modeling System (SAM)

Background - REPuter  repfind – perfect and degenerate repeats 1. Build suffix tree 2. Find instances of repeats ≥ 2  repselect Filters repfind results based on options Filters repfind results based on options Shared object file can be used to provide more specific filters Shared object file can be used to provide more specific filters  Current version of Reputer is limited to 67 million bases

Conceptual Overview Problem:  Heuristic tools require a query sequence to get started One approach might be to compare chunks of chromosomes for regions of similarity One approach might be to compare chunks of chromosomes for regions of similarity  Probabilistic models require multiple sequences to build a model Could use results from above to generate a model Could use results from above to generate a model

Conceptual Overview My solution: 1.Use REPuter to locate perfect repeats 2.Cluster similar repeats 3.Build a database of these clusters, which will serve as seeds for further analysis  Clusters can be used to build probabilistic models (such as Hidden Markov Models)  FASTA or BLAST can be used to compare the clusters against a genome to expand the cluster

Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Add Perfect Repeats to Database Consolidate Repeats to produce Maximal repeats Add Maximal Repeats to Database Obtain Sequences of all Maximal repeats Compare Sequences with FASTA Cluster Sequences with BAG Add clusters to Imperfect Sequence Database Generate Perfect Repeats (Reputer) Design – Program Flow Chart

Collect user information Generate Perfect Repeats (Reputer) Design – Program Flow Chart

REPuter repfind – Selected options:  -f Finds forward repeats Finds forward repeats Ignores: Ignores:  Palindromes: attcaat  Reverse repeats: attggtta  Complemented repeats: attgcaat  -b Output file is in binary format to reduce size Output file is in binary format to reduce size  -allmax All maximal repeats All maximal repeats Reported in order found Reported in order found  -l User parameter User parameter Default: 30 Default: 30

REPuter repselect – Implemented filters:  Minimum (20) and Maximum repeat size (300)  Maximum percent of same character in a row aaaaaaaaga = 80% a in a row (Default: 50%) aaaaaaaaga = 80% a in a row (Default: 50%)  Maximum percent of same character in the entire sequence aaaaaaaaga = 90% a (Default: 80%) aaaaaaaaga = 90% a (Default: 80%)  Maximum percent of two characters in the entire sequence aagggaagac = 90% a/g (Default 90%) aagggaagac = 90% a/g (Default 90%)  Sequence must contain at least 2 of the 4 nucleotides

Collect user information Parse Repeats into data structure Generate Perfect Repeats (Reputer) Design – Program Flow Chart

REPuter Results Filter:  Require minimum number of repeats User parameter User parameter Default: 20 Default: 20

Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Generate Perfect Repeats (Reputer) Design – Program Flow Chart

Perfect Repeat Properties  Determined from reputer output: Sequence of repeat Sequence of repeat Start position of repeat Start position of repeat Length of repeat Length of repeat  Calculated from reputer output: End position of repeat End position of repeat Average distance between repeats Average distance between repeats  Calculated across all chromosomes Orientation – plus or minus Orientation – plus or minus

Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Add Perfect Repeats to Database Generate Perfect Repeats (Reputer) Design – Program Flow Chart

Design - Database Layout perf_reps prepid sequence length avgdist perf_pos pposid prepid spos epos direction chrom perf_pos pposid prepid spos epos direction chrom prepidUnique ID of perfect repeat sequenceSequence of the perfect repeat lengthLength of the perfect repeat avgdistAverage distance between repeats pposidUnique ID of perfect repeat position prepidID of perfect repeat (not unique) sposStart position of perfect repeat eposEnd position of perfect repeat directionOrientation, plus or minus strand chromChromosome number

Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Add Perfect Repeats to Database Consolidate Repeats to produce Maximal repeats Generate Perfect Repeats (Reputer) Design – Program Flow Chart

Maximal Repeats  Maximal repeats are determined by comparing smaller repeats to the larger repeats and removing all exact subsets from the list  Given: 1. atggatggc 2. ccgatggatggcatt 3. atggatggcattaaatt  Maximal repeat is: atggatggcattaaatt  Comprised of 1 and 3, but not 2  Why not 2?

Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Add Perfect Repeats to Database Consolidate Repeats to produce Maximal repeats Add Maximal Repeats to Database Generate Perfect Repeats (Reputer) Design – Program Flow Chart

Design - Database Layout max_reps maxid prepid max_comp maxid prepid max_reps maxid prepid max_comp maxid prepid maxidUnique ID of the maximal repeat prepid ID of the perfect repeat that is the maximal repeat maxidID of maximal repeat (not unique) prepid ID of perfect repeats that are sub sequences of the maximal repeat

Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Add Perfect Repeats to Database Consolidate Repeats to produce Maximal repeats Add Maximal Repeats to Database Obtain Sequences of all Maximal Repeats from db Compare Sequences with FASTA Generate Perfect Repeats (Reputer) Design – Program Flow Chart

FASTA  Pairwise comparison of all maximal repeats (FASTA 3.4)  Filter:  Require minimum percent identity User parameter User parameter Default 75% Default 75%

Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Add Perfect Repeats to Database Consolidate Repeats to produce Maximal repeats Add Maximal Repeats to Database Obtain Sequences of all Maximal Repeats from db Compare Sequences with FASTA Cluster Sequences with BAG Generate Perfect Repeats (Reputer) Design – Program Flow Chart

BAG  BAG: A Graph Theoretic Sequence Clustering Algorithm (Sun Kim)  Clustering based on 2 graph properties 1. Biconnected components 2. Articulation points

BAG A B C F D E G Connected graph Articulation point Biconnected graph

BAG  BAG: A Graph Theoretic Sequence Clustering Algorithm (Sun Kim)  Clustering based on 2 graph properties 1. Biconnected components 2. Articulation points  Z-score used to determine similarity User parameter User parameter Default: 400 Default: 400

Collect user information Parse Repeats into data structure Obtain positional information and calculate distances Add Perfect Repeats to Database Consolidate Repeats to produce Maximal repeats Add Maximal Repeats to Database Obtain Sequences of all Maximal Repeats from db Compare Sequences with FASTA Cluster Sequences with BAG Add clusters to Imperfect Sequence Database Generate Perfect Repeats (Reputer) Design – Program Flow Chart

Design - Database Layout cluster clusterid size cluster_comp clusterid prepid clusteridUnique ID for repeat cluster size Size of cluster - number of perfect repeats that comprise the cluster clusteridID of cluster prepid ID of perfect repeats (not unique) that comprise the cluster

Design - Database Layout perf_reps prepid sequence length avgdist max_reps maxid prepid cluster clusterid size cluster_comp clusterid prepid perf_pos pposid prepid spos epos direction chrom max_comp maxid prepid

Results Common options for test cases: Reputer:  Minimum Repeat Size: 30  Maximum Repeat Size: 300 FASTA:  Minimum percent identity: 75% BAG (clustering):  Z-score cutoff: 400

Results Computer specifications:  Dual CPU – Intel Xeon 1.7 GHz  4 GB RAM  Linux RedHat 8.0

Results Results Time (m:s)* Minimum Number repeats 1020**1020** # Perfect repeats Pre-consolidation 30,49212,437 Part 1 (repfind/repselect) 7:207:20 # Perfect repeats Post-consolidation 69552697 Part 2 (Parse/consolidate) 36:4327:48 # members 2282782 Part 3 (FASTA/cluster) 41:565:42 # clusters 848312Total85:4740:37 * Average 2 runs ** Default

Acknowledgments Sun Kim Don Gilbert Scott Martin

Whole Genome Repeat Analysis Package A Preliminary Analysis of the Caenorhabditis elegans Genome Paul Poole.

Similar presentations

Presentation on theme: "Whole Genome Repeat Analysis Package A Preliminary Analysis of the Caenorhabditis elegans Genome Paul Poole."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Whole Genome Repeat Analysis Package A Preliminary Analysis of the Caenorhabditis elegans Genome Paul Poole.

Similar presentations

Presentation on theme: "Whole Genome Repeat Analysis Package A Preliminary Analysis of the Caenorhabditis elegans Genome Paul Poole."— Presentation transcript:

Similar presentations

About project

Feedback