PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes
Overview Background Background Tandem repeats Tandem repeats Methodology Methodology Results Results Conclusions Conclusions References References Capstone Presentation 05/18/2007 2
Background Background April 20, 2006 Capstone Presentation 05/18/ An array of consecutive repeats An array of consecutive repeats Repeating pattern or consensus = 5 Repeating pattern or consensus = 5 Total repeat length = 25 Total repeat length = 25 3 main types of tandem repeats 3 main types of tandem repeats Microsatellites bp repeating pattern Minisatellites bp repeating pattern Large tandem -- greater than 50 bp repeating pattern GATCCGATCCGATCCGATCCGATCC Background Tandem repeats Tandem Gene duplication Methodology Tandem Repeat Finder Dating tandem repeats Jukes-Cantor model Results Analysis Conclusion
Significance Use tandem repeats to determine whether 2 DNA samples belong to same person or not Use tandem repeats to determine whether 2 DNA samples belong to same person or not Uses – Uses – Forensic use Paternity testing Capstone Presentation 05/18/ Image downloaded from Background Tandem repeats Tandem Gene duplication Methodology Tandem Repeat Finder Dating tandem repeats Jukes-Cantor model Results Analysis Conclusion
Mechanism of tandem duplication Unequal recombination is the major known mechanism for the formation of large tandem repeats Unequal recombination is the major known mechanism for the formation of large tandem repeats Image has been downloaded from okyo.ac.jp/JSBi/journal/GIW02/GIW02F010/GIW02F010.html Image has been downloaded from okyo.ac.jp/JSBi/journal/GIW02/GIW02F010/GIW02F010.htmlhttp://hc.ims.u Capstone Presentation 05/18/ Background Tandem repeats Tandem Gene duplication Methodology Tandem Repeat Finder Dating tandem repeats Jukes-Cantor model Results Analysis Conclusion
Tandem gene duplication Benefits – New functions arise. Responsible for the evolution of gene clusters Benefits – New functions arise. Responsible for the evolution of gene clusters Example – Zinc finger genes in mammalian genes Example – Zinc finger genes in mammalian genes Capstone Presentation 05/18/ homologous Genes second gene = Duplicated
Purpose Large tandem repeats are commonly found in eukaryotes – humans have % and chimpanzees have 1.525% Large tandem repeats are commonly found in eukaryotes – humans have % and chimpanzees have 1.525% To date the large tandem duplication and find the relationship between various characteristics of long tandem repeats and corresponding evolutionary time To date the large tandem duplication and find the relationship between various characteristics of long tandem repeats and corresponding evolutionary time 8 genomes – 3 primates, 2 rodents, dog, chicken and puffer fish were analyzed 8 genomes – 3 primates, 2 rodents, dog, chicken and puffer fish were analyzed Capstone Presentation 05/18/ Background Tandem repeats Tandem Gene duplication Methodology Tandem Repeat Finder Dating tandem repeats Jukes-Cantor model Results Analysis Conclusion
Methodology Identification Identification Tandem repeat finder (TRF) for identification of large tandem repeats Distance computation Distance computation Jukes – Cantor distance model to find distance between two repeats Transformation Transformation Transform the above computed distance into evolutionary time Capstone Presentation 05/18/ Background Tandem repeats Tandem Gene duplication Methodology Tandem Repeat Finder Dating tandem repeats Jukes-Cantor model Results Analysis Conclusion
Tandem Repeat Finder Tandem Repeat Finder STRING, Mreps and TRF STRING, Mreps and TRF TRAP: T.Jose, P. Sobreira, A.Durham and A.Gruber TRF can be downloaded at TRF can be downloaded at Starting and ending positions of tandem repeat was present Starting and ending positions of tandem repeat was present Number of repetitions Number of repetitions A%, C%, G%, T% percentage of bases in the tandem repeat A%, C%, G%, T% percentage of bases in the tandem repeat Length of the consensus word (only the first 10 bases) Length of the consensus word (only the first 10 bases) Capstone Presentation 05/18/ Background Tandem repeats Tandem Gene duplication Methodology Tandem Repeat Finder Dating tandem repeats Jukes-Cantor model Results Analysis Conclusion
Tandem Repeat Finder Tandem repeat finder outline : Tandem repeat finder outline : Tandem repeat finder program has 2 main components – detection and analysis Tandem repeat finder program has 2 main components – detection and analysis Detection - Finds candidate tandem repeats Analysis - Produces an alignment for each candidate and statistics about the alignment Background Tandem repeats Tandem Gene duplication Methodology Tandem Repeat Finder Dating tandem repeats Jukes-Cantor model Results Analysis Conclusion Capstone Presentation 05/18/
Tandem Repeat Finder Tandem Repeat Finder Large tandem repeats were extracted Large tandem repeats were extracted Results of TRF – Results of TRF – GATCC GATCCGATCCGATCCGATCCGATCC GATCC GATCCGATCCGATCCGATCCGATCC GATCC - period or consensus GATCCGATCCGATCCGATCCGATCC - repeat 1 - indices 5 - consensus or period size percent matches 0 - percent indels 50 - score 20 - % of A 40 - % of C entropy Background Tandem repeats Tandem Gene duplication Methodology Tandem Repeat Finder Dating tandem repeats Jukes-Cantor model Results Analysis Conclusion Capstone Presentation 05/18/
DNA Sequence Evolution Model For Dating DNA Sequence Evolution Model For Dating AAGACTT TGGACTTAAGGCCT 3 mil yrs 2 mil yrs 1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT TAGCCCATAGACTTAGCGCTTAGCACAAAGGGCAT TAGCCCTAGCACTT AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT Capstone Presentation 05/18/ D
Computing divergence of tandem repeating units – Computing divergence of tandem repeating units – Repeat identity each repeat is compared with other repeats and maximum similarity/identity is considered Repeat identity - each repeat is compared with other repeats and maximum similarity/identity is considered GATCC GATCC|GATCC|GATCC|GATCC|GATCC Dating tandem duplications Dating tandem duplications Capstone Presentation 05/18/ Background Tandem repeats Tandem Gene duplication Methodology Tandem Repeat Finder Dating tandem repeats Jukes-Cantor model Results Analysis Conclusion
Jukes-Cantor model Computes the distance between 2 repeats Computes the distance between 2 repeats All bases occur with equal probability, All bases occur with equal probability, i.e. p = 0.25 for A, T, G and C i.e. p = 0.25 for A, T, G and C All possible base substitutions are equally likely as follows - All possible base substitutions are equally likely as follows - A ↔ G, A ↔ C, A ↔ T, G ↔ T A ↔ G, A ↔ C, A ↔ T, G ↔ T Capstone Presentation 05/18/ Background Tandem repeats Tandem Gene duplication Methodology Tandem Repeat Finder Dating tandem repeats Jukes-Cantor model Results Analysis Conclusion
Jukes-Cantor model m = no. of mutations n = length of sequence D = -3/4 ln(1- 4/3 m/n) D = Distance between two repeats Ex- Observed mismatches at 25% of the sites, then Jukes Cantor model predicts the distance between two repeat is Capstone Presentation 05/18/ Background Tandem repeats Tandem Gene duplication Methodology Tandem Repeat Finder Dating tandem repeats Jukes-Cantor model Results Analysis Conclusion
Estimating the evolutionary time Transforming the computed distance (D) between two repeats into evolutionary time Transforming the computed distance (D) between two repeats into evolutionary time Neutral mutation rate in mammals is nearly 1.25 * per year per site Neutral mutation rate in mammals is nearly 1.25 * per year per site Time (T) = D / 1.25 * years ago Time (T) = D / 1.25 * years ago Ex- D = 0.1 Ex- D = 0.1 T = 0.1 / 1.25 * = 80 million years ago T = 0.1 / 1.25 * = 80 million years ago Capstone Presentation 05/18/ Background Tandem repeats Tandem Gene duplication Methodology Tandem Repeat Finder Dating tandem repeats Jukes-Cantor model Results Analysis Conclusion
Material and Method Material Material The genome files were downloaded from UCSC site The tandem repeat finder and stretcher software were downloaded Procedure Procedure Extraction of large tandem repeats with the help of tandem repeat finder Calculation of similarities between tandem repeats using stretcher Computation of the distance using Jukes- Cantor model Transformation of distance to the evolutionary time Capstone Presentation 05/18/ Background Tandem repeats Tandem Gene duplication Methodology Tandem Repeat Finder Dating tandem repeats Jukes-Cantor model Results Analysis Conclusion
Tree of life Tree of life 500 Million years ago
Recap – period & repeat Capstone Presentation 05/18/ ATTCGATTCGATTCGGGATTCGACATTCG ATTCG REPEAT PERIOD or CONSENSUS
Results ResultsGenome Chr#, Longest repeat length Chr#, highest total# of repeat Chr#, longest period length Total repeat Total genome size Chr#, highest % of repeat Total coverage (% of repeat in genome) HUMAN CHIMPANZEE 8, , , MB 2.97 GB 19, MACAQUE 13, , , MB 2.87 GB RAT 18, , , MB 2.75 GB 12, MOUSE 7, , , MB 2.61 GB X, DOG X, , , MB 2.40 GB CHICKEN 1, , , MB 1.1 GB 16, PUFFER FISH 1, 6586 Y, X, MB , , 139 2, MB , , MB 10, GB 19, 6.06
Results Capstone Presentation 05/18/ Background Tandem repeats Tandem Gene duplication Methodology Tandem Repeat Finder Dating tandem repeats Jukes-Cantor model Results Analysis Conclusion
Total number of repeats Capstone Presentation 05/18/
Total number of period or consensus Capstone Presentation 05/18/
Results of repeat length Capstone Presentation 05/18/
% Repeat results Fish Human Human
Dating tandem repeats Capstone Presentation 05/18/
Tree of life Tree of life 500 Million years ago Capstone Presentation 05/18/
Conclusions Primates (human, chimpanzee and macaque) have highest number of long tandem repeat duplications Primates (human, chimpanzee and macaque) have highest number of long tandem repeat duplications Dating peak is prominent in human, chimpanzee and macaque, especially between million years ago Dating peak is prominent in human, chimpanzee and macaque, especially between million years ago Tandem repeat results follow a pattern which is similar to the divergence as shown in the tree of life Tandem repeat results follow a pattern which is similar to the divergence as shown in the tree of life Dog, rat and mouse show steady increase in number of tandem duplications but burst is negligible between million years ago Dog, rat and mouse show steady increase in number of tandem duplications but burst is negligible between million years ago Human has highest number of duplications among all studied genomes Human has highest number of duplications among all studied genomes Capstone Presentation 05/18/ Background Tandem repeats Tandem Gene duplication Methodology Tandem Repeat Finder Dating tandem repeats Jukes-Cantor model Results Analysis Conclusions
Acknowledgements Advisor – Dr. Haixu Tang Advisor – Dr. Haixu Tang School of Informatics School of Informatics Members of Computational Omics Lab Members of Computational Omics Lab Parents, Rajen & Rajeev Parents, Rajen & Rajeev Prasanta Prasanta
References Methods for reconstructing the history of tandem repeats and their application to the human genome Methods for reconstructing the history of tandem repeats and their application to the human genome Authors: Jaitly D, Kearney P, Lin G, Ma B Authors: Jaitly D, Kearney P, Lin G, Ma B A Survey on Algorithmic Aspects of Tandem Repeats Evolution. A Survey on Algorithmic Aspects of Tandem Repeats Evolution. Authors: E. Rivals Authors: E. Rivals Topological Rearrangements and Local Search Method for Tandem Duplication Trees Topological Rearrangements and Local Search Method for Tandem Duplication Trees Authors: Denis Bertrand and Olivier Gascuel Greedy method for inferring tandem duplication history Greedy method for inferring tandem duplication history Authors: Louxin Zhang Bin Ma Lusheng Wang and Ying Xu A fast and accurate distance algorithm to reconstruct tandem duplication trees A fast and accurate distance algorithm to reconstruct tandem duplication trees Authors: Elemento O. and Gascuel O Tandem repeats finder: a program to analyze DNA sequences Tandem repeats finder: a program to analyze DNA sequences Author: Gary Benson Author: Gary Benson