PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

Slides:



Advertisements
Similar presentations
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Changes in Highly Conserved Elements John McGuigan 05/04/2009.
Phylogenetic Trees Lecture 4
GENE TREES Abhita Chugh. Phylogenetic tree Evolutionary tree showing the relationship among various entities that are believed to have a common ancestor.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetic reconstruction
Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication.
CS273a Lecture 8, Win07, Batzoglou Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion.
Some new sequencing technologies. Molecular Inversion Probes.
Finding approximate palindromes in genomic sequences.
Bioinformatics and Phylogenetic Analysis
CS273a Lecture 9/10, Aut 10, Batzoglou Multiple Sequence Alignment.
BNFO 235 Lecture 5 Usman Roshan. What we have done to date Basic Perl –Data types: numbers, strings, arrays, and hashes –Control structures: If-else,
Short Primer on Comparative Genomics Today: Special guest lecture 12pm, Alway M108 Comparative genomics of animals and plants Adam Siepel Assistant Professor.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
© Wiley Publishing All Rights Reserved.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Input for the Bayesian Phylogenetic Workflow All Input values could be loaded as text file or typing directly. Only for the multifasta file is advised.
Molecular phylogenetics
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Evidence For Evolution. Homologous Structures: Similar features that indicate a common ancestor. Example: Click here for a link to Winging It: Fish with.
Chapter 26: Phylogeny and the Tree of Life Objectives 1.Identify how phylogenies show evolutionary relationships. 2.Phylogenies are inferred based homologies.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Fig Chapter 12: Genomics. Genomics: the study of whole-genome structure, organization, and function Structural genomics: the physical genome; whole.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
A project on alignment of short tandem repeat loci between Homo Sapiens and Homo Neanderthalensis genomes Łukasz Olczak, Silesian University of Technology.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Chapter 4: Clusters and repeats. 4.1 Introduction  Duplication of DNA is a major force in evolution.
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Used for detection of genetic diseases, forensics, paternity, evolutionary links Based on the characteristics of mammalian DNA Eukaryotic genome 1000x.
Comp. Genomics Recitation 3 The statistics of database searching.
CSCE555 Bioinformatics Lecture 12 Phylogenetics I Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Calculating branch lengths from distances. ABC A B C----- a b c.
Figure 2: over-representation of neighbors in the fushi-tarazu region of Drosophila melanogaster. Annotated enhancers are marked grey. The CDS is marked.
Vervet Monkey Genomics: Genome Canada and Génome Québec Physical Map Project J. Wasserscheid, G. Leveque, C. Nagy, C. Pinsonnault, and K. Dewar, McGill.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Genomics and Forensics
Using BLAST for Genomic Sequence Annotation Jeremy Buhler For HHMI / BIO4342 Tutorial Workshop.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Table 8.3 & Alberts Fig.1.38 EVOLUTION OF GENOMES C-value paradox: - in certain cases, lack of correlation between morphological complexity and genome.
School: National Experimental High School at Central Taiwan Science Park Teacher: Yu Jen Hu Author: Wang Han-Lin, Lin Yi-Chieh The Mathematical Method.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Evolution at the Molecular Level. Outline Evolution of genomes Evolution of genomes Review of various types and effects of mutations Review of various.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
PROTEIN INTERACTION NETWORK – INFERENCE TOOL DIVYA RAO CANDIDATE FOR MASTER OF SCIENCE IN BIOINFORMATICS ADVISOR: Dr. FILIPPO MENCZER CAPSTONE PROJECT.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
5.4 Cladistics The images above are both cladograms. They show the statistical similarities between species based on their DNA/RNA. The cladogram on the.
Molecular Evolution. Study of how genes and proteins evolve and how are organisms related based on their DNA sequence Molecular evolution therefore is.
Sequence similarity, BLAST alignments & multiple sequence alignments
Introduction to Bioinformatics Resources for DNA Barcoding
Reconstructing the Evolutionary History of Complex Human Gene Clusters
Methods of molecular phylogeny
Inferring phylogenetic trees: Distance and maximum likelihood methods
Molecular Evolution.
Model of segmental duplication Acceptor regions of the genome acquire segments of genomic material that range from 1–200 kb from disparate regions.
Chapter 6 Clusters and Repeats.
Jeffrey A. Fawcett, Hideki Innan  Trends in Genetics 
Presentation transcript:

PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes

Overview Background Background Tandem repeats Tandem repeats Methodology Methodology Results Results Conclusions Conclusions References References Capstone Presentation 05/18/2007 2

Background Background April 20, 2006 Capstone Presentation 05/18/ An array of consecutive repeats An array of consecutive repeats Repeating pattern or consensus = 5 Repeating pattern or consensus = 5 Total repeat length = 25 Total repeat length = 25 3 main types of tandem repeats 3 main types of tandem repeats  Microsatellites bp repeating pattern  Minisatellites bp repeating pattern  Large tandem -- greater than 50 bp repeating pattern GATCCGATCCGATCCGATCCGATCC  Background  Tandem repeats  Tandem Gene duplication  Methodology  Tandem Repeat Finder  Dating tandem repeats  Jukes-Cantor model  Results  Analysis  Conclusion

Significance Use tandem repeats to determine whether 2 DNA samples belong to same person or not Use tandem repeats to determine whether 2 DNA samples belong to same person or not Uses – Uses –  Forensic use  Paternity testing Capstone Presentation 05/18/ Image downloaded from  Background  Tandem repeats  Tandem Gene duplication  Methodology  Tandem Repeat Finder  Dating tandem repeats  Jukes-Cantor model  Results  Analysis  Conclusion

Mechanism of tandem duplication Unequal recombination is the major known mechanism for the formation of large tandem repeats Unequal recombination is the major known mechanism for the formation of large tandem repeats Image has been downloaded from okyo.ac.jp/JSBi/journal/GIW02/GIW02F010/GIW02F010.html Image has been downloaded from okyo.ac.jp/JSBi/journal/GIW02/GIW02F010/GIW02F010.htmlhttp://hc.ims.u Capstone Presentation 05/18/  Background  Tandem repeats  Tandem Gene duplication  Methodology  Tandem Repeat Finder  Dating tandem repeats  Jukes-Cantor model  Results  Analysis  Conclusion

Tandem gene duplication Benefits – New functions arise. Responsible for the evolution of gene clusters Benefits – New functions arise. Responsible for the evolution of gene clusters Example – Zinc finger genes in mammalian genes Example – Zinc finger genes in mammalian genes Capstone Presentation 05/18/ homologous Genes second gene = Duplicated

Purpose Large tandem repeats are commonly found in eukaryotes – humans have % and chimpanzees have 1.525% Large tandem repeats are commonly found in eukaryotes – humans have % and chimpanzees have 1.525% To date the large tandem duplication and find the relationship between various characteristics of long tandem repeats and corresponding evolutionary time To date the large tandem duplication and find the relationship between various characteristics of long tandem repeats and corresponding evolutionary time 8 genomes – 3 primates, 2 rodents, dog, chicken and puffer fish were analyzed 8 genomes – 3 primates, 2 rodents, dog, chicken and puffer fish were analyzed Capstone Presentation 05/18/  Background  Tandem repeats  Tandem Gene duplication  Methodology  Tandem Repeat Finder  Dating tandem repeats  Jukes-Cantor model  Results  Analysis  Conclusion

Methodology Identification Identification  Tandem repeat finder (TRF) for identification of large tandem repeats Distance computation Distance computation  Jukes – Cantor distance model to find distance between two repeats Transformation Transformation  Transform the above computed distance into evolutionary time Capstone Presentation 05/18/  Background  Tandem repeats  Tandem Gene duplication  Methodology  Tandem Repeat Finder  Dating tandem repeats  Jukes-Cantor model  Results  Analysis  Conclusion

Tandem Repeat Finder Tandem Repeat Finder STRING, Mreps and TRF STRING, Mreps and TRF TRAP: T.Jose, P. Sobreira, A.Durham and A.Gruber TRF can be downloaded at TRF can be downloaded at Starting and ending positions of tandem repeat was present Starting and ending positions of tandem repeat was present Number of repetitions Number of repetitions A%, C%, G%, T% percentage of bases in the tandem repeat A%, C%, G%, T% percentage of bases in the tandem repeat Length of the consensus word (only the first 10 bases) Length of the consensus word (only the first 10 bases) Capstone Presentation 05/18/  Background  Tandem repeats  Tandem Gene duplication  Methodology  Tandem Repeat Finder  Dating tandem repeats  Jukes-Cantor model  Results  Analysis  Conclusion

Tandem Repeat Finder Tandem repeat finder outline : Tandem repeat finder outline : Tandem repeat finder program has 2 main components – detection and analysis Tandem repeat finder program has 2 main components – detection and analysis  Detection - Finds candidate tandem repeats  Analysis - Produces an alignment for each candidate and statistics about the alignment  Background  Tandem repeats  Tandem Gene duplication  Methodology  Tandem Repeat Finder  Dating tandem repeats  Jukes-Cantor model  Results  Analysis  Conclusion Capstone Presentation 05/18/

Tandem Repeat Finder Tandem Repeat Finder Large tandem repeats were extracted Large tandem repeats were extracted Results of TRF – Results of TRF – GATCC GATCCGATCCGATCCGATCCGATCC GATCC GATCCGATCCGATCCGATCCGATCC GATCC - period or consensus GATCCGATCCGATCCGATCCGATCC - repeat 1 - indices 5 - consensus or period size percent matches 0 - percent indels 50 - score 20 - % of A 40 - % of C entropy  Background  Tandem repeats  Tandem Gene duplication  Methodology  Tandem Repeat Finder  Dating tandem repeats  Jukes-Cantor model  Results  Analysis  Conclusion Capstone Presentation 05/18/

DNA Sequence Evolution Model For Dating DNA Sequence Evolution Model For Dating AAGACTT TGGACTTAAGGCCT 3 mil yrs 2 mil yrs 1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT TAGCCCATAGACTTAGCGCTTAGCACAAAGGGCAT TAGCCCTAGCACTT AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT Capstone Presentation 05/18/ D

Computing divergence of tandem repeating units – Computing divergence of tandem repeating units – Repeat identity each repeat is compared with other repeats and maximum similarity/identity is considered Repeat identity - each repeat is compared with other repeats and maximum similarity/identity is considered GATCC GATCC|GATCC|GATCC|GATCC|GATCC Dating tandem duplications Dating tandem duplications Capstone Presentation 05/18/  Background  Tandem repeats  Tandem Gene duplication  Methodology  Tandem Repeat Finder  Dating tandem repeats  Jukes-Cantor model  Results  Analysis  Conclusion

Jukes-Cantor model Computes the distance between 2 repeats Computes the distance between 2 repeats All bases occur with equal probability, All bases occur with equal probability, i.e. p = 0.25 for A, T, G and C i.e. p = 0.25 for A, T, G and C All possible base substitutions are equally likely as follows - All possible base substitutions are equally likely as follows - A ↔ G, A ↔ C, A ↔ T, G ↔ T A ↔ G, A ↔ C, A ↔ T, G ↔ T Capstone Presentation 05/18/  Background  Tandem repeats  Tandem Gene duplication  Methodology  Tandem Repeat Finder  Dating tandem repeats  Jukes-Cantor model  Results  Analysis  Conclusion

Jukes-Cantor model m = no. of mutations n = length of sequence D = -3/4 ln(1- 4/3 m/n) D = Distance between two repeats Ex- Observed mismatches at 25% of the sites, then Jukes Cantor model predicts the distance between two repeat is Capstone Presentation 05/18/  Background  Tandem repeats  Tandem Gene duplication  Methodology  Tandem Repeat Finder  Dating tandem repeats  Jukes-Cantor model  Results  Analysis  Conclusion

Estimating the evolutionary time Transforming the computed distance (D) between two repeats into evolutionary time Transforming the computed distance (D) between two repeats into evolutionary time Neutral mutation rate in mammals is nearly 1.25 * per year per site Neutral mutation rate in mammals is nearly 1.25 * per year per site Time (T) = D / 1.25 * years ago Time (T) = D / 1.25 * years ago Ex- D = 0.1 Ex- D = 0.1 T = 0.1 / 1.25 * = 80 million years ago T = 0.1 / 1.25 * = 80 million years ago Capstone Presentation 05/18/  Background  Tandem repeats  Tandem Gene duplication  Methodology  Tandem Repeat Finder  Dating tandem repeats  Jukes-Cantor model  Results  Analysis  Conclusion

Material and Method Material Material  The genome files were downloaded from UCSC site  The tandem repeat finder and stretcher software were downloaded Procedure Procedure  Extraction of large tandem repeats with the help of tandem repeat finder  Calculation of similarities between tandem repeats using stretcher  Computation of the distance using Jukes- Cantor model  Transformation of distance to the evolutionary time Capstone Presentation 05/18/  Background  Tandem repeats  Tandem Gene duplication  Methodology  Tandem Repeat Finder  Dating tandem repeats  Jukes-Cantor model  Results  Analysis  Conclusion

Tree of life Tree of life 500 Million years ago

Recap – period & repeat Capstone Presentation 05/18/ ATTCGATTCGATTCGGGATTCGACATTCG ATTCG REPEAT PERIOD or CONSENSUS

Results ResultsGenome Chr#, Longest repeat length Chr#, highest total# of repeat Chr#, longest period length Total repeat Total genome size Chr#, highest % of repeat Total coverage (% of repeat in genome) HUMAN CHIMPANZEE 8, , , MB 2.97 GB 19, MACAQUE 13, , , MB 2.87 GB RAT 18, , , MB 2.75 GB 12, MOUSE 7, , , MB 2.61 GB X, DOG X, , , MB 2.40 GB CHICKEN 1, , , MB 1.1 GB 16, PUFFER FISH 1, 6586 Y, X, MB , , 139 2, MB , , MB 10, GB 19, 6.06

Results Capstone Presentation 05/18/  Background  Tandem repeats  Tandem Gene duplication  Methodology  Tandem Repeat Finder  Dating tandem repeats  Jukes-Cantor model  Results  Analysis  Conclusion

Total number of repeats Capstone Presentation 05/18/

Total number of period or consensus Capstone Presentation 05/18/

Results of repeat length Capstone Presentation 05/18/

% Repeat results Fish Human Human

Dating tandem repeats Capstone Presentation 05/18/

Tree of life Tree of life 500 Million years ago Capstone Presentation 05/18/

Conclusions Primates (human, chimpanzee and macaque) have highest number of long tandem repeat duplications Primates (human, chimpanzee and macaque) have highest number of long tandem repeat duplications Dating peak is prominent in human, chimpanzee and macaque, especially between million years ago Dating peak is prominent in human, chimpanzee and macaque, especially between million years ago Tandem repeat results follow a pattern which is similar to the divergence as shown in the tree of life Tandem repeat results follow a pattern which is similar to the divergence as shown in the tree of life Dog, rat and mouse show steady increase in number of tandem duplications but burst is negligible between million years ago Dog, rat and mouse show steady increase in number of tandem duplications but burst is negligible between million years ago Human has highest number of duplications among all studied genomes Human has highest number of duplications among all studied genomes Capstone Presentation 05/18/  Background  Tandem repeats  Tandem Gene duplication  Methodology  Tandem Repeat Finder  Dating tandem repeats  Jukes-Cantor model  Results  Analysis  Conclusions

Acknowledgements Advisor – Dr. Haixu Tang Advisor – Dr. Haixu Tang School of Informatics School of Informatics Members of Computational Omics Lab Members of Computational Omics Lab Parents, Rajen & Rajeev Parents, Rajen & Rajeev Prasanta Prasanta

References Methods for reconstructing the history of tandem repeats and their application to the human genome Methods for reconstructing the history of tandem repeats and their application to the human genome Authors: Jaitly D, Kearney P, Lin G, Ma B Authors: Jaitly D, Kearney P, Lin G, Ma B A Survey on Algorithmic Aspects of Tandem Repeats Evolution. A Survey on Algorithmic Aspects of Tandem Repeats Evolution. Authors: E. Rivals Authors: E. Rivals Topological Rearrangements and Local Search Method for Tandem Duplication Trees Topological Rearrangements and Local Search Method for Tandem Duplication Trees Authors: Denis Bertrand and Olivier Gascuel Greedy method for inferring tandem duplication history Greedy method for inferring tandem duplication history Authors: Louxin Zhang Bin Ma Lusheng Wang and Ying Xu A fast and accurate distance algorithm to reconstruct tandem duplication trees A fast and accurate distance algorithm to reconstruct tandem duplication trees Authors: Elemento O. and Gascuel O Tandem repeats finder: a program to analyze DNA sequences Tandem repeats finder: a program to analyze DNA sequences Author: Gary Benson Author: Gary Benson