Repeats in the Genome Lecture 11/2.

Slides:



Advertisements
Similar presentations
Evolution of genomes.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Genomics – The Language of DNA Honors Genetics 2006.
DNA Organization Lec 2. Aims The aims of this lecture is to investigate how cells organize their DNA within the cell nucleus, how is the huge amount of.
Chromatin Structure & Genome Organization. Overview of Chromosome Structure Nucleosomes –~200 bp DNA in 120 Å diameter coil –3.4 Å /bp x 200 = 680 Å –680/120.
Introduction to genomes & genome browsers
Chap. 6 Problem 2 Protein coding genes are grouped into the classes known as solitary (single) genes, and duplicated or diverged genes in gene families.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Human Genetics Weibin Shi Michele Sale. Contact Information  Shi:  Sale:
Genomes and Genetic Architecture. Life on Earth.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
Genomes summary 1.>930 bacterial genomes sequenced. 2.Circular. Genes densely packed Mbases, ,000 genes 4.Genomes of >200 eukaryotes (45.
Human Genetic Variation Weibin Shi. Genetic variations underlie phenotypic differences Wilt Chamberlain, a famous NBA basketball player (7 feet, 1 inch;
[Bejerano Fall10/11] 1 Primer, Friday 10am, Beckman B-302 Ex. 1 is coming.
Online Counseling Resource YCMOU ELearning Drive… School of Architecture, Science and Technology Yashwantrao C havan Maharashtra Open University, Nashik.
Sequencing a genome and Basic Sequence Alignment
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Genomic Organization at the DNA level! By: Caroline Fowle, Amanda Zink, Ben Whitfield, Farvah Khaja and Danielle Siegert.
Introduction Basic Genetic Mechanisms Eukaryotic Gene Regulation The Human Genome Project Test 1 Genome I - Genes Genome II – Repetitive DNA Genome III.
Eukaryotic Gene Expression The “More Complex” Genome.
Selfish DNA Honors Genetics.
Genome Organization and Evolution. Assignment For 2/24/04 Read: Lesk, Chapter 2 Exercises 2.1, 2.5, 2.7, p 110 Problem 2.2, p 112 Weblems 2.4, 2.7, pp.
Eukaryotic Genomes Demonstrate Sequence Organization Characterized by Repetitive DNA Honors Genetics Lemon Bay High School
Genomes and Their Evolution. GenomicsThe study of whole sets of genes and their interactions. Bioinformatics The use of computer modeling and computational.
Copyright ©The McGraw-Hill Companies, Inc. Permission required for reproduction or display CHAPTER 17 RECOMBINATION AND TRANSPOSITION AT THE MOLECULAR.
Repetitive Elements May Comprise Over Two-Thirds of the Human Genome
Chapter 11 Outline 11.1 Large Amounts of DNA Are Packed into a Cell, A Bacterial Chromosome Consists of a Single Circular DNA Molecule,
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
The Biology and Genetic Base of Cancer. 2 (Mutation)
Copyright © 2002 Pearson Education, Inc., publishing as Benjamin Cummings Section B: Genome Organization at the DNA Level 1.Repetitive DNA and other noncoding.
Genome Organization & Evolution. Chromosomes Genes are always in genomic structures (chromosomes) – never ‘free floating’ Bacterial genomes are circular.
Sequencing a genome and Basic Sequence Alignment
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
Genomes & their evolution Ch 21.4,5. About 1.2% of the human genome is protein coding exons. In 9/2012, in papers in Nature, the ENCODE group has produced.
Used for detection of genetic diseases, forensics, paternity, evolutionary links Based on the characteristics of mammalian DNA Eukaryotic genome 1000x.
Chapter 21 Eukaryotic Genome Sequences
Non-Coding Areas & Mutations Within the human genome the majority of the DNA (~75%) is made up of sequences not involved in coding for proteins, RNA, or.
BB30055: Genes and genomes Genomes - Dr. MV Hejmadi Lecture 2 – Repeat elements.
Genomics and Forensics
PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.
Lecture 10 Genes, genomes and chromosomes
Lecture 2 – Repeat elements
Human Genomics. Writing in RED indicates the SQA outcomes. Writing in BLACK explains these outcomes in depth.
David Sadava H. Craig Heller Gordon H. Orians William K. Purves David M. Hillis Biologia.blu B – Le basi molecolari della vita e dell’evoluzione The Eukaryotic.
Transposable Elements DNA Sequences That Change Positions in the Genome.
Biodiversity. Genetic Mutations Change in base pairs Affect sequence May affect protein production Can alter genetic makeup within species.
Local Multiple Sequence Alignment Sequence Motifs
Differences in DNA Heterochromatin vs. Euchromatin
The Secret of Life! DNA. 2/4/20162 SOMETHING HAPPENS GENE PROTEIN.
Alu Elements PCR Workshop Instruction manuals that come with new gadgets are notoriously frustrating…but at least they do not insert, just when.
 DNA- genetic material of eukaryotes.  Are highly variable in size and complexity.  About 3.3 billion bp in humans.  Complexity- due to non coding.
Reconstructing the Evolutionary History of Complex Human Gene Clusters
Transposable Elements
Differences in DNA Heterochromatin vs. Euchromatin
Genomes and Their Evolution
SGN23 The Organization of the Human Genome
Transposable Elements And Transposition
Genomes and Their Evolution
Evolution of eukaryote genomes
Ab initio gene prediction
Mohammed El-Khateeb GENETIC VARIATION June 23ed 2014 MGL-2.
Introduction to Bioinformatics II
What kinds of things have been learned?
Organization of the human genome
Lecture 11 LTRs Properties of Chromatin Telomeres.
Gene Density and Noncoding DNA

Chapter 6 Clusters and Repeats.
Presentation transcript:

Repeats in the Genome Lecture 11/2

Repeats in the genome Interspersed repeats Tandem repeats Microsatellites Minisatellites Satellites

http://mcb1. ims. abdn. ac. uk/djs/web/lectures/repeats1 http://mcb1.ims.abdn.ac.uk/djs/web/lectures/repeats1.html#anchor10305

Large repeats: Transposons “Transposable elements” (TE’s) Sequences that get moved/copied into different loci in the genome P elements in Drosophila: genes piggybacked on transposons and inserted into the genome, in the lab “transgenic fruitflies”

Transposons http://nitro.biosci.arizona.edu/courses/EEB600A-2003/lectures/lecture26/lecture26.html

Transposons http://nitro.biosci.arizona.edu/courses/EEB600A-2003/lectures/lecture26/lecture26.html

Transposons http://nitro.biosci.arizona.edu/courses/EEB600A-2003/lectures/lecture26/lecture26.html

Retrotransposons: 2 examples SINEs : Short Interspersed repeats 100-500bp; up to 1M copies; Non-autonomous Example : “Alu” repeats 13 % of human genome LINEs : Long Interspersed repeats Up to 7 Kbp long; 4000 - 100,000 copies Autonomous Examples: LINE1, LINE2, LINE3 21 % of human genome

Functions of interspersed repeats May cause disruptions, disease Colorectal cancer Role in evolution of new genes Function of SINEs and LINEs not fully known Selfish DNA ? Parasitic elements akin to viruses

RepeatMasker Program to detect and mask interspersed repeats in a sequence Also finds low complexity sequences and masks them Can work with a library of known repeats

Tandem Repeats Satellites Mini- and micro-satellites In centromeres and telomeres Repeating pattern 1bp - 1000s bp long Mini- and micro-satellites simple, small sequence repeats

Microsatellite 1-5bp repeating pattern 541 gagccactag tgcttcattc tctcgctcct actagaatga acccaagatt gcccaggccc 601 aggtgtgtgt gtgtgtgtgt gtgtgtgtgt gtgtgtgtgt gtatagcaga gatggtttcc 661 taaagtaggc agtcagtcaa cagtaagaac ttggtgccgg aggtttgggg tcctggccct 721 gccactggtt ggagagctga tccgcaagct gcaagacctc tctatgcttt ggttctctaa 781 ccgatcaaat aagcataagg tcttccaacc actagcattt ctgtcataaa atgagcactg 841 tcctatttcc aagctgtggg gtcttgagga gatcatttca ctggccggac cccatttcac a microsatellite in a dog (canis familiaris) gene http://www.bioinfo.rpi.edu/~bystrc/courses/biol4540/lecture24/lec24.pdf

Microsatellites Copy numbers variable across individuals Associated with human diseases Fragile X syndrome, Huntington’s disease, Myotonic dystrophy Can be used for genetic fingerprinting & paternity tests, due to high variability

Minisatellites 6-20 bp repeating pattern Consensus AGGATTTT 1 tgattggtct ctctgccacc gggagatttc cttatttgga ggtgatggag gatttcagga 61 tttgggggat tttaggatta taggattacg ggattttagg gttctaggat tttaggatta 121 tggtatttta ggatttactt gattttggga ttttaggatt gagggatttt agggtttcag 181 gatttcggga tttcaggatt ttaagttttc ttgattttat gattttaaga ttttaggatt 241 tacttgattt tgggatttta ggattacggg attttagggt ttcaggattt cgggatttca 301 ggattttaag ttttcttgat tttatgattt taagatttta ggatttactt gattttggga 361 ttttaggatt acgggatttt agggtgctca ctatttatag aactttcatg gtttaacata 421 ctgaatataa atgctctgct gctctcgctg atgtcattgt tctcataata cgttcctttg Consensus AGGATTTT

Minisatellites Highly polymorphic across individuals Used for DNA fingerprinting Regulation of gene expression

Recognizing repeat sequences “Dot plots” Self-similarity

Tandem repeat detection Have to account for approximate tandem repeats Repeating unit may not be exactly same (mutations) May not be exactly in tandem (indels)

TRF (Benson) Assume > 80% sequence identity on average Assume < 10% rate of indels Basic idea T A T A C G T C G A G A C T T A T C C A C G G A G A T A T T T A

Statistical criteria The candidate tandem repeat converted into a Bernoulli (head/tail) sequence Assess significance of this sequence, assuming a probabilistic model CCACAACC-CGTCAGGCAAGT CTGCACCATCGTCTGGGAAGT HTTHHTHTTHHHHTHHTHHHH

Statistical criteria Sequence of length 100, with pH = 0.75 >=95% of time, total number of heads is >=68 >=95% of time, total number of heads in runs of length 5 or more is >=26 We are counting only head-runs of length k or more This tells us what would would be a significant number of heads

Statistical criteria Due to indels, a repeating pattern of size d may induce exact-matching k-tuples separated by d,d1, d2 etc. Consider all such pairs, up to ddmax dmax calculated using an assumption about pI (the indel frequency) and a random-walk model

Statistical criteria Other criteria to distinguish tandem repeats from non-tandem direct repeats matching k-tuples biased on one side pick tuple sizes

Mreps (another program) Different algorithm to detect repeats Maximal run of k-mismatch tandem repeats, with period p: A maximal string such that any substring of length 2p is a tandem repeat with at most k mismatches All such maximal runs can be computed in time O(nk log(k)), where n is length of sequence

Mreps: Statistical criteria Two reasons for insignificance Short length Reject runs of length < p+9 Too many mismatches Create “random” DNA sequences, and infer quality filter based on this

Gene Duplications If a region containing a gene is duplicated, a new copy of gene is created: paralogs Eases up the “selective pressure” on one of the copies free exploration of sequence space Cases of entire genomes being duplicated yeast, wheat

Pseudogenes Upon gene duplication, one of the two copies may gather a deleterious mutation Example: premature “stop codon” Once the gene “dies” in this fashion, no more selective pressure on it. Such a “dead” copy of a gene is a “pseudogene”

Pseudogenes Any sequence that appears to code for a gene product, but does not do so Origins of pseudogenes Gene duplication Change of environment, gene no longer needed portion of mRNA transcript reverse-transcribed and inserted into genome Create problems for genome study Mis-annotated as genes

Pseudogenes Pseudogenes mutate at “neutral” rate, free of any selective pressures Can be used for evolutionary analysis Example: In Drosophila, insertions:deletions in the ratio of 1:8, based on study of pseudogenes

Tandem Repeats and Binding Sites Regulatory modules have 20-40% coverage by tandem repeats Based on a study on Drosophila Very significant statistically, if assuming low-order Markov background Relation between tandem repeats and binding sites ?

Tandem Repeats and Binding Sites Possibility: Tandem repeats help in creating duplicates of binding sites Multiple copies of binding site helps exploring new binding sites helps fine-tune binding affinity Faster evolution ?

Implications for regulatory sequence analysis Regulatory sequence modeled as a mixture of motif and non-motif “background” Background typically a Markov chain of fixed order Given last k bases, S[i..i+k-1], next base determined by a fixed probability distribution

Tandem Repeats in Model Tandem repeats violate Markov assumption: previous k bases S[i..i+k-1] may provide a probability distribution on next base, OR we may have a tandem repeat of previous j <= k bases Similarly, a binding site or a part of a binding site may also be tandem repeated

Tandem Repeats in Model Need to modify the probabilistic model to include tandem repeats Research topic