Download presentation
Presentation is loading. Please wait.
Published byScott Ward Modified over 9 years ago
1
Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk
2
Repeat Visualisation Using Suffix Arrays The Analysis Artificial Sequences Genomic Sequences The Algorithm Larger Sequences Non-genomic sequences
3
The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA
4
The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA
5
The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA
6
The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA
7
The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA
8
The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA
9
The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA
10
The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA
11
The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA
12
The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA
13
The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA 12 3
14
The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA AT Occurs 3 time(s) TG Occurs 1 time(s) GC Occurs 1 time(s) CA Occurs 1 time(s) TA Occurs 2 time(s)
15
The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA No. occurrences (r) No. sequences the occur r times. 13 21 31 40 AT Occurs 3 time(s) TG Occurs 1 time(s) GC Occurs 1 time(s) CA Occurs 1 time(s) TA Occurs 2 time(s)
16
The repeat-score plot Number of occurrences Sub-string length 1 Sub-string length 2 Sub-string length 3 Sub-string length 4 Sub-string length 5 123565 201100 311000 410000 500000
17
The repeat-score plot The resulting matrix is then plotted as an image:
18
Repeatscore plots of Artificial Sequences Small repeats Reverse strand is also included
19
Random Sequences
20
DNA Sequences “The language of life” Composed of four different bases A, T, G and C Sequences range in size from 2000bp to 670 billion bp.
21
Small Genomic Sequences Lambda Phage
22
Small Genomic Sequences Lambda Phage Random Sequence
23
E.Coli
25
Sequences coding for rRNA Known inter-genic repeat elements
26
E.Coli
27
Repeats in Genomic Sequences
28
A Linear time algorithm The plots shown would take hours to construct using traditional methods. The algorithms used would not scale linearly It is not feasible to create these plots on large sequences unless more advanced algorithms are used.
29
The suffix array banana$ anana$ nana$ ana$ na$ a$ Original string: banana$ All suffixes
30
The suffix array banana$ anana$ nana$ ana$ na$ a$ Original string: banana$ In sorted order a$ ana$ anana$ banana$ na$ nana$ All suffixes
31
Generating the repeatscore plot a$ ana$ anana$ banana$ na$ nana$
32
Generating the repeatscore plot a$ ana$ anana$ banana$ na$ nana$
33
Whole human genome
36
Human Chromosome 18
37
Arabidopsis thaliana chromosome 1, coding region
38
Fibonacci derived sequences
39
Gallus gallus chromosome 20
40
Application to other sequences Analysing writing styles Finding plagiarised text Any sequence that may contain motif based, language like structure.
41
Shakespeare
42
Text document containing the text “The quick brown fox jumped over the lazy dog” 16times.
43
“On the Economy of Machinery and Manufacturers” by Charles Babbage with artificial repeat inserted 16times.
45
Conclusion This new visualisation technique can highlight repeat structure in sequences. In genomic sequences this maybe useful in generating annotation. There are applications in other areas worth pursuing. Our next step is to allow the repeatscore plot to be easily interrogated by a user in order to better understand the repeat structure.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.