Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton

Slides:



Advertisements
Similar presentations
Suffix Array: Data structures and applications
Advertisements

Genomics – The Language of DNA Honors Genetics 2006.
Order Analysis of Algorithms Debdeep Mukhopadhyay IIT Madras.
1 3. genome analysis. 2 The first DNA-based genome to be sequenced in its entirety was that of bacteriophage Φ-X174; (5,368 bp), sequenced by Frederick.
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith.
Next Generation Sequencing, Assembly, and Alignment Methods
1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.
Sequencing a genome and Basic Sequence Alignment Lecture 10 1Global Sequence.
Finding approximate palindromes in genomic sequences.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
CHAPTER 15 Microbial Genomics Genomic Cloning Techniques Vectors for Genomic Cloning and Sequencing MS2, RNA virus nt sequenced in 1976 X17, ssDNA.
Selection of Optimal DNA Oligos for Gene Expression Arrays Reporter : Wei-Ting Liu Date : Nov
Objectives Learn how to implement the sequential search algorithm Explore how to sort an array using the selection sort algorithm Learn how to implement.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
0 How to Customize This Template Choose an appropriate title slide and delete the others Edit presentation title, author names, and affiliation(s) Choose.
Sequencing a genome and Basic Sequence Alignment
Arabidopsis Gene Project GK-12 April Workshop Karolyn Giang and Dr. Mulligan.
Bacteria Transformation
Motif search and discovery Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
How to Customize This Template Choose an appropriate title slide and delete the others Edit presentation title, author names, and affiliation(s) Choose.
Whole genome comparison Kelley Crouse And Greg Matuszek.
CIS 218 Advanced UNIX1 CIS 218 – Advanced UNIX (g)awk.
CS223 Advanced Data Structures and Algorithms 1 Sorting and Master Method Neil Tang 01/21/2009.
A new way of seeing genomes Combining sequence- and signal-based genome analyses Maik Friedel, Thomas Wilhelm, Jürgen Sühnel FLI Introduction: So far,
Fig Genome = Genic + Intergenic (or non-genic) Eukaryotic genomes: composition of human genome.
Indexing DNA sequences for local similarity search Joint work of Angela, Dr. Mamoulis and Dr. Yiu 17/5/2007.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
Topic 25 - more array algorithms 1 "To excel in Java, or any computer language, you want to build skill in both the "large" and "small". By "large" I mean.
Sequencing a genome and Basic Sequence Alignment
Design & Analysis of Algorithms COMP 482 / ELEC 420 John Greiner.
Large Scale Assembly of DNA Strings using Suffix Trees David Rivshin Parallel 2 4/11/2001.
Chapter 11: Functional genomics
.1Sources of DNA and Sequencing Methods.1Sources of DNA and Sequencing Methods 2 Genome Assembly Strategy and Characterization 2 Genome Assembly.
Maik Friedel, Thomas Wilhelm, Jürgen Sühnel FLI-Jena, Germany Introduction: During the last 10 years, a large number of complete.
VizTree Huyen Dao and Chris Ackermann. Introducing example
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
DNA Sequence Alignment Genome-scale Algorithmics group (GSA) Develop algorithms and data structures for the analysis.
Burrows-Wheeler Transformation Review
E.Coli AS MODERN VECTOR.
3. genome analysis.
How to Customize This Template
Example of a common SNP in dogs
Synchronizing Text & Objects
Systematic Mapping of RNA-Chromatin Interactions In Vivo
A1 Student Posters Posters Print Services  Robinson Library  University of Newcastle  phone: Introduction The.
<ELLIIT Project Name>
How to Customize This Template
Insert Presentation Title
Array Techniques Unit 4.
Presentation title.
Presentation title.
Poster Title Heading Heading Heading Heading Heading Heading
2016 REPORTING The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.
A0 PowerPoint Poster Posters at Print Services Robinson Library, Newcastle University • • phone Introduction.
2016 REPORT.
The Life Cycle of a Filamentous Phage
TOPIC: (insert here) INSERT STUDENT NAMES HERE.
Systematic Mapping of RNA-Chromatin Interactions In Vivo
Figure 1. The 12 species in this study and details of the improved G4-seq method. (A) Phylogenetic representation of ... Figure 1. The 12 species in this.
A1 Student Posters Posters at Print Services  Robinson Library  University of Newcastle  phone: Introduction.
.1Sources of DNA and Sequencing Methods 2 Genome Assembly Strategy and Characterization 3 Gene Prediction and Annotation 4 Genome Structure 5 Genome.
Topic 25 - more array algorithms
201X REPORT.
目 录 The quick brown fox. 目 录 The quick brown fox.
E.Coli AS MODERN VECTOR.
2016 REPORT.
Presentation transcript:

Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton

Repeat Visualisation Using Suffix Arrays The Analysis Artificial Sequences Genomic Sequences The Algorithm Larger Sequences Non-genomic sequences

The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA 12 3

The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA AT Occurs 3 time(s) TG Occurs 1 time(s) GC Occurs 1 time(s) CA Occurs 1 time(s) TA Occurs 2 time(s)

The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA No. occurrences (r) No. sequences the occur r times AT Occurs 3 time(s) TG Occurs 1 time(s) GC Occurs 1 time(s) CA Occurs 1 time(s) TA Occurs 2 time(s)

The repeat-score plot Number of occurrences Sub-string length 1 Sub-string length 2 Sub-string length 3 Sub-string length 4 Sub-string length

The repeat-score plot The resulting matrix is then plotted as an image:

Repeatscore plots of Artificial Sequences Small repeats Reverse strand is also included

Random Sequences

DNA Sequences “The language of life” Composed of four different bases A, T, G and C Sequences range in size from 2000bp to 670 billion bp.

Small Genomic Sequences Lambda Phage

Small Genomic Sequences Lambda Phage Random Sequence

E.Coli

Sequences coding for rRNA Known inter-genic repeat elements

E.Coli

Repeats in Genomic Sequences

A Linear time algorithm The plots shown would take hours to construct using traditional methods. The algorithms used would not scale linearly It is not feasible to create these plots on large sequences unless more advanced algorithms are used.

The suffix array banana$ anana$ nana$ ana$ na$ a$ Original string: banana$ All suffixes

The suffix array banana$ anana$ nana$ ana$ na$ a$ Original string: banana$ In sorted order a$ ana$ anana$ banana$ na$ nana$ All suffixes

Generating the repeatscore plot a$ ana$ anana$ banana$ na$ nana$

Generating the repeatscore plot a$ ana$ anana$ banana$ na$ nana$

Whole human genome

Human Chromosome 18

Arabidopsis thaliana chromosome 1, coding region

Fibonacci derived sequences

Gallus gallus chromosome 20

Application to other sequences Analysing writing styles Finding plagiarised text Any sequence that may contain motif based, language like structure.

Shakespeare

Text document containing the text “The quick brown fox jumped over the lazy dog” 16times.

“On the Economy of Machinery and Manufacturers” by Charles Babbage with artificial repeat inserted 16times.

Conclusion This new visualisation technique can highlight repeat structure in sequences. In genomic sequences this maybe useful in generating annotation. There are applications in other areas worth pursuing. Our next step is to allow the repeatscore plot to be easily interrogated by a user in order to better understand the repeat structure.