Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dot Plots Dot Plots provide a graphic view of the amount of similarity between two sequences. The two axes represent the two sequences. In its simplest.

Similar presentations


Presentation on theme: "Dot Plots Dot Plots provide a graphic view of the amount of similarity between two sequences. The two axes represent the two sequences. In its simplest."— Presentation transcript:

1 Dot Plots Dot Plots provide a graphic view of the amount of similarity between two sequences. The two axes represent the two sequences. In its simplest form, a dot plot is the following: For each residue along one sequence, a dot is drawn where an identity is found in the other sequence. Diagonal succeeding runs of dots will then reveal similar stretches.

2 DNA dot plots Identification of regions of
Similarity between two sequences Insertions-deletions: Introns Repetitive regions (self-self analysis)‏ Inverted repeats The purpose of dot plots is to provide an intuitive view of sequence similarity. Uses for DNA are the identification of regions of similarity or identity (for example in sequence assembly), insertions-deletions, repetitive regions Question to the class: what will inversions look like

3 Repeats All DNA sequences contain repeats
All (except very short) DNA sequences contain repeats. This is due to the very small nucleotide alphabet, consisting of only four different bases. In a dot plot of a sequence against itself, such repeats are readily seen as off- diagonal lines. In the indicated repeat, positions are repeated in positions

4 Repeats All DNA sequences contain repeats
The sequence has two repetitions, one at positions 0-15, and a second at positions

5 Window size Window size 1
For DNA dot plots, the CLC workbench provides only one tunable parameter: the window size. The window size determines the length of sequence (‘window’) surrounding the centre nucleotide to be taken into consideration. Often, a window size of one generates too much noise. The CLC workbench default is therefore 9 for nucleotides.

6 Window size Window size 9
The CLC workbench default is therefore 9 for nucleotides. A dark blue spot means then, that the identity and order of all 9 surrounding nucleotides of a given nucleotide on sequence one are identical to the 9 surrounding nucleotides of a given nucleotide on the other sequence. As the degree of identity decreases, the shading of blue gets lighter. Questions to the class: If a window of 9 is employed, which position is the first one that can receive a dot in the dot plot? Answer: 5, which is the middle position of the first window. If a sequence is 100 nucleotides long, how many positions can receive dots, if a window of 9 is employed? Answer: 100 (total length)- 4 (the first four positions cannot receive dots) - 4 (the last four cannot receive) = 92

7 Exercise Practice for, a) window size 1 b) window size 3 C T A G
Sequence 2 As an exercise, fill out this DNA dot plot on a piece of paper With window size 1 with window size 3 Sequence 1

8 Exercise Window size 1 C T A G Identity Sequence 2 Sequence 1

9 Exercise Window size 3 C T A G Not considered Sequence 2 Sequence 1
The red fields are not compared at window size 3 Sequence 1

10 Exercise Window size 3 C T A G 3 GGA Sequence 2 = 3 / 3 identities
The first field to be filled with word size 3 is the one corresponding to the second position, with one nucleotide to each side. The sequence alignment of that positions neighborhood reveals 3 identities in the window compared. Sequence 1

11 Exercise Window size 3 C T A G 3 2 GGA GAA Sequence 2
= 2 / 3 identities The sequence alignment of that positions neighborhood reveals 2 out of 3 identities in the window compared. Sequence 1

12 Exercise Window size 3 C T A G 3 2 1 GGA AAA Sequence 2
= 1 / 3 identities The sequence alignment of that positions neighborhood reveals 1 out of 3 identities in the window compared. Sequence 1

13 Exercise Window size 3 C T A G 3 2 1 GGA AAT Sequence 2
GGA AAT Sequence 2 = 0 / 3 identities The sequence alignment of that positions neighborhood reveals 0 out of 3 identities in the window compared. Sequence 1

14 Exercise Window size 3 C T A G 1 2 3 Sequence 2 Sequence 1
1 2 3 Sequence 2 The complete matrix looks like this. CLC workbench will display the figures are shadings of blue. In this example, semitransparent shadings represent 2 out of 3 identities in the window compared. Sequence 1

15 Introns } Gene Introns are spliced out in the mRNA } } } mRNA
The position of introns are readily seen by making a dot plot of a gene sequence against its mRNA sequence. } mRNA

16 Protein dot plots Dotplots for proteins follow the same principle as DNA sequences. Here two similar proteins are compared.

17 CLC Combined Workbench
For proteins, however the CLC workbench allows the user to specify either the identity matrix or commonly used evolutionary derived substitutions matrices. This way, sequences that are similar, but, due to long divergence times, no longer identical, can still be compared in a meaningful way.

18 Ankyrin repeat protein
Here an Ankyrin repeat protein is compared to intself. The Ankyrin repeat is readily seen as parallel runs to the diagonal.

19 HIV Long Terminal Repeats
Retrotransposons such as HIV require Long Terminal Repeats for full functionality. These are 250 nt. in length and serve as transcriptional regulatory sequences, as well as transpositions mediators.

20 Di-nucleotide repeats
In this sequence of the human genome, we see a highly repetitive region. The dot plot was created using window size 25, and the repeat unit is the dinucleotide CA.

21 Repetitive regions Most genomic regions of eukaryotes are highly repetitive. Sometimes mRNAs also have repeats. This is the sequence of the human androgen receptor (AR) mRNA plotted against itself. In the insert zoom, we can see two (related) repetitive sequences, which are repeated in an interleaved fashion. Both repeats are trinucleotide repeats. The first repeat at 1350 is of type gca (encoding aminoacid glutamine (Q)), the second, at 2500 of type ggn, where n is one of the four bases (encoding glycine (G)). When n is repeatedly C, the two repeats have high similarity in a window of 25 nucleotides, as used here. These repeats are 2 polymorphic trinucleotide repeat segments that encode polyglutamine and polyglycine tracts in the N-terminal transactivation domain of its protein. Expansion of the polyglutamine tract causes spinal bulbar muscular atrophy (Kennedy disease).

22 Exercise: Inverted repeats

23 Exercise: Inverted repeats
Window size 3 C T A G Make a dot plot with the sequence against the reverse-complement of the sequence. Now diagonals represent inverted repeats. Reverse complement Answer: make a dot plot with the sequence against the reverse-complement of the sequence. Now diagonals represent inverted repeats. Semitransparent shadings represent 2 out of 3 identities in the window compared. Sequence 1

24 Genome dot plots: inverted repeats
Analysis of a random sequence of Homo sapiens chromosome 7 reveals numerous short inverted repeats Analysis of a random sequence of Homo sapiens chromosome 7 reveals numerous short inverted repeats, mainly like pearls on a string. These are predominantly remains of transposable elements. The dot plot was made by comparing a genomic sequence with its reverse complement.

25 The human Alu sequence A self-self plot reveals some repetitive regions. The human Alu sequence is a repeat unit of about 450 bp. In this dot plot, we see that the repeat unit is somehow repetitive.

26 The human Alu sequence A plot of the Alu sequence against its reverse-complement reveals its inverted repeat (palindromic) nature, seen as the diagonal along the entire sequence length Here we compare the Alu sequence to its reverse-complement. It is seen that the Alu sequence is an (imperfect) inverted repeat.

27 WD-repeat proteins Identity matrix Blosum45 matrix
WD repeat containing proteins are highly repetitive. WD repeats are minimally conserved regions of approximately 40 amino acids typically bracketed by gly- his and trp-asp (GH-WD), which may facilitate formation of heterotrimeric or multiprotein complexes. Members of this family are involved in a variety of cellular processes, including cell cycle progression, signal transduction, apoptosis, and gene regulation. This protein contains 7 repeats. The usage of more sensitive substitution matrices instead of identity matrices allows more remote similarities to be detected.

28 Conclusion Dot plots provide an intuitive view of sequence comparisons. The sliding window size is important. For proteins, substitution matrices can be used. Dot plots can reveal Repeats Insertion/Deletions (such as introns)‏ Inverted repeats


Download ppt "Dot Plots Dot Plots provide a graphic view of the amount of similarity between two sequences. The two axes represent the two sequences. In its simplest."

Similar presentations


Ads by Google