Dot Plots Dot Plots provide a graphic view of the amount of similarity between two sequences. The two axes represent the two sequences. In its simplest.

Slides:



Advertisements
Similar presentations
Genomics – The Language of DNA Honors Genetics 2006.
Advertisements

Computational Biology, Part 7 Similarity Functions and Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Chap. 6 Problem 2 Protein coding genes are grouped into the classes known as solitary (single) genes, and duplicated or diverged genes in gene families.
Genome organization Lesk, Ch 2 (Lesk, 2008). Genomes and proteomes Genome of a typical bacterium comes as a single DNA molecule of about 5 million characters.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Lecture 8 Alignment of pairs of sequence Local and global alignment
Sequence analysis June 18, 2008 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Sequence similarity.
Sequence comparisons June 23, 2009 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Assessment of sequence alignment Lecture Introduction The Dot plot Matrix visualisation matching tool: – Basics of Dot plot – Examples of Dot plot.
© Wiley Publishing All Rights Reserved.
Sequence Analysis Alignments dot-plots scoring scheme Substitution matrices Search algorithms (BLAST)
Genomic Organization at the DNA level! By: Caroline Fowle, Amanda Zink, Ben Whitfield, Farvah Khaja and Danielle Siegert.
Assessment of sequence alignment Lecture Introduction The Dot plot Matrix visualisation matching tool: – Basics of Dot plot – Examples of Dot plot.
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Alineamiento Matricial (Harr Plot, Matrix Plot, Dot Plot, Dot Matrix)
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
LOC_Os02g08480 Supplementary Figure S1. Exons shorter than a read length have few or no reads aligned. The gene at LOC_Os02g08040 contains exons shorter.
Chapter 11 Outline 11.1 Large Amounts of DNA Are Packed into a Cell, A Bacterial Chromosome Consists of a Single Circular DNA Molecule,
Genomes & their evolution Ch 21.4,5. About 1.2% of the human genome is protein coding exons. In 9/2012, in papers in Nature, the ENCODE group has produced.
BACTERIAL TRANSPOSONS
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Control of Eukaryotic Genome
The Secret of Life! DNA. 2/4/20162 SOMETHING HAPPENS GENE PROTEIN.
Last lecture summary.
Objective: I can explain how genes jumping between chromosomes can lead to evolution. Chapter 21; Sections ; Pgs Genomes: Connecting.
Biomathematics seminar Application of Fourier to Bioinformatics Girolamo Giudice.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
GROUP 2 DNA TO PROTEIN. 9.1 RICIN AND YOUR RIBOSOMES.
Chapter 2 Genes Code for Proteins. 2.1Introduction Early work measuring recombination frequencies between genes led to the establishment of “linkage groups”:
Chapter 13- RNA and Protein Synthesis
Sequence similarity, BLAST alignments & multiple sequence alignments
Wild-type hemoglobin DNA Mutant hemoglobin DNA LE Wild-type hemoglobin DNA Mutant hemoglobin DNA 3¢ 5¢ 3¢ 5¢ mRNA mRNA 5¢ 3¢ 5¢ 3¢ Normal hemoglobin.
Lesson Overview 13.1 RNA.
Lesson Overview 13.1 RNA.
Protein Sequence Alignments
Genomes and Their Evolution
SGN23 The Organization of the Human Genome
The Chimpanzee Genome Motivation for sequencing
Lesson Overview 13.1 RNA.
MUTATIONS.
Ab initio gene prediction
There are four levels of structure in proteins
Mutations changes in the DNA sequence that can be inherited
Lesson Overview 13.1 RNA Objectives: Contrast RNA and DNA.
What is RNA? Do Now: What is RNA made of?
Chapter 4 The Interrupted Gene.
Gene Density and Noncoding DNA
Pairwise Sequence Alignment
RNA Chapter 13.1.
12-3 RNA and Protein Synthesis
Genes Code for Proteins
MUTATIONS.
Genes Encode RNAs and Polypeptides
Extra chromosomal Agents Transposable elements
Chapter 10 Objectives Describe how the lac operon is turned on or off.
Lesson Overview 13.1 RNA.
MUTATIONS.
Biology, 9th ed,Sylvia Mader
Transposable Elements
Lesson Overview 13.1 RNA.
Lesson: RNA Key Questions: How does RNA differ from DNA?
Lesson Overview 13.1 RNA.
Lesson Overview 13.1 RNA.
It is the presentation about the overview of DOT MATRIX and GAP PENALITY..
Presentation transcript:

Dot Plots Dot Plots provide a graphic view of the amount of similarity between two sequences. The two axes represent the two sequences. In its simplest form, a dot plot is the following: For each residue along one sequence, a dot is drawn where an identity is found in the other sequence. Diagonal succeeding runs of dots will then reveal similar stretches.

DNA dot plots Identification of regions of Similarity between two sequences Insertions-deletions: Introns Repetitive regions (self-self analysis)‏ Inverted repeats The purpose of dot plots is to provide an intuitive view of sequence similarity. Uses for DNA are the identification of regions of similarity or identity (for example in sequence assembly), insertions-deletions, repetitive regions Question to the class: what will inversions look like

Repeats All DNA sequences contain repeats All (except very short) DNA sequences contain repeats. This is due to the very small nucleotide alphabet, consisting of only four different bases. In a dot plot of a sequence against itself, such repeats are readily seen as off- diagonal lines. In the indicated repeat, positions 70-80 are repeated in positions 130-140 .

Repeats All DNA sequences contain repeats The sequence 140-150 has two repetitions, one at positions 0-15, and a second at positions 110-120 .

Window size Window size 1 For DNA dot plots, the CLC workbench provides only one tunable parameter: the window size. The window size determines the length of sequence (‘window’) surrounding the centre nucleotide to be taken into consideration. Often, a window size of one generates too much noise. The CLC workbench default is therefore 9 for nucleotides.

Window size Window size 9 The CLC workbench default is therefore 9 for nucleotides. A dark blue spot means then, that the identity and order of all 9 surrounding nucleotides of a given nucleotide on sequence one are identical to the 9 surrounding nucleotides of a given nucleotide on the other sequence. As the degree of identity decreases, the shading of blue gets lighter. Questions to the class: If a window of 9 is employed, which position is the first one that can receive a dot in the dot plot? Answer: 5, which is the middle position of the first window. If a sequence is 100 nucleotides long, how many positions can receive dots, if a window of 9 is employed? Answer: 100 (total length)- 4 (the first four positions cannot receive dots) - 4 (the last four cannot receive) = 92

Exercise Practice for, a) window size 1 b) window size 3 C T A G Sequence 2 As an exercise, fill out this DNA dot plot on a piece of paper With window size 1 with window size 3 Sequence 1

Exercise Window size 1 C T A G Identity Sequence 2 Sequence 1

Exercise Window size 3 C T A G Not considered Sequence 2 Sequence 1 The red fields are not compared at window size 3 Sequence 1

Exercise Window size 3 C T A G 3 GGA Sequence 2 = 3 / 3 identities The first field to be filled with word size 3 is the one corresponding to the second position, with one nucleotide to each side. The sequence alignment of that positions neighborhood reveals 3 identities in the window compared. Sequence 1

Exercise Window size 3 C T A G 3 2 GGA GAA Sequence 2 = 2 / 3 identities The sequence alignment of that positions neighborhood reveals 2 out of 3 identities in the window compared. Sequence 1

Exercise Window size 3 C T A G 3 2 1 GGA AAA Sequence 2 = 1 / 3 identities The sequence alignment of that positions neighborhood reveals 1 out of 3 identities in the window compared. Sequence 1

Exercise Window size 3 C T A G 3 2 1 GGA AAT Sequence 2 GGA AAT Sequence 2 = 0 / 3 identities The sequence alignment of that positions neighborhood reveals 0 out of 3 identities in the window compared. Sequence 1

Exercise Window size 3 C T A G 1 2 3 Sequence 2 Sequence 1 1 2 3 Sequence 2 The complete matrix looks like this. CLC workbench will display the figures are shadings of blue. In this example, semitransparent shadings represent 2 out of 3 identities in the window compared. Sequence 1

Introns } Gene Introns are spliced out in the mRNA } } } mRNA The position of introns are readily seen by making a dot plot of a gene sequence against its mRNA sequence. } mRNA

Protein dot plots Dotplots for proteins follow the same principle as DNA sequences. Here two similar proteins are compared.

CLC Combined Workbench For proteins, however the CLC workbench allows the user to specify either the identity matrix or commonly used evolutionary derived substitutions matrices. This way, sequences that are similar, but, due to long divergence times, no longer identical, can still be compared in a meaningful way.

Ankyrin repeat protein Here an Ankyrin repeat protein is compared to intself. The Ankyrin repeat is readily seen as parallel runs to the diagonal.

HIV Long Terminal Repeats Retrotransposons such as HIV require Long Terminal Repeats for full functionality. These are 250 nt. in length and serve as transcriptional regulatory sequences, as well as transpositions mediators.

Di-nucleotide repeats In this sequence of the human genome, we see a highly repetitive region. The dot plot was created using window size 25, and the repeat unit is the dinucleotide CA.

Repetitive regions Most genomic regions of eukaryotes are highly repetitive. Sometimes mRNAs also have repeats. This is the sequence of the human androgen receptor (AR) mRNA plotted against itself. In the insert zoom, we can see two (related) repetitive sequences, which are repeated in an interleaved fashion. Both repeats are trinucleotide repeats. The first repeat at 1350 is of type gca (encoding aminoacid glutamine (Q)), the second, at 2500 of type ggn, where n is one of the four bases (encoding glycine (G)). When n is repeatedly C, the two repeats have high similarity in a window of 25 nucleotides, as used here. These repeats are 2 polymorphic trinucleotide repeat segments that encode polyglutamine and polyglycine tracts in the N-terminal transactivation domain of its protein. Expansion of the polyglutamine tract causes spinal bulbar muscular atrophy (Kennedy disease).

Exercise: Inverted repeats

Exercise: Inverted repeats Window size 3 C T A G Make a dot plot with the sequence against the reverse-complement of the sequence. Now diagonals represent inverted repeats. Reverse complement Answer: make a dot plot with the sequence against the reverse-complement of the sequence. Now diagonals represent inverted repeats. Semitransparent shadings represent 2 out of 3 identities in the window compared. Sequence 1

Genome dot plots: inverted repeats Analysis of a random sequence of Homo sapiens chromosome 7 reveals numerous short inverted repeats Analysis of a random sequence of Homo sapiens chromosome 7 reveals numerous short inverted repeats, mainly like pearls on a string. These are predominantly remains of transposable elements. The dot plot was made by comparing a genomic sequence with its reverse complement.

The human Alu sequence A self-self plot reveals some repetitive regions. The human Alu sequence is a repeat unit of about 450 bp. In this dot plot, we see that the repeat unit is somehow repetitive.

The human Alu sequence A plot of the Alu sequence against its reverse-complement reveals its inverted repeat (palindromic) nature, seen as the diagonal along the entire sequence length Here we compare the Alu sequence to its reverse-complement. It is seen that the Alu sequence is an (imperfect) inverted repeat.

WD-repeat proteins Identity matrix Blosum45 matrix WD repeat containing proteins are highly repetitive. WD repeats are minimally conserved regions of approximately 40 amino acids typically bracketed by gly- his and trp-asp (GH-WD), which may facilitate formation of heterotrimeric or multiprotein complexes. Members of this family are involved in a variety of cellular processes, including cell cycle progression, signal transduction, apoptosis, and gene regulation. This protein contains 7 repeats. The usage of more sensitive substitution matrices instead of identity matrices allows more remote similarities to be detected.

Conclusion Dot plots provide an intuitive view of sequence comparisons. The sliding window size is important. For proteins, substitution matrices can be used. Dot plots can reveal Repeats Insertion/Deletions (such as introns)‏ Inverted repeats