Bioinformatics GBIO0009 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Sequence allignement 1 Chitta Baral. Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)*
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Measuring the degree of similarity: PAM and blosum Matrix
1 ALIGNMENT OF NUCLEOTIDE & AMINO-ACID SEQUENCES.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Introduction to bioinformatics
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Similar Sequence Similar Function Charles Yan Spring 2006.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Sequence Alignment.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________.
Pairwise & Multiple sequence alignments
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
An Introduction to Bioinformatics
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Bioinformatics Chapter 4: Sequence comparison ____________________________________________________________________________________________________________________.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
GE3M25: Computer Programming for Biologists Python, Class 5
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
Copyright OpenHelix. No use or reproduction without express written consent1.
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Practice -- BLAST search in your own computer 1.Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2.Start a CMD window,
GBIO Bioinformatics ____________________________________________________________________________________________________________________ Kirill.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Bacterial infection by lytic virus
Bacterial infection by lytic virus
Pairwise sequence Alignment.
Sequence Based Analysis Tutorial
Pairwise Sequence Alignment
Sequence Alignment Practical
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Presentation transcript:

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 1 Biological Sequences Comparison GBIO Presented by Kirill Bessonov Oct 27, 2015

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 2 Lecture structure 1.Introduction 2.Global Alignment: Needleman-Wunsch 3.Local Alignment: Smith-Waterman 4.Illustrations – Sequences identification via BLAST – Retrieval of sequences via UniProt and R – Alignment of sequences via R ( seqinr) – Detection of ORFs with R – Comparing genomes of two species

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 3 Biological sequence Single continuous molecule – DNA – RNA – protein

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 4 Biological problem Given the DNA sequence AATCGGATGCGCGTAGGATCGGTAGGGTAGGCTTT AAGATCATGCTATTTTCGAGATTCGATTCTAGCTA Answer – Is it likely to be a gene? – What is its possible expression level? – What is the possible structure of the protein product? – Can we get the protein? – Can we figure out the key residues of the protein? – ….

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 5 Alphabet – In the case of DNA A, C, T, G – In the case of RNA A,C, U, G – In the case of protein 20 amino acids Complete list is found herehere

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 6 Words Short strings of letters from an alphabet A word of length k is called a k-word or k-tuple Examples: – 1-tuple: individual nucleotide – 2-tuple: dinucleotide – 3-tuple: codon

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 7 Patterns Recognizing motifs, sites, signals, domains – functionally important regions – a conserved motif - consensus sequence – Often words (in bold) are used interchangeably Gene starts with with an “ATG” codon – Identify # of potential gene start sites AATCGGATGCGCGTAGGATCGGTAGGGTAGGCTTTAAGATCATGCTATTTTCGAGATT CGATTCTAGCTAGGTTTAGCTTAGCTTAGTGCCAGAAATCGGATGCGCGTAGGATCGG TAGGGTAGGCTTTAAGATCATGCTATTTTCGAGATTCGATTCTAGCTAGGTTTTTAGT GCCAGAAATCGTTAGTGCCAGAAATCGATT

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 8 Probability What is the probability distribution of the number of times a given base N (A,C,T or G) occurs in a random DNA sequence of length L n ? – Assume X 1, …, X n is the sequence of L n – Count the number of times N appears – Calculate probabilities P(X i =N) = # {X i =N}/ L n P(X i = not N) = 1 – p N Ex: Given ATCCTACTGT L n = 10 – P(X i =A) = 2/10 = 0.2 – P(X i =T or C or G) = 1 – 0.2 = 0.8

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 9 Alignments

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 10 Biological context Proteins may be multifunctional – Sequence determines protein function – Assumptions Pairs of proteins with similar sequence also share similar biological function(s)

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 11 Comparing sequences are important for a number of reasons. – used to establish evolutionary relationships among organisms – identifi­cation of functionally conserved sequences (e.g., DNA sequences controlling gene expression) ‘TATAAT’ box  transcription initiation – develop models for human diseases identify corresponding genes in model organisms (e.g. yeast, mouse), which can be geneti­cally manipulated – E.g. gene knock outs / silencing

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 12 Comparing two sequences There are two ways of pairwise comparison – Global using Needleman-Wunsch algorithm (NW) – Local using Smith-Waterman algorithm (SW) Global alignment (NW) Alignment of the “whole” sequence Local alignment (SW) tries to align portions (e.g. motifs) more flexible – Considers sequences “parts” works well on – highly divergent sequences entire sequence perfect match unaligned sequence aligned portion

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 13 Global alignment

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 14 Global alignment (NW) Sequences are aligned end-to-end along their entire length Many possible alignments are produced – The alignment with the highest score is chosen Naïve algorithm is very inefficient (O exp ) – To align sequence of length 15, need to consider (insertion, deletion, gap) 15 = 3 15 = 1,4*10 7 – Impractical for sequences of length >20 nt Used to analyze homology/similarity of – genes and proteins – between species

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 15 Methodology of global alignment (1 of 4)

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 16 Methodology of global alignment (2 of 4) The matrix should have extra column and row – M +1 columns, where M is the length sequence M – N +1 rows, where N is the length of sequence N 1.Initialize the matrix – introduce gap penalty at every initial position along rows and columns – Scores at each cell are cumulative WHAT W -2 H -4 Y -6 -2

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 17 Methodology of global alignment (3 of 4) 2.Alignment possibilities Gap (horiz/vert) Match (W-W diag.) 3.Select the maximum score – Best alignment WHAT W H Y WH 0 -4 W -2-4 WH W

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 18 Methodology of global alignment (4 of 4) 4.Select the most very bottom right cell 5.Consider different path(s) going to very top left cell – How the next cell value was generated? From where? WHAT WHY- WH-Y Overall score = 1 Overall score = 1 6.Select the best alignment(s) WHAT W H Y WHAT W H Y

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 19 Local alignment

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 20 Local alignment (SW) Sequences are aligned to find regions where the best alignment occurs (i.e. highest score) Assumes a local context (aligning parts of seq.) Ideal for finding short motifs, DNA binding sites – helix-loop-helix (bHLH) - motif – TATAAT box (a famous promoter region) – DNA binding site Works well on highly divergent sequences

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 21 Methodology of local alignment (1 of 4)

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 22 Methodology of local alignment (2 of 4) Construct the MxN alignment matrix with M+1 columns and N+1 rows Initialize the matrix by introducing gap penalty at 1 st row and 1 st column WHAT W 0 H 0 Y 0 s(a,b) ≥ 0 (min value is zero)

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 23 Methodology of local alignment (3 of 4) For each subsequent cell consider alignments – Vertical s(I, - ) – Horizontal s(-,J) – Diagonal s(I,J) For each cell select the highest score – If score is negative  assign zero WHAT W H Y

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 24 Methodology of local alignment (4 of 4) Select the initial cell with the highest score(s) Consider different path(s) leading to score of zero – Trace-back the cell values – Look how the values were originated (i.e. path) WH Mathematically – where S(I, J) is the score for sub-sequences I and J WHAT W H Y total score of 4 B A J I

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 25 Local alignment illustration (1 of 2)

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 26 Local alignment illustration (2 of 2) GGCTCAATCA A C C T A A G G GGCTCAATCA A 0 C 0 C 0 T 0 A 0 A 0 G 0 G 0 GGCTCAATCA A C 0 C 0 T 0 A 0 A 0 G 0 G 0 GGCTCAATCA A C 0 C 0 T 0 A 0 A 0 G 0 G 0 GGCTCAATCA A C 0 C 0 T 0 A 0 A 0 G 0 G 0 GGCTCAATCA A C 0 C 0 T 0 A 0 A 0 G 0 G 0 GGCTCAATCA A C C 0 T 0 A 0 A 0 G 0 G 0 GGCTCAATCA A C C T 0 A 0 A 0 G 0 G 0 GGCTCAATCA A C C T A 0 A 0 G 0 G 0 GGCTCAATCA A C C T A A 0 G 0 G 0 GGCTCAATCA A C C T A A G 0 G 0 GGCTCAATCA A C C T A A G G 0 GGCTCAATCA A C C T A A G G

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 27 Local alignment illustration (3 of 3) CTCAAGGCTCAATCA CT-AA ACCT-AAGG Best score: 6 GGCTCAATCA A C C T A A G G in the whole seq. context (globally) locally

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 28 Aligning proteins Globally and Locally

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 29 Biological context Find common functional units – Structural motifs Helix-loop-helix Zinc finger … Phylogeny – Distance between species

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 30 Protein Alignment Protein local and global alignment – follows the same rules as we saw with DNA/RNA Differences (∆) – alphabet of proteins is 22 residues (aa) long – scoring/substitution matrices used (BLOSUM) protein proprieties are taken into account – residues that are totally different due to charge such as polar Lysine and apolar Glycine are given a low score

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 31 Substitution matrices Protein sequences are more complex – matrices = collection of scoring rules Matrices over events such as – mismatch and perfect match Need to define gap penalty separately E.g. BLOcks SUbstitution Matrix (BLOSUM)

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 32 BLOSUM-x matrices Constructed from aligned sequences with specific x% similarity – matrix built using sequences with no more then 50% similarity is called BLOSUM-50 For highly mutating / dissimilar sequences use – BLOSUM-45 and lower For highly conserved / similar sequences use – BLOSUM -62 and higher

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 33 BLOSUM 62 What diagonal represents? What is the score for substitution E  D (acid a.a.)? More drastic substitution K  I (basic to non-polar)? perfect match between a.a. Score = 2 Score = -3

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 34 Practical problem: Align following sequences both globally and locally using BLOSUM 62 matrix with gap penalty of -8 Sequence A: AAEEKKLAAA Sequence B: AARRIA

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 35 Aligning globally using BLOSUM 62 AAEEKKLAAA AA--RRIA-- Score: -14 Other alignment options? Yes AAEEKKLAAA A A R R I A

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 36 Aligning locally using BLOSUM 62 KKLA RRIA Score: 10 AAEEKKLAAA A A R R I A

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 37 Practical 1 of 5: Using BLAST for sequence identification

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 38 BLAST Basic Local Alignment Search Tool Many different types

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 39 Types blastn – nucleotide query vs nucleotide database blastp – protein query vs protein DB blastx – translated in 6 frames nucleotide query vs protein DB

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 40 Sequence identity Want to know which genes are coded by the genomic sequence >human_genomic_seq TGGACTCTGCTTCCCAGACAGTACCCCTGACAGTGACAGAACTGCCACTCTCCCCACCTG ACCCTGTTAGGAAGGTACAACCTATGAAGAAAAAGCCAGAATACAGGGGACATGTGAGCC ACAGACAACACAAGTGTGCACAACACCTCTGAGCTGAGCTTTTCTTGATTCAAGGGCTAG TGAGAACGCCCCGCCAGAGATTTACCTCTGGTCTTCTGAGGTTGAGGGCTCGTTCTCTCT TCCTGAATGTAAAGGTCAAGATGCTGGGCCTCAGTTTCCTCTTACATACTCACCAAAAGG CTCTCCTGATCAGAGAAGCAGGATGCTGCACTTGTCCTCCTGTCGATGCTCTTGGCTATG ACAAAATCTGAGCTTACCTTCTCTTGCCCACCTCTAAACCCCATAAGGGCTTCGTTCTGT GTCTCTTGAGAATGTCCCTATCTCCAACTCTGTCATACGGGGGAGAGCGAGTGGGAAGGA TCCAGGGCAGGGCTCAGACCCCGGCGCATGGACCTAGTCGGGGGCGCTGGCTCAGCCCC GCCCCGCGCGCCCCCGTCGCAGCCGACGCGCGCTCCCGGGAGGCGGCGGCAGAGGCAG CATCCACAGCATCAGCAGCCTCAGCTTCATCCCCGGGCGGTCTCCGGCGGGGAAGGCCGG TGGGACAAACGGACAGAAGGCAAAGTGCCCGCAATGGAGGGAGCATCCTTTGGCGCGG GCCGTGCGGGAGCTGCCTTTGATCCCGTGAGCTTTGCGCGGCGGCCCCAGACCCTGTTGC GGGTCGTGTCCTGG

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 41 BLAST GUI

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 42 Results Top hit Mus musculus targeted KO-first, conditional ready, lacZ-tagged mutant allele Tbl3:tm1a(EUCOMM)Hmgu; transgenic

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 43 Practical 2 of 5 : Sequence Retrieval and Analysis via R ( seqinr )

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 44 Protein database UniProt database ( has high quality protein data manually curatedhttp:// It is manually curated Each protein is assigned UniProt ID

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 45 Manual retrieval from In search field one can enter either use UniProt ID or common protein name – example: myelin basic protein We will use retrieve data for P02686 Uniprot ID

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 46 FASTA format FASTA format is widely used and has the following parameters – Sequence name start with > sign – The fist line corresponds to protein name Actual protein sequence starts from 2 nd line

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 47 Retrieving protein data with R and SeqinR Can “talk” programmatically to UniProt database using R and seqinR library – seqinR library is suitable for “Biological Sequences Retrieval and Analysis” Detailed manual could be found herehere – Install this library in your R environment install.packages("seqinr") library("seqinr") – Choose database to retrieve data from choosebank("swissprot") – Download data object for target protein (P02686) MBP_HUMAN = query("MBP_HUMAN", "AC=P02686") – See sequence of the object MBP_HUMAN MBP_HUMAN_seq = getSequence(MBP_HUMAN); MBP_HUMAN_seq

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 48 Dot Plot (comparison of 2 sequences) (1of2) Each sequence plotted on vertical or horizontal dimension – If two a.a. from two sequences at given positions are identical the dot is plotted – matching sequence segments appear as diagonal lines (that could be parallel to the absolute diagonal line if insertion or gap is present)

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 49 Dot Plot (comparison of 2 sequences) (2of2) Visualize dot plot dotPlot(MBP_HUMAN_seq[[1]], MBP_MOUSE_seq[[1]],xlab="MBP - Human", ylab = "MBP - Mouse") - Is there similarity between human and mouse form of MBP protein? - Where is the difference in the sequence between the two isoforms? Let’s compare two protein sequences – Human MBP (Uniprot ID: P02686) – Mouse MBP (Uniprot ID: P04370) Download 2 nd mouse sequence MBP_MOUSE = query("MBP_MOUSE", "AC=P04370"); MBP_MOUSE_seq = getSequence(MBP_MOUSE); INSERTION in MBP-Human or GAP in MBP-Mouse Shift in diagonal line (identical regions) Breaks in diagonal line = regions of dissimilarity

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 50 Practical 3 of 5: Pairwise global and local alignments via R and Biostrings

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 51 Installing Biostrings library DNA_subst_matrix

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 52 Global alignment using R and Biostrings Create two sting vectors (i.e. sequences) seqA = "GATTA" seqB = "GTTA" Use pairwiseAlignment() and the defined rules globalAlignAB = pairwiseAlignment(seqA, seqB, substitutionMatrix = DNA_subst_matrix, gapOpening = -2, gapExtension=0, scoreOnly = FALSE, type="global") Visualize best paths (i.e. alignments) globalAlignAB Global PairwiseAlignedFixedSubject (1 of 1) pattern: [1] GATTA subject: [1] G-TTA score: 6

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 53 Local alignment using R and Biostrings Input two sequences seqA = "AGGATTTTAAAA" seqB = "TTTT" The scoring rules will be the same as we used for global alignment localAlignAB = pairwiseAlignment(seqA, seqB, substitutionMatrix = DNA_subst_matrix, gapOpening = -2, scoreOnly = FALSE, type="local") Visualize alignment localAlignAB Local PairwiseAlignedFixedSubject (1 of 1) pattern: [5] TTTT subject: [1] TTTT score: 8

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 54 Aligning protein sequences Protein sequences alignments are very similar except the substitution matrix is specified data(BLOSUM62) BLOSUM62 Will align sequences seqA = "PAWHEAE" seqB = "HEAGAWGHEE" Execute the global alignment globalAlignAB <- pairwiseAlignment(seqA, seqB, substitutionMatrix = "BLOSUM62", gapOpening = -2, gapExtension = -8, scoreOnly = FALSE)

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 55 Practical 4 of 5: Open Reading Frame (ORF) identification via R (i.e. Gene Finding)

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 56 Gene expression sequences Central dogma – DNA  RNA  protein Each codon codes for one amino acid (a.a.) – residue = amino acid mRNA polymerase II – Reads from 5’ to 3’ direction – 3 nucleotides code for 1 a.a. In the DNA context – Start codon: ATG – Stop codon: TAA, TGA, TAG

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 57 mRNA  Protein alphabet Codon table: 3 nucleotides code for 1 a.a.

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 58 Find ORF via R source(" biocLite("Biostrings"); library("Biostrings"); s1 <- "aaaatgcagtaacccatgccc"; # Find all ATGs in the sequence s1 matchPattern("atg", s1); start end width [1] [atg] [2] [atg] 1)#ORFs: there are two “ATG”s in the sequence 2)Positions: nucleotides 4-6, and at nucleotides 16-18

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 59 Find stop codons s1 <- "aaaatgcagtaacccatgccc"; stop.codons <- c("taa", "tga", "tag"); sapply(stop.codons,function(x){matchPattern(x,s1)}); $taa Views on a 21-letter BString subject subject: aaaatgcagtaacccatgccc views: start end width [1] [taa] $tga Views on a 21-letter BString subject subject: aaaatgcagtaacccatgccc views: NONE $tag Views on a 21-letter BString subject subject: aaaatgcagtaacccatgccc views: NONE

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 60 Dengue virus example Want to find ORFs in Dengue virus – find all potential start and stop codons – 500 nucleotides of the genome sequence the DEN-1 Dengue virus (NCBI accession NC_001477)

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 61 Define getncbiseq() getncbiseq <- function(accession) { require("seqinr") # this function requires the SeqinR R package # first find which ACNUC database the accession is stored in: dbs <- c("genbank","refseq","refseqViruses","bacterial") numdbs <- length(dbs) for (i in 1:numdbs) { db <- dbs[i] choosebank(db) # check if the sequence is in ACNUC database 'db': resquery <- try(query(".tmpquery", paste("AC=", accession)), silent = TRUE) if (!(inherits(resquery, "try-error"))) { queryname <- "query2" thequery <- paste("AC=",accession,sep="") query2 <-query(`queryname`,`thequery`); # see if a sequence was retrieved: seq <- getSequence(query2$req[[1]]) closebank() return(seq); } closebank() } print(paste("ERROR: accession",accession,"was not found")) } dengueseq <- getncbiseq("NC_001477");

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 62 Select portion of the sequence Take portion of the Dengue virus library("seqinr"); # Take the first 500 nucleotides dengueseqstart <- dengueseq[1:500] dengueseqstartstring <- c2s(dengueseqstart);

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 63 Define findPotentialStartsAndStops() findPotentialStartsAndStops <- function(sequence) { # Define a vector with the sequences of potential start and stop codons codons <- c("atg", "taa", "tag", "tga") # Find the number of occurrences of each type of potential start or stop codon for (i in 1:4) { codon <- codons[i] # Find all occurrences of codon "codon" in sequence "sequence" occurrences <- matchPattern(codon, sequence) # Find the start positions of all occurrences of "codon" in sequence "sequence" codonpositions <- # Find the total number of potential start and stop codons in sequence "sequence" numoccurrences <- length(codonpositions) if (i == 1) { # Make a copy of vector "codonpositions" called "positions" positions <- codonpositions # Make a vector "types" containing "numoccurrences" copies of "codon" types <- rep(codon, numoccurrences) } else { # Add the vector "codonpositions" to the end of vector "positions": positions <- append(positions, codonpositions, after=length(positions)) # Add the vector "rep(codon, numoccurrences)" to the end of vector "types": types <- append(types, rep(codon, numoccurrences), after=length(types)) } # Sort the vectors "positions" and "types" in order of position along the input sequence: indices <- order(positions) positions <- positions[indices] types <- types[indices] # Return a list variable including vectors "positions" and "types": mylist <- list(positions,types) return(mylist) }

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 64 Find ORFs Sequence has 11 start and 22 stop codons "agttgttagtctacgtggaccgacaagaacagtttcgaatcggaagcttgc ttaacgtagttctaacagttttttattagagagcagatctctgatgaacaac caacggaaaaagacgggtcgaccgtct … “ findPotentialStartsAndStops(dengueseqstartstring); [[1]] [1] [[2]] [1] "tag" "taa" "tag" "taa" "tag" "tga" "atg" "tga" "atg" "tga" "atg" "tga" "tga" "atg" "tag" "taa" "tag" "tag" "atg" "atg" "atg" "tga" "taa" "atg" "tga" "tga" "atg" "atg" "tga" "atg" "tga" "tag" "tag"

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 65 Thee reading frames three different possible reading frames – +1 - integer number of triplets (codons) – +2 - integer number of triplets, plus 1 nucleotide – +3 - integer number of triplets, plus 2 nucleotides

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 66 Identify reading frame Get sub-sequence substring(dengueseqstartstring,137,143); [1] "atgctga" starts – a potential start codon (ATG) ends – with a potential stop codon (TGA) there is – an integer number of triplets – plus one nucleotide – frame +2

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 67 Find ORF in DNA Search for – a potential start codon, – followed by an integer number of codons, – followed by a potential stop codon – test all 3 frames Input 500bp of the DEN-1 virus

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 68 Defining findORFsinSeq() 1of3 findORFsinSeq <- function(sequence) { require(Biostrings) # Make vectors "positions" and "types" containing information on the positions of ATGs in the sequence: mylist <- findPotentialStartsAndStops(sequence) positions <- mylist[[1]] types <- mylist[[2]] # Make vectors "orfstarts" and "orfstops" to store the predicted start and stop codons of ORFs orfstarts <- numeric() orfstops <- numeric() # Make a vector "orflengths" to store the lengths of the ORFs orflengths <- numeric() # Print out the positions of ORFs in the sequence: # Find the length of vector "positions" numpositions <- length(positions)

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 69 Defining findORFsinSeq()2of3 # There must be at least one start codon and one stop codon to have an ORF. if (numpositions >= 2) { for (i in 1:(numpositions-1)) { posi <- positions[i] typei <- types[i] found <- 0 while (found == 0) { for (j in (i+1):numpositions) { posj <- positions[j] typej <- types[j] posdiff <- posj - posi posdiffmod3 <- posdiff % 3 # Add in the length of the stop codon orflength <- posj - posi + 3 if (typei == "atg" && (typej == "taa" || typej == "tag" || typej == "tga") && posdiffmod3 == 0) { # Check if we have already used the stop codon at posj+2 in an ORF numorfs <- length(orfstops) usedstop <- -1 if (numorfs > 0) { for (k in 1:numorfs) { orfstopk <- orfstops[k] if (orfstopk == (posj + 2)) { usedstop <- 1 } } if (usedstop == -1) { orfstarts <- append(orfstarts, posi, after=length(orfstarts)) orfstops <- append(orfstops, posj+2, after=length(orfstops)) # Including the stop codon. orflengths <- append(orflengths, orflength, after=length(orflengths)) } found <- 1 break } if (j == numpositions) { found <- 1 } }

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 70 Defining findORFsinSeq() 3of3 # Sort the final ORFs by start position: indices <- order(orfstarts) orfstarts <- orfstarts[indices] orfstops <- orfstops[indices] # Find the lengths of the ORFs that we have orflengths <- numeric() numorfs <- length(orfstarts) for (i in 1:numorfs) { orfstart <- orfstarts[i] orfstop <- orfstops[i] orflength <- orfstop - orfstart + 1 orflengths <- append(orflengths,orflength,after=length(orflengths)) } mylist <- list(orfstarts, orfstops, orflengths) return(mylist) }

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 71 Find ORF start / end sites s1 <- "aaaatgcagtaacccatgccc"; findORFsinSeq(s1); [[1]] [1] 4 [[2]] [1] 12 [[3]] [1] 9 1.the start positions of ORFs 2.a vector of the end positions of those ORFs 3.a vector containing the lengths of the ORFs.

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 72 Practical 5 of 5: comparing the genomes of two different species

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 73 Comparison Comparative genomics – Compares genomes two different species, or two different strains Do the two species have the same number of genes? Which genes were gained or lost?

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 74 Tuning biomaRt # Load the biomaRt package in R library("biomaRt"); # List all databases that can be queried listMarts(); 1.The names of the databases are listed and version 2.Can query many different databases including WormBase, UniProt, Ensembl, etc.

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 75 Connect to database via biomaRt Once the particular Ensembl data is set – can perform the query using the getBM() function # Specify that we want to query the Ensembl Protists database ensemblprotists <- useMart("protists_mart_29"); ensemblleishmania <- useDataset("lmajor_eg_gene",mart=ensemblprotists); leishmaniaattributes <- listAttributes(ensemblleishmania); attributenames <- leishmaniaattributes[[1]]; attributedescriptions <- leishmaniaattributes[[2]]; A very long list of 292 features – in the Leishmania major Ensembl data set

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 76 Query genes of Leishmania major Find all genes in Leishmania major leishmaniagenes <- getBM(attributes = c("ensembl_gene_id"), mart=ensemblleishmania); leishmaniagenes[1:10]; Find only protein coding genes in Leishmania major, “gene_biotype” – specifies type of sequence (eg. protein-coding, pseudogene, etc.): leishmaniagenes2 <- getBM(attributes = c("ensembl_gene_id", "gene_biotype"), mart=ensemblleishmania); # Get the vector of the names of all L. major genes leishmaniagenenames2 <- leishmaniagenes2[,1]; # Get the vector of the biotypes of all genes leishmaniagenebiotypes2 <- leishmaniagenes2[,2]; table(leishmaniagenebiotypes2); ncRNA nontranslating_CDS protein_coding pseudogene rRNA snoRNA snRNA tRNA

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 77 Query genes of Plasmodium falciparum Find protein coding genes in – Plasmodium falciparum ensemblpfalciparum <- useDataset("pfalciparum_eg_gene",mart=ensemblprotists); pfalciparumgenes <- getBM(attributes = c("ensembl_gene_id", "gene_biotype"), mart=ensemblpfalciparum); # Get the names of the P. falciparum genes pfalciparumgenenames <- pfalciparumgenes[[1]]; length(pfalciparumgenenames); pfalciparumgenebiotypes <- pfalciparumgenes[[2]]; table(pfalciparumgenebiotypes); ncRNA nontranslating_cds protein_coding rRNA snRNA tRNA

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 78 Comparing two species Plasmodium falciparum seems to have less protein-coding genes (5349) than Leishmania major (8308) – Why? that there have been gene duplications in the Leishmania major lineage that completely new genes (that are not related to any other Leishmania major gene) have arisen that there have been genes lost from the Plasmodium falciparum genome

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 79

Bioinformatics GBIO Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 80 Resources Online Tutorial on Sequence Alignment – bioinformatics.readthedocs.org/en/latest/src/chapter4.html bioinformatics.readthedocs.org/en/latest/src/chapter4.html Graphical alignment of proteins – Pairwise alignment of DNA and proteins using your rules: – Documentation on libraries – Biostings: – SeqinR: