Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 1 Biological Sequences Analysis GBIO00002-1 Presented by Kirill Bessonov Oct 27, 2015

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 2 Talk Structure 1.Introduction 2.Global Alignment: Needleman-Wunsch 3.Local Alignment: Smith-Waterman 4.Practical – Retrieval of sequences using R – Alignment of sequences using R – Finding ORF with R – Comparing two species based on genes

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 3 Gene structure and expression Gene regions – 5’ untranslated region (5’UTR) Directly upstream of start codon (AUG) Regulates translation – 3’ untranslated region (3’UTR) Right after the stop codon Influences translation efficiency  [protein] – Open Reading Frame (ORF) Protein coding region (with introns / exons) mRNA

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 4 Gene expression sequences Central dogma – DNA  RNA  protein Each codon codes for one amino acid (a.a.) – residue = amino acid mRNA polymerase II – Reads from 5’ to 3’ direction – 3 nucleotides code for 1 a.a. In the DNA context – Start codon: ATG – Stop codon: TAA, TGA, TAG

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 5 mRNA  Protein alphabet Codon table: 3 nucleotides code for 1 a.a.

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 6 Biological sequence Single continuous molecule – DNA ACGGCT – RNA ACGGCU – Protein TA or ThrAla

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 7 Biological problem Given the DNA sequence AATCGGATGCGCGTAGGATCGGTAGGGTAGGCTTT AAGATCATGCTATTTTCGAGATTCGATTCTAGCTA Answer – Is it likely to be a gene? – What is its possible expression level? – What is the possible structure of the protein product? – Can we get the protein (i.e. express protein)? – Can we figure out the key residues of the protein? – Can we determine the organism from which this sequence came?

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 8 Biological alphabet – In the case of DNA A, C, T, G – In the case of RNA A,C, U, G – In the case of protein 20 amino acids Complete list is found herehere

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 9 Words Short strings of letters from an alphabet A word of length k is called a k-word or k-tuple Examples: – 1-tuple: individual nucleotide – 2-tuple: dinucleotide – 3-tuple: codon

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 10 2-words: dinucleotides Composed of 2 nucleotides – Given DNA alphabet {A,T,C,G} How many possible dinucleoties? Total of 16: AA, AC,AG,AT … TG,TT CpG islands are regions of DNA – Frequent repetition of CpG dinucleotides – Rich in ‘G’ and ‘C’ – CpG islands appear in some 70% of promoters of human genes

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 11 CpG islands CpG sites could be methylated Location and methylation state – Impact gene expression If in promoter region and methylated – May inhibit expression – Present at the gene ‘start’ region How many CpG sites in this DNA sequence?

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 12 3-words: codons Important in case of DNA sequences Linked to expression – DNA  RNA  protein

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 13 Patterns Recognizing motifs, sites, signals, domains – functionally important regions – a conserved motif - consensus sequence – Often words (in bold) are used interchangeably Gene starts with with an “ATG” codon – Identify # of potential gene start sites 4 sites AATCGGATGCGCGTAGGATCGGTAGGGTAGGCTTTAAGATCATGCTATTTTCGAGATT CGATTCTAGCTAGGTTTAGCTTAGCTTAGTGCCAGAAATCGGATGCGCGTAGGATCGG TAGGGTAGGCTTTAAGATCATGCTATTTTCGAGATTCGATTCTAGCTAGGTTTTTAGT GCCAGAAATCGTTAGTGCCAGAAATCGATT

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 14 Bases distribution The distribution of bases within a DNA – is not ordinarily uniform i.e. not P(A) =P(C) = P(G) = P(T) = 0.25 There may be an excess of G over C on the leading strands – This can be described by the “GC skew”,

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 15 GC skew in a sequence GC skew is defined as (#G - #C) / (#G + #C) Calculated in windows of length l Theoretical minimal is 0 – #G and #C are equal

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 16 Human genome characteristics Fact: the amount of adenine is always the same as the amount of thymine, and the amount of guanine equals the amount of cytosine (A:T=1 and G:C = 1) - in the whole genome

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 17 Alignments

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 18 Biological context Proteins may be multifunctional – Sequence determines protein function – Assumptions Pairs of proteins with similar sequence also share similar biological function(s)

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 19 Comparing sequences are important for a number of reasons. – used to establish evolutionary relationships among organisms – identification of functionally conserved sequences (e.g., DNA sequences controlling gene expression) ‘TATAAT’ box  transcription initiation – develop models for human diseases identify corresponding genes in model organisms (e.g. yeast, mouse), which can be genetically manipulated – E.g. gene knock outs / silencing

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 20 Deletions/insertions/substitution of nucleotides Mutations are of 3 main types – Deletions – Insertions – Substitution – Cause a shift in the mRNA reading frame

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 21 Comparing two sequences There are two ways of pairwise comparison – Global using Needleman-Wunsch algorithm (NW) – Local using Smith-Waterman algorithm (SW) Global alignment (NW) Alignment of the “whole” sequence Local alignment (SW) tries to align portions (e.g. motifs) more flexible – Considers sequences “parts” works well on – highly divergent sequences entire sequence perfect match unaligned sequence aligned portion

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 22 Global alignment

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 23 Global alignment (NW) Sequences are aligned end-to-end along their entire length Many possible alignments are produced – The alignment with the highest score is chosen Naïve algorithm is very inefficient (O exp ) – To align sequence of length 15, need to consider (insertion, deletion, gap) 15 = 3 15 = 1,4*10 7 – Impractical for sequences of length >20 nt Used to analyze homology/similarity of – genes and proteins – between species

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 24 Methodology of global alignment (1 of 4) s1:..AATA.. s2:..AACA.. s1:..AAT-A.. s2:..AACA.. s1:..AATA.. s2:..AATA..

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 25 Methodology of global alignment (2 of 4) The matrix should have extra column and row – M +1 columns, where M is the length sequence M – N +1 rows, where N is the length of sequence N 1.Initialize the matrix – introduce gap penalty at every initial position along rows and columns – Scores at each cell are cumulative WHAT 0 -2-4-6-8 W -2 H -4 Y -6 -2

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 26 Methodology of global alignment (3 of 4) 2.Alignment possibilities Gap (horiz/vert) Match (W-W diag.) 3.Select the maximum score – Best alignment WHAT 0 -2-4-6-8 W -220 -4 H 04 2 0 Y -6 -22 3 1 WH 0 -4 W -2-4 WH 0 -2-4 W -2+2 -2 +2 -OR-

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 27 Methodology of global alignment (4 of 4) 4.Select the most very bottom right cell 5.Consider different path(s) going to very top left cell – How the next cell value was generated? From where? WHAT WHY- WH-Y Overall score = 1 Overall score = 1 6.Select the best alignment(s) WHAT 0 -2-4-6-8 W -220 -4 H 04 2 0 Y -6 -22 3 1 WHAT 0 -4-6-8 W -220 -4 H 04 2 0 Y -6 -22 3 1

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 28 Local alignment

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 29 Local alignment (SW) Sequences are aligned to find regions where the best alignment occurs (i.e. highest score) Assumes a local context (aligning parts of seq.) Ideal for finding short motifs, DNA binding sites – helix-loop-helix (bHLH) - motif – TATAAT box (a famous promoter region) – DNA binding site Works well on highly divergent sequences

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 30 Methodology of local alignment (1 of 4)

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 31 Methodology of local alignment (2 of 4) Construct the MxN alignment matrix with M+1 columns and N+1 rows Initialize the matrix by introducing gap penalty at 1 st row and 1 st column WHAT 0 0000 W 0 H 0 Y 0 s(a,b) ≥ 0 (min value is zero)

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 32 Methodology of local alignment (3 of 4) For each subsequent cell consider alignments – Vertical s(I, - ) – Horizontal s(-,J) – Diagonal s(I,J) For each cell select the highest score – If score is negative  assign zero WHAT 0 0000 W 02000 H 0 0420 Y 0 0231

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 33 Methodology of local alignment (4 of 4) Select the initial cell with the highest score(s) Consider different path(s) leading to score of zero – Trace-back the cell values – Look how the values were originated (i.e. path) WH Mathematically – where S(I, J) is the score for sub-sequences I and J WHAT 0 0000 W 02000 H 0 0420 Y 0 0231 total score of 4 B A J I

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 34 Local alignment illustration (1 of 2)

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 35 Local alignment illustration (2 of 2) GGCTCAATCA A C C T A A G G GGCTCAATCA 0000 0 0 0 0 0 0 0 A 0 C 0 C 0 T 0 A 0 A 0 G 0 G 0 GGCTCAATCA 0000 0 0 0 0 0 0 0 A 00 0 0 0 02 C 0 C 0 T 0 A 0 A 0 G 0 G 0 GGCTCAATCA 0000 0 0 0 0 0 0 0 A 00 0 0 0 00 C 0 C 0 T 0 A 0 A 0 G 0 G 0 GGCTCAATCA 0000 0 0 0 0 0 0 0 A 00 0 0 0 00 C 0 C 0 T 0 A 0 A 0 G 0 G 0 GGCTCAATCA 0000 0 0 0 0 0 0 0 A 00 0 0 0 02 2 0 02 C 0 C 0 T 0 A 0 A 0 G 0 G 0 GGCTCAATCA 0000 0 0 0 0 0 0 0 A 00 0 0 0 02 2 0 02 C 0 0 0 2 0 2 0 1 1 2 0 C 0 T 0 A 0 A 0 G 0 G 0 GGCTCAATCA 0000 0 0 0 0 0 0 0 A 00 0 0 0 02 2 0 02 C 0 0 0 2 0 2 0 1 1 2 0 C 0 0 0 2 1 2 1 0 0 3 1 T 0 A 0 A 0 G 0 G 0 GGCTCAATCA 0000 0 0 0 0 0 0 0 A 00 0 0 0 02 2 0 02 C 0 0 0 2 0 2 0 1 1 2 0 C 0 0 0 2 1 2 1 0 0 2 1 T 0 0 0 0 4 2 1 0 2 0 1 A 0 A 0 G 0 G 0 GGCTCAATCA 0000 0 0 0 0 0 0 0 A 00 0 0 0 02 2 0 02 C 0 0 0 2 0 2 0 1 1 2 0 C 0 0 0 2 1 2 1 0 0 2 1 T 0 0 0 0 4 2 1 0 2 0 1 A 00002343112 A 0 G 0 G 0 GGCTCAATCA 0000 0 0 0 0 0 0 0 A 00 0 0 0 02 2 0 02 C 0 0 0 2 0 2 0 1 1 2 0 C 0 0 0 2 1 2 1 0 0 2 1 T 0 0 0 0 4 2 1 0 2 0 1 A 00002343112 A 00 0 0 0 1 5 6 4 2 3 G 0 G 0 GGCTCAATCA 0000 0 0 0 0 0 0 0 A 00 0 0 0 02 2 0 02 C 0 0 0 2 0 2 0 1 1 2 0 C 0 0 0 2 1 2 1 0 0 2 1 T 0 0 0 0 4 2 1 0 2 0 1 A 00002343112 A 00 0 0 0 1 5 6 4 2 3 G 02 2 0 0 0 3 4 5 3 1 G 0 GGCTCAATCA 0000 0 0 0 0 0 0 0 A 00 0 0 0 02 2 0 02 C 0 0 0 2 0 2 0 1 1 2 0 C 0 0 0 2 1 2 1 0 0 3 1 T 0 0 0 0 4 2 1 0 2 1 2 A 00002343113 A 00 0 0 0 1 5 6 4 2 3 G 02 2 0 0 0 3 4 5 3 1 G 0 2 4 2 0 0 1 2 3 4 2

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 36 Local alignment illustration (3 of 3) CTCAAGGCTCAATCA CT-AA ACCT-AAGG Best score: 6 GGCTCAATCA 0000 0 0 0 0 0 0 0 A 00 0 0 0 02 2 0 02 C 0 0 0 2 0 2 0 1 1 2 0 C 0 0 0 2 1 2 1 0 0 3 1 T 0 0 0 0 4 2 1 0 2 1 1 A 00002343113 A 00 0 0 0 1 5 6 4 2 3 G 02 2 0 0 0 3 4 5 3 1 G 0 2 4 2 0 0 1 2 3 4 2 in the whole seq. context (globally) locally

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 37 Aligning proteins Globally and Locally

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 38 Biological context Find common functional units – Structural motifs Helix-loop-helix Zinc finger … Phylogeny – Distance between species

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 39 Protein Alignment Protein local and global alignment – follows the same rules as we saw with DNA/RNA Differences (∆) – alphabet of proteins is 22 residues (aa) long – scoring/substitution matrices used (BLOSUM) protein proprieties are taken into account – residues that are totally different due to charge such as polar Lysine and apolar Glycine are given a low score

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 40 Substitution matrices Protein sequences are more complex – matrices = collection of scoring rules Matrices over events such as – mismatch and perfect match Need to define gap penalty separately E.g. BLOcks SUbstitution Matrix (BLOSUM)

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 41 BLOSUM-x matrices Constructed from aligned sequences with specific x% similarity – matrix built using sequences with no more then 50% similarity is called BLOSUM-50 For highly mutating / dissimilar sequences use – BLOSUM-45 and lower For highly conserved / similar sequences use – BLOSUM -62 and higher

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 42 BLOSUM 62 What diagonal represents? What is the score for substitution E  D (acid a.a.)? More drastic substitution K  I (basic to non-polar)? perfect match between a.a. Score = 2 Score = -3

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 43 Practical problem: Align following sequences both globally and locally using BLOSUM 62 matrix with gap penalty of -8 Sequence A: AAEEKKLAAA Sequence B: AARRIA

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 44 Aligning globally using BLOSUM 62 AAEEKKLAAA AA--RRIA-- Score: -14 Other alignment options? Yes AAEEKKLAAA 0-8-16-24-32-40-48-56-64-72-80 A-84-4-12-20-28-36-44-52-60-68 A-16-480-8-16-24-32-40-48-56 R-24-12080-6-14-22-30-38-46 R-32-20-8082-4-12-20-28-36 I-40-28-16-805-2-10-18-26 A-48-36-24-16-84-22-6-14

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 45 Aligning locally using BLOSUM 62 KKLA RRIA Score: 10 AAEEKKLAAA 00000000000 A04400000444 A04830000488 R00383220037 R00038540002 I00000526000 A044000411044

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 46 Practical 1 of 4 : Sequence Retrieval and Analysis via R

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 47 Protein database UniProt database (http://www.uniprot.org/) has high quality protein data manually curatedhttp://www.uniprot.org/ It is manually curated Each protein is assigned UniProt ID

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 48 Retrieving data from In search field one can enter either use UniProt ID or common protein name – example: myelin basic protein We will use retrieve data for P02686 Uniprot ID

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 49 FASTA format FASTA format is widely used and has the following parameters – Sequence name start with > sign – The fist line corresponds to protein name Actual protein sequence starts from 2 nd line

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 50 Retrieving protein data with R Can “talk” programmatically to UniProt database using R and seqinR library – seqinR library is suitable for “Biological Sequences Retrieval and Analysis” Detailed manual could be found herehere – Install this library in your R environment install.packages("seqinr") library("seqinr") – Choose database to retrieve data from choosebank("swissprot") – Download data object for target protein (P02686) MBP_HUMAN = query("MBP_HUMAN", "AC=P02686") – See sequence of the object MBP_HUMAN MBP_HUMAN_seq = getSequence(MBP_HUMAN); MBP_HUMAN_seq

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 51 Dot Plot (comparison of 2 sequences) (1of2) Each sequence plotted on vertical or horizontal dimension – If two a.a. from two sequences at given positions are identical the dot is plotted – matching sequence segments appear as diagonal lines (that could be parallel to the absolute diagonal line if insertion or gap is present)

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 52 Dot Plot (comparison of 2 sequences) (2of2) Visualize dot plot dotPlot(MBP_HUMAN_seq[[1]], MBP_MOUSE_seq[[1]],xlab="MBP - Human", ylab = "MBP - Mouse") - Is there similarity between human and mouse form of MBP protein? - Where is the difference in the sequence between the two isoforms? Let’s compare two protein sequences – Human MBP (Uniprot ID: P02686) – Mouse MBP (Uniprot ID: P04370) Download 2 nd mouse sequence MBP_MOUSE = query("MBP_MOUSE", "AC=P04370"); MBP_MOUSE_seq = getSequence(MBP_MOUSE); INSERTION in MBP-Human or GAP in MBP-Mouse Shift in diagonal line (identical regions) Breaks in diagonal line = regions of dissimilarity

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 53 Practical 2 of 4: Pairwise global and local alignments via R and Biostrings

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 54 Installing Biostrings library DNA_subst_matrix

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 55 Global alignment using R and Biostrings Create two sting vectors (i.e. sequences) seqA = "GATTA" seqB = "GTTA" Use pairwiseAlignment() and the defined rules globalAlignAB = pairwiseAlignment(seqA, seqB, substitutionMatrix = DNA_subst_matrix, gapOpening = -2, scoreOnly = FALSE, type="global") Visualize best paths (i.e. alignments) globalAlignAB Global PairwiseAlignedFixedSubject (1 of 1) pattern: [1] GATTA subject: [1] G-TTA score: 2

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 56 Local alignment using R and Biostrings Input two sequences seqA = "AGGATTTTAAAA" seqB = "TTTT" The scoring rules will be the same as we used for global alignment localAlignAB = pairwiseAlignment(seqA, seqB, substitutionMatrix = DNA_subst_matrix, gapOpening = -2, scoreOnly = FALSE, type="local") Visualize alignment globalAlignAB Local PairwiseAlignedFixedSubject (1 of 1) pattern: [5] TTTT subject: [1] TTTT score: 8

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 57 Aligning protein sequences Protein sequences alignments are very similar except the substitution matrix is specified data(BLOSUM62) BLOSUM62 Will align sequences seqA = "PAWHEAE" seqB = "HEAGAWGHEE" Execute the global alignment globalAlignAB <- pairwiseAlignment(seqA, seqB, substitutionMatrix = "BLOSUM62", gapOpening = -2, gapExtension = -8, scoreOnly = FALSE)

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 58 Practical 3 of 4: DNA sequence statistics and Seqinr

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 59 Retrieving genome sequence data Can retrieve sequence data from NCBI 1.Manually via webGUI 2.Programmatically via R DEN-1 Dengue virus genome sequence, which has NCBI accession NC_001477 Gain in speed compared to manual retrieval More complex queries

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 60 Manually NCBI Sequence Database via its website www.ncbi.nlm.nih.govwww.ncbi.nlm.nih.gov Dengue DEN-1 DNA sequence is a viral DNA sequence NCBI accession is NC_001477

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 61 NCBI database

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 62 Retrieving FASTA sequence To retrieve the DNA sequence as a FASTA format sequence file – click on “Send” at the top right choose “File” in the pop-up menu – and then choose FASTA from the “Format” » click on “Create file”.

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 63 Retrieving genome sequence data using SeqinR Can retrieve sequences much faster programmatically getncbiseq <- function(accession) { require("seqinr"); # this function requires the SeqinR R package # first find which ACNUC database the accession is stored in: dbs <- c("genbank","refseq","refseqViruses","bacterial"); numdbs <- length(dbs); for (i in 1:numdbs) { db <- dbs[i]; choosebank(db); # check if the sequence is in ACNUC database 'db': resquery <- try(query(".tmpquery", paste("AC=", accession)), silent = TRUE); if (!(inherits(resquery, "try-error"))) { queryname <- "query2"; thequery <- paste("AC=",accession,sep=""); query2 <- query(`queryname`,`thequery`); # see if a sequence was retrieved: seq <- getSequence(query2$req[[1]]); closebank(); return(seq); } closebank(); } print(paste("ERROR: accession",accession,"was not found")); } dengueseq <- getncbiseq("NC_001477"); dengueseq[1:50]; length(dengueseq);

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 64 Base composition of a DNA sequence Count frequencies of the 4 nucleotides table(dengueseq); dengueseq; a c g t 3426 2240 2770 2299 This means that the DEN-1 Dengue virus genome sequence has – 3426 As, 2240 Cs, 2770 Gs and 2299 Ts

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 65 GC Content of DNA the fraction of the sequence that consists of Gs and Cs, ie. the %(G+C). – the percentage of the bases in the genome that are – Gs or Cs GC(dengueseq) [1] 0.4666977 (2240+2770)*100/(3426+2240+2770+2299) [1] 46.66977

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 66 Di-nucleotides it is also interesting to know – the frequency of longer DNA “words” – dinucleotides ie. “AA”, “AG”, “AC”, “AT”, “CA”, “CG”, “CC”, “CT”, “GA”, “GG”, “GC”, “GT”, “TA”, “TG”, “TC”, and “TT” count(dengueseq, 2) aa ac ag at 1108 720 890 708 ca cc cg ct 901 523 261 555 ga gc gg gt 976 500 787 507 ta tc tg tt 440 497 832 529

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 67 Practical 4 of 4: Using BLAST for sequence identification

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 68 BLAST Basic Local Alignment Search Tool Many different types http://blast.ncbi.nlm.nih.gov/Blast.cgi

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 69 Types blastn – nucleotide query vs nucleotide database blastp – protein query vs protein DB blastx – translated in 6 frames nucleotide query vs protein DB

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 70 Sequence identity Want to know which genes are coded by the genomic sequence >human_genomic_seq TGGACTCTGCTTCCCAGACAGTACCCCTGACAGTGACAGAACTGCCACTCTCCCCACCTG ACCCTGTTAGGAAGGTACAACCTATGAAGAAAAAGCCAGAATACAGGGGACATGTGAGCC ACAGACAACACAAGTGTGCACAACACCTCTGAGCTGAGCTTTTCTTGATTCAAGGGCTAG TGAGAACGCCCCGCCAGAGATTTACCTCTGGTCTTCTGAGGTTGAGGGCTCGTTCTCTCT TCCTGAATGTAAAGGTCAAGATGCTGGGCCTCAGTTTCCTCTTACATACTCACCAAAAGG CTCTCCTGATCAGAGAAGCAGGATGCTGCACTTGTCCTCCTGTCGATGCTCTTGGCTATG ACAAAATCTGAGCTTACCTTCTCTTGCCCACCTCTAAACCCCATAAGGGCTTCGTTCTGT GTCTCTTGAGAATGTCCCTATCTCCAACTCTGTCATACGGGGGAGAGCGAGTGGGAAGGA TCCAGGGCAGGGCTCAGACCCCGGCGCATGGACCTAGTCGGGGGCGCTGGCTCAGCCCC GCCCCGCGCGCCCCCGTCGCAGCCGACGCGCGCTCCCGGGAGGCGGCGGCAGAGGCAG CATCCACAGCATCAGCAGCCTCAGCTTCATCCCCGGGCGGTCTCCGGCGGGGAAGGCCGG TGGGACAAACGGACAGAAGGCAAAGTGCCCGCAATGGAGGGAGCATCCTTTGGCGCGG GCCGTGCGGGAGCTGCCTTTGATCCCGTGAGCTTTGCGCGGCGGCCCCAGACCCTGTTGC GGGTCGTGTCCTGG

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 71 BLAST GUI

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 72 Results Top hit Mus musculus targeted KO-first, conditional ready, lacZ-tagged mutant allele Tbl3:tm1a(EUCOMM)Hmgu; transgenic

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 73

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 74 Resources Online Tutorial on Sequence Alignment – http://a-little-book-of-r-for- bioinformatics.readthedocs.org/en/latest/src/chapter4.html http://a-little-book-of-r-for- bioinformatics.readthedocs.org/en/latest/src/chapter4.html Graphical alignment of proteins – http://www.itu.dk/~sestoft/bsa/graphalign.html http://www.itu.dk/~sestoft/bsa/graphalign.html Pairwise alignment of DNA and proteins using your rules: – http://www.bioinformatics.org/sms2/pairwise_align_dna.html http://www.bioinformatics.org/sms2/pairwise_align_dna.html Documentation on libraries – Biostings: http://www.bioconductor.org/packages/2.10/bioc/manuals/Biostrings/man/Biostrings.pdf http://www.bioconductor.org/packages/2.10/bioc/manuals/Biostrings/man/Biostrings.pdf – SeqinR: http://seqinr.r-forge.r-project.org/seqinr_2_0-7.pdfhttp://seqinr.r-forge.r-project.org/seqinr_2_0-7.pdf

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Similar presentations

Presentation on theme: "Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________.

Similar presentations

Presentation on theme: "Bioinformatics GBIO0002 -1 Biological Sequences ____________________________________________________________________________________________________________________."— Presentation transcript:

Similar presentations

About project

Feedback