1 Lesson 2 Aligning sequences and searching databases.

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

1 A B C
Scenario: EOT/EOT-R/COT Resident admitted March 10th Admitted for PT and OT following knee replacement for patient with CHF, COPD, shortness of breath.
Simplifications of Context-Free Grammars
Variations of the Turing Machine
Angstrom Care 培苗社 Quadratic Equation II
AP STUDY SESSION 2.
1
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
David Burdett May 11, 2004 Package Binding for WS CDL.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Local Customization Chapter 2. Local Customization 2-2 Objectives Customization Considerations Types of Data Elements Location for Locally Defined Data.
Create an Application Title 1Y - Youth Chapter 5.
CALENDAR.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
The 5S numbers game..
Break Time Remaining 10:00.
Factoring Quadratics — ax² + bx + c Topic
Turing Machines.
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
PP Test Review Sections 6-1 to 6-6
Scoring Matrices.
Outline Minimum Spanning Tree Maximal Flow Algorithm LP formulation 1.
Bellwork Do the following problem on a ½ sheet of paper and turn in.
Thomas Jellema & Wouter Van Gool 1 Question. 2Answer.
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
1..
Adding Up In Chunks.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
Artificial Intelligence
Before Between After.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Analyzing Genes and Genomes
1 Let’s Recapitulate. 2 Regular Languages DFAs NFAs Regular Expressions Regular Grammars.
Essential Cell Biology
12 System of Linear Equations Case Study
Converting a Fraction to %
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
ANSC644 Bioinformatics-Database Mining 1 ANSC644 Bioinformatics §Carl J. Schmidt §051 Townsend Hall §
Clock will move after 1 minute
PSSA Preparation.
Essential Cell Biology
Physics for Scientists & Engineers, 3rd Edition
Energy Generation in Mitochondria and Chlorplasts
Select a time to count down from the clock above
Copyright Tim Morris/St Stephen's School
9. Two Functions of Two Random Variables
1 Dr. Scott Schaefer Least Squares Curves, Rational Representations, Splines and Continuity.
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Aligning sequences and searching databases
|| || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG.
Introduction to bioinformatics
1 ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG GAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACT.
Sequence similarity.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
1 Lesson 3 Aligning sequences and searching databases.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Pairwise Sequence Alignment Exercise 2. || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG.
1 Homology and sequence alignment.. Homology Homology = Similarity between objects due to a common ancestry Hund = Dog, Schwein = Pig.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Presentation transcript:

1 Lesson 2 Aligning sequences and searching databases

2 Homology and sequence alignment.

Homology Homology = Similarity between objects due to a common ancestry Hund = Dog, Schwein = Pig

4 Sequence homology VLSPAVKWAKVGAHAAGHG ||| || |||| | |||| VLSEAVLWAKVEADVAGHG Similarity between sequences as a result of common ancestry.

5 Sequence alignment Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences.

6 Why align? VLSPAVKWAKV ||| || |||| VLSEAVLWAKV 1.To detect if two sequences are homologous. If so, homology may indicate similarity in function (and structure). 2.Required for evolutionary studies (e.g., tree reconstruction). 3.To detect conservation (e.g., a tyrosine that is evolutionary conserved is more likely to be a phosphorylation site). 4.Given a sequenced DNA, from an unknown region, align it to the genome.

7 Insertions, deletions, and substitutions

8 Sequence alignment If two sequences share a common ancestor – for example human and dog hemoglobin, we can represent their evolutionary relationship using a tree VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV VLSEAVLWAKV

9 Perfect match VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV VLSEAVLWAKV A perfect match suggests that no change has occurred from the common ancestor (although this is not always the case).

10 A substitution VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV VLSEAVLWAKV A substitution suggests that at least one change has occurred since the common ancestor (although we cannot say in which lineage it has occurred).

11 Indel VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV VLSEAVLWAKV Option 1: The ancestor had L and it was lost here. In such a case, the event was a deletion. VLSEAVLWAKV

12 Indel VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV - WAKV VLSEAVWAKV Option 2: The ancestor was shorter and the L was inserted here. In such a case, the event was an insertion. VLSEAVLWAKV L

13 Indel VLSPAV - WAKV Normally, given two sequences we cannot tell whether it was an insertion or a deletion, so we term the event as an indel. VLSEAVLWAKV Deletion?Insertion?

14 Indels in protein coding genes Indels in protein coding genes are often of 3bp, 6bp, 9bp, etc... Gene Search In fact, searching for indels of length 3K (K=1,2,3,…) can help algorithms that search a genome for coding regions

15 Global and Local pairwise alignments

16 Global vs. Local Global alignment – finds the best alignment across the entire two sequences. Local alignment – finds regions of similarity in parts of the sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ ADLG CDRYFQ |||| |||| | ADLG CDRYYQ Global alignment: forces alignment in regions which differ Local alignment will return only regions of good alignment

17 Global alignment PTK2 protein tyrosine kinase 2 of human and rhesus monkey

18 Proteins are comprised of domains Domain B Protein tyrosine kinase domain Domain A Human PTK2 :

19 Protein tyrosine kinase domain In leukocytes, a different gene for tyrosine kinase is expressed. Domain X Protein tyrosine kinase domain Domain A

20 Domain X Protein tyrosine kinase domain Domain B Protein tyrosine kinase domain Domain A Leukocyte TK PTK2 The sequence similarity is restricted to a single domain

21 Global alignment of PTK and LTK

22 Local alignment of PTK and LTK

23 Conclusions Use global alignment when the two sequences share the same overall sequence arrangement. Use local alignment to detect regions of similarity.

24 How alignments are computed

25 Pairwise alignment AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- One possible alignment:

26 AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- This alignment includes: 2 mismatches 4 indels (gap) 10 perfect matches

27 Choosing an alignment for a pair of sequences AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Which alignment is better? Many different alignments are possible for 2 sequences:

28 Scoring system (naïve) AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- Score: = (+1)x10 + (-2)x2 + (-1)x4 = 2Score: = (+1)x9 + (-2)x2 + (-1)x6 = -1 A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Higher score  Better alignment Perfect match: +1 Mismatch: -2 Indel (gap): -1

29 Alignment scoring - scoring of sequence similarity: Assumes independence between positions: each position is considered separately Scores each position: Positive if identical (match) Negative if different (mismatch or gap) Total score = sum of position scores Can be positive or negative

30 Scoring systems

31 Scoring system In the example above, the choice of +1 for match,-2 for mismatch, and -1 for gap is quite arbitrary Different scoring systems  different alignments We want a good scoring system…

32 Scoring matrix TCGA 2A 2-6G 2 C 2 T Representing the scoring system as a table or matrix n X n (n is the number of letters the alphabet contains. n=4 for nucleotides, n=20 for amino acids) symmetric

33 DNA scoring matrices Uniform substitutions between all nucleotides: TCGAFrom To 2A 2-6G 2 C 2 T Match Mismatch

34 DNA scoring matrices Can take into account biological phenomena such as: Transition-transversion

35 Amino-acid scoring matrices Take into account physico-chemical properties

36 Scoring gaps (I) In advanced algorithms, two gaps of one amino-acid are given a different score than one gap of two amino acids. This is solved by giving a penalty to each gap that is opened. Gap extension penalty < Gap opening penalty

37 Scoring gaps (II) The dependency between the penalty and the length of the gap need not to be linear. AGGGTTC—GA AGGGTTCTGA Score = -2 AGGGTT-—GA AGGGTTCTGA Score = -4 AGGGT--—GA AGGGTTCTGA Score = -6 AGGG---—GA AGGGTTCTGA Score = -8 Linear penalty

38 Scoring gaps (II) The dependency between the penalty and the length of the gap need not to be linear. AGGGTTC—GA AGGGTTCTGA Score = -4 AGGGTT-—GA AGGGTTCTGA Score = -6 AGGGT--—GA AGGGTTCTGA Score = -7 AGGG---—GA AGGGTTCTGA Score = -8 Non-linear penalty

39 PAM AND BLOSUM

40 Amino-acid substitution matrices Actual substitutions: –Based on empirical data –Commonly used by many bioinformatics programs –PAM & BLOSUM

41 Protein matrices – actual substitutions The idea: Given an alignment of a large number of closely related sequences we can score the relation between amino acids based on how frequently they substitute each other M G Y D E M G Y E E M G Y D E M G Y Q E M G Y D E M G Y E E In the fourth column E and D are found in 7 / 8

42 PAM Matrix - Point Accepted Mutations The Dayhoff PAM matrix is based on a database of 1,572 changes in 71 groups of closely related proteins (85% identity => Alignment was easy and reliable). Counted the number of substitutions per amino-acid pair (20 x 20) Found that common substitutions occurred between chemically similar amino acids

43 PAM Matrices Family of matrices PAM 80, PAM 120, PAM 250 The number on the PAM matrix represents evolutionary distance Larger numbers are for larger distances

44 Example: PAM 250 Similar amino acids have greater score

45 PAM - limitations Based only on a single, and limited dataset Examines proteins with few differences (85% identity) Based mainly on small globular proteins so the matrix is biased

46 BLOSUM Henikoff and Henikoff (1992) derived a set of matrices based on a much larger dataset BLOSUM observes significantly more replacements than PAM, even for infrequent pairs

47 BLOSUM: Blo cks Su bstitution M atrix Based on BLOCKS database –~2000 blocks from 500 families of related proteins –Families of proteins with identical function Blocks are short conserved patterns of 3-60 amino acids without gaps AABCDA----BBCDA DABCDA----BBCBB BBBCDA-AA-BCCAA AAACDA-A--CBCDB CCBADA---DBBDCC AAACAA----BBCCC

48 BLOSUM Each block represents a sequence alignment with different identity percentage For each block the amino-acid substitution rates were calculated to create the BLOSUM matrix

49 BLOSUM Matrices BLOSUMn is based on sequences that share at least n percent identity BLOSUM62 represents closer sequences than BLOSUM45

50 Example : Blosum62 Derived from blocks where the sequences share at least 62% identity

51 PAM vs. BLOSUM More distant sequences PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45

52 Intermediate summary 1.Scoring system = substitution matrix + gap penalty. 2.Used for both global and local alignment 3.For amino acids, there are two types of substitution matrices: PAM and Blosum