Fa 06CSE182 CSE182-L6 Protein sequence analysis Fa 06CSE182 Possible domain queries Case 1: –You have a collection of sequences that belong to a family.

Fa 06CSE182 CSE182-L6 Protein sequence analysis

Fa 06CSE182 Possible domain queries Case 1: –You have a collection of sequences that belong to a family (contain a functional domain). –Given an ‘orphan’ sequence, does it belong to the family? –There are different solutions depending upon the representation of the domain (patterns/alignments/HMM/profiles) Case 2: –You have an orphan sequence from an uncharacterized family. Can you identify other members of the family, and create a representation of them (Harder problem).

Fa 06CSE182 EX: Innexins The Macagno lab is studying Gap junction proteins, Innexins (invertebrate analogs of connexins) in Hirudo Innexins have been found in C. elegans, and Drosophila. In C. elegans, 25 members of this family have been found, and partially categorized.

Fa 06CSE182 Innexins in Hirudo When certain Innexins are knocked out, they cause serious defects in cells in the ganglia. The EST database (partial gene sequences) contains a number of putative Innexins, discovered via BLAST. Project: Q: Can you confirm that these are Innexins. Can you find more members? (this lecture) Q: Can you characterize them w.r.t known innexins in C. elegans, and Drosophila? Q: Use your method for other families of interest. Netrins, and their receptors.

Fa 06CSE182 Not all features(residues) are important Skin patterns Facial Features

Fa 06CSE182 Protein sequence motifs Premise: The sequence of a protein sequence gives clues about its structure and function. Not all residues are equally important in determining function. Suppose we knew the key residues of a family. If our query matches in those residues, it is a member. Otherwise, it is not. The key residues can be identified if we had structural information, or through conserved residues in an alignment of the family.

Fa 06CSE182 Representation of domains/families. We will consider a number of representations that describe key residues, characteristic of a family –Patterns (regular expressions) –Alignments –Profiles –HMMs Start with the following: –A collection of sequences with the same function. –Region/residues known to be significant for maintaining structure and function. Develop a pattern of conserved residues around the residues of interest Iterate for appropriate sensitivity and specificity

Fa 06CSE182 From alignment to patterns * ALRDFATHDDF SMTAEATHDSI ECDQAATHEAS ATH-[DE] Search a database with the resulting pattern Refine pattern to eliminate false positives Iterate

Fa 06CSE182 Regular Expression Patterns Zinc Finger motif –C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H –2 conserved C, and 2 conserved H How can we search a database using these motifs? –The motif is described using a regular expression. What is a regular expression?

Fa 06CSE182 Regular Expressions Concise representation of a set of strings over alphabet . Described by a string over R is a r.e. if and only if

Fa 06CSE182 Regular Expression Q: Let  ={A,C,E} –Is (A+C)*EEC* a regular expression? –Is *(A+C) regular? Q: When is a string s in a regular expression? –R =(A+C)*EEC* –Is CEEC in R? –AEC? –ACEE?

Fa 06CSE182 Regular Expression & Automata  Every R.E can be expressed by an automaton (a directed graph) with the following properties: –The automaton has a start and end node –Each edge is labeled with a symbol from , or   Suppose R is described by automaton A  S  R if and only if there is a path from start to end in A, labeled with s.

Fa 06CSE182 Examples: Regular Expression & Automata (A+C)*EEC* CA C startend EE –Is CEEC in R? –AEC? –ACEE? –ACE?

Fa 06CSE182 Constructing automata from R.E R = {  } R = {  },    R = R 1 + R 2 R = R 1 · R 2 R = R 1 *      

Fa 06CSE182 Regular Expression Matching Given a database D, and a regular expression R, is a substring of D in R? Is there a string D[l..c] that is accepted by the automaton of R? Simpler Q: Is D[1..c] accepted by the automaton of R?

Fa 06CSE182 Alg. For matching R.E. If D[1..c] is accepted by the automaton R A –There is a path labeled D[1]…D[c] that goes from START to END in R A D[1] D[2]  D[c]

Fa 06CSE182 Alg. For matching R.E. If D[1..c] is accepted by the automaton R A –There is a path labeled D[1]…D[c] that goes from START to END in R A –There is a path labeled D[1]..D[c-1] from START to node u, and a path labeled D[c] from u to the END D[1].. D[c-1] D[c] u

Fa 06CSE182 D.P. to match regular expression Define: –A[u,  ] = Automaton node reached from u after reading  –Eps(u): set of all nodes reachable from node u using epsilon transitions. –N[c] = subset of nodes reachable from START node after reading D[1..c] –Q: when is v  N[c]  u v  u Eps(u)

Fa 06CSE182 Q: when is v  N[c]? A: If for some u  N[c-1], w = A[u,D[c]], v  {w}+ Eps(w) D.P. to match regular expression

Fa 06CSE182 Algorithm

Fa 06CSE182 The final step We have answered the question: –Is D[1..c] accepted by R? –Yes, if END  N[c] We need to answer –Is D[l..c] (for some l, and some c) accepted by R

Fa 06CSE182 Representation 2: Profiles Profiles versus regular expressions –Regular expressions are intolerant to an occasional mis- match. –The Union operation (I+V+L) does not quantify the relative importance of I,V,L. It could be that V occurs in 80% of the family members. –Profiles capture some of these ideas.

Fa 06CSE182 Profiles Start with an alignment of strings of length m, over an alphabet A, Build an |A| X m matrix F=(f ki ) Each entry f ki represents the frequency of symbol k in position i 0.71 0.14 0.28

Fa 06CSE182 Scoring matrices Given a sequence s, does it belong to the family described by a profile? We align the sequence to the profile, and score it Let S(i,j) be the score of aligning position i of the profile to residue s j The score of an alignment is the sum of column scores. s sjsj i

Fa 06CSE182 Scoring Profiles k i s f ki Scoring Matrix

Fa 06CSE182 Domain analysis via profiles Given a database of profiles of known domains/families, we can query our sequence against each of them, and choose the high scoring ones to functionally characterize our sequences. What if the sequence matches some other sequences weakly (using BLAST), but does not match any Profile?

Fa 06CSE182 Psi-BLAST idea Iterate: –Find homologs using Blast on query –Discard very similar homologs –Align, make a profile, search with profile. –Why is this more sensitive? Seq Db

Fa 06CSE182 Psi-BLAST speed Two time consuming steps. 1.Multiple alignment of homologs 2.Searching with Profiles. 1.Does the keyword search idea work? Multiple alignment: –Use ungapped multiple alignments only Pigeonhole principle again: –If profile of length m must score >= T –Then, a sub-profile of length l must score >= lT|/m –Generate all l-mers that score at least lT|/M –Search using an automaton

Fa 06CSE182 Representation 3: HMMs Question: your ‘friend’ likes to gamble. He tosses a coin: HEADS, he gives you a dollar. TAILS, you give him a dollar. Usually, he uses a fair coin, but ‘once in a while’, he uses a loaded coin. Can you say what fraction of the times he loads the coin?

Fa 06CSE182 Representation 3: HMMs Building good profiles relies upon good alignments. –Difficult if there are gaps in the alignment. –Psi-BLAST/BLOCKS etc. work with gapless alignments. An HMM representation of Profiles helps put the alignment construction/membership query in a uniform framework. V

Fa 06CSE182 The generative model Think of each column in the alignment as generating a distribution. For each column, build a node that outputs a residue with the appropriate distribution 0.71 0.14 Pr[F]=0.71 Pr[Y]=0.14

Fa 06CSE182 A simple Profile HMM Connect nodes for each column into a chain. Thie chain generates random sequences. What is the probability of generating FKVVGQVILD? In this representation –Prob [New sequence S belongs to a family]= Prob[HMM generates sequence S] What is the difference with Profiles?

Fa 06CSE182 Profile HMMs can handle gaps The match states are the same as on the previous page. Insertion and deletion states help introduce gaps. A sequence may be generated using different paths.

Fa 06CSE182 Example Probability [ALIL] is part of the family? Note that multiple paths can generate this sequence. –M 1 I 1 M 2 M 3 –M 1 M 2 I 2 M 3 In order to compute the probabilities, we must assign probabilities of transition between states A L - L A I V L A I - L

Fa 06CSE182 Profile HMMs Directed Automaton M with nodes and edges. –Nodes emit symbols according to ‘emission probabilities’ –Transition from node to node is guided by ‘transition probabilities’ Joint probability of seeing a sequence S, and path P –Pr[S,P|M] = Pr[S|P,M] Pr[P|M] –Pr[ALIL AND M 1 I 1 M 2 M 3 ] = Pr[ALIL| M 1 I 1 M 2 M 3,M] Pr[M 1 I 1 M 2 M 3 | M] Pr[ALIL | M] = ?

Fa 06CSE182 Protein structure basics

Fa 06CSE182 Side chains determine amino-acid type The residues may have different properties. Aspartic acid (D), and Glutamic Acid (E) are acidic residues

Fa 06CSE182 Bond angles form structural constraints

Fa 06CSE182 Various constraints determine 3d structure Constraints –Structural constraints due to physiochemical properties –Constraints due to bond angles –H-bond formation Surprisingly, a few conformations are seen over and over again.

Fa 06CSE182 Alpha-helix 3.6 residues per turn H-bonds between 1st and 4th residue stabilize the structure. First discovered by Linus Pauling

Fa 06CSE182 Beta-sheet Each strand by itself has 2 residues per turn, and is not stable. Adjacent strands hydrogen-bond to form stable beta-sheets, parallel or anti-parallel. Beta sheets have long range interactions that stabilize the structure, while alpha-helices have local interactions.

Fa 06CSE182 Domains The basic structures (helix, strand, loop) combine to form complex 3D structures. Certain combinations are popular. Many sequences, but only a few folds

Fa 06CSE182 3D structure Predicting tertiary structure is an important problem in Bioinformatics. Premise: Clues to structure can be found in the sequence. While de novo tertiary structure prediction is hard, there are many intermediate, and tractable goals. The PDB database is a compendium of structures PDB

Fa 06CSE182 Searching structure databases Threading, and other 3d Alignments can be used to align structures. Database filtering is possible through geometric hashing.

Fa 06CSE182 Trivia Quiz What research won the Nobel prize in Chemistry in 2004? In 2002?

Fa 06CSE182 How are Proteins Sequenced? Mass Spec 101:

Fa 06CSE182 Nobel Citation 2002

Fa 06CSE182 Nobel Citation, 2002

Fa 06CSE182 Mass Spectrometry

Fa 06CSE182 Sample Preparation Enzymatic Digestion (Trypsin) + Fractionation

Fa 06CSE182 Single Stage MS Mass Spectrometry LC-MS: 1 MS spectrum / second

Fa 06CSE182 Tandem MS Secondary Fragmentation Ionized parent peptide

Fa 06CSE182 CSE182-L6 Protein sequence analysis Fa 06CSE182 Possible domain queries Case 1: –You have a collection of sequences that belong to a family.

Similar presentations

Presentation on theme: "Fa 06CSE182 CSE182-L6 Protein sequence analysis Fa 06CSE182 Possible domain queries Case 1: –You have a collection of sequences that belong to a family."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fa 06CSE182 CSE182-L6 Protein sequence analysis Fa 06CSE182 Possible domain queries Case 1: –You have a collection of sequences that belong to a family.

Similar presentations

Presentation on theme: "Fa 06CSE182 CSE182-L6 Protein sequence analysis Fa 06CSE182 Possible domain queries Case 1: –You have a collection of sequences that belong to a family."— Presentation transcript:

Similar presentations

About project

Feedback