Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence Similarity.

Similar presentations


Presentation on theme: "Sequence Similarity."— Presentation transcript:

1 Sequence Similarity

2 Why sequence similarity
structural similarity >25% sequence identity  similar structure evolutionary relationship all proteins come from < 2000 (super)families related functional role similar structure  similar function functional modules are often preserved

3 Muscle cells and contraction
Lodish et al. p. 796: “Muscle cells have evolved to carry out [] contraction. Muscle contractions [] occur quickly and repetitively, and [] through long distances with enough force to move large loads. A typical skeletal muscle cell, called a myofiber, is cylindrical, large (1-40 mm in length and microns in width), and multinucleated (containing as many as 100 nuclei). The cytoplasm is packed with a regular repeating array of filament bundles organized into a specialized structure called a sarcomere. A chain of sarcomeres, each about 2 microns long in resting muscle, constitutes a myofibril. The sarcomere is both the structural and the functional unit of skeletal muscle. During contraction, the sarcomeres are shortened to about 70 percent of their uncontracted, resting length. Electron microscopy and biochemical analysis have shown that each sarcomere contains two types of filaments: thick filaments, composed of myosin II, and thin filaments, containing actin.

4 Actin and myosin during muscle movement
The results of studies of muscle contraction show that myosin heads slide or walk along actin filaments. Myosin harnesses the energy released by ATP hydrolysis to move along an actin filament Myosin undergoes a series of events during each step of movement. In the course of one cycle, myosin must exist in at least three conformational states: an ATP state unbound to actin, an ADB-Pi state bound to actin, and a state after the power-generating stroke that has been completed. The binding and hydrolysis of a nucleotide cause a small conformational change in the head domain that is amplified into a large movement of the neck region. The small conformational change in the head domain is localized to a switch region consisting of the nucleotide- and acting-binding sites. A converter region at the base of the head acts like a fulcrum that causes the leverlike neck to bend and rotate. Allostery refers to any change in a protein’s tertiary or quaternary structure of both induced by the binding of a ligand, which may be an activator, inhibitor, substrate, or all three. To understand how a muscle contracts, consider the interactions between one myosin head (among the hundreds in a thick filament) and a thin (actin) filament. During these cyclical interactions, also called the cross-bridge cycle, the hydrolysis of ATP is coupled to the movement of a myosin head toward the Z disk, which corresponds to the + end of the thin filament. Because the thick filament is bipolar, the action of the myosin head at opposite ends of the thick filament draws the thin filament toward the center fo the thick filament, and therefore toward the center of the sarcomere. This movement shortens the sarcomere until the ends of the thick filaments abut the Z disk or the – ends of the thin filaments overlap at the center of the A band. Contraction of an intact muscle results from the activity of hundreds of myosin heads on a single thick filament, amplified by the hundreds of thick and thin filaments on a sarcomere and thousands of sarcomeres in a muscle fiber.

5 Actin structure Lodish et al p. 174
The cytosol or a eukaryotic cell contains three types of filaments that can be distinguished on the bases of their diameter, type of subunit, and subunit arrangement. Actin filaments, also called microfilaments, are 8-9 nm in diameter and have a twisted two-stranded structure. Microtubules are hollow tubelike structures, 24 nm in diameter. Intermediate filaments (Ifs) have the structure of a 10-nm-diameter rope. Monomeric actin subunits assemble into microfilaments. The cytoskeleton has been highly conserved in evolution. A comparison of gene sequences shows only a small percentage of differences in sequence between yeast actin and tubulin and human actin and tubulin. This structural conservation is explained by the variety of critical functions that depend on the cytoskeleton. A mutation in a cytoskeleton protein subunit could disrupt the assembly of filaments and their binding to other proteins. Analyses of gene sequences and protein structures have identified bacterial homologs of actin and tubulin. Actin is Ancient, Abundant, and highly Conserved. Actin is the most abundant intracellular protein in most eukaryotic cells. In muscle cells, for example, actin comprises 10 percent by weight of the total cell protein. Even in non-muscle cells, actin makes up 1-5 percent of the cellular protein. A typical liver cell has 2x10^4 insulin receptor molecules but 5x10^8 actin molecules. Actin exists as a globular monomer called G-actin and as a filamentus polymer called F-actin, which is a linear chain of G-actin subunits. The ability of G-actin to polymerize into F-actin and of F-actin to depolymerize into G-actin is an important property of actin.

6 Actin sequence Actin is ancient and abundant
Most abundant protein in cells 1-2 actin genes in bacteria, yeasts, amoebas Humans: 6 actin genes -actin in muscles; -actin, -actin in non-muscle cells ~4 amino acids different between each version MUSCLE ACTIN Amino Acid Sequence 1 EEEQTALVCD NGSGLVKAGF AGDDAPRAVF PSIVRPRHQG VMVGMGQKDS YVGDEAQSKR 61 GILTLKYPIE HGIITNWDDM EKIWHHTFYN ELRVAPEEHP VLLTEAPLNP KANREKMTQI 121 MFETFNVPAM YVAIQAVLSL YASGRTTGIV LDSGDGVSHN VPIYEGYALP HAIMRLDLAG 181 RDLTDYLMKI LTERGYSFVT TAEREIVRDI KEKLCYVALD FEQEMATAAS SSSLEKSYEL 241 PDGQVITIGN ERFRGPETMF QPSFIGMESS GVHETTYNSI MKCDIDIRKD LYANNVLSGG 301 TTMYPGIADR MQKEITALAP STMKIKIIAP PERKYSVWIG GSILASLSTF QQMWITKQEY 361 DESGPSIVHR KCF Actin is encoded by a large, highly conserved gene family. Actin arose from a bacterial ancestor and then evolved further as eukaryotic cells became specialized. Some single-celled organisms such as rod-shaped bacteria, yeasts, and amebas have one or two actin genes, whereas many multicellular organisms contain multiple actin genes. For instance, humans have six actin genes, which encode isoforms of the protein, and some plants have more than 60 actin genes, although most are pseudogenes. In vertebrates, the four alpha-actin isoforms present in various muscle cells and the beta-actin and gamma-actin isoforms present in nonmuscle cells differ at only four or five positions. Although these differences among isoforms seem minor, the isoforms have different functions: alpha-actin is associated with contractile structures; gamma-actin accounts for filaments in stress fibers; and beta-actin is at the front, or leading edge, of moving cells where actin filaments polymerize. Actins are among the most conserved proteins in a cell, comparable with histones, the structural proteins of chromatin. The sequences of actins from amebas and from animals are identical at 80 percent of the positions.

7 A related protein in bacteria
The simple bacterial cytoskeleton controls cell length, width, and the site of cell division. The FtsZ protein, a bacterial homolog of tubulin, is localized around the neck of dividing cells, and participates in cell division.

8 Relation between sequence and structure
FtsA consists of two domains with the nucleotide-binding site in the interdomain cleft. Both domains have a common core that is also found in the actin family of proteins. The structure of FtsA is most homologous to actin and heat-shock cognate protein (Hsc70). An important difference between FtsA and the actin family of proteins is the insertion of a subdomain in FtsA. Movement of this subdomain partially encloses a groove, that could bind the C-terminus of FtsZ. FtsZ is the bacterial homologue of tubulin and the FtsZ ring is functionally similar to the contractile ring in dividing eukaryotic cells. The elucidation of the crystal structure of FtsA shows that another bacterial protein involved in cytokinesis is structurally related to an eukaryotic cytoskeletal protein involved in cytokinesis. FtsZ forms a ring-like structure at midcell, shortly after birth of the cell, that contracts as the cell divides. Before contraction, other components of the divisome are sequentially recruited to the FtsZ ring. Shortly after the FtsZ ring has formed, FtsA and ZipA are located at midcell. The assembly of the other components of the divisome is dependent on the presence of FtsA and FtsZ. FtsA is, after FtsZ, the most highly conserved protein of the divisome, though it is lacking in archaea Cell division is impaired if the localization of FtsA to the FtsZ ring is prevented. Recently, 8-10 highly conserved residues at the C-terminus of FtsZ have been shown to be important for the interaction with FtsA Once bound to the FtsZ-ring, FtsA may form a bridge between FtsZ molecules and membrane-anchored proteins or integral membrane proteins of the septum. Genetic evidence suggests that FtsA interacts with FtsI (PBP3) and that the localization of FtsK, FtsL, FtsN, and FtsQ to the septum is dependent on FtsA

9 A multiple alignment of actins

10 Gene expression DNA RNA Protein PEPTIDE transcription translation
CCTGAGCCAACTATTGATGAA CCUGAGCCAACUAUUGAUGAA But first some quick genetics to make sure that we all are on the same page. The genes are expressed by the DNA in our chromosomes being transcribed into RNA. Basically an enzyme attaches to the DNA molecule and copies it, creating a RNA molecule. Then the RNA is translated in triplets into amino acids, creating a peptide, which in its finished form becomes a protein. The protein is the end product of the gene, and thus the expression of the genes. PEPTIDE

11 Biomolecules as Strings
Macromolecules are the chemical building blocks of cells Proteins 20 amino acids Nucleic acids 4 nucleotides {A, C, G, ,T} Polysaccharides Polymers are composed of multiple covalently linked identical or nearly identical small molecules, or monomers. The covalent bonds between monomers usually form by dehydration reactions in which a water molecule is lost. Proteins: peptide bonds; nucleotides: phosphodiester bonds; polysaccharides are linear or branched polymers of monosaccharides (sugars) such as glucose linked by glycosidic bonds. Monosaccharides are carbohydrates, which are literally combinations of carbon and water in a one-to-one ratio. Large polysaccharides, containing dozens to hundreds of monosaccharide units, function as reservoirs for glucose, as structural components, or as adhesives that help hold cells together in tissues. The most common storage carbohydrate in animal cells is glycogen, a very long, highly branched polymer of glucose. As much as 10 percent by weight of the liver can be glycogen. The primary storage carbohydrate in plant cells is starch.

12 The information is in the sequence
Sequence  Structure  Function Sequence similarity  Structural and/or Functional similarity Nucleic acids and proteins are related by molecular evolution Orthologs: two proteins in animals X and Y that evolved from one protein in immediate ancestor animal Z Paralogs: two proteins that evolved from one protein through duplication in some ancestor Homologs: orthologs or paralogs that exhibit sequence similarity

13 Protein Phylogenies Proteins evolve by both duplication and species divergence duplication orthologs paralogs

14 Evolution

15 Evolution at the DNA level
Deletion Mutation …ACGGTGCAGTTACCA… SEQUENCE EDITS …AC CAGTCACCA… REARRANGEMENTS Inversion Translocation Duplication

16 Evolutionary Rates next generation OK OK OK
Changes in non-functional sites are OK, so will be propagated X X Still OK? Most changes in functional sites are deleterious and will be rejected

17 Sequence conservation implies function
Proteins between humans and rodents are on average 85% identical

18 Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x1x2...xM, y = y1y2…yN, an alignment is an assignment of gaps to positions 0,…, M in x, and 0,…, N in y, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence

19 What is a good alignment?
The “best” way to match the letters of one sequence with those of the other How do we define “best”? A hypothesis that the two sequences come from a common ancestor through sequence edits Parsimonious explanation: Find the minimum number of edits that transform one sequence into the other

20 Scoring Function Sequence edits: AGGCCTC Scoring Function: Match: + m
Mutations AGGACTC Insertions AGGGCCTC Deletions AGG–CTC Scoring Function: Match: + m Mismatch: – s Gap: – d Score F = (# matches)  m – (# mismatches)  s – (#gaps)  d

21 How do we compute the best alignment?
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA Too many possible alignments: O( 2M+N) AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

22 Alignment is additive Observation: The score of aligning x1……xM y1……yN
Say that x1…xi xi+1…xM aligns to y1…yj yj+1…yN The two scores add up: F(x[1:M], y[1:N]) = F(x[1:i], y[1:j]) + F(x[i+1:M], y[j+1:N]) Key property: optimal solution to the entire problem is composed of optimal solutions to subproblems – Dynamic Programming

23 Dynamic Programming Construct a DP matrix F: MxN:
Suppose we wish to align x1……xM y1……yN Let F(i, j) = optimal score of aligning x1……xi y1……yj

24 Dynamic Programming (cont’d)
Notice three possible cases: xi aligns to yj x1……xi-1 xi y1……yj-1 yj 2. xi aligns to a gap x1……xi-1 xi y1……yj - yj aligns to a gap x1……xi - y1……yj-1 yj m, if xi = yj F(i, j) = F(i-1, j-1) + -s, if not F(i, j) = F(i-1, j) – d F(i, j) = F(i, j-1) – d

25 Dynamic Programming (cont’d)
How do we know which case is correct? Inductive assumption: F(i, j – 1), F(i – 1, j), F(i – 1, j – 1) are optimal Then, F(i – 1, j – 1) + s(xi, yj) F(i, j) = max F(i – 1, j) – d F(i, j – 1) – d Where s(xi, yj) = m, if xi = yj; -s, if not

26 Example F(i,j) i = 0 1 2 3 4 A G T -1 -2 -3 -4 1 2 Optimal Alignment:
x = AGTA m = 1 y = ATA s = -1 d = -1 F(i,j) i = A G T -1 -2 -3 -4 1 2 Optimal Alignment: F(4, 3) = 2 AGTA A - TA j = 0 1 2 3

27 The Needleman-Wunsch Algorithm
Initialization. F(0, 0) = 0 F(0, j) = - j  d F(i, 0) = - i  d Main Iteration. Filling-in partial alignments For each i = 1……M For each j = 1……N F(i-1,j-1) + s(xi, yj) [case 1] F(i, j) = max F(i-1, j) – d [case 2] F(i, j-1) – d [case 3] DIAG, if [case 1] Ptr(i,j) = LEFT, if [case 2] UP, if [case 3] Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

28 Performance Time: Space:
O(NM) Space: Possible to reduce space to O(N+M) using Hirschberg’s divide & conquer algorithm

29 Substitutions of Amino Acids
Mutation rates between amino acids have dramatic differences! How can we quantify the differences in rates by which one amino acid replaces another across related proteins?

30 Substitution Matrices
BLOSUM matrices: Start from BLOCKS database (curated, gap-free alignments) Cluster sequences according to > X% identity Calculate Aab: # of aligned a-b in distinct clusters, correcting by 1/mn, where m, n are the two cluster sizes Estimate P(a) = (b Aab)/(c≤d Acd); P(a, b) = Aab/(c≤d Acd)

31 Gaps are not inserted uniformly

32 A state model for alignment
(+1,+1) Alignments correspond 1-to-1 with sequences of states M, I, J I (+1, 0) J (0, +1) -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII

33 Let’s score the transitions
s(xi, yj) M (+1,+1) s(xi, yj) s(xi, yj) Alignments correspond 1-to-1 with sequences of states M, I, J -d -d I (+1, 0) J (0, +1) -e -e -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII

34 A probabilistic model for alignment
Assign probabilities to every transition (arrow), and emission (pair of letters or gaps) Probabilities of mutation reflect amino acid similarities Different probabilities for opening and extending gap M (+1,+1) I (+1, 0) J (0, +1) -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII

35 A Pair HMM for alignments
log(1 – 2) M P(xi, yj) log Prob(xi, yj) log(1 – ) log(1 – ) log  log  log  log  I P(xi) J P(yj) Highest scoring path corresponds to the most likely alignment!

36 How do we find the highest scoring path?
Compute the following matrices (DP) M(i, j): most likely alignment of x1…xi with y1…yj ending in state M I(i, j): most likely alignment of x1…xi with y1…yj ending in state I J(i, j): most likely alignment of x1…xi with y1…yj ending in state J M(i, j) = log( Prob(xi, yj) ) + max{ M(i-1, j-1) + log(1-2), I(i-1, j) + log(1-), J(i, j-1) + log(1-) } I(i, j) = max{ M(i-1, j) + log , I(i-1, j) + log  } log(1 – 2) M P(xi, yj) log Prob(xi, yj) log(1 – ) log(1 – ) log  log  log  I P(xi) J P(yj) log 

37 The Viterbi algorithm for alignment
For each i = 1, …, M For each j = 1, …, N M(i, j) = log( Prob(xi, yj) ) + max { M(i-1, j-1) + log(1-2), I(i-1, j) + log(1-), J(i, j-1) + log(1-) } I(i, j) = max { M(i-1, j) + log , I(i-1, j) + log  J(i, j) = max { M(i-1, j) + log , When matrices are filled, we can trace back from (M, N) the likeliest alignment

38 One way to view the state paths – State M
…… y1 yn x1 …… xm

39 State I …… y1 yn x1 …… xm

40 State J …… y1 yn x1 …… xm

41 Putting it all together
States I(i, j) are connected with states J and M (i-1, j) States J(i, j) are connected with states I and M (i-1, j) States M(i, j) are connected with states J and I (i-1, j-1) …… y1 yn x1 …… xm

42 Putting it all together
States I(i, j) are connected with states J and M (i-1, j) States J(i, j) are connected with states I and M (i-1, j) States M(i, j) are connected with states J and I (i-1, j-1) Optimal solution is the best scoring path from top-left to bottom-right corner This gives the likeliest alignment according to our HMM …… y1 yn x1 …… xm

43 Yet another way to represent this model
Ix Ix BEGIN Iy Iy END Mx1 Mxm Sequence X We are aligning, or threading, sequence Y through sequence X Every time yj lands in state xi, we get substitution score s(xi, yj) Every time yj is gapped, or some xi is skipped, we pay gap penalty


Download ppt "Sequence Similarity."

Similar presentations


Ads by Google