Last lecture summary.

Slides:



Advertisements
Similar presentations
Proteins: Structure reflects function….. Fig. 5-UN1 Amino group Carboxyl group carbon.
Advertisements

Sequence Alignments.
• Exam II Tuesday 5/10 – Bring a scantron with you!
Measuring the degree of similarity: PAM and blosum Matrix
5’ C 3’ OH (free) 1’ C 5’ PO4 (free) DNA is a linear polymer of nucleotide subunits joined together by phosphodiester bonds - covalent bonds between.
DNA sequences alignment measurement
Last lecture summary.
Last lecture summary.
Introduction to Bioinformatics
Sequence Alignments and Database Searches Introduction to Bioinformatics.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Molecular Techniques in Molecular Systematics. DNA-DNA hybridisation -Measures the degree of genetic similarity between pools of DNA sequences. -Normally.
Introduction to bioinformatics
Sequence similarity.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Unit 7 RNA, Protein Synthesis & Gene Expression Chapter 10-2, 10-3
Protein Synthesis. DNA RNA Proteins (Transcription) (Translation) DNA (genetic information stored in genes) RNA (working copies of genes) Proteins (functional.
An Introduction to Bioinformatics
Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.
Last lecture summary. New generation sequencing (NGS) The completion of human genome was just a start of modern DNA sequencing era – “high-throughput.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
CHAPTER 12 PROTEIN SYNTHESIS AND MUTATIONS -RNA -PROTEIN SYNTHESIS -MUTATIONS.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Learning Targets “I Can...” -State how many nucleotides make up a codon. -Use a codon chart to find the corresponding amino acid.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Last lecture summary. New generation sequencing (NGS) The completion of human genome was just a start of modern DNA sequencing era – “high-throughput.
CELL REPRODUCTION: MITOSIS INTERPHASE: DNA replicates PROPHASE: Chromatin condenses into chromosomes, centrioles start migrating METAPHASE: chromosomes.
Last lecture summary. Flavors of sequence alignment pair-wise alignment × multiple sequence alignment.
Amino Acids ©CMBI 2001 “ When you understand the amino acids, you understand everything ”
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
Alignment methods April 17, 2007 Quiz 1—Question on databases Learning objectives- Understand difference between identity, similarity and homology. Understand.
CS273a A Zero-Knowledge Based Introduction to Biology Courtesy of George Asimenos.
Protein Sequence Alignment Multiple Sequence Alignment
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
Last lecture summary. Sequence alignment What is sequence alignment Three flavors of sequence alignment Point mutations, indels.
Prepared By: Syed Khaleelulla Hussaini. Outline Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity.
Genomics Lecture 3 By Ms. Shumaila Azam. Proteins Proteins: large molecules composed of one or more chains of amino acids, polypeptides. Proteins are.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Last lecture summary.
Amino acids.
Sequence Alignment.
BIOLOGY 12 Protein Synthesis.
Protein Sequence Alignments
Bellwork: Tues. Nov. 28, 2017 What is each number?
Sequence Alignment ..
The genetic code © 2016 Paul Billiet ODWS.
Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM
The Interface of Biology and Chemistry
A Ala Alanine Alanine is a small, hydrophobic
The 20 amino acids.
Translation.
The 20 amino acids.
The Chemical Building Blocks of Life
Example of regression by RBF-ANN
“When you understand the amino acids,
Introduction to bioinformatics Lecture 5 Pair-wise sequence alignment
Presentation transcript:

Last lecture summary

New generation sequencing (NGS) The completion of human genome was just a start of modern DNA sequencing era – “high-throughput next generation sequencing” (NGS). New approaches, reduce time and cost. Holly Grail of sequencing – complete human genome below $ 1000. 1st generation – Sanger dideoxy method 2nd generation – sequencing by synthesis (pyrosequencing) 3rd generation – single molecule sequencing

cDNA, EST libraries cDNA – reverse transcriptase, contains only expressed genes (no introns) cDNA library – a collection of different DNA sequences that have been incorporated into a vector EST – Expressed Sequence Tag short, unedited (single-pass read), randomly selected subsequence (200-800 bps) of cDNA sequence generated either from 5’ or from 3’ higher quality in the middle cDNA/EST – direct evidence of transcriptome - Because usually the desired gene sequences still represent only a tiny proportion of the total cDNA population, the cDNA fragments are amplified by cloning/PCR.

What is sequence alignment ? CTTTTCAAGGCTTA GGCTTATTATTGC Fragments overlaps CTTTTCAAGGCTTA GGCTATTATTGC navozeni squence alignmentu na prikladech, kdy se tento objevuje v predchozim vykladu nahore je presny overlap dole je priblizny overlap, jsou ukazana dve zarovnani, zarovnani s vlozenou mezerou je vice optimalni CTTTTCAAGGCTTA GGCT-ATTATTGC

What is sequence alignment ? CCCCATGGTGGCGGCAGGTGACAG CATGGGGGAGGATGGGGACAGTCCGG TTACCCCATGGTGGCGGCTTGGGAAACTT TGGCGGCTCGGGACAGTCGCGCATAAT CCATGGTGGTGGCTGGGGATAGTA TGAGGCAGTCGCGCATAATTCCG “EST clustering” CCCCATGGTGGCGGCAGGTGACAG CATGGGGGAGGATGGGGACAGTCCGG TTACCCCATGGTGGCGGCTTGGGAAACTT TGGCGGCTCGGGACAGTCGCGCATAAT CCATGGTGGTGGCTGGGGATAGTA TGAGGCAGTCGCGCATAATTCCG toto je pouze demonstracni zarovnani nahore jsou EST sekvence clustering vede ke konsensualni sekvenci ale ty sekvence musime nejak takhle setridit samozrejme jsou tam chyby, caste na koncich ESTU, dve jsou ukazany (ale mam jich tam vic) TTACCCCATGGTGGCGGCTGGGGACAGTCGCGCATAATTCCG consensus

Sequence alphabet Adenine A Thymine T Cytosine G Guanine C Name side chain charge at physiological pH 7.4 Name 3 letters 1 letter Positively charged side chains Arginine Arg R Histidine His H Lysine Lys K Negatively charged side chains Aspartic Acid Asp D Glutamic Acid Glu E Polar uncharged side chains Serine Ser S Threonine Thr T Asparagine Asn N Glutamine Gln Q Special Cysteine Cys C Selenocysteine Sec U Glycine Gly G Proline\ Pro P Hydrophobic side chains Alanine Ala A Leucine Leu L Isoleucine Ile I Methionine Met M Phenylalanine Phe F Tryptophan Trp W Tyrosine Tyr Y Valine Val V Adenine A Thymine T Cytosine G Guanine C

Sequence alignment Procedure of comparing sequences Point mutations – easy More difficult example However, gaps can be inserted to get something like this ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGATTCGCCCTATCGTCTATCT gapless alignment ACGTCTGATACGCCGTATAGTCTATCT CTGATTCGCATCGTCTATCT Gaps correspond to inserion in one sequnce, or deletion in another. (indel) Comparing two genes it is generally impossible to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known: ACGTCTGATACGCCGTATCGTCTATCT ACGTCTGAT---CCGTATCGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT ----CTGATTCGC---ATCGTCTATCT gapped alignment insertion × deletion indel

Why align sequences – continuation The draft human genome is available Automated gene finding is possible Gene: AGTACGTATCGTATAGCGTAA What does it do? One approach: Is there a similar gene in another species? Align sequences with known genes Find the gene with the “best” match

Flavors of sequence alignment pair-wise alignment × multiple sequence alignment - párové/násobné zarovnání - Sequences that are quite similar and approximately the same length are suitable candidates for global alignment. - Local alignments are more suitable for aligning sequences that are similar along some of their lengths but dissimilar in others, sequences that differ in length, or sequences that share a conserved region or domain.

Flavors of sequence alignment global alignment × local alignment global align entire sequence stretches of sequence with the highest density of matches are aligned, generating islands of matches or subalignments in the aligned sequences - párové/násobné zarovnání - Sequences that are quite similar and approximately the same length are suitable candidates for global alignment. - Local alignments are more suitable for aligning sequences that are similar along some of their lengths but dissimilar in others, sequences that differ in length, or sequences that share a conserved region or domain. local

New stuff

Evolution common ancestors wikipedia.org

Evolution of sequences The sequences are the products of molecular evolution. When sequences share a common ancestor, they tend to exhibit similarity in their sequences, structures and biological functions. DNA1 DNA2 Protein1 Protein2 - similar sequences produce similar proteins – this is probably the most powerful idea of bioinformatics because it enables us to make predictions. Often little is known about the function of new sequence from a genome sequencing program, but if similar sequences can be found in a database for which functional or structural information is available, then this can be used as the basis of a prediction of function or structure for the new sequence. Sequence similarity Similar 3D structure Similar function Similar sequences produce similar proteins However, this statement is not a rule. See Gerlt JA, Babbitt PC. Can sequence determine function? Genome Biol. 2000;1(5) PMID: 11178260

Homology During the time period, the molecular sequences undergo random changes, some of which are selected during the process of evolution. Selected sequences accumulate mutations, they diverge over time. Two sequences are homologous when they are descended from a common ancestor sequence. Traces of evolution may still remain in certain portions of the sequences to allow identification of the common ancestry. Residues performing key roles are preserved by natural selection, less crucial residues mutate more frequently.

Orhology, paralogy I Orthologs – homologous proteins from different species that possess the same function (e.g. corresponding kinases in signal transduction pathway in humans and mice) Paralogs – homologous proteins that have different function in the same species (e.g. two kinases in different signal transduction pathways of humans) However, these terms are controversially discussed: Jensen RA. Orthologs and paralogs - we need to get it right. Genome Biol. 2001;2(8), PMID: 11532207 and references therein two flavors of homology transfers phosphate groups from high-energy donor molecules, such as ATP, to specific substrates

Orthology, paralogy II Orthologs – genes separated by the event of speciation Sequences are direct descendants of a common ancestor. Most likely have similar domain structure, 3D structure and biological function. Paralogs – genes separated by the event of genetic duplication Gene duplication: An extra copy of a gene. Gene duplication is a key mechanism in evolution. Once a gene is duplicated, the identical genes can undergo changes and diverge to create two different genes. http://www.globalchange.umich.edu/globalchange1/current/lectures/speciation/speciation.html

Gene duplication Unequal cross-over Entire chromosome is replicated twice This error will result in one of the daughter cells having an extra copy of the chromosome. If this cell fuses with another cell during reproduction, it may or may not result in a viable zygote. Retrotransposition Sequences of DNA are copied to RNA and then back to DNA instead of being translated into proteins resulting in extra copies of DNA being present within cell. Gene duplication is believed to play a major role in evolution; Duplications typically arise from an event termed unequal crossing-over (recombination) that occurs between misaligned homologous chromosomes during meiosis (germ cell formation). The chance of this event happening is a function of the degree of sharing of repetitive elements between two chromosomes. The recombination products of such an event are a duplication at the site of the exchange and a reciprocal deletion. Another way that gene duplication can occur is if the entire chromosome is replicated twice. This error will result in one of the daughter cells having an extra copy of the chromosome and all the extra genetic material. If this cell fuses with another cell during reproduction, it may or may not result in a viable zygote. The last way that gene duplication can occur is through retrotransposition. During retrotransposition, sequences of DNA are copied to RNA and then back to DNA instead of being translated into proteins. This results in extra copies of that DNA being present within the cell, which can rejoin with the chromosomes that are already present. Any genes found along these sequences of DNA will have been duplicated in the process.

Unequal cross-over Homologous chromosomes are misaligned during meiosis. The probability of misalignment is a function of the degree of sharing of repetitive elements. The underlying DNA sequence homology of the similar maternal and paternal chromosome pairs guides this search and eventual alignment along the entire length of each chromosome. The alignment is further mediated and cemented by a three-dimensional zipperlike structure surrounding each set of paired homologous chromosomes, the synaptonemal complex. Read more: Meiosis - Biology Encyclopedia - cells, plant, body, human, process, different, chromosomes, DNA, organs http://www.biologyreference.com/Ma-Mo/Meiosis.html#ixzz1cXdSgeg7

Comparing sequences through alignment – patterns of conservation and variation can be identified. The degree of sequence conservation in the alignment reveals evolutionary relatedness of different sequences The variation between sequences reflects the changes that have occurred during evolution in the form of substitutions and/or indels. Identifying the evolutionary relationships between sequences helps to characterize the function of unknown sequences. Protein sequence comparison can identify homologous sequences from common ancestor 1 billions year ago (BYA). DNA sequences typically only 600 MYA.

Scoring systems I DNA and protein sequences can be aligned so that the number of identically matching pairs is maximized. Counting the number of matches gives us a score (3 in this case). Higher score means better alignment. This procedure can be formalized using substitution matrix. A T T G - - - T A – - G A C A T A T C G 1 Identity matrix

Scoring systems II For nucleotide sequences identity matrix is usually good enough. For protein sequences identity matrix is not sufficient to describe biological and evolutionary proceses. It’s because amino acids are not exchanged with the same probability as can be conceived theoretically. For example substitution of aspartic acids D by glutamic acid E is frequently observed. And change from aspartic acid to tryptophan W is very rare. Why is that? Triplet-based genetic code GAT (D) → GAA (E), GAT (D) → TGG (W) Both D and E have similar properties, but D and W differ considerably. D is hydrophylic, W is hydrophobic, D → W mutation can greatly alter 3D structure and consequently function.

Genetic code http://www.doctortee.com/dsu/tiftickjian/bio100/gene-expression.html

Gaps or no gaps three examples of gapless alignments between two short sequences three examples of gapped alignments between two short sequences

Scoring DNA sequence alignment (1) Match score: +1 Mismatch score: +0 Gap penalty: –1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 × (+1) Mismatches: 2 × 0 Gaps: 7 × (– 1) Score = +11

Length penalties We want to find alignments that are evolutionarily likely. Which of the following alignments seems more likely to you? ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGAT-------ATAGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT AC-T-TGA--CG-CGT-TA-TCTATCT We can achieve this by penalizing more for a new gap, than for extending an existing gap   - It’s more likely that longer stretches of sequence are deleted.

Scoring DNA sequence alignment (2) Match/mismatch score: +1/+0 Origination/length penalty: –2/–1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 × (+1) Mismatches: 2 × 0 Origination: 2 × (–2) Length: 7 × (–1) Score = +7

Substitution matrices Substitution (score) matrices show scores for amino acids substitution. Higher score means higher probability of mutation. Conservative substitutions – conserve the physical and chemical properties of the amino acids, limit structural/functional disruption Substitution matrices should reflect: Physicochemical properties of amino acids. Different frequencies of individual amino acids occuring in proteins. Interchangeability of the genetic code.

PAM matrices I How to assign scores? Let’s get nature – evolution – involved! If you choose set of proteins with very similar sequences, you can do alignment manually. Also, if sequences in your set are similar, then there is high probability that amino acid difference are due to single mutation. From the frequencies of mutations in the set of similar protein sequences probabilities of substitutions can be derived. This is exactly the approach take by Margaret Dayhoff in 1978 to construct PAM (Accepted Point Mutation) matrices. Dayhoff, M.O., Schwartz, R. and Orcutt, B.C. (1978). "A model of Evolutionary Change in Proteins". Atlas of protein sequence and structure (volume 5, supplement 3 ed.). Nat. Biomed. Res. Found.. pp. 345–358.

PAM matrices II Alignments of 71 groups of very similar (at least 85% identity) protein sequences. 1572 substitutions were found. These mutations do not significantly alter the protein function. Hence they are called accepted mutations (accepted by natural selection). Probabilities that any one amino acid would mutate into any other were calculated. If I know probabilities of individual amino acids, what is the probability for the given sequence? Product Thus probabilities are converted to logarithms, and an alignment score can be calculated by summation. Excellent discussion of the derivation and use of PAM matrices: George DG, Barker WC, Hunt LT. Mutation data matrix and its uses. Methods Enzymol. 1990,183:333-51. PMID: 2314281.

PAM matrices III Dayhoff’s definition of accepted mutation was thus based on empirically observed amino acids substitutions. The used unit is a PAM. Two sequences are 1 PAM apart if they have 99% identical residues. PAM1 matrix is the result of computing the probability of one substitution per 100 amino acids. PAM1 matrix represents probabilities of point mutations over certain evolutionary time. in Drosophila 1 PAM corresponds to ~2.62 MYA in Human 1 PAM corresponds to ~4.58 MYA - Henikoff S, Henikoff JG. Amino acid substitution matrices. Advances in Protein Chemistry. 54: 73-97, 2000 http://blocks.fhcrc.org/steveh/Henikoff_publications.html - odhad delky 1 PAM prevzat z Introduction to Computational Biology – An Evolutionary Approach, by B. Haubold, T. Wiehe

PAM1 matrix numbers are multiplied by 10 000

Higher PAM matrices What to do if I want get probabilities over much longer evolutionary time? i.e. I want to align sequences with far less than 85% identity. Dayhoff proposed a model of evolution that is a Markov process. We already met (in Lin Alg lecture) linear dynamical system, which is a case of Markov process.

Linear dynamical system I A new species of frog has been introduced into an area where it has too few natural predators. In an attempt to restore the ecological balance, a team of scientists is considering introducing a species of bird which feeds on this frog. Experimental data suggests that the population of frogs and birds from one year to the next can be modeled by linear relationships. Specifically, it has been found that if the quantities Fk and Bk represent the populations of the frogs and birds in the kth year, then 𝐵 𝑘+1 =0.6 𝐵 𝑘 +0.4 𝐹 𝑘 𝐹 𝑘+1 =−0.35 𝐵 𝑘 +1.4 𝐹 𝑘 The question is this: in the long run, will the introduction of the birds reduce or eliminate the frog population growth?

Linear dynamical system II 𝐹 𝑘+1 𝐵 𝑘+1 = 0.6 0.4 −0.35 1.4 𝐹 𝑘 𝐵 𝑘 So this system evolves in time according to x(k+1) = Ax(k). Such a system is called discrete linear dynamical system, matrix A is called transition matrix. If we need to know the state of the system in time k = 50, we have to compute x(50) = A50 x(0). And the same is true for Dayhoff’s model of evolution. If we need to obtain probability matrices for higher percentage of accepted mutations (i.e. covering longer evolutionary time), we do matrix powers. Let’s say we want PAM120 – 120 mutations fixed on average per 100 residues. We do PAM1120.

Higher PAM matrices Biologically, the PAM120 matrix means that in 100 amino acids there have been 50 substitutions, while in PAM250 there have been 2.5 amino acid mutation at each side. This may sound unusual, but remember, that over evolutionary time, it is possible that an alanine was changed to glycine, then to valine, and then back to alanine. These are called silent substituions.

PAM 120 small, polar small, nonpolar polar or acidic basic Zvelebil, Baum, Understanding bioinformatics. PAM 120 Positive score – frequency of substitutions is greater than would have occurred by random chance. Zero score – frequency is equal to that expected by chance. Negative score – frequency is less than would have occurred by random chance. small, polar small, nonpolar polar or acidic basic large, hydrophobic aromatic

PAM matrices assumptions Mutation of amino acid is independent of previous mutations on the same position (Markov process requirement). Only PAM1 was “measured”, all other are extrapolations (i.e. predictions based on some model). Each amino acid position is equally mutable. Mutations are assumed to be independent of surrounding residues. Forces responsible for sequence evolution over short time are the same as these over longer times. PAM matrices are based on protein sequences available in 1978 (bias towards small, globular proteins) New generation of Dayhoff-type – e.g. PET91

How to calculate score? substitution matrix 2 - BLOSUM62 shown here Selzer, Applied bioinformatics. How to calculate score? substitution matrix 2 - BLOSUM62 shown here

Protein vs. DNA sequences Given the choice of aligning DNA or protein, it is often more informative to compare protein sequences. There are several reasons for this: Many changes in DNA do not change the amino acid that is specified. Many amino acids share related biophysical properties. Though these amino acids are not identical, they can be more easily substituted each with other. These relationships can be accounted for using scoring systems. When is it appropriate to compare nucleic sequences? confirming the identity of DNA sequence in database search, searching for polymorphisms, confirming identity of cloned cDNA When nucleotide sequence is analyzed, it is usually preferable to study the protein sequences. Particularly 3rd position in codon does not change the coded amino acid.