Download presentation
1
DNA/RNA Protein Expression Interaction
Molecular Data DNA/RNA Protein Expression Interaction
2
A sequence A sequence is a linear set of characters (sequence elements) representing nucleotides or amino acids
3
Character representation of sequences
DNA or RNA use 1-letter codes (e.g., A,C,G,T) protein use 1-letter codes can convert to/from 3-letter codes
4
The I.U.B. Code proposed by International Union of Biochemistry
A, C, G, T, U R = A, G (puRine) Y = C, T (pYrimidine) S = G, C (Strong hydrogen bonds) W = A, T (Weak hydrogen bonds) M = A, C (aMino group) K = G, T (Keto group) B = C, G, T (not A) D = A, G, T (not C) H = A, C, T (not G) V = A, C, G (not T/U) N = A, C, G, T/U (iNdeterminate) X or - are sometimes used
5
DNA code Amino Acid Abbreviation DNA Codons Alanine Ala
GCA, GCC, GCG, GCT Cysteine Cys TGC, TGT Aspartic Acid Asp GAC, GAT Glutamic Acid Glu GAA, GAG Phenylalanine Phe TTC, TTT Glycine Gly GGA, GGC, GGG, GGT Histidine His CAC, CAT Isoleucine Ile ATA, ATC, ATT Lysine Lys AAA, AAG Leucine Leu TTA, TTG, CTA, CTC, CTG, CTT Methionine Met ATG Asparagine Asn AAC, AAT Proline Pro CCA, CCC, CCG, CCT Glutamine Gln CAA, CAG Arginine Arg CGA, CGC, CGG, CGT Serine Ser TCA, TCC, TCG, TCT, AGC, AGT Threonine Thr ACA, ACC, ACG, ACT Valine Val GTA, GTC, GTG, GTT Tryptophan Trp TGG Tyrosine Tyr TAC, TAT Stop . TAA, TAG, TGA
6
Fasta format >gi| |ref|NM_ | Homo sapiens cyclin-dependent kinase inhibitor AGCTGAGGTGTGAGCAGCTGCCGAAGTCAGTTCCTTGTGGAGCCGGAGCTGGGCGCGGATTCGCCGAGGC ACCGAGGCACTCAGAGGAGGTGAGAGAGCGGCGGCAGACAACAGGGGACCCCGGGCCGGCGGCCCAGAGC CGAGCCAAGCGTGCCCGCGTGTGTCCCTGCGTGTCCGCGAGGATGCGTGTTCGCGGGTGTGTGCTGCGTT CACAGGTGTTTCTGCGGCAGGCGCCATGTCAGAACCGGCTGGGGATGTCCGTCAGAACCCATGCGGCAGC AAGGCCTGCCGCCGCCTCTTCGGCCCAGTGGACAGCGAGCAGCTGAGCCGCGACTGTGATGCGCTAATGG CGGGCTGCATCCAGGAGGCCCGTGAGCGATGGAACTTCGACTTTGTCACCGAGACACCACTGGAGGGTGA CTTCGCCTGGGAGCGTGTGCGGGGCCTTGGCCTGCCCAAGCTCTACCTTCCCACGGGGCCCCGGCGAGGC CGGGATGAGTTGGGAGGAGGCAGGCGGCCTGGCACCTCACCTGCTCTGCTGCAGGGGACAGCAGAGGAAG ACCATGTGGACCTGTCACTGTCTTGTACCCTTGTGCCTCGCTCAGGGGAGCAGGCTGAAGGGTCCCCAGG TGGACCTGGAGACTCTCAGGGTCGAAAACGGCGGCAGACCAGCATGACAGATTTCTACCACTCCAAACGC CGGCTGATCTTCTCCAAGAGGAAGCCCTAATCCGCCCACAGGAAGCCTGCAGTCCTGGAAGCGCGAGGGC CTCAAAGGCCCGCTCTACATCTTCTGCCTTAGTCTCAGTTTGTGTGTCTTAATTATTATTTGTGTTTTAA TTTAAACACCTCCTCATGTACATACCCTGGCCGCCCCCTGCCCCCCAGCCTCTGGCATTAGAATTATTTA AACAAAAACTAGGCGGTTGAATGAGAGGTTCCTAAGAGTGCTGGGCATTTTTATTTTATGAAATACTATT TAAAGCCTCCTCATCCCGTGTTCTCCTTTTCCTCTCTCCCGGAGGTTGGGTGGGCCGGCTTCATGCCAGC TACTTCCTCCTCCCCACTTGTCCGCTGGGTGGTACCCTCTGGAGGGGTGTGGCTCCTTCCCATCGCTGTC ACAGGCGGTTATGAAATTCACCCCCTTTCCTGGACACTCAGACCTGAATTCTTTTTCATTTGAGAAGTAA ACAGATGGCACTTTGAAGGGGCCTCACCGAGTGGGGGCATCATCAAAAACTTTGGAGTCCCCTCACCTCC TCTAAGGTTGGGCAGGGTGACCCTGAAGTGAGCACAGCCTAGGGCTGAGCTGGGGACCTGGTACCCTCCT GGCTCTTGATACCCCCCTCTGTCTTGTGAAGGCAGGGGGAAGGTGGGGTCCTGGAGCAGACCACCCCGCC TGCCCTCATGGCCCCTCTGACCTGCACTGGGGAGCCCGTCTCAGTGTTGAGCCTTTTCCCTCTTTGGCTC CCCTGTACCTTTTGAGGAGCCCCAGCTACCCTTCTTCTCCAGCTGGGCTCTGCAATTCCCCTCTGCTGCT GTCCCTCCCCCTTGTCCTTTCCCTTCAGTACCCTCTCAGCTCCAGGTGGCTCTGAGGTGCCTGTCCCACC CCCACCCCCAGCTCAATGGACTGGAAGGGGAAGGGACACACAAGAAGAAGGGCACCCTAGTTCTACCTCA GGCAGCTCAAGCAGCGACCGCCCCCTCCTCTAGCTGTGGGGGTGAGGGTCCCATGTGGTGGCACAGGCCC CCTTGAGTGGGGTTATCTCTGTGTTAGGGGTATATGATGGGGGAGTAGATCTTTCTAGGAGGGAGACACT GGCCCCTCAAATCGTCCAGCGACCTTCCTCATCCACCCCATCCCTCCCCAGTTCATTGCACTTTGATTAG CAGCGGAACAAGGAGTCAGACATTTTAAGATGGTGGCAGTAGAGGCTATGGACAGGGCATGCCACGTGGG CTCATATGGGGCTGGGAGTAGTTGTCTTTCCTGGCACTAACGTTGAGCCCCTGGAGGCACTGAAGTGCTT AGTGTACTTGGAGTATTGGGGTCTGACCCCAAACACCTTCCAGCTCCTGTAACATACTGGCCTGGACTGT TTTCTCTCGGCTCCCCATGTGTCCTGGTTCCCGTTTCTCCACCTAGACTGTAAACCTCTCGAGGGCAGGG ACCACACCCTGTACTGTTCTGTGTCTTTCACAGCTCCTCCCACAATGCTGAATATACAGCAGGTGCTCAA TAAATGATTCTTAGTGACTTTAAAAAAAAAAAAAAAAAAAA
7
Sequence Content Mononucleotide frequencies Dinucleotide frequencies
GC content Dinucleotide frequencies CpG islands
8
GC content is non-random
Lander et al
9
GC content and expression
10
Determining mononucleotide frequencies
Alphabet: A T C G Count how many times each nucleotide appears in sequence Divide (normalize) by total number of nucleotides fA mononucleotide frequency of A (frequency that A is observed) pAmononucleotide probability that a nucleotide will be an A
11
Determining dinucleotide frequencies
Make 4 x 4 matrix, one element for each ordered pair of nucleotides Set all elements to zero Go through sequence linearly, adding one to matrix entry corresponding to the pair of sequence elements observed at that position Divide by total number of dinucleotides fAC dinucleotide frequency of AC (frequency that AC is observed out of all dinucleotides)
12
Dinucleotide counts A T C G ATTCGACCAGAG Create a 4 x 4 matrix
Set all cells to zeros Use a window of size 2 and add 1 to each cell of the matrix when encountering the specified dinucleotide A T C G ATTCGACCAGAG
13
Dinucleotide counts A T C G 1 2 ATTCGACCAGAG
14
Observed and expected frequencies
15
Observed and expected frequencies
16
Dinucleotide frequencies in genome
17
Sequence features A sequence feature is a pattern that is observed to occur in more than one sequence and (usually) to be correlated with some function
18
Sequence features promoters transcription initiation sites
transcription termination sites polyadenylation sites ribosome binding sites protein features
19
Consensus sequences A consensus sequence is a sequence that summarizes or approximates the pattern observed in a group of aligned sequences containing a sequence feature Consensus sequences are regular expressions
20
Occurences Basic Algorithm
Example: recognition site for a restriction enzyme EcoRI recognizes GAATTC AccI recognizes GTMKAC Basic Algorithm Start with first character of sequence to be searched See if enzyme site matches starting at that position Advance to next character of sequence to be searched Repeat previous two steps until all positions have been tested
21
Statistics of pattern appearance
Goal: Determine the significance of observing a feature (pattern) Method: Estimate the probability that a pattern would occur randomly in a given sequence. Three different methods Assume all nucleotides are equally frequent Use measured frequencies of each nucleotide (mononucleotide frequencies) Use measured frequencies with which a given nucleotide follows another (dinucleotide frequencies)
22
Example 1 What is the probability of observing the sequence feature ART (A followed by a purine, either A or G, followed by a T)? Using observed mononucleotide frequencies: pART = pA (pA + pG) pT Using equal mononucleotide frequencies pA = pC = pG = pT = 1/4 pART = 1/4 * (1/4 + 1/4) * 1/4 = 1/32
23
Example 1: using mononucleotide frequencies
Using equal mononucleotide frequencies pA = pC = pG = pT = 1/4 pART = 1/4 * (1/4 + 1/4) * 1/4 = 1/32 Using observed mononucleotide frequencies: pART = pA (pA + pG) pT
24
Example 1: using dinucleotide frequencies
pART=pA(p*AAp*AT+p*AGp*GT)
25
Example 2: What is the probability of observing the sequence feature ARYT (A followed by a purine {either A or G}, followed by a pyrimidine {either C or T}, followed by a T)? Using equal mononucleotide frequencies pA = pC = pG = pT = 1/4 pARYT = 1/4 * (1/4 + 1/4) * (1/4 + 1/4) * 1/4 = 1/64
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.