Download presentation
Presentation is loading. Please wait.
1
CS262 Lecture 9, Win07, Batzoglou Gene Recognition
2
CS262 Lecture 9, Win07, Batzoglou Using Comparative Information
3
CS262 Lecture 9, Win07, Batzoglou Using Comparative Information Hox cluster is an example where everything is conserved
4
CS262 Lecture 9, Win07, Batzoglou Patterns of Conservation 30% 1.3% 0.14% 58% 14% 10.2% GenesIntergenic Mutations Gaps Frameshifts Separation 2-fold 10-fold 75-fold
5
CS262 Lecture 9, Win07, Batzoglou Comparison-based Gene Finders Rosetta, 2000 CEM, 2000 –First methods to apply comparative genomics (human-mouse) to improve gene prediction Twinscan, 2001 –First HMM for comparative gene prediction in two genomes SLAM, 2002 –Generalized pair-HMM for simultaneous alignment and gene prediction in two genomes NSCAN, 2006 –Best method to-date based on a phylo-HMM for multiple genome gene prediction
6
CS262 Lecture 9, Win07, Batzoglou Twinscan 1.Align the two sequences (eg. from human and mouse) 2.Mark each human base as gap ( - ), mismatch ( : ), match ( | ) New “alphabet”: 4 x 3 = 12 letters = { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| } 3.Run Viterbi using emissions e k (b) where b { A-, A:, A|, …, T| } Emission distributions e k (b) estimated from real genes from human/mouse e I (x|) < e E (x|): matches favored in exons e I (x-) > e E (x-): gaps (and mismatches) favored in introns Example Human : ACGGCGACGUGCACGU Mouse : ACUGUGACGUGCACUU Alignment : ||:|:|||||||||:|
7
CS262 Lecture 9, Win07, Batzoglou SLAM – Generalized Pair HMM d e Exon GPHMM 1.Choose exon lengths (d,e). 2.Generate alignment of length d+e.
8
CS262 Lecture 9, Win07, Batzoglou NSCAN—Multiple Species Gene Prediction GENSCAN TWINSCAN N-SCAN TargetGGTGAGGTGACCAAGAACGTGTTGACAGTA Conservation|||:||:||:|||||:||||||||...... sequence TargetGGTGAGGTGACCAAGAACGTGTTGACAGTA Conservation|||:||:||:|||||:||||||||...... sequence TargetGGTGAGGTGACCAAGAACGTGTTGACAGTA Informant1GGTCAGC___CCAAGAACGTGTAG...... Informant2GATCAGC___CCAAGAACGTGTAG...... Informant3GGTGAGCTGACCAAGATCGTGTTGACACAA TargetGGTGAGGTGACCAAGAACGTGTTGACAGTA Informant1GGTCAGC___CCAAGAACGTGTAG...... Informant2GATCAGC___CCAAGAACGTGTAG...... Informant3GGTGAGCTGACCAAGATCGTGTTGACACAA... Target sequence: Informant sequences (vector): Joint prediction (use phylo-HMM):
9
CS262 Lecture 9, Win07, Batzoglou NSCAN—Multiple Species Gene Prediction X X C C Y Y Z Z H H M M R R X X C C Y Y Z Z H H M M R R
10
CS262 Lecture 9, Win07, Batzoglou Performance Comparison GENSCAN Generalized HMM Models human sequence TWINSCAN Generalized HMM Models human/mouse alignments N-SCAN Phylo-HMM Models multiple sequence evolution GENSCAN Generalized HMM Models human sequence TWINSCAN Generalized HMM Models human/mouse alignments N-SCAN Phylo-HMM Models multiple sequence evolution NSCAN human/mouse > Human/multiple informants
11
CS262 Lecture 9, Win07, Batzoglou 2-level architecture No Phylo-HMM that models alignments CONTRAST Human tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Macaque tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Mouse ttgcttagACTTTAAAGTTGTCAAGCCGCGTTCTTGATAAAATAAGTATTGGACAACTTGTTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cca Rat ttgcttagACTTTAAAGTTGTCAAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTATTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccca Rabbit t--attagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGGCAACTTATTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Dog t-cattagACTTTAAAGCTGTCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTCGATGAAgtatgtaccta Cow t-cattagACTTTGAAGCTATCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cta Armadillo gca--tagACCTTAAAACTGTCAAGCCGTGTTTTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtgccta Elephant gct-ttagACTTTAAAACTGTCCAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTGTCAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Tenrec tc-cttagACTTTAAAACTTTCGAGCCGGGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Opossum ---tttagACCTTAAAACTGTCAAGCCGTGTTCTAGATAAAATAAGCACTGGACAGCTTATCAGTCTCCTTTCCAACAATCTGAACAAGTTTGATGAAgtatgtagctg Chicken ----ttagACCTTAAAACTGTCAAGCAAAGTTCTAGATAAAATAAGTACTGGACAATTGGTCAGCCTTCTTTCCAACAATCTGAACAAATTCGATGAGgtatgtt--tg SVM X Y abab
12
CS262 Lecture 9, Win07, Batzoglou CONTRAST
13
CS262 Lecture 9, Win07, Batzoglou log P(y | x) ~ w T F(x, y) F(x, y) = i f(y i-1, y i, i, x) f(y i-1, y i, i, x): 1{y i-1 = INTRON, y i = EXON_FRAME_1} 1{y i-1 = EXON_FRAME_1, x human,i-2,…, x human,i+3 = ACCGGT) 1{y i-1 = EXON_FRAME_1, x human,i-1,…, x dog,i+1 = ACC, AGC) (1-c)1{a<SVM_DONOR(i)<b} (optional)1{EXON_FRAME_1, EST_EVIDENCE} CONTRAST - Features
14
CS262 Lecture 9, Win07, Batzoglou Accuracy increases as we add informants Diminishing returns after ~5 informants CONTRAST – SVM accuracies SNSP
15
CS262 Lecture 9, Win07, Batzoglou CONTRAST - Decoding Viterbi Decoding: maximize P(y | x) Maximum Expected Boundary Accuracy Decoding: maximize i,B 1{y i-1, y i is exon boundary B} Accuracy(y i-1, y i, B | x) Accuracy(y i-1, y i, B | x) = P(y i-1, y i is B | x) – (1 – P(y i-1, y i is B | x))
16
CS262 Lecture 9, Win07, Batzoglou CONTRAST - Training Maximum Conditional Likelihood Training: maximize L(w) = P w (y | x) Maximum Expected Boundary Accuracy Training: Expected BoundaryAccuracy (w) = i Accuracy i Accuracy i = B 1{(y i-1, y i is exon boundary B} P w (y i-1, y i is B | x) - B’ ≠ B P(y i-1, y i is exon boundary B’ | x)
17
CS262 Lecture 9, Win07, Batzoglou Performance Comparison De Novo EST-assisted Human Macaque Mouse Rat Rabbit Dog Cow Armadillo Elephant Tenrec Opossum Chicken Human Macaque Mouse Rat Rabbit Dog Cow Armadillo Elephant Tenrec Opossum Chicken
18
CS262 Lecture 9, Win07, Batzoglou Performance Comparison
19
CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays
20
CS262 Lecture 9, Win07, Batzoglou Overview A. Gene Expression and Regulation B. Measuring Gene Expression: Microarrays C. Finding Regulatory Motifs
21
CS262 Lecture 9, Win07, Batzoglou Cells respond to environment Cell responds to environment— various external messages
22
CS262 Lecture 9, Win07, Batzoglou Genome is fixed – Cells are dynamic A genome is static Every cell in our body has a copy of same genome A cell is dynamic Responds to external conditions Most cells follow a cell cycle of division Cells differentiate during development Gene expression varies according to: Cell type Cell cycle External conditions Location slide credits: M. Kellis
23
CS262 Lecture 9, Win07, Batzoglou Where gene regulation takes place Opening of chromatin Transcription Translation Protein stability Protein modifications
24
CS262 Lecture 9, Win07, Batzoglou Transcriptional Regulation Efficient place to regulate: No energy wasted making intermediate products However, slowest response time After a receptor notices a change: 1.Cascade message to nucleus 2.Open chromatin & bind transcription factors 3.Recruit RNA polymerase and transcribe 4.Splice mRNA and send to cytoplasm 5.Translate into protein
25
CS262 Lecture 9, Win07, Batzoglou Transcription Factors Binding to DNA Transcription regulation: Transcription factors bind DNA Binding recognizes DNA substrings: Regulatory motifs
26
CS262 Lecture 9, Win07, Batzoglou Promoter and Enhancers Promoter necessary to start transcription Enhancers can affect transcription from afar
27
CS262 Lecture 9, Win07, Batzoglou Transcription Factor (Protein) DNA Gene Regulation with TFs Regulatory Element Gene RNA polymerase
28
CS262 Lecture 9, Win07, Batzoglou Gene RNA polymerase Transcription Factor (Protein) Regulatory Element DNA Gene Regulation with TFs
29
CS262 Lecture 9, Win07, Batzoglou DNA New protein Gene Regulation with TFs Transcription Factor (Protein) Regulatory Element Gene RNA polymerase
30
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA CATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC AGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC CGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT AGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG ATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA AAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA TTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG ATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT CTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG AACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA AAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA GCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAA CTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGA TAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTT GGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAA GTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAA TGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGA TACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACAT TTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAA AGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAAT ACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTAC AACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATAT CAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCG TTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTC TTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATT AATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATA CCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTA AGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGA TTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATA GTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATG CTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACT TAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGAT TGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAAT
31
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA CATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC AGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC CGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT AGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG ATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA AAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA TTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG ATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT CTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG AACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA AAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA GCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAA CTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGA TAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTT GGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAA GTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAA TGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGA TACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACAT TTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAA AGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAAT ACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTAC AACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATAT CAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCG TTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTC TTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATT AATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATA CCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTA AGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGA TTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATA GTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATG CTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACT TAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGAT TGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATTT Promoter motifs 3’ UTR motifsExons Introns
32
CS262 Lecture 9, Win07, Batzoglou Example: A Human heat shock protein TATA box: positioning transcription start TATA, CCAAT: constitutive transcription GRE: glucocorticoid response MRE:metal response HSE:heat shock element TATASP1 CCAAT AP2 HSE AP2CCAAT SP1 promoter of heat shock hsp70 0 --158 GENE
33
CS262 Lecture 9, Win07, Batzoglou DNA Microarrays Measuring gene transcription in a high- throughput fashion
34
CS262 Lecture 9, Win07, Batzoglou What is a microarray
35
CS262 Lecture 9, Win07, Batzoglou What is a microarray A 2D array of DNA sequences from thousands of genes Each spot has many copies of same gene Measure number of hybridizations per spot Result: Thousands of “experiments” – one per gene – in one go Perform many microarrays for different conditions: Time during cell cycle Temperature Nutrient level
36
CS262 Lecture 9, Win07, Batzoglou Goal of Microarray Experiments Measure level of gene expression across many different conditions: Expression Matrix M: {genes} {conditions}: M ij = |gene i | in condition j Group genes into coregulated sets Observe cells under different conditions Find genes with similar expression profiles Potentially regulated by same TF slide credits: M. Kellis
37
CS262 Lecture 9, Win07, Batzoglou Clustering vs. Classification Clustering Idea: Groups of genes that share similar function have similar expression patterns Hierarchical clustering k-means Bayesian approaches Projection techniques Principal Component Analysis Independent Component Analysis Classification Idea: A cell can be in one of several states (Diseased vs. Healthy, Cancer X vs. Cancer Y vs. Normal) Can we train an algorithm to use the gene expression patterns to determine which state a cell is in? Support Vector Machines Decision Trees Neural Networks K-Nearest Neighbors
38
CS262 Lecture 9, Win07, Batzoglou Clustering Algorithms b e d f a c h g abdefghc K-means b e d f a c h g c1 c2 c3 abghcdef Hierarchical slide credits: M. Kellis
39
CS262 Lecture 9, Win07, Batzoglou Hierarchical clustering Bottom-up algorithm: Initialization: each point in a separate cluster At each step: Choose the pair of closest clusters Merge The exact behavior of the algorithm depends on how we define the distance CD(X,Y) between clusters X and Y Avoids the problem of specifying the number of clusters b e d f a c h g slide credits: M. Kellis
40
CS262 Lecture 9, Win07, Batzoglou Results of Clustering Gene Expression CLUSTER is simple and easy to use De facto standard for microarray analysis Time: O(N 2 M) N: #genes M: #conditions
41
CS262 Lecture 9, Win07, Batzoglou K-Means Clustering Algorithm Each cluster X i has a center c i Define the clustering cost criterion COST(X 1,…X k ) = ∑ Xi ∑ x Xi |x – c i | 2 Algorithm tries to find clusters X 1 …X k and centers c 1 …c k that minimize COST K-means algorithm: Initialize centers Repeat: Compute best clusters for given centers → Attach each point to the closest center Compute best centers for given clusters → Choose the centroid of points in cluster Until the changes in COST are “small” b e d f a c h g c1 c2 c3 slide credits: M. Kellis
42
CS262 Lecture 9, Win07, Batzoglou K-Means Algorithm Randomly Initialize Clusters
43
CS262 Lecture 9, Win07, Batzoglou K-Means Algorithm Assign data points to nearest clusters
44
CS262 Lecture 9, Win07, Batzoglou K-Means Algorithm Recalculate Clusters
45
CS262 Lecture 9, Win07, Batzoglou K-Means Algorithm Recalculate Clusters
46
CS262 Lecture 9, Win07, Batzoglou K-Means Algorithm Repeat
47
CS262 Lecture 9, Win07, Batzoglou K-Means Algorithm Repeat
48
CS262 Lecture 9, Win07, Batzoglou K-Means Algorithm Repeat … until convergence Time: O(KNM) per iteration N: #genes M: #conditions
49
CS262 Lecture 9, Win07, Batzoglou Mixture of Gaussians – Probabilistic K-means Data is modeled as mixture of K Gaussians N( 1, 2 I), …, N( K, 2 I) Prior probabilities 1, …, K Different i for every Gaussian i, or even different covariance matrices are possible, but learning becomes harder P(x) = ∑ i P(x | N( 1, 2 I)) i Use EM to learn parameters
50
CS262 Lecture 9, Win07, Batzoglou Analysis of Clustering Data Statistical Significance of Clusters Gene Ontologyhttp://www.geneontology.org/http://www.geneontology.org/ KEGG http://www.genome.jp/kegg/http://www.genome.jp/kegg/ Regulatory motifs responsible for common expression Regulatory Networks Experimental Verification
51
CS262 Lecture 9, Win07, Batzoglou Evaluating clusters – Hypergeometric Distribution +–N genes, p labeled +, (N-p) – +Cluster: k genes, m labeled + +P-value of single cluster containing k genes of which at least r are + +– Prob a random set of k genes has m + and k-m – genes + P-value that at least r genes are + in the cluster slide credits: M. Kellis
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.