Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Similar presentations


Presentation on theme: "CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov."— Presentation transcript:

1 CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

2 CS262 Lecture 16, Win07, Batzoglou Gene structure exon1 exon2exon3 intron1intron2 transcription translation splicing exon = protein-coding intron = non-coding Codon: A triplet of nucleotides that is converted to one amino acid

3 CS262 Lecture 16, Win07, Batzoglou GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Hidden Markov Models for Gene Finding Intergene State First Exon State Intron State

4 CS262 Lecture 16, Win07, Batzoglou GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Hidden Markov Models for Gene Finding Intergene State First Exon State Intron State

5 CS262 Lecture 16, Win07, Batzoglou TAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTGGGGGGGGGGGGGGGCCCCCCC Exon1Exon2Exon3 Duration d Duration HMM for Gene Finding Duration Modeling Introns: regular HMM states—geometric duration Exons: special duration model V E0,0 (i) = max d=1…D { Prob[duration(E0,0)=d]  a Intron0,E0,0   j=i-d+1…i e E0,0 (x j ) } where i is an admissible exon-ending state, D is restricted by the longest ORF GENSCAN: Chris Burge and Sam Karlin, 1997 Best performing de novo gene finder HMM with duration modeling for Exon states  i P INTRON (x i | x i-1 …x i-w ) P EXON_DUR (d)  i P EXON((i – j + 2)%3)) (x i | x i-1 …x i-w ) j+2 P 5’SS (x i-3 …x i+4 ) P STOP (x i-4 …x i+3 )

6 CS262 Lecture 16, Win07, Batzoglou HMM-based Gene Finders GENMARK (Borodovsky & McIninch 1993) GENIE (Kulp 1996) GENSCAN (Burge 1997)  Big jump in accuracy of de novo gene finding  Currently, one of the best  HMM with duration modeling for Exon states FGENESH (Solovyev 1997)  Currently one of the best HMMgene (Krogh 1997) VEIL (Henderson, Salzberg, & Fasman 1997)

7 CS262 Lecture 16, Win07, Batzoglou Better way to do it: negative binomial EasyGene: Prokaryotic gene-finder Larsen TS, Krogh A Negative binomial with n = 3

8 CS262 Lecture 16, Win07, Batzoglou GENSCAN’s hidden weapon C+G content is correlated with:  Gene content (+)  Mean exon length(+)  Mean intron length (–) These quantities affect parameters of model Solution  Train parameters of model in four different C+G content ranges!

9 CS262 Lecture 16, Win07, Batzoglou Evaluation of Accuracy (Slide by NF Samatova) Sensitivity (SN)Fraction of exons (coding nucleotides) whose boundaries are predicted exactly (that are predicted as coding) Specificity (Sp)Fraction of the predicted exons (coding nucleotides) that are exactly correct (that are coding) Correlation Coefficient (CC) Combined measure of Sensitivity & Specificity Range: -1 (always wrong)  +1 (always right) TP FP TN FN TP FN TN Actual Predicted Coding / No Coding TNFN FPTP Predicted Actual No Coding / Coding

10 CS262 Lecture 16, Win07, Batzoglou Results of GENSCAN On the initial test dataset (Burset & Guigo)  80% exact exon detection 10% partial exons 10% wrong exons In general  HMMs have been best in de novo prediction  In practice they overpredict human genes by ~2x

11 CS262 Lecture 16, Win07, Batzoglou Comparison-based Methods

12 CS262 Lecture 16, Win07, Batzoglou Cross-species gene finding 5’ 3’ Exon1 Exon2 Exon3 Intron1Intron2 [human] [mouse] GGTTTT--ATGAGTAAAGTAGACACTCCAGTAACGCGGTGAGTAC----ATTAA | ||||| ||||| ||| ||||| ||||||||||||| | | C-TCAGGAATGAGCAAAGTCGAC---CCAGTAACGCGGTAAGTACATTAACGA-

13 CS262 Lecture 16, Win07, Batzoglou Comparison of 1196 orthologous genes (Makalowski et al., 1996) Sequence identity between genes in human/mouse –exons: 84.6% –protein: 85.4% –introns: 35% –5’ UTRs: 67% –3’ UTRs: 69% 27 proteins were 100% identical

14 CS262 Lecture 16, Win07, Batzoglou

15 Not always: HoxA human-mouse

16 CS262 Lecture 16, Win07, Batzoglou Patterns of Conservation 30% 1.3% 0.14% 58% 14% 10.2% GenesIntergenic Mutations Gaps Frameshifts Separation 2-fold 10-fold 75-fold 

17 CS262 Lecture 16, Win07, Batzoglou Twinscan Twinscan is an augmented version of the Gencscan HMM. E I transitions duration emissions ACUAUACAGACAUAUAUCAU

18 CS262 Lecture 16, Win07, Batzoglou Twinscan Algorithm 1.Align the two sequences (eg. from human and mouse) 2.Mark each human base as gap ( - ), mismatch ( : ), match ( | ) New “alphabet”: 4 x 3 = 12 letters  = { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| } 3.Run Viterbi using emissions e k (b) where b  { A-, A:, A|, …, T| } Emission distributions e k (b) estimated from real genes from human/mouse e I (x|) < e E (x|): matches favored in exons e I (x-) > e E (x-): gaps (and mismatches) favored in introns

19 CS262 Lecture 16, Win07, Batzoglou Example Human : ACGGCGACGUGCACGU Mouse : ACUGUGACGUGCACUU Alignment : ||:|:|||||||||:| Input to Twinscan HMM: A| C| G: G| C: G| A| C| G| U| G| C| A| C| G: U| Recall, e E (A|) > e I (A|) e E (A-) < e I (A-) Likely exon

20 CS262 Lecture 16, Win07, Batzoglou HMMs for simultaneous alignment and gene finding: Generalized Pair HMMs

21 CS262 Lecture 16, Win07, Batzoglou The SLAM hidden Markov model

22 CS262 Lecture 16, Win07, Batzoglou Exon GPHMM d e 1.Choose exon lengths (d,e). 2.Generate alignment of length d+e.

23 CS262 Lecture 16, Win07, Batzoglou Approximate alignment

24 CS262 Lecture 16, Win07, Batzoglou Measuring Performance

25 CS262 Lecture 16, Win07, Batzoglou Example: HoxA2 and HoxA3 SLAM SGP-2 Twinscan Genscan TBLASTX SLAM CNS VISTA RefSeq

26 CS262 Lecture 16, Win07, Batzoglou Gene Regulation and Microarrays

27 CS262 Lecture 16, Win07, Batzoglou Overview A. Gene Expression and Regulation B. Measuring Gene Expression: Microarrays C. Finding Regulatory Motifs

28 CS262 Lecture 16, Win07, Batzoglou Cells respond to environment Cell responds to environment— various external messages

29 CS262 Lecture 16, Win07, Batzoglou Genome is fixed – Cells are dynamic A genome is static  Every cell in our body has a copy of same genome A cell is dynamic  Responds to external conditions  Most cells follow a cell cycle of division Cells differentiate during development Gene expression varies according to:  Cell type  Cell cycle  External conditions  Location slide credits: M. Kellis

30 CS262 Lecture 16, Win07, Batzoglou Where gene regulation takes place Opening of chromatin Transcription Translation Protein stability Protein modifications

31 CS262 Lecture 16, Win07, Batzoglou Transcriptional Regulation Efficient place to regulate: No energy wasted making intermediate products However, slowest response time After a receptor notices a change: 1.Cascade message to nucleus 2.Open chromatin & bind transcription factors 3.Recruit RNA polymerase and transcribe 4.Splice mRNA and send to cytoplasm 5.Translate into protein

32 CS262 Lecture 16, Win07, Batzoglou Transcription Factors Binding to DNA Transcription regulation: Certain transcription factors bind DNA Binding recognizes DNA substrings: Regulatory motifs

33 CS262 Lecture 16, Win07, Batzoglou Promoter and Enhancers Promoter necessary to start transcription Enhancers can affect transcription from afar

34 CS262 Lecture 16, Win07, Batzoglou Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA

35 CS262 Lecture 16, Win07, Batzoglou Regulation of Genes Gene RNA polymerase Transcription Factor (Protein) Regulatory Element DNA

36 CS262 Lecture 16, Win07, Batzoglou Regulation of Genes Gene RNA polymerase Transcription Factor Regulatory Element DNA New protein

37 TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA CATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC AGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC CGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT AGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG ATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA AAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA TTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG ATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT CTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG AACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA AAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA GCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAA CTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGA TAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTT GGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAA GTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAA TGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGA TACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACAT TTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAA AGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAAT ACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTAC AACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATAT CAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCG TTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTC TTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATT AATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATA CCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTA AGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGA TTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATA GTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATG CTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACT TAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGAT TGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAAT

38 TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA CATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC AGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC CGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT AGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG ATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA AAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA TTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG ATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT CTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG AACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA AAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA GCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAA CTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGA TAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTT GGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAA GTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAA TGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGA TACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACAT TTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAA AGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAAT ACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTAC AACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATAT CAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCG TTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTC TTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATT AATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATA CCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTA AGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGA TTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATA GTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATG CTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACT TAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGAT TGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATTT Promoter motifs 3’ UTR motifsExons Introns

39 CS262 Lecture 16, Win07, Batzoglou Example: A Human heat shock protein TATA box: positioning transcription start TATA, CCAAT: constitutive transcription GRE: glucocorticoid response MRE:metal response HSE:heat shock element TATASP1 CCAAT AP2 HSE AP2CCAAT SP1 promoter of heat shock hsp70 0 --158 GENE

40 CS262 Lecture 16, Win07, Batzoglou The Cell as a Regulatory Network Genes = wires Motifs = gates ABMake DC If C then D If B then NOT D If A and B then D D Make BD If D then B C gene D gene B

41 CS262 Lecture 16, Win07, Batzoglou The Cell as a Regulatory Network (2)

42 CS262 Lecture 16, Win07, Batzoglou DNA Microarrays Measuring gene transcription in a high- throughput fashion

43 CS262 Lecture 16, Win07, Batzoglou What is a microarray

44 CS262 Lecture 16, Win07, Batzoglou What is a microarray Measure the level of mRNA messages in a cell DNA 1 DNA 3 DNA 5DNA 6 DNA 4 DNA 2 cDNA 4 cDNA 6 Hybridize Gene 1 Gene 3 Gene 5Gene 6 Gene 4 Gene 2 Measure RNA 4 RNA 6 RT slide credits: M. Kellis

45 CS262 Lecture 16, Win07, Batzoglou What is a microarray A 2D array of DNA sequences from thousands of genes Each spot has many copies of same gene Measure number of hybridizations per spot Result: Thousands of “experiments” – one per gene – in one go Perform many microarrays for different conditions:  Time during cell cycle  Temperature  Nutrient level

46 CS262 Lecture 16, Win07, Batzoglou Goal of Microarray Experiments Measure level of gene expression across many different conditions:  Expression Matrix M: {genes}  {conditions}: M ij = |gene i | in condition j Group genes into coregulated sets  Observe cells under different conditions  Find genes with similar expression profiles Potentially regulated by same TF slide credits: M. Kellis

47 CS262 Lecture 16, Win07, Batzoglou Clustering vs. Classification Clustering  Idea: Groups of genes that share similar function have similar expression patterns Hierarchical clustering k-means Bayesian approaches Projection techniques Principal Component Analysis Independent Component Analysis Classification  Idea: A cell can be in one of several states (Diseased vs. Healthy, Cancer X vs. Cancer Y vs. Normal)  Can we train an algorithm to use the gene expression patterns to determine which state a cell is in? Support Vector Machines Decision Trees Neural Networks K-Nearest Neighbors

48 CS262 Lecture 16, Win07, Batzoglou Clustering Algorithms b e d f a c h g abdefghc K-means b e d f a c h g c1 c2 c3 abghcdef Hierarchical slide credits: M. Kellis

49 CS262 Lecture 16, Win07, Batzoglou Hierarchical clustering Bottom-up algorithm:  Initialization: each point in a separate cluster At each step:  Choose the pair of closest clusters  Merge The exact behavior of the algorithm depends on how we define the distance CD(X,Y) between clusters X and Y Avoids the problem of specifying the number of clusters b e d f a c h g slide credits: M. Kellis

50 CS262 Lecture 16, Win07, Batzoglou Distance between clusters CD(X,Y)=min x  X, y  Y D(x,y) Single-link method CD(X,Y)=max x  X, y  Y D(x,y) Complete-link method CD(X,Y)=avg x  X, y  Y D(x,y) Average-link method CD(X,Y)=D( avg(X), avg(Y) ) Centroid method e d f h g e d f h g e d f h g e d f h g slide credits: M. Kellis

51 CS262 Lecture 16, Win07, Batzoglou Results of Clustering Gene Expression CLUSTER is simple and easy to use De facto standard for microarray analysis Time: O(N 2 M) N: #genes M: #conditions

52 CS262 Lecture 16, Win07, Batzoglou K-Means Clustering Algorithm Each cluster X i has a center c i Define the clustering cost criterion COST(X 1,…X k ) = ∑ Xi ∑ x  Xi |x – c i | 2 Algorithm tries to find clusters X 1 …X k and centers c 1 …c k that minimize COST K-means algorithm:  Initialize centers  Repeat: Compute best clusters for given centers → Attach each point to the closest center Compute best centers for given clusters → Choose the centroid of points in cluster  Until the changes in COST are “small” b e d f a c h g c1 c2 c3 slide credits: M. Kellis

53 CS262 Lecture 16, Win07, Batzoglou K-Means Algorithm Randomly Initialize Clusters

54 CS262 Lecture 16, Win07, Batzoglou K-Means Algorithm Assign data points to nearest clusters

55 CS262 Lecture 16, Win07, Batzoglou K-Means Algorithm Recalculate Clusters

56 CS262 Lecture 16, Win07, Batzoglou K-Means Algorithm Recalculate Clusters

57 CS262 Lecture 16, Win07, Batzoglou K-Means Algorithm Repeat

58 CS262 Lecture 16, Win07, Batzoglou K-Means Algorithm Repeat

59 CS262 Lecture 16, Win07, Batzoglou K-Means Algorithm Repeat … until convergence Time: O(KNM) per iteration N: #genes M: #conditions

60 CS262 Lecture 16, Win07, Batzoglou Mixture of Gaussians – Probabilistic K-means Data is modeled as mixture of K Gaussians  N(  1,  2 I), …, N(  K,  2 I)  Prior probabilities  1, …,  K Different  i for every Gaussian i, or even different covariance matrices are possible, but learning becomes harder  P(x) = ∑ i P(x | N(  1,  2 I))   i  Use EM to learn parameters

61 CS262 Lecture 16, Win07, Batzoglou Analysis of Clustering Data Statistical Significance of Clusters  Gene Ontologyhttp://www.geneontology.org/http://www.geneontology.org/  KEGG http://www.genome.jp/kegg/http://www.genome.jp/kegg/ Regulatory motifs responsible for common expression Regulatory Networks Experimental Verification

62 CS262 Lecture 16, Win07, Batzoglou Evaluating clusters – Hypergeometric Distribution +–N experiments, p labeled +, (N-p) – +Cluster: k elements, m labeled + +P-value of single cluster containing k elements of which at least r are + Prob that a randomly chosen set of k experiments would result in m positive and k-m negative P-value of uniformity in computed cluster slide credits: M. Kellis


Download ppt "CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov."

Similar presentations


Ads by Google