Special Topics in Genomics Lecture 1: Introduction Instructor: Hongkai Ji Department of Biostatistics
Outline of today’s lecture Introduction to genome and genomics Topics and tools Relevance of statistics
DNA DNAs (Deoxyribonucleic acids) are molecules to store genetic information of a living organism. DNA consists of two polymers made from four types of nucleotides: adenine (A) guanine (G), cytosine (C) and thymine (T). Purines: A, G; Pyrimidines: C, T Two polymers are complementary to each other and from a double-helix structure 5’-ACCGTTCGACGGTAA-3’ ||||||||||||||| 3’-TGGCAAGCTGCCATT-5’
Chromosome
Genome TCAGTTGGAGCTGCTCCCCCACGGCCTCTCCTCACATTCCACGTCCTGTAGCTCTATGACCTCCACCTTTGAGTCCCTCCT CTCACACCTGACATGAAAAGGCACATGAGGATCCTCAAATACCCCGTGATCAGTCTCAGGGTAGCTCTCATAGCCTGGACA GGGCCCCCCTCGGGGGTTGCGCCCAGGTCCAGGCGGGGGATGCACAGCAACAGTCACCGAAGCAGAAGCCGTCACAGTGGT GATGGGCTGGCAGTAGCTGGGCACAGAGCTGCCCATGGCGGTGGACGTTGGGTTCCGAGGGTTGTGAGAACGGGCCCCACG GGGCCCTGAGCGGTCCCTATTGCTAGGGCCAGAATGCCCTTCAGTAGAAATTTCAAAAGCGTCTCTGCGCGGTCTGTAGGG GGGTGGCCGCAAGCCTTCTCTAGGGGGATCCCTTCGAGGCTGCTGGCCTTGCCGTCCAGGGGACAAGGAGCCAGAGTCCAG GTGGGGCTGTTGCCGAGGGGTCAAGGGAGGCTGATGTCTGGAGTCCGGATGGACCACCTGCAGAGGAGAGACATAGGTCAA CACAGGGAGGTAGGATGGTGGTGATGTTCCACCCACAAAAGAAAACCTATTCCTTTAGAAACCTCCAGGATGTGAATCCTG CCTGCACCTGCACAGCTGGCTGGAGGCATATAGCCACTGCCCATAGATCTCAACTTACCCTCACAACCAACTGCCCCCAGG CCTAAGTTCTCTGCCTCAAAACTGCCAAGGCCTGGATAGCCAAGAGCCTGGGTGTCTTGGAAATATGCAACCATAAATAGT AGCTTTTAGAAGTATAAGGCTCCTGTTTCTGGGTCATATTAGTGTTGTTTTCACCTGTCCCCAGCCCTAAGCCAGGTGTGG CCAGAAGCAAATGTACTGTAAGAGCAGAGCAAAAACTTCCACACAGATAGTTCTGTTAGGCAATACATCTCTGCCTGACTA TTAGGAATCTGGTTTCTGGGTCCTCTGTACAAAGCTCGGAGCAACACAGTGGCCACATCAATCAAAAGGACCGTGACCAAC TTCAAAGTCGGTGAGCTTGTACCTATTTTTAGGCTCCTGCTGAACAGAACCAGATTCACACTACAGCTCAGCAGGGCATCG TCACGGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTTGGGGGGGGGGGGTGGACAGAGGACGGGGACACAATT CACTGGCCAGCCCTTCTCTCCTTCAAGGAAGGCTGCTCTAGCCTGGGACTGGAATACACATTTCCTGTAAACATGGTGGGG GCCTCAGGCAAGCCAGAGTTTTGGAGCCTTCCTTAACTCTTCAAGGTGAGCATCTTGACTTGGAGGGTGGGGGTGCGGGTA AGGAAGGAACCTGTGGACTCCTCCCTACAAGACAGAAAAGGAATAAGCCACGAAGACAATAACGATTTTTGTATCAAGCGT CCTCTCCCATTTCAGCTTACCTGACAATGAAATCAAATTCGGACCCTGCAAGCATCAGTACACCCAGCAGAGTGGACACAG CACCGTCCAGAACGGGAGCAAACATGTGCTCCAGAGCGAGCATAGCCCTGTGGTTCTTGTCCCCAATGGCTGTCAGAAAGG CCTGAACAAAGGAGAAAATTGACACGGTCACATTCTGGGTGTGGTAAAGTGCTCAGCTGTGTCTATACTTGGGTTTTGTAT … Total amount of DNA in human genome: 3 * 10 9 base pairs (bp)
Gene
Central Dogma Gene expression
Topic 1: gene expression and microarray Expression A B C A B C A B C No Expression X Y Z X Y Z X Y Z X Y Z Temporally Spatially
Microarray probe cDNA sample
Microarray data
Topic 2: transcriptional regulation TF1TF2 Transcription factors (TF): Transcription factor binding sites (TFBS):CCACCCAC, TAATAAAAT TF1 TF2 TTATGTAACCTGCACTTACTACCACCCACAACATAATAAAATCTAAACCACTGAATGAAATACAAAATCTATGTATGA... TF1 TF2 TTATGTAACCTGCACTTACTACCACCCACAACATAATAAAATCTAAACCACTGAATGAAATACAAAATCTATGTATGA...
Transcription factor binding motif GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA TTAGAGGCACAATTGCTTGGGTGGTGCACAAAAAAACAAG AACAGCCTTGGATTAGCTGCTGGGGGGGTGAGTGGTCCAC ATCAGAATGGGTGGTCCATATATCCCAAAGAAGAGGGTAG TF TGGGTGGTC TGGGTGGTA TGGGAGGTC TGGGTGGTG TGAGTGGTC TGGGTGGTC A C G T A C G T Motif Transcription Factor Binding Sites (TFBS)
Motifs are regulatory codes in the genome TCAGTTGGAGCTGCTCCCCCACGGCCTCTCCTCACATTCCACGTCCTGTAGCTCTATGACCTCCACCTTTGAGTCCCTCCT CTCACACCACCCATGTTTTGTTTATGAGGATCCTCAAATACCCCGTGATCAGTCTCAGGGTAGCTCTCATAGCCTGGACAG GGCCCCCCTCGGGGGTTGCGCCCAGGTCCAGGCGGGGGATGCACAGCAACAGTCACCGAAGCAGAAGCCGTCACAGTGGTG ATGGGCTGGCAGTAGCTGGGCACAGAGCTGCCCATGGCGGTGGACGTTGGGTTCCGAGGGTTGTGAGAACGGGCCCCACGG GGCCCTGAGCGGTCCCTATTGCTAGGGCCAGAATGCCCTTCAGTAGAAATTTCAAAAGCGTCTCTGCGCGGTCTGTAGGGG GGTGGCCGCAAGCCTTCTCTAGGGGGATCCCTTCGTTGCTGCTGGCCTTGCCGTCCAGGGGACAAGGAGCCAGAGTCCAGG TGGGGCTGTTGCCGAGGGGTCAAGGGAGGCTGATGTCTGGAGTCCGGATGGACCACCTGCAGAGGAGAGACATAGGTCAAC ACAGGGAGGTAGGATGGTGGTGATGTTCCACCCACAAAAGAAAACCTATTCCTTTAGAAACCTCCAGGATGTGAATCCTGC CTGCACCTGCACAGCTGGCTGGAGGCATATAGCCACTGCCCATAGATCTCAACTTACCCTCACAACCAACTGCCCCCAGGC CTAAGTTCTCTGCCTCAAAACTGCCAAGGCCTGGATAGCCAAGAGCCTGGGTGTCTTGGAAATATGCAACCATAAATAGTA GCTTTTAGAAGTATAAGGCTCCTGTTTCTGGGTCATATTAGTTTTGTTTTCACCTGTCCCCACCCATAAGCCAGGTGTGGC CAGAAGCAAATGTACTGTAAGAGCAGAGCAAAAACTTCCACACAGATAGTTCTGTTAGGCAATACATCTCTGCCTGACTAT TAGGAATCTGGTTTCTGGGTCCTCTGTACAAAGCTCGGAGCAACACAGTGGCCACATCAATCAAAAGGACCGTGACCAACT TCAAAGTCGGTGAGCTTGTACCTATTTTTAGGCTCCTGCTGAACAGAACCAGATTCACACTACAGCTCAGCAGGGCATCGT CACGGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTTGGGGGGGGGGGGTGGACAGAGGACGGGGACACAATTC ACTGGCCAGCCCTTCTCTCCTTCAAGGAAGGCTGCTCTAGCCTGGGACTGGAATACACATTTCCTGTAAACATGGTGGGGG CCTCAGGCAAGCCAGAGTTTTGGAGCCTTCCTTAACTCTTCAAGGTGAGCATCTTGACTTGGAGGGTGGGGGTGCGGGTAA GGAAGGAACCTGTGGACTCCACCCAACAAGACAGAAAAGGAATAAGCCACGAAGACAATAACGATTTTTGTATCAAGCGTC CTCTCCCATTTCAGCTTACCTGACAATGAAATCAAATTCGGACCCTGCAAGCATCAGTACACCCAGCAGAGTGGACACAGC ACCGTCCAGAACGGGAGCAAACATGTGCTCCAGAGCGAGCATAGCCCTGTGGTTCTTGTCCCCAATGGCTGTCAGAAAGGC CTGAACAAAGGAGAAAATTGACACGGTCACATTCTGGGTGTGGTAAAGTGCTCAGCTGTGTCTATACTTGGGTTTTGTAT Transcription Factor Binding Sites (TFBS) Gene
Gene regulatory network Transcription factors Other genes Activation Repression Other Interactions MisregulationDiseases Gene1 TACTACCACCCACAACATAATAAAATCTAA TF1TF2 Gene2 TTAATAAAATACCACCCACAACCTAAGGAT TF1 TF2 TF3 Gene3
Motif discovery and decoding regulatory programs in the genome DictionaryHuman Language guesswhatthestoryisaslongasyouknowthela nguageitshouldbeprettyeasy Guess what the story is. As long as you know the language, it should be pretty easy. Know Guess Be … Dictionary Genomic Language GGCCCTGAGCGGTCCCTATTGCTGGGTGGTCAATGCCCTTCATCTGAAATTTC AAAAGCGTCTCTGCGCGGTCTGTAGGGGGGTGGCCGCAAGCCTTCTCTAGGGG GGCCCTGAGCGGTCCCTATTGCTAGGGCCAGAATGCCCTTCAGTAGAAATTTC GGCCCTGAGCGGTCCCTATTGCTGGGTGGTCAATGCCCTTCATCTGGAATTTC AAAAGCGTCTCTGCGCGGTCTGTAGGGGGGTGGCCGCAAGCCTTCTCTAGGGG GGCCCTGAGCGGTCCCTATTGCTAGGGCCAGAATGCCCTTCAGTAGAAATTTC step1 step2 step1 step2
Finding motifs from co-regulated genes (Roth et al., 1998; Hughes et al., 2000; etc.) GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA Gene 1 Gene 2 Gene 3 … Gene N Condition1 Condition2 GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA Gene1 Gene2 Gene3
Motif discovery is difficult in mammalian genomes due to a low signal-to-noise ratio Gene1 100~1000 bp Gene2 100~1000 bp Gene3 100~1000 bp Gene1 10k~1000k bp Gene2 10k~1000k bp Gene3 10k~1000k bp yeast human
Topic 3: ChIP-chip and tiling array IPNo IP 500~2000 bp long ChIP-chip (Chromatin ImmunoPrecipitation coupled with Microarray)
ChIP-chip on tiling arrays IP IP CT CT IP CT 500~2000 bp long Probe: 25~60 bp long 35~300 bp spacing
A combined approach to study gene regulation ChIP-chip 500~2000bp 6~30bp Sequence Analysis GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA TTAGAGGCACAATTGCTTGGGTGGTGCACAAAAAAACAAG AACAGCCTTGGATTAGCTGCTGGGGGGGTGAGTGGTCCAC
Topic 4: alternative splicing and exon array splicing gene exon intron promoter transcription start site (TSS)
Alternative splicing Isoform 1 Isoform 2 Isoform 3 exon 1exon 2exon 3 exon 4exon 5
Exon array
Topic 5: single nucleotide polymorphism and SNP array SNPs: occur every 100 to 1000 bp make up 90% of genetic variations minor allele frequency >= 1% (otherwise we call them mutations)
SNP array ACCGTGGA[C/T]CTGAACCG |||||||| | |||||||| TGGCACCT[G/A]GACTTGGC ACCGTGGA[G]CTGAACCG ACCGTGGA[C]CTGAACCG ACCGTGGA[T]CTGAACCG ACCGTGGA[A]CTGAACCG What will happen when the genotype is CC? CT? TT? Applications: 1. Genotyping & genome-wide association study 2. Copy number variations and loss of heterozygosity 3. Allele specific expression …
Topic 6: next-generation sequencing Traditional sequencing
Next-generation sequencing Prepare genomic DNA Attach DNA to surface Bridge amplification Fragement become double stranded Denature the double stranded molecules Complete amplification Determine first base Image first base Determine second base Image second base Sequence reads over multiple cycles Align data. >50 milliion clusters/flow cell, each 1000 copies of the same template, 1 billion bases per run, 1% of the cost of capillary-based method. (From:
Array vs. next-generation sequencing
Microarray, Exon array RNA-seq ChIP-chip ChIP-seq SNP array SNP/mutation detection by sequencing … …
Other topics Epigenomics Transposon miRNA
Relevance of statistics GenomicsStatistics Need new statistical theories and tools Guide development of efficient data analysis strategies
Example 1: differential gene expression
Example 1: multiple testing Gene i=1 i=2 i=3 … i=I t-statistic … -0.5 p-value … 0.56 Bonferroni adjustment Rejections … Bonferroni adjustment too stringent Multiplicity needs to be adjusted in order to determine statistical significance False discovery rate
False discovery rate (FDR, Benjamini & Hochberg, 1995) AcceptRejectTotal True H 0 UVm0m0 True H 1 TSm-m 0 m-RRm FDR = E(V/R) = Pr(R>0)E(V/R|R>0) FWER = Pr(V ≥ 1) False discovery rate (FDR)
Pooling information … Test Sample Variance ( df ) … I … Variance Estimates … Modified t-statistics Multiplicity caused some problem in controlling type I errors, but it can be used to improve statistical power! A common distribution
Example 2: motif discovery 00 Θ S: GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGACTGGGAGGTCCTCGGTTCAGAGTCACAGAGCA A: Motif:Background: A C G T A C G T A C G T A Inference by iterative estimation/sampling (Gibbs sampler) f (A,Θ | S) Marginalization: f (A | S) = ∫ f (A, Θ | S) dΘ