DNA, Gene, and Genome
Translating Machinery for Genetic Information
Transcription factors mRNA levels
Automated DNA Sequencing
Data Increase (from NCBI web site)
Partial Display of Human Draft Sequence (Nature, 2001)
Human Genome Map at NCBI
MGALRPTLLPPSLPLLLLLMLGMGCWAREVLVPEGPLYRVAGTAVSISCNVTGY EGPAQQNFEWFLYRPEAPDTALGIVSTKDTQFSYAVFKSRVVAGEVQVQRLQGD AVVLKIARLQAQDQGIYECTPSTDTRYLGSYSGKVELRVLPDVLQVSAAPPGPR GRQAPTSPPRMTVHEGQELALGCLARTSTQKHTHLAVSFGRSVPEAPVGRSTLQ EVVGIRSDLAVEAGAPYAERLAAGELRLGKEGTDRYRMVVGGAQAGDAGTYH CTAAEWIQDPDGSWAQIAEKRAVLAHVDVQTLSSQLAVTVGPGERRIGPGEPLE LLCNVSGALPPAGRHAAYSVGWEMAPAGAPGPGRLVAQLDTEGVGSLGPGYE GRHIAMEKVASRTYRLRLEAARPGDAGTYRCLAKAYVRGSGTRLREAASARSR PLPVHVREEGVVLEAVAWLAGGTVYRGETASLLCNISVRGGPPGLRLAASWWV ERPEDGELSSVPAQLVGGVGQDGVAELGVRPGGGPVSVELVGPRSHRLRLHSL GPEDEGVYHCAPSAWVQHADYSWYQAGSARSGPVTVYPYMHALDTLFVPLL VGTGVALVTGATVLGTITCCFMKRLRKR KDa Protein interacting with prostate cancer suppressor
Molecular biology databases Sequence databases –Annotated –Low-annotation –Specialized Structural databases Motif databases Genome databases Proteome databases RNA expression Literature Populations Mutations Polymorphisms Organisms Pathways
PromotersESTs Tissues and cells Genome maps DNA sequences Molecular Phylogeny Protein sequences Protein structures DNA motifs Protein motifs Substrates Metabolic pathways Transcription Factors RNA expression Mutations/polymorphisms Gene Family
Databases formats Relational databases –GDB, GSDB, MGD etc. –Vender: Sybase, Oracle etc. Flat file databases –GenBank, SWISS-PROT etc. Object-oriented databases –ACeDB, AtDB etc.
Molecular biology data types OrganismsGenome maps Mouse chromosome X from the Mouse Genome Informatics project
Molecular biology data types OrganismsGenome maps DNA sequences RNA sequences...AATGGTACCGATGACCTGGAGCTTGGTTCGA...
Molecular biology data types OrganismsGenome maps DNA sequences RNA sequences Protein sequences...TRLRPLLALLALWPPPPARAFVNQHLCGSHLVEA...
Molecular biology data types OrganismsGenome maps DNA sequences RNA sequences Protein sequences Protein structures RNA structures PDB entry 1CIS P.Osmark, P.Sorensen, F.M.Poulsen
Molecular biology data types OrganismsGenome maps DNA sequences RNA sequences Protein sequences Protein structures DNA motifs Protein motifs RNA expression RNA structures
DNA microarrays measure variations in RNA levels The full Yeast genome on a chip De Risi et al, Science 278:680 Red dots: genes whose RNA level increased Green dots: genes whose RNA level decreased
Substrates for High Throughput Arrays Nylon Membrane Glass SlidesGeneChip Single label P 33 Single label biotin streptavidin Dual label Cy3, Cy5
GeneChip ® Probe Arrays 24µm Millions of copies of a specific oligonucleotide probe Image of Hybridized Probe Array Image of Hybridized Probe Array >200,000 different complementary probes Single stranded, labeled RNA target Oligonucleotide probe * * * * *1.28cm GeneChip Probe Array Hybridized Probe Cell
GeneChip ® Expression Array Design GeneSequence Probes designed to be Perfect Match Probes designed to be Mismatch Multiple oligo probes 5´3´
Procedures for Target Preparation cDNA Fragment (heat, Mg 2+ ) LLLL Wash & Stain Scan Hybridize (16 hours) Labeled transcript Poly (A) + / Total RNA RNA AAAA IVT(Biotin-UTPBiotin-CTP) Labeled fragments L L L L Cells
Microarray Technology
NSF Soybean Functional Genomics Steve Clough / Vodkin Lab Printing Arrays on 50 slides
Cells from condition A Cells from condition B mRNA Label Dye 2 NSF / U of Illinois Microarray Workshop -Steve Clough / Vodkin Lab Ratio of expression of genes from two sources Label Dye 1 cDNA equaloverunder Mix Total or
GSI Lumonics NSF Soybean Functional Genomics Steve Clough / Vodkin Lab
Beta Actin PKG HPRT Beta 2 microglobulin Rubisco AB binding protein Major latex protein homologue (MSG) Cattle and Soy Controls Array of cattle and soy spiking controls. 50 ug of cattle brain total RNA was labeled with Cy3 (green). 1 ul each of in vitro transcribed soy Rubisco (5 ng), AB binding protein (0.5 ng) and MSG (0.05 ng) were labeled with Cy5. The two labeled samples were cohybridized on superamine slides (Telechem, Inc.). To the right of each set of spots are five negative controls (water).
IgM IgM heavy chain MYLK COL1A2 MYLK IgM Fetal Spleen-Cy3Adult Spleen-Cy5 IgM heavy chain
Placenta vs. Brain – 3800 Cattle Placenta Array cy3 cy5 GenePix Image Analysis Software
1.Experimental Design 2.Image Analysis – raw data 3.Normalization – “clean” data 4.Data Filtering – informative data 5.Model building 6.Data Mining (clustering, pattern recognition, et al) 7.Validation Microarray Data Process
Scatterplot of Normalized Data Adult Fetal
>0.3<-0.3
Complexity Levels of Microarray Experiments: 1.Compare genes in a control situation versus a treatment situation Example: Is the level of expression (up-regulated or down-regulated) significantly different in the two situations? (drug design application) Methods: t-test, Bayesian approach 2.Find multiple genes that share common functionalities Example: Find related genes that are dependent? Methods: Clustering (hierarchical, k-means, self-organizing maps, neural network, support vector machines) 3.Infer the underlying gene and protein networks that are responsible for the patterns and functional pathways observed Example: What is the gene regulation at system level? Directions: mining regulatory regions, modeling regulatory networks on a global scale
Comparing data from two experiments.
NO DRUG 1nM Drug 1 M Drug Statistical filters used: The genes present (Presence Call in Affymetrix) in drug treated, ANOVA p<0.02 between groups. Red indicates increased expression, and green is decreased expression (Log(fold change)). Genesight 3 (Biodiscovery Software, Clustering to extract genes which tightly co-express.
Statistical filters used: The genes present (Presence Call in Affymetrix) in absence of drug, ANOVA p<0.02 between groups. NO DRUG 1nM Drug 1 M Drug
Self Organizing Maps
Molecular Classification of Cancer
Gene Expression Profile of Aging and Its Retardation by Caloric Restriction Cheol-Koo Lee, Roger G. Klopp, Richard Weindruch, Tomas A. Prolla
Data Mining Methods Classification, Regression (Predictive Modeling) Clustering (Segmentation) Association Discovery (Summarization) Change and deviation detection Dependency Modeling Information Visualization