Download presentation
1
The Bioinformatics of Microarrays
Microarray Outreach Team Fall 2005
2
Outline Biology, Statistics, Data mining common term definitions
Transcriptome caveats and limitations Experimental Design Scan to intensity measures Low level analysis Data mining – how to interpret > 6000 measures Databases Software Techniques Comparing to prior HT studies, across platforms? Issues
3
Bioinformatics, Computational Biology, Data Mining
Bioinformatics is an interdisciplinary field about the information processing problems in computational biology and a unified treatment of the data mining methods for solving these problems. Computational Biology is about modeling real data and simulating unknown data of biological entities, e.g. Genomes (viruses, bacteria, fungi, plants, insects,…) Proteins and Proteomes Biological Sequences Molecular Function and Structure Data Mining is searching for knowledge in data Knowledge mining from databases Knowledge extraction Data/pattern analysis Data dredging Knowledge Discovery in Databases (KDD)
4
Basic Terms in Biology Example:
The human body contains ~100 trillion cells Inside each cell is a nucleus Inside the nucleus are two complete sets of the human genome (except in egg, sperm cells and blood cells) Each set of genomes includes 30,000-80,000 genes on the same 23 chromosomes Gene – A functional hereditary unit that occupies a fixed location on a chromosome, has a specific influence on phenotype, and is capable of mutation. Chromosome – A DNA containing linear body of the cell nuclei responsible for determination and transmission of hereditary characteristics
5
Basic Terms in Data Mining
Data Mining:A step in the knowledge discovery process consisting of particular algorithms (methods) that under some acceptable objective, produces a particular enumeration of patterns (models) over the data. Knowledge Discovery Process: The process of using data mining methods (algorithms) to extract (identify) what is deemed knowledge according to the specifications of measures and thresholds, using a database along with any necessary preprocessing or transformations. A pattern is a conservative statement about a probability distribution. Webster: A pattern is (a) a natural or chance configuration, (b) a reliable sample of traits, acts, tendencies, or other observable characteristics of a person, group, or institution
6
Problems in Bioinformatics Domain
Data production at the levels of molecules, cells, organs, organisms, populations Integration of structure and function data, gene expression data, pathway data, phenotypic and clinical data, … Prediction of Molecular Function and Structure Computational biology: synthesis (simulations) and analysis (machine learning)
7
Subcellular Localization, Provides a simple goal for genome-scale functional prediction
Determine how many of the ~6000 yeast proteins go into each compartment
8
Subcellular Localization, a standardized aspect of function
Cytoplasm Nucleus Membrane ER Extra- cellular [secreted] Golgi Mitochondria
9
"Traditionally" subcellular localization is "predicted" by sequence patterns
Cytoplasm NLS Nucleus Membrane TM-helix ER HDEL Extra- cellular [secreted] Golgi Import Sig. Mitochondria Sig. Seq.
10
[Expression Level in Copies/Cell]
Subcellular localization is associated with the level of gene expression [Expression Level in Copies/Cell] Cytoplasm Nucleus Membrane ER Extra- cellular [secreted] Golgi Mitochondria
11
[Expression Level in Copies/Cell]
Combine Expression Information & Sequence Patterns to Predict Localization [Expression Level in Copies/Cell] Cytoplasm NLS Nucleus Membrane TM-helix ER HDEL Extra- cellular [secreted] Golgi Import Sig. Mitochondria Sig. Seq.
12
The central dogma of molecular biology???
Major Objective: Discover a comprehensive theory of life’s organization at the molecular level The major actors of molecular biology: the nucleic acids, DeoxyriboNucleic acid (DNA) and RiboNucleic Acids (RNA) The central dogma of molecular biology??? Proteins are very complicated molecules with 20 different amino acids.
14
Dynamic Nature of Yeast Genome
eORF= essential kORF= known hORF= homology identified shORF= short tORF= transposon identified qORF= questionable dORF= disabled First published sequence claimed 6274 genes– a # that has been revised many times, why?
15
The Affy detection oligonucleotide sequences are frozen at the time of synthesis, how does this impact downstream data analysis?
16
Microarray Data Process
Experimental Design Image Analysis – raw data Normalization – “clean” data Data Filtering – informative data Model building Data Mining (clustering, pattern recognition, et al) Validation
17
Experimental Design A good microarray design has 4 elements
A clearly defined biological question or hypothesis Treatment, perturbation and observation of biological materials should minimize systematic bias Simple and statistically sound arrangement that minimizes cost and gains maximal information Compliance with MIAME The goal of statistics is to find signals in a sea of noise The goal of exp. design is to reduce the noise so signals can be found with as small a sample size as possible
18
Observational Study vs. Designed Experiment
Investigator is a passive observer who measures variables of interest, but does not attempt to influence the responses Designed Experiment- Investigator intervenes in natural course of events What type is our DMSO exp?
19
Experimental Replicates
Why? In any exp. system there is a certain amount of noise—so even 2 identical processes yield slightly different results Sources? In order to understand how much variation there is it is necessary to repeat an exp a # of independent times Replicates allow us to use statistical tests to ascertain if the differences we see are real
21
Technical vs. Biological Replicates
As we progress from the starting material to the scanned image we are moving from a system dominated by biological effects through one dominated by chemistry and physics noise Within Affy platform the dominant variation is usually of a biological nature thus best strategy is to produce replicates as high up the experimental tree as possible
23
From probe level signals to gene abundance estimates
24
From probe level signals to gene abundance estimates
The job of the expression summary algorithm is to take a set of Perfect Match (PM) and Mis-Match (MM) probes, and use these to generate a single value representing the estimated amount of transcript in solution, as measured by that probeset. To do this, .DAT files containing array images are first processed to produce a .CEL file, which contains measured intensities for each probe on the array. It is the .CEL files that are analysed by the expression calling algorithm.
25
PM and MM Probes The purpose of each MM probe is to provide a direct measure of background and stray-signal(perhaps due to cross-hybridisation) for its perfect-match partner. In most situations the signal from each probepair is simply the difference PM - MM. For some probepairs, however, the MM signal is greater than the PM value; we have an apparently impossible measure of background.
26
Signal Intensity Following these calculations, the MAS5 algorithm now has a measure of the signal for each probe in a probeset. Other algortihms, ex RMA, GCRMA, dCHIP and others have been developed by academic teams to improve the precision and accuracy of this calculation In our Exp we will use RMA and GCRMA
27
Low level data analysis / pre-processing
Varying biological or cellular composition among sample types. Differences in sample preparation, labeling or hybridization Non specific cross-hybridization of target to probes. Lead to systemic differences between individual arrays GMC scientists Scott Scott Anjie Anjie Raw Data Quality Control Scaling Normalization and filtering. GeneSpring, R-language, Bioconductorb GMC scientists + entire UVM outreach team
30
Data processing is completed now what?
41
Overview of Microarray Problem
Biology Application Domain Validation Data Analysis Microarray Experiment Image Analysis Data Mining Experiment Design and Hypothesis Data Warehouse Artificial Intelligence (AI) Knowledge discovery in databases (KDD) Statistics
42
Back to Biology Do the changes you see in gene expression make sense BIOLOGICALLY? How do we know? If they don’t make sense, can you hypothesize as to why those genes might be changing? Leads to many, many more experiments
43
A Common Language for Annotation of Genes from
The Gene Ontologies A Common Language for Annotation of Genes from Yeast, Flies and Mice …and Plants and Worms …and Humans …and anything else!
44
Gene Ontology Objectives
GO represents concepts used to classify specific parts of our biological knowledge: Biological Process Molecular Function Cellular Component GO develops a common language applicable to any organism GO terms can be used to annotate gene products from any species, allowing comparison of information across species GO is the designation of a project as well as the product of the project. Starting with the cellular level, we are not distinguishing cell types, organs, etc. Gene Ontology is a collaboration between the fly (FlyBase), mouse (MGD) genome databases, and yeast (SGD). All three groups had started independent projects to produce controlled vocabularies for the biology of their organisms. You will all be familiar with hierarchical system to classify enzymes (EC) or functions (YPD, SwissPROT, MIPS, …). We have divided our project into the creation of three ontologies. These are not necessarily hierarchical rather they can be a network of associations -- a directed acyclic graph (DAG). Process: cell cycle, nutrient transport, behavior, Function: alcohol dehydrogenase, Cellular Location: organelle, protein complex, subcellular compartment
45
Sriniga Srinivasan, Chief Ontologist, Yahoo!
The ontology. Dividing human knowledge into a clean set of categories is a lot like trying to figure out where to find that suspenseful black comedy at your corner video store. Questions inevitably come up, like are Movies part of Art or Entertainment? (Yahoo! lists them under the latter.) -Wired Magazine, May 1996
46
The 3 Gene Ontologies Molecular Function = elemental activity/task
the tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity Biological Process = biological goal or objective broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions Cellular Component = location or complex subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and RNA polymerase II holoenzyme GO is the designation of a project as well as the product of the project. Starting with the cellular level, we are not distinguishing cell types, organs, etc. Gene Ontology is a collaboration between the fly (FlyBase), mouse (MGD) genome databases, and yeast (SGD). All three groups had started independent projects to produce controlled vocabularies for the biology of their organisms. You will all be familiar with hierarchical system to classify enzymes (EC) or functions (YPD, SwissPROT, MIPS, …). We have divided our project into the creation of three ontologies. These are not necessarily hierarchical rather they can be a network of associations -- a directed acyclic graph (DAG). Process: cell cycle, nutrient transport, behavior, Function: alcohol dehydrogenase, Cellular Location: organelle, protein complex, subcellular compartment
47
Example: Gene Product = hammer
Function (what) Process (why) Drive nail (into wood) Carpentry Drive stake (into soil) Gardening Smash roach Pest Control Clown’s juggling object Entertainment Stakes into vampires Nails into wood is reversible with this model of hammer, you can remove the nails with the claw. Cytoskeletal structural protein Storage protein WHY -- Process WHAT -- Function
48
Biological Examples Biological Process Biological Process
Molecular Function Molecular Function Cellular Component Cellular Component
49
Terms, Definitions, IDs term: MAPKKK cascade (mating sensu Saccharomyces) goid: GO: definition: OBSOLETE. MAPKKK cascade involved in transduction of mating pheromone signal, as described in Saccharomyces. definition_reference: PMID: comment: This term was made obsolete because it is a gene product specific term. To update annotations, use the biological process term 'signal transduction during conjugation with cellular fusion ; GO: '. definition: MAPKKK cascade involved in transduction of mating pheromone signal, as described in Saccharomyces
50
SGD
53
SGD public microarray data sets available for public query
54
Homework Go to and find 3 candidate genes of known f(x) and one of undefined f(x) that you might predict to be altered by DMSO treatment What GO biological processes and molecular mechanisms are associated with your candidate genes? Where, subcellularly does the protein reside in the cell? What other proteins are known or inferred to interact with yours? How was this interaction determined? Is this a genetic or physical interaction? Find the expression of at least one of your known genes in another public ally deposited microarray data set? Name of data set and how you found it? What is the largest Fold change observed for this gene in the public study? Now that you are microarray technology experts can you give me 3 reasons why the observed transcript level difference may not be confirmed through a second technology like RTQPCR?
55
Suggested Reading
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.