1 The Biology, Technology and Statistical Modeling of High- throughput Genomics Data Naomi Altman Dept. of Statistics Penn State U. May 25, 2010.

Slides:



Advertisements
Similar presentations
Replication, Transcription, & Translation
Advertisements

Biological background: Gene Expression and Molecular Laboratory Techniques Class web site: Statistics.
Additional Powerful Molecular Techniques Synthesis of cDNA (complimentary DNA) Polymerase Chain Reaction (PCR) Microarray analysis Link to Gene Therapy.
DNA Sequencing and Gene Analysis
Introduce to Microarray
Applied Biosystems 7900HT Fast Real-Time PCR System I. Real-time RT-PCR analysis of siRNA-induced knockdown in mammalian cells (Amit Berson, Mor Hanan.
Biological Information Flow
13.3: RNA and Gene Expression
Genomics I: The Transcriptome RNA Expression Analysis Determining genomewide RNA expression levels.
DNA Replication DNA mRNA protein transcription translation replication Before each cell division the DNA must be replicated so each daughter cell can get.
and analysis of gene transcription
with an emphasis on DNA microarrays
1 Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007.
Analyzing your clone 1) FISH 2) “Restriction mapping” 3) Southern analysis : DNA 4) Northern analysis: RNA tells size tells which tissues or conditions.
6.3 Advanced Molecular Biological Techniques 1. Polymerase chain reaction (PCR) 2. Restriction fragment length polymorphism (RFLP) 3. DNA sequencing.
From Haystacks to Needles AP Biology Fall Isolating Genes  Gene library: a collection of bacteria that house different cloned DNA fragments, one.
-The methods section of the course covers chapters 21 and 22, not chapters 20 and 21 -Paper discussion on Tuesday - assignment due at the start of class.
Chapter 11 Key Knowledge: molecular genetics principal events in transcription and translation; cell reproduction: cell cycle, DNA replication, apoptosis;
How do you identify and clone a gene of interest? Shotgun approach? Is there a better way?
Restriction Nucleases Cut at specific recognition sequence Fragments with same cohesive ends can be joined.
Microarray Technology
Module 1 Section 1.3 DNA Technology
POLYMERASE CHAIN REACTION. DNA Structure DNA consists of two molecules that are arranged into a ladder-like structure called a Double Helix. A molecule.
RNA and Protein Synthesis
RNA AND PROTEIN SYNTHESIS RNA vs DNA RNADNA 1. 5 – Carbon sugar (ribose) 5 – Carbon sugar (deoxyribose) 2. Phosphate group Phosphate group 3. Nitrogenous.
Expression of the Genome The transcriptome. Decoding the Genetic Information  Information encoded in nucleotide sequences contained in discrete units.
Biotechnology.
 DNA (gene mutations, paternity, organs compatibility for transplantations)  RNA  Proteins (gene expression)
DNA, RNA, and Proteins Section 3 Section 3: RNA and Gene Expression Preview Bellringer Key Ideas An Overview of Gene Expression RNA: A Major Player Transcription:
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Chapter 11: DNA & Genes Sections 11.1: DNA: The Molecular of Heredity Subsections: What is DNA? Replication of DNA.
Human Genomics. Writing in RED indicates the SQA outcomes. Writing in BLACK explains these outcomes in depth.
REPLICATION IN BACTERIA Replication takes place at several locations simultaneously Each replication bubble represents 2 replication forks moving in opposite.
6.3 Advanced Molecular Biological Techniques 1. Polymerase chain reaction (PCR) 2. Restriction fragment length polymorphism (RFLP) 3. DNA sequencing.
Chapter 10: Genetic Engineering- A Revolution in Molecular Biology.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
RNA, transcription & translation Unit 1 – Human Cells.
DNA in the Cell Stored in Number of Chromosomes (24 in Human Genome) Tightly coiled threads of DNA and Associated Proteins: Chromatin 3 billion bp in Human.
Molecular Genetic Technologies Gel Electrophoresis PCR Restriction & ligation Enzymes Recombinant plasmids and transformation DNA microarrays DNA profiling.
Microarrays and Other High-Throughput Methods BMI/CS 576 Colin Dewey Fall 2010.
Polymerase Chain Reaction (PCR). PCRPCR PCR produces billions of copies of a specific piece of DNA from trace amounts of starting material. (i.e. blood,
DNA Microarray Overview and Application. Table of Contents Section One : Introduction Section Two : Microarray Technique Section Three : Types of DNA.
PCR With PCR it is possible to amplify a single piece of DNA, or a very small number of pieces of DNA, over many cycles, generating millions of copies.
The beginning of protein synthesis. OVERVIEW  Uses a strand of nuclear DNA to produce a single-stranded RNA molecule  Small section of DNA molecule.
Introduction to Oligonucleotide Microarray Technology
Gene Technologies and Human ApplicationsSection 3 Section 3: Gene Technologies in Detail Preview Bellringer Key Ideas Basic Tools for Genetic Manipulation.
Unit-II Synthetic Biology: Protein Synthesis Synthetic Biology is - A) the design and construction of new biological parts, devices, and systems, and B)
Microbial Genetics Structure and Function of Genetic Material The Regulation of Bacterial Gene Expression Mutation: Change in Genetic Material Genetic.
Higher Human Biology Unit 1 Human Cells KEY AREA 5: Human Genomics.
Human Genomics Higher Human Biology. Learning Intentions Explain what is meant by human genomics State that bioinformatics can be used to identify DNA.
Microarray: An Introduction
8.2 KEY CONCEPT DNA structure is the same in all organisms.
Part 3 Gene Technology & Medicine
DNA Replication.
Expression of the Genome
From DNA to Proteins Lesson 1.
Expression of the Genome
Pharmacogenetics and Pharmacoepidemiology
Relationship between Genotype and Phenotype
Topic DNA.
Chapter 14 Bioinformatics—the study of a genome
Transcription & Translation.
Synthetic Biology: Protein Synthesis
RNA: The other nucleic acid
Central Dogma Central Dogma categorized by: DNA Replication Transcription Translation From that, we find the flow of.
Pharmacogenetics and Pharmacoepidemiology
12-3 RNA and Protein Synthesis
Expression of the Genome
RealTime-PCR.
Presentation transcript:

1 The Biology, Technology and Statistical Modeling of High- throughput Genomics Data Naomi Altman Dept. of Statistics Penn State U. May 25, 2010

2 DNA 100 A Statistician’s Simplification Every cell has the same genetic material, stored in the double helix of DNA. The "backbone" is the support of the ladder. The rungs are "base pairs". Each pair consists of 2 bound codons which are designed C, G, A, T. These are called base pairs. C binds only to G. A binds only to T. In a diploid population, most cells have 2 copies of each chromosome. RC/VL/GG/chromosome.html AMonksFlourishingGarden/

3 DNA replication When the cell divides, the DNA is replicated by breaking the double bond between the base pairs, and rebuilding the double helix by creating a new backbone and new pairs on each strand. The molecules making up the backbone are asymmetric creating a direction along the backbone. One end is called the "3' end". The other end is the "5' end". Duplication always goes from the 5' end to the 3' end. A complex suite of proteins is involved in duplication. /~ballardh/pbio475/Heredity /Heredity.htm

4 Transcription and Translation To make a protein: The DNA unzips mRNA binds to the exposed codons on the coding (sense) strand - the matching strand is the anti-sense strand mRNA goes to the ribosome where it binds to amino acids brought to the ribosome by the tRNA Each set of 3 codons encodes 1 amino acid The complete linear set of amino acids defines the protein The protein folds into its active shape AMonksFlourishingGarden/

5 Transcription transcription factors bind to the promoter and bind RNA polymerase DNA strands separate and transcription is initiated transcription continues in the 3'-5' direction until the stop codons are reached The completed RNA strand is released for post-processing

6 Introns and Exons In "higher" organisms, the gene contains noncoding regions, called introns, and coding regions called exons. The introns are spliced out of the mRNA before translation into protein. "Splicing variants" can be formed by the cell selecting combinations of the exons. The resulting spliced strand is the mRNA. We can "predict" exons using statistical algorithms, but the gold standard is that only exons match mRNA sequences ology_124/Summaries/T&T.html Chromosome promoter

7 DNA 100 A Statistician’s Simplification DNA is complicated stuff. Protein-coding regions are called genes. There are also other functional parts to the DNA, some of which code for RNA and some of which are regulatory regions - i.e. they help control how the coding regions are used - e.g. promoters The supercoiling of the DNA may also control how the coding regions are used. As well, there is a lot of DNA which appears to be "junk" - i.e. to date no function is known. But we keep making new discoveries - e.g. some of the "junk" codes for small RNA pieces that are functional.

8 DNA 100 A Statistician’s Simplification An allele is a variant of a gene - e.g. "blood type" A, B, O in humans. If a gene has 2 or more alleles, it is said to be polymorphic. A Single Nucleotide Polymorphism (SNP) means that 2 individuals from the same species have a difference in one nucleotide at some location in their DNA. (e.g. a C in one person, and a T in the other). SNPs are very useful for determining the genotype of an organism and for tracing evolution of proteins.

9 DNA 100 A Statistician’s Simplification A key step in microarray technology is reverse transcription: going from mRNA to DNA with the introns excised. This is called cDNA. At the 5' and 3' ends of the cDNA are the regulatory regions called the "UnTranslated Regions" or UTRs. The 5' UTR is functional and evolves very slowly. The 3' UTR is less functional and hence evolves more rapidly. It can be used to distinguish closely related genes.

10 DNA 100 A Statistician’s Simplification DNA persists in the cell, and is the cell's memory device. mRNA and proteins do not persist in the cell and are degraded with components recycled. Degradation is part of cell regulation. Cells degrade both imperfect compounds and those no longer needed. Understanding cellular processes is complicated by our inability to follow the synthesis and degradation processes in single cells - so we are actually seeing the average over many cells which may be at somewhat different stages.

11 DNA 100 A Statistician’s Simplification The function of each cell is determined by which proteins it produces. Our objective will be to measure either proteins are produced (directly, by measuring and identifying proteins or indirectly by measuring mRNA). It is easier to measure mRNA than protein, but due to degradation, the correlation between mRNA levels and protein levels is imperfect. In fact, in some cases, the mRNA may not actually produce any protein. In some cases we will measure the genomic DNA directly - usually to look for differences among alleles.

12 PCR Polymerase Chain Reaction DNA primer denature (by high temperature) new strand PCR allows us to greatly amplify any selected piece of DNA. Selection is done by choice of primer. This allows us to detect small quantities of DNA. Labels can be added to the new strands by attaching chemicals to the free C,G,A,T

13 Electrophoresis Maryam Ahmed Khan February 14, 2001 The PCR product is put through an electrophoresis gel to determine presence/absence of the DNAs targeted by the primer

14 PCR Polymerase Chain Reaction PCR is used directly to amplify genes. It is mainly used to detect alleles - i.e. variants of a gene that can be used to e.g. identify individuals (e.g. DNA fingerprinting) identify subpopulations (e.g. tracking ivory poaching) determine which variants are associated with a condition (e.g. drug efficacy)

15 RT-PCR Reverse Transcription PCR primer DNA mRNA in the cell mRNA cDNA primer cDNA RT-PCR is used to identify genes which EXPRESS in the tissue or create a cDNA library

16 cDNA Library and ESTs A cDNA library is a means of storing specific genes or gene fragments. The library is actually a set of "wells" containing living cells with plasmids containing the cDNA. Often a cDNA library for a tissue is partially sequenced, to obtain Expressed Sequence Tags (ESTs), short pieces of sequenced DNA which can be used to identify which genes are expressed in the tissue. (There is a lot of computation involved in compiling ESTs into gene sequences, which is called assembly.) fig.cox.miami.edu/~cmallery/150/gene/sf16x5.jpg

17 PCR Methods for Measuring Gene Expression PCR is considered the gold standard for detecting and measuring gene expression. Detection is "simple". A label (radioactive or dye) can be added during the PCR reaction. After several cycles, if the label is bound, then the PCR target must be present.

18 Quantitative PCR Methods for Measuring Gene Expression Because each cycle of PCR requires the denaturization step the number of PCR cycles is under experimental control. Hence, the quantity of PCR product at the end of some number of cycles can be used to estimate the initial quantity. The estimate is usually improved by also amplifying a "control" product with "known" initial quantity. Quantitative PCR uses only the measured quantity at the final step of a preset number of cycles. Real time PCR uses a label that binds only to double stranded DNA, and measures the quantity at the end of each cycle. This provides a curve giving the label intensity versus the number of cycles, which can be extrapolated back to the initial point. This method is more accurate but much more expensive.

19 Real Time RT-PCR ( from the PSU Nucleic Acid Facility ) A probe is designed to anneal to the target sequence between mRNA and cDNA primers. The probe is labeled at the 5' end with a reporter fluorochrome and a quencher fluorochrome added at any T position or at the 3' end. The amount of fluorescence released during the amplification cycle is proportional to the amount of product generated in each cycle. The software calculates the threshold cycle (CT) for each reaction with which there is a linear relationship to the amount of starting DNA or RNA. Up to 96 samples are run simultaneously, so the relative fluorescence corresponds to the relative quantity of mRNA initially present

20 Northern Blot This is another "1-at-a- time" RNA detection method amenable to quantification - the "old" gold standard.

21 Microarrays A microarray is a glass or plastic slide on which are printed 1000's of single strands of cDNA. RT is used to create single strand labeled cDNA from the mRNA of a tissue. The cDNA binds only to the complementary strand on the slide. Dye intensity for each "spot" is proportional to the concentration of matching cDNA. The intensity is summarized by a scanning microscope, which detects the "spots".

22 What is a microarray probe? A probe is a spot on an array representing a gene or part of a gene On “cDNA” arrays, the probes are actual pieces of cDNA originally extracted from a cell. We may not know the genetic sequence of a cDNA.

23 What is a microarray probe? If we know the genetic sequence of the cDNA, we can artificially synthesize a strand of DNA with the same sequence. This is called an oligo(nucleotide). Oligos may be “spotted” on the array like cDNA or may be synthesized on the array by one of several technologies.

24 cDNA versus Oligos cDNAs have different hybridization properties due to their biochemistry Oligos may be chosen to have similar hybridization properties -and to represent maximally unique parts of genes -or to represent common domains

25 cDNA versus Oligos cDNAs are maintained in cDNA libraries which are expensive to maintain and may be mislabeled or contaminated. Oligos are synthesized from genomic sequence information which can be subject to error.

26 Spotted 2-Channel Array Spotted arrays are printed on coated microscope slides. 2 RNA samples are converted to cDNA. Each is labelled with a different dye.

27 "Spotted" arrays The spot material may be a cDNA, or an oligo - generally codons long. Some commercial arrays use only a single dye. "Spotted" refers to the print technology. Arrays with similar format may have oligos synthesized directly on the array surface.

28 Affymetrix Array Each gene is represented by a “probe set” Each “probe set” is pairs of oligos Each oligo is 25 nucleotides A PM (perfect match) probe matches a strand of cDNA The corresponding MM (mismatch) probe differs from the PM by a change in the central nucleotide The probe pairs are spatially dispersed Control probes are printed

29 Format of an Affymetrix Array

30 Heuristics for “Probe Sets” MM probe is supposed to control for: Variation in chemical composition Abundance of cross-hybridizing fragments from other genes By combining PM and MM information from many probes, gene to gene differences should be minimized. These arrays are more quantitative than other types of microarrays.

31 Microarrays for Gene Expression Whichever technology is used, an intensity value is obtained for every probe from every sample. Generally values are comparative - i.e. does this probe express more highly in melanoma than in a normal skin cell. The data are very noisy. A lot of effort has gone into data-cleaning methods which are generally called "normalization".

32 Microarrays for Gene Expression Microarrays are "genomic" ,000 genes may be on a single array. Microarrays have other uses - e.g. tiling arrays cover the entire genome - SNP arrays have 2 variants of many SNPs - promoter arrays have upstream sequence We will focus on gene expression arrays but most of what we discuss will be useful for all "omic" level data.

33 Measuring Gene Expression Microarrays are very expensive (on a per array basis) and somewhat noisy, have broad coverage (which makes them cheap on a per gene basis). Real time PCR and Northern blots are more accurate (maybe) but are "single gene" methods.

34 Measuring Gene Expression Microarrays are used to obtain broad coverage of the genome. Real time PCR or Northern Blots are often used to verify the results for a few genes, or for some low-expression genes.