Gene expression  Introduction to gene expression arrays Microarray Data pre-processing  Introduction to RNA-seq Deep sequencing applications RNA-seq.

Slides:



Advertisements
Similar presentations
Gene Expression Index Stat Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model.
Advertisements

Microarray Normalization
Peter Tsai Bioinformatics Institute, University of Auckland
RNA-seq: the future of transcriptomics ……. ?
Data Analysis for High-Throughput Sequencing
Microarray technology and analysis of gene expression data Hillevi Lindroos.
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Microarray Data Analysis Stuart M. Brown NYU School of Medicine.
Statistical Methods in Microarray Data Analysis Mark Reimers, Genomics and Bioinformatics, Karolinska Institute.
Transcriptomics Jim Noonan GENE 760.
Getting the numbers comparable
DNA microarray and array data analysis
Probe Level Analysis of AffymetrixTM Data
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
SNP chips Advanced Microarray Analysis Mark Reimers, Dept Biostatistics, VCU, Fall 2008.
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”
Introduce to Microarray
Gene expression array and SNP array
Analysis of microarray data
Microarray Preprocessing
Microarray Data Analysis Illumina Gene Expression Data Analysis Yun Lian.
with an emphasis on DNA microarrays
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
Affymetrix vs. glass slide based arrays
Data Type 1: Microarrays
Gene Expression Data Qifang Xu. Outline cDNA Microarray Technology cDNA Microarray Technology Data Representation Data Representation Statistical Analysis.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
RNAseq analyses -- methods
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
Verna Vu & Timothy Abreo
Microarray - Leukemia vs. normal GeneChip System.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Lo w -Level Analysis of Affymetrix Data Mark Reimers National Cancer Institute Bethesda Maryland.
A Short Overview of Microarrays Tex Thompson Spring 2005.
Intro to Microarray Analysis Courtesy of Professor Dan Nettleton Iowa State University (with some edits)
Lecture Topic 5 Pre-processing AFFY data. Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the PM.
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public.
1 FINAL PROJECT- Key dates –last day to decided on a project * 11-10/1- Presenting a proposed project in small groups A very short presentation (Max.
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Introduction to RNAseq
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Overview of Microarray. 2/71 Gene Expression Gene expression Production of mRNA is very much a reflection of the activity level of gene In the past, looking.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Microarrays and Other High-Throughput Methods BMI/CS 576 Colin Dewey Fall 2010.
Lecture 12 RNA – seq analysis.
DNA Microarray Overview and Application. Table of Contents Section One : Introduction Section Two : Microarray Technique Section Three : Types of DNA.
Oigonucleotide (Affyx) Array Basics Joseph Nevins Holly Dressman Mike West Duke University.
Transcriptome What is it - genome wide transcript abundance How do you obtain it - Arrays + MPSS What do you do with it when you have it - ?
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Analyzing DNA using Microarray and Next Generation Sequencing (1) Background SNP Array Basic design Applications: CNV, LOH, GWAS Deep sequencing Alignment.
Introduction to Oligonucleotide Microarray Technology
Arrays How do they work ? What are they ?. WT Dwarf Transgenic Other species Arrays are inverted Northerns: Extract target RNA YFG Label probe + hybridise.
Microarray Technology and Applications
Getting the numbers comparable
Microarray Data Analysis
Sequence Analysis - RNA-Seq 2
Data Type 1: Microarrays
Pre-processing AFFY data
Presentation transcript:

Gene expression  Introduction to gene expression arrays Microarray Data pre-processing  Introduction to RNA-seq Deep sequencing applications RNA-seq data pre-processing

An “old” technology - some predict microarrays will be replaced by deep sequencing Currently – much cheaper/faster than sequencing; widely used 2005: first next-generation sequencing machine Timeline of DNA Microarray Developments 1991: Photolithographic printing (Affymetrix) 1994: First cDNA collections are developed at Stanford 1995: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. 1996: Commercialization of arrays (Affymetrix) 1997: Genome- wide expression monitoring in S. cerevisiae (yeast) 2000: Portraits/ Signatures of cancer. 2003: Introduction into clinical practices 2004: Whole human genome on one microarray 2006: All exons measured on one microarray Basics of microarrays

They utilize the chemical binding between the four nucleotides. A --- T, and C --- G. The DNA structure is formed through the binding: DNA_Overview.png

Basics of microarrays

AATTCAGCATGGGCACATGCCCGCG TTAAGTCGTACCCGTGTACGGGCGC Basics of microarrays

Two strategies: (1)One sample on each array The amount is calculated from spot intensity. (2) Two samples, differentially labeled, on each array The relative amount, is given by the ratio between the fluorescence. Amplified DNA segments  fluorescence labeling  hybridization on the array  reading by photo scanner  digitize into fluorescence values  quantify amount of each target sequence Basics of microarrays

Gene expression arrays DNA (2 copies) mRNA (multiple copies) Protein (multiple copies) gene exon intron Poly A tail Start codon The amount of these guys matter! But they are hard to measure. The amount of these guys is easy to measure. And it is positively correlated with the protein amount!

Gene expression array --- affymetrix The Affymetrix platform is one of the most widely used.

Gene expression arrays -- Affy Here we use the U133 system for illustration. Some 20 probes per gene; Selected from the 3’ end of the gene sequence; Not necessarily evenly spaced --- sequence property matters; The probes are located at random locations on the chip; TTAAGTCGTACCCGTGTACGGGCGC Target sequence AATTCAGCATGGGCACATGCCCGCG Perfect match (PM) probe AATTCAGCATGGACACATGCCCGCG Mis-match (MM) probe

Gene expression array - affy The hope was that mismatch probes won’t bind the target sequence.

Gene expression arry --- affy

Microarray data ? We are going to focus on pre-processing for now. Downstream analyses are more in the realm of traditional statistics: multiple testing, clustering, classification…… They are common across different high-throughput techniques.

Microarray data Issues: Background level variation caused by variations in overall RNA concentration in the sample, image reader, etc. Within every probeset, each probe has different sensitivity/specificity, caused by cross-hybridization, different chemical properties etc. Across chips, the fluorescence intensity-concentration response curve can be different, caused by variations in sample processing, image reader etc.

Affy data --- general strategy Background correction (within chip) Presence/absence call (within chip) Normalization (across-chip) Probe-set level expression value (within chip) Probeset-level statistical analysis (combining chips)

Affy data --- general strategy There are many processing methods. The most popular include: MAS 5.0 (Affymetrix) Flawed. But it comes with the Affymetrix software. Thus widely used by non-experts. dChip (Cheng Li & Wing Wong) Good performance and versatile. Stand-alone Windows application. Can handle arrays other than expression array. RMA (Rafael Irizarry et al.) Good performance. Easily used in R/Bioconductor.

Affy data --- RMA Background correction For each array, assumes: lambda=1,miu=1,sigma=1 lambda=5, miu=1, sigma=1

Affy data --- RMA Background correction For each array, from the PM signal distribution, estimate the parameters, Find the overall mode by kernel density estimation; Find the miu and sigma from PM values lower than the overall mode (sample mean and sd) Find the lambda from PM values higher than the overall mode (1/(sample mean minus the overall mode)) then adjust the PM readings (s is PM signal; lambda is replaced by alpha in this expression): See the derivation here:

Affy data --- normalization *** This is also relevant to other array platforms ! To reduce chip effect, including non-linear effect. Difficulty: the sample is different for each chip. We can’t match a gene in chip A to the same gene in chip B hoping they have the same intensity. PM MM Assumptions on the overall distributions of the signals on each chip are made. For example: Some house-keeping genes don’t change; The overall distribution of concentrations don’t change; ……

Affy data --- normalization Quantile normalization --- match the quantiles between two chips. Assumes that the distribution of gene abundances is the same between samples. x norm = F 2 -1 (F 1 (x)), x: value in the chip to be normalized F 1 : distribution function in the chip to be normalized F 2 : distribution function in the reference chip Nature Protocols 2, (2007)

Affy data --- RMA summary Model-fitting: Median Polish (robust against outliers) alternately removing the row and column medians until convergence The remainder is the residual; After subtracting the residual, the row- and column- medians are the estimates of the effects.

Affy data ---- rma summary Remove row median Remove column median

Affy data ---- rma summary Remove row median Remove column median

Affy data ---- rma summary Remove row median Remove column median Converged. This is the residual.

Affy data ---- rma summary * This reflects the assumption that probe effects have median zero.

Deep Sequencing “Method of the year” 2007 by Nature Methods. The name: “Next generation sequencing” “Deep sequencing” “High-throughput sequencing” “Second-generation sequencing” The key characteristics: Massive parallel sequencing amount of data from a single run ~ amount of data from the human genome project The reads are short ~ a few hundred bases / read

Background Potential impact: The “$1000 genome” Genome sequencing will become a regular medical procedure. Personalized medicine Predictive medicine Ethical issues For statisticians: Data mining using hundreds of thousands of genomes Finding rare SNPs/mutations associated with diseases New methods to analyze epigeomics/transcriptomics data Finding interventions to improve life quality

Background The companies use different techniques. We use Illumina’s as an example. (

Background

An incomplete list of some common platforms. Bioinformatics and Biology insights 2015:9(s1)

Background Advantages: Fast and cost effective. No need to clone DNA fragments. Drawbacks: Short read length (platform dependent) Some platforms have trouble on identical repeats Non-uniform confidence in base calling in reads. Data less reliable near the 3’ end of each read.

Background What deep sequencing can do:

Background Nat Methods Nov;6(11 Suppl):S2-5.

Sequence the genome of a person? --- Alignment Can rely on existing human genome as a blue print. Align the short reads onto the existing human genome. Need a few fold coverage to cover most regions. Sequence a whole new genome? --- Assembly Overlaps are required to construct the genome. The reads are short  need ~30 fold coverage. If 3G data per run, need 30 runs for a new genome similar to human size. Alignment and Assembly

Whole gnome/exome/transcriptome sequencing

Alignment

Finding novel exons. Alternative splicing RNA-Seq

Gene expression profiling – to replace arrays? Exon-specific abundance. RNA-Seq

Genome Biology 2010, 11:220

Alignment Hash table-based alignment. Similar to BLAST in principle. (1) Find potential locations: (2) Local alignment.

Normalization Genome Biology 2010, 11:220

RPKM: Reads per kilobase transcript per million reads Normalization by ERCC (External RNA Controls Consortium): Normalization Nature Methods 12: 339–342(2015)

Sequence count models Example: Simple Poisson model: Between group testing, d i : sequencing depth of sample i β g : the expression level of gene g γ g : the association of gene g with the covariate Cancer Informatics 2015:14(s1)

Sequence count models Poisson model doesn’t allow overdispersion. Negative binomial model: Φ g accounts for the sample ­to­ sample variability Methods like DESeq use the negative binomial distribution. Cancer Informatics 2015:14(s1)

RNA-Seq v.s. Array Good agreement for genes expressed at medium-level.