QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics.

QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 2 Contents  Background on quality control (QC) and (further) data pre-processing  Application of an automated workflow for Affymetrix data −Settings −Illustration on data sets −Interpretation of outcome  Introduction to the afternoon session and the data set to be used

FHML - 3 BACKGROUND

FHML - 4 Proper quality control (QC) Ensures validity of study results Is pivotal in –omics research –Hard to judge quality by eye Several tables and images assist in judging quality  Here we focus on QC of gene expression arrays

FHML - 5 Data analysis overview Untreated (control) Exposed to compound Raw data Normalised data List of regulated genes Results Microarray scans Image analysis Quality control Further pre-processing Statistical analysis Pattern analysis Pathway analysis Literature data Slide based on a slide from J. Pennings, RIVM, NL Background correction Normalisation

FHML - 6 QC and pre-processing Ensure signal comparability within each array –Stains on the array –Gradient over the array Ensure comparable signals between all arrays –Degraded / low quality sample –Failed hybridisation –Too low or high overall intensity Some effects can be corrected for, others require removal of data from the set

FHML - 7 QC for one and two channel microarrays The principles are similar for both types of arays But the details are different In two channel arrays QC is a bit more complex –Each spot consists of two measurements, not one –Dye-effect I will further discuss QC later in this talk, focusing on one channel arrays (Affymetrix chips)

FHML - 8 Dye bias Foreground intensityBackground intensity

FHML - 9 Red and green foreground intensity  For two channel arrays, it is relevant to check whether effects cancel out between channels

FHML - 10 Pre-processing: background correction Background signal needs to be corrected for –For example signal of remaining non-hybridised mRNA Three types of background –Overall slide background –Local slide background –Specific background For example cross-hybridization, can be corrected for by mismatch probes (in case of Affymetrix chips) Also used to make present/marginal/absent calls

FHML - 11 Pre-processing: normalisation After discarding bad arrays and spots, remaining within- and between-array differences not related to the biology, need to be corrected for The procedure is cyclic –Several QC plots are made before and after normalisation –Whether normalisation can correct an artifact may influence decision to discard or not –After data selection, the complete QC should be run again Some abberations may have been masked by larger ones

FHML - 12 Log transformation Generally, the intensities are first 2 log-transformed −The distribution of the logged intensities is more ‘normal’ than on the original scale After logging and normalisation one can compute the difference in means (‘logFC’) between several experimental groups −The difference is easier to handle statistically 2^logFC corresponds to the fold change (ratio) on the original scale

FHML - 13 The log Fold Change The logFC ‘spreads out’ the data and offers symmetry ‘raw’ ratio (FC) log ratio (logFC) 12½ 12½ 2 log of:

FHML - 14 Spotted and Affymetrix arrays Spotted arrays –Either one or two channel –Spot-level QC often included –Also often parts of arrays are flagged –Each gene is measured by only one or two probes on the array Affymetrix chips –Always one channel no dye effect –No spot-level QC is taken into account –No flagging of local abberations –Each gene is measured by a probeset of probes spread randomly over the array Main focus in remainder of talk

FHML - 15 Pre-processing for Affymetrix chips A specific extra step is summarisation of probe values into one value for each probeset Well-known methods for pre-processing Affymetrix chips –MAS5.0 (uses mismatch intensities) –RMA (Robust Multiarray Average, does not use mismatches) Includes both background correction and (quantile) normalisation –GC-RMA (like RMA, but also takes into account GC content) –dChip (model-based) –For exonST en geneST arrays, only RMA can be used (another option is PLIER, error-model)

FHML - 16 Custom CDF files Affymetrix provides annotations for their probesets (CDF file) When these get outdated, one can of course update probeset annotations But it may be even better to: –disassemble these sets into the separate probes –reannotate probes –reassemble these into new different probesets This is exactly what custom CDF files do Note that reassembled probesets do not necessarily contain the same number of probes anymore

FHML - 17 BrainArray CDF files 1 Reannotation based on one of several genome databases IDs are created as follows: ID from the gene the probeset refers to followed by ‘_at’ to resemble an Affymetrix ID –For example: ENSG00000139618_at When using these annotations in other tools, you have to remove the ‘_at’ additions, in order to get recognisable Ids –Note that when using Entrez gene this means that the ID is composed of a number (Entrez gene ID) followed by ‘_at’, and as such looks exactly like a normal Affymetrix ID, but IT IS NOT 1 http://arrayanalysis.mbni.med.umich.edu/arrayanalysis.html

FHML - 18 Low intensity filtering Before filtering After filtering Low intensity spots are more subject to noise Filtering can be done at a later stage average intensity difference between groups

FHML - 19 AN AUTOMATED WORKFLOW

FHML - 20 ArrayAnalysis.org web server local machine calculation server

FHML - 21 http://www.arrayanalysis.org

FHML - 22

FHML - 23

FHML - 24

FHML - 25

FHML - 26

FHML - 27

FHML - 28

FHML - 29

FHML - 30 Table and images of QC statistics Affymetrix criteria: Sample prep controls Lys < Phe < Thr < Dap Lys present Bèta Actin 3’/5’ ≤ 3 GAPDH 3’/5’ ≤ 1.25 Hybridisation controls BioB < BioC < BioD < Crex BioB present Percentage present within 10% Background within 20 units Scaling factors within 3-fold from the average In the table, red and blue indicate whether criteria are fulfilled The images are taken from other data sets than the one you will be using Outcome of the workflow

FHML - 31 RNA DegradationDensity plot plot

FHML - 32 Boxplots

FHML - 33 Virtual (spatial) imagesMA plots

FHML - 34 NUSE and RLE plot

FHML - 35 Array correlation plot

FHML - 36 Clustering and PCA plots

FHML - 37 Perspectives Future relevance of Affymetrix chips? Data repositories / comparative research It is also available for local install in R We will soon include model for statistical analysis (and processing of other data types)

FHML - 38 Quality Control (QC) of Microarrays Nature, 2005

FHML - 39 Project members Lars EijssenMagali JaillardMichiel AdriaensPhilip de GrootChris Evelo Thanks to:

FHML - 40 THE AFTERNOON SESSION AND THE DATA SET

FHML - 41 The afternoon session In the afternoon session, you will be performing QC and pre-processing yourself You will follow a stepwise guide available online at http://www.bigcat.unimaas.nl/wiki/index.php/PET_course_2011 You will use an Affymetrix data set and make use of arrayanalysis.org* * For normalisation you will use a Genepattern module, as the tool you will use for statistical analysis (finding which genes are different) requires this input

FHML - 42 NuGOExpressionFileCreator

FHML - 43 Short description of the data set (1) Microarray experiments have to be uploaded to online repositories such as Gene Expression Omnibus (GEO, NCBI) or ArrayExpress (AE, EBI) upon publication We will use a published 1 dataset available from AE 1 Toxicogenomics of subchronic hexachlorobenzene exposure in Brown Norway rats. Ezendam J, Staedtler F, Pennings J, et al. Environ Health Perspect 112(7):782-91

FHML - 44 Short description of the data set (2) Hexachlorobenzene (HCB) is a persistent pollutant, that is toxic for liver, neurons and the reproductive and immune systems In this study, Brown Norway rats were fed a diet supplemented with HCB doses of 0, 150, or 450 mg/kg Spleen, mesenteric lymph nodes (MLN), thymus, blood, liver, and kidney were analyzed using the Affymetrix rat RGU-34A GeneChip microarray –13-17 arrays per tissue, max 6 per concentration We will be primarily considering the liver data (17 arrays)

QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics.

Similar presentations

Presentation on theme: "QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics.

Similar presentations

Presentation on theme: "QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics."— Presentation transcript:

Similar presentations

About project

Feedback