Download presentation
1
Introduction to Microarray Analysis
Uma Chandran PhD, MSIS Department of Biomedical Informatics 10/17/12
2
What is a microarray Probes on surface Arrays can detect
Glass beads, chips, slides Arrays can detect mRNA microRNA Methylation SNP High throughput 10000s of specific probes Measure global gene expression, SNP calls, LOH, amplification, methylation etc
3
Questions that can be asked
Can measure global changes Which mRNAs are high in disease versus normal, i.e, out of the 1000s of mRNAs expressed in the cell at any time Are there single nucleotide polymorphism that are markers for a disease – many studies on for example, autism, schizophrenia Are there methylation changes in disease versus normal
4
Array DESIGN
5
Affymetrix Insert oligo slide Probes are synthesized on a chip
Probes are oligonculeotides of a specified length Generally 25 mers At each x, y location a particular oligonucleotide is synthesized in 1000s of copies at that location
6
Affymetrix Feature: a location on the array with a particular oligonucleotide sequence Oligonucleotides are synthesized using a photolithographic manufacturing process The oligo on the chip is called the probe and RNA (or DNA) that it hybridizes to is called the target
7
Affy array design Probe set
8
Affymetrix
9
Probe design Multiple probe sets/gene Probe sets are selected based on
GenBank dbEST RefSeq Bioinformatics approaches Design at the time of chip design However, this may be incorrect as genome builds update
10
Affymetrix data
11
Annotation The probe set id and sequence are contained in reference files This id never changes However, annotations change with genome builds Many software tools to annotate Some involve new BLAST of the sequences Mask out probe sets
12
Affymetrix Chips for Dynamic range Cannot compare genes within chips
Human HGU95, HGU133A, B, HGU133 set 54K probe sets on the HGU133, 30+ to known genes and ESTs Control probes like GAPDH Spike in bacterial probes Mouse Rat Chimpanzee Plants Many other species Dynamic range Very low ~ 10 units 20K + Cannot compare genes within chips For example, a transcript that is expressed at 500 units may not be more abundant than one that is expressed at 200 units This is due to probe binding affinities etc However, can compare the same probe across multiple chips Difficulty in probe design makes it difficult to compare from one version to another
13
Affymetrix workflow from:
14
Illumina
15
Each bead has one type of oligo and thousands of these oligos/bead
Illumina Each bead has one type of oligo and thousands of these oligos/bead Bead is deposited on wells in glass slides. The beads are decoded by a step by proprietary technology
16
Microarray analysis objectives
Data Preprocessing Data Analysis
17
Analysis questions Treatment Normal Class Comparison Class Discovery
Expression - Which genes/miRs are up or down in tumors v normal, untreated v treated SNP – Which regions are amplified or deleted Class Discovery Within the tumor samples, are there subgroups that have a specific expression profile? SNP – amplification or deletion common to subgroups? Class prediction, pathway analysis etc Integrative analysis Proteomic and genomic SNP and expression Methylation and expression Insert a picture of two different conditions
18
Challenges in microarray analysis
Different platforms Ilumina, Affymetrix, Agilent…. Many file types, many data formats Need to learn platform dependent methods and software required Analysis How to get started? Which methods? Which software? Many freely available tools. Some commercial Analysis software and methods will depend on platform. SNP analysis is different from expression Software used may be very specific to SNP For example, Excel cannot open large SNP files How to interpret results
19
Public databases Many sources for public data – labs, consortia, government Publications require that data files including raw files be made public GEO – Array Express -
20
Hands on #1 Look at GEO Search Data Set with the term Exercise
Exercise Heart Human Identify Platform by clicking on GSE record Try restricting by platform such as Affymetrix or Illumina
21
Affy data Normalization method Signal value Probe set Id
Total probesets Raw files
22
Data pre-processing Affy produces many files - .dat, .cel, .chp etc
Process these to produce data that can be opened in excel or .txt Illumina produces different file types
23
Data Preprocessing Objective Multiple step
Convert image of thousands of signals to a a signal value for each gene or probe set Multiple step Image analysis Background and noise subtraction Normalization Summarized expression value for a probe set or gene Gene Gene Gene 3 75 . Gene
24
Data Pre-processing Go from .DAT file to feature quantification
The first step where .DAT file is aligned to a grid and the features are quantified is usually performed by Affy’s proprietary algorithm .DAT CEL file .CEL file contains the feature quantifications .CEL file still has probes spread over the chip Values still need to be summarized to probe set level; for example 90525_at = 250 units 250
25
Data Pre-processing – Step 1
Image processing Usually done using proprietary software Affy: convert .dat file to .cel file May perform noise subtraction, background Illumina: Bead Studio software to convert bead level data to next level of data
26
Data Preprocessing – Step 2
Normalization Bring all the experiments up to the same scale Multi-step process depending on technology Summarized expression value for a probe set or gene Affy: .cel to .chp; need .cdf file which describes the file layout Ilumina: normalization option and background subtraction option using Bead Studio Gene Gene Gene 3 75 . Gene
27
.CEL +.CDF to .CHP In going from .CEL to .CHP file to generate signal values, the multiple probes within a probe set are “averaged” to produce a single value for that gene/transcript
28
Normalization Corrects for variation in hybridization etc
Important for all high throughput platforms Assumption that no global change in gene expression Without normalization Intensity value for gene will be lower on Chip B Many genes will appear to be downregulated when in reality they are not Treated Control Gene Gene Gene . Gene 50 75 32 250
29
How to normalize? After normalization from .cel
Many methods – Affy MAS5.0 Median scaling – median intensity for all chips should be the same Known genes, house keeping, invariant genes Quantile - RMA Normalization method may differ depending on platform Illumina – cubic spline Affymetrix Choose method .cel to .chp file Which method to choose? Know the biology After normalization from .cel .chp file .txt file A B Before (down) After (no change)
30
Normalization
31
Affy data Normalization method Signal value Probe set Id
Total probesets Raw files
32
Workflows Affy Illumina
.dat file > .cel file > .chp file > .txt file Affy software needed for .dat > cel The rest of the steps can be carried out by other tools Illumina Through Bead Studio Bkg subtraction > normalization with various options > background normalization > .txt file Need bead studio to carry out these steps and raw files not necessarily given normalization cdf file
33
Illumina Does not have .DAT, .CEL, .CDF and .CHP files
There is no chip definition or chip layout as in Affy However, the identity of each bead has to be decoded vial proprietary software
34
Illumina Data preprocessing Signal normalization
Raw files are .txt files Probe id
35
Affy v Illumina Affy Illumina 25mer Probe synthesized on chips
Multiple probes/probeset May have multiple probes/transcript .dat, .cel, .cdf, .chp file types Normalization methods such as quantile Txt output can be used for downstream data analysis Annotations can be updated Illumina Longer oligo Bead technology Single probe May have multiple probes/transcript Image file processed by Bead Studio Several normalization methods Txt output can be used for downstream data analysis Annotations can be updated
37
Hands on #2 -Data analysis
Import data into BRB Which files to import .cel file if performing normalization through BRB Or mport already normalized file as .txt file for further analysis
38
Steps in analysis - Import
Affy Import all files into Affy tools such as Expression console Normalize and generate signal values using Affy MAS5.0 Assess QC using GAPDH, B-actin and control probes for spike in and hybridization Then, import into other tools such as BRB for analysis Illumina Depending on background subtraction/normalization, may have generated negative values Check QC metrics, such as did chip pass? Remove negative values Import into tools such as BRB
39
Step in Data analysis – Normalization
Import raw data into a tool Has data been normalized? If not, which method to use? What is available for a particular platform If not available in tools, is R code or package available After normalization, check distribution Are there any batch effects? Is the data log transformed? If not, should you log transform? When? After or before normalization? Are there missing or negative values in data? What should be done? Impute? Remove rows
40
Steps in Data analysis – update Annotations
Very important step Annotations updated Annotations provided may often be incorrect Multiple probe sets for each gene
41
BRB – Array tools Website Excel plug in; R and fortran
Import, choose correct format For Affy: .cel files Process using GCRMA or MAS5.0 Or directly from processed files Attaches annotation Create experiment labels
42
Class Discovery Objective? Methods
Can data tell us which classes are similar? Are there subgroups? Do T-ALL, T-LL, B-ALL fall into distinct groups? Methods Hierarchical clustering K-means, SOM etc These are Unsupervised Methods Class Ids are not known to the algorithm For example, does not know which one is cancer or non cancer Do the expression values differentiate, does it discover new classes
43
Multidimensional scaling - MDS
44
Class comparison – differential expression analysis
What genes are up regulated between control and test or multiple test conditions Normal v tumor Treated v untreated Fold change Not sufficient, need statistics Statistics t test, non-parametric, fdr,
45
Class comparison Many analysis methods
May produce different results Different underlying statistics and methods t test t test with permutations SAM Emperical bayesian Depends on underlying assumptions about data High throughput data with many rows and few samples What is the distribution Variance from gene to gene Save raw data files to try different methods and compare results
46
Fold change does not take variation into account
low variability Differentially expressed gene medium variability Differentially expressed gene. A low-reliable estimate high variability Differentially expressed gene. Powerful and exact statistical tests must be used Modified from madB
47
Hypothesis Testing Null hypothesis Alternative hypotheses Normal Tumor
d mean1 mean2 Null hypothesis Alternative hypotheses
48
Statistical power t test
Test hypothesis that the two means are not statistically different Adding “confidence” to the fold change value Mean Standard deviation Sample size Calculates statistic You choose cutoff or threshold Give me gene list at a cutoff of p <0.05 95% confidence that the mean for that gene between control are treated are different
49
Experimental Design – Very important!!!
Sample size How many samples in test and control Will depend on many factors such as whether tissue culture or tissue sample Power analysis Replicates Technical v biological Biological replicates is more important for more heterogenous samples Need replicates for statistical analysis To pool or not to pool Depends on objective Sample acquistion or extraction Laser captered or gross dissected All experimental steps from sample acquisition to hybridization Microarray experiments are very expensive. So, plan experiments carefully Not just within your lab, institution but across many datasets is quality of data
50
t tests Results might look like
At a p<0.05, there are 300 genes up and 200 genes downregulated 95% confidence that the means of these genes in the two groups is different At a p < 0.05, x genes up and y genes down with a fold change of at least 3.0
51
Multiple comparison Microarrays have multiple comparison problem
p <= 0.05 says that 95% confidence means are different; therefore 5% due to chance 5% of is 500 500 genes are picked up by chance Suppose t tests selects 1000 genes at a p of 0.05 500/1000 ;Approximately 50% of the genes will be false Very high false discovery rate; need more confidence How to correct? Correction for multiple comparison p value and a corrected p value
52
Corrections for multiple comparisons
Involve corrections to the p value so that the actual p value is higher Bonferroni Benjamin-Hochberg Significance Analysis of Microarrays Tusher et al. at Stanford
53
Hands on BRB Class comparison Choose comparison
Which tests are available? P value cutoff How is multiple correction testing being done? Stringent p value, fdr How is the output reported? Can you figure out how many genes are regulated at different p values and different cutoffs How to interpret results Look at gene lists generated by our analysis v those generated in the paper
54
BRB – Class Comparison Output folder Check the .html file
Look at results P value Fold change Annotation Click on annotation Cut and paste save into Excel
55
Issues Annotation How to compare between platforms
Multiple probe sets for a gene Annotation files will get updated Which one is correct? Where does it map? How to report the genes? How to compare between platforms Different chips within same platform Biological annotation
56
Difficult to interpret experimental results
57
Which probe/probe set is correctly aligned to the gene?
58
Probe set errors Types of Probe Error Mismatched Probe SNPs
Cross Hybridization Intron Probe
59
ESR1 probes in UCSC genome browser
60
How to manipulate Gene lists
Create gene lists Venn Diagram Can be done even though study done on different platforms Compare MAS and RMA Compare B-ALL v T-LL and T-LL v B-ALL
61
Venn Diagram http://www.pangloss.com/seidel/Protocols/venn.cgi
62
Conclusion GEO has some data analysis features Other analysis
Class prediction Gene list from class comparison can be used in pathway analysis HSLS pathway workshops on Ingenuity, DAVID, Pathway Architect Future: Integrate expression data with other data such as snp or microRNA GEO has some data analysis features
63
ESR1 probes in UCSC genome browser
64
Next Gen Sequencing Directly sequence DNA to determine
SNP CN Expression, mRNA, microRNA Protein binding sites Methylation Initial steps depend not on hybridization but also on base pairing or complementarity and DNA synthesis Data analysis extremely challenging
65
Next Gen Sequencing Applications
Sequence varation – WGS, Exome Seq Structural rearrangements – WGS, Exome Seq Copy number – WGS, Exome Seq Epigenetic changes such as methylation – Methyl Seq DNA – protein binding – CHIP Seq mRNA expression – RNA Seq
66
Next Gen Sequencing
67
Read mapping Alignment
Denovo assembly Mapping to reference genome Based on complementarity of a given 35 nucleotide to the entire genome Computationally intensive Million of 35 bp reads has to search for alignment against the reference and align spefically to a given regions Large file sizes Sequence files in the TB Aligned file BAM files Several hundred GB Reference genome
68
Sequence variation
70
Analysis pipeline- CHIP-Seq
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.