January MSCL Analyst’s Toolbox Instructors: Jennifer Barb, Zoila G. Rangel, Peter Munson Jan 2008 Mathematical and Statistical Computing Laboratory Division of Computational Bioscience
January Course Outline Day 1 MSCL Analyst’s Toolbox and JMP™ overview MSCL Toolbox Concepts JMP™fundamentals Lunch Affymetrix ExpressionConsole™, processing.cel files, exporting data MSCL Toolbox Demo –Data input –Basic Analysis (Master File, Final File, Data normalization, QC, PCA, ) –Gene selection, statistical tests (p-values, FDR) –Annotation Day 2 Statistical Topics (PCA, Data normalization, FDR) MSCL Analyst’sToolbox Demo (cont.) –Complex Analysis (2-way ANOVA, blocked ANOVA) –Data Visualization
January Topics not included Exon Array Analysis -- coming soon! SNP chip Resequencing analysis, ChIP-Chip, copy number 2-color or spotted cDNA array analysis complete JMP tutorial JMP on Mac, Linux JMP scripting language Data management commands in JMP: Stack, Split, Concatenate, Sort
January Why use JMP? Interactive graphics facilitates data exploration, discovery of features Powerful, > 2,00,000 rows by 100s of columns (currently, 2 GB limit) Scripting language -- object oriented, allows matrix manipulation Connects to database servers including NIHLIMS or local GCOS JMP is also general purpose statistics pack Good technical support for JMP from: (919) or No direct cost to individual NIH users* (centrally supported in most NIH ICs) MSCL Analyst's Toolbox is FREE, adds tools for microarray studies
January 20085
6 MSCL Analyst’s Toolbox Features Menu driven Automated gene annotations Web link-out** Highly interactive, intuitive user interface Analysis pipeline, based on years of experience Familiar parametric analysis, e.g. ANOVA Exploratory Data Analysis Adaptable to new designs, analyses (e.g. Exon chips, SNP chips) Powerful, handles largest Affy chips, probe-level analysis Up to hundreds of chips at once PC, Mac or Linux desktops Support available through MSCL
January MSCL Analyst’s Toolbox Capabilities Connects to the central NIHLIMS database or local GCOS databases Reads in Pivot Tables from Affymetrix EC™ or GCOS™ Visualizes Principal Components Analyzes simple experiments (paired, unpaired T-tests) Analyzes complex experiments (multiple treatments, time series, linear trends, slope changes between treatments) Compensates for “batch” effects Selects and annotates significant genes Manages multiple gene lists (intersection, union, Venn diagrams) Multivariate, Cluster, Discriminant, Neural net analysis Uses dynamic visualization tools
January How to obtain: JMP – –Find your desktop support person at –JMP technical support from (919) The MSCL Analyst's Toolbox –Download from –Help offered on collaborative basis by MSCL – questions to:
January NIH Bioinformatics Cooperative
January Input files or Fetch data Transform and normalize Principal Components Analysis Create Master file, add treatment groups Compute statistical test, get p-values Correct for multiple comparisons or use FalseDiscoveryRate Compute log fold-change Visualize results Select relevant genes files Xform PCA Master Final MSCL Toolbox Data Pipeline:
January Data sources: NIHLIMS database via ODBC connection Local GCOS database via ODBC connection GCOS pivot table EC pivot table (NEW support for this option) Excel spread sheet Text files
January Data Input or data fetch DCEG/NCI Publish DB MSCL Publish DB client files client workstation Analyze (MAS) Process DB.dat files.cel files.chp files.rpt files Import(LM) Export(LM) Import Publish(MAS) ODBC access DMT Partek GeneSpring archive(LM) delete(LM) assume ownership(LM) Fluidics PlatformScanner CCMD Publish DB A-SCAN NIHLIMS database EC™ or GCOS™ MAS5™.txt
January Gene Expression Data Matrix Expression Matrix 116 Samples 1 20,000 Genes Gene Annotations Sample information
January Annotations for each gene Probe Set ID Genbank ID Unigene ID, Title Entrez Gene ID Cytogenetic map location Physical map location HUGO gene symbol, synonyms Functional relevance Associated literature references... GO terms for molecular process, biological function or cellular component Gene Annotations 1 20,000 Genes
January Annotation Files: Affymetrix annotations for each probeset have been downloaded and formatted for MSCL Toolbox, available at affylims.cit.nih.gov Annotations are updated quarterly Annotation tables may be JOINed by ProbeSetID Probe Set ID Gene Title Gene Symbol UnigeneID Transcript ID Ensembl Entrez Gene Representative Public ID First SwissProt Genome Alignment Chromosome Genome Alignment Start Address Genome Alignment Stop Address Genome Alignment Strand Chromosomal Location FinalAnnot. Final-Annot
January Annotating Genes Netaffx, reformatted Your data file “JOIN” on ProbeSetID
January Information about the Sample (transposed into MasterFile) 1 16 S amples Information about each Sample Clinical information (human) Diagnosis Demographic information Treatment (in vivo, in vitro) in designed experiment Tissue of origin Cell culture, strain, passage Sampling date/time RNA preparation protocol Operator/batch/lot/laboratory information QC information (rawQ, scale factor, 3/5-actin, 3/5-GAPDH, etc)
January Table formats JMP usually deals with a single Table, but… TWO tables are needed for MSCL Analyst’s Toolbox: 1. "Master File" layout –Each ROW represents a chip –Columns define treatment, replicate number, etc. 2. "Final" layout –COLUMNs correspond to chips (rows in Master File) –Each ROW is a probe set, unique identifier is probe set ID Tables are LINKED by “Shortnames” field in Master
January Linked Table Formats Master File -- one row per chip Final File -- one row per probe set
January Naming Convention for Final File Columns (prefixes) Data type: AD-, SG-, PA- Data transform: L-, Lmed-, GL-, S10- Statistical results: p-, FDR-, mean-, SFC- Column Naming Tips: –Avoid punctuation, hyphen, period, slash, etc. –Avoid spaces, use underscore “_” instead –Shorter is better –Toolbox utility available for trimming column names Column Name ITEM_NAME SG-33NH SG-33TH S10-33NH S10-33TH PA-33NH PA-33TH SFC-7 SFC-11 p-slope¢2 FDR slope¢2
January Input files or Fetch data Transform and normalize Principal Components Analysis Create Master file, add treatment groups Compute statistical test, get p-values Correct for multiple comparisons or use FalseDiscoveryRate Compute log fold-change Visualize results Select relevant genes Data Pipeline: files Xform PCA Master Final
January Data Transformation and Normalization
January Log(x/median x) transform (“Lmed”)
January Input files or Fetch data Transform and normalize Principal Components Analysis Create Master file, add treatment groups Compute statistical test, get p-values Correct for multiple comparisons or use FalseDiscoveryRate Compute log fold-change Visualize results Select relevant genes Data Pipeline: files Xform PCA Master Final
January Principal Components Analysis PC 1(38%) PC 2(12%)
January Input files or Fetch data Transform and normalize Principal Components Analysis Create Master file, add treatment groups Compute statistical test, get p-values Correct for multiple comparisons or use FalseDiscoveryRate Compute log fold-change Visualize results Select relevant genes Data Pipeline: files Xform PCA Master Final
January Analysis Scripts ANOVA1 T-test, unequal variance Paired t-test Consistency test ANOVA1 with blocking ANOVA2 with interaction terms (unbalanced data allowed) ANOVA2 with blocking Linear regression ANCOVA with blocking (balanced data case) ANCOVA2 with blocking (balanced data case) Other tests are easily added (requires scripting)
January Input files or Fetch data Transform and normalize Principal Components Analysis Create Master file, add treatment groups Compute statistical test, get p-values Correct for multiple comparisons or use FalseDiscoveryRate Compute log fold-change Visualize results Select relevant genes Data Pipeline: files Xform PCA Master Final
January Log(FoldChange)=“LFC” FoldChange = treated / control Log(FoldChange) = Log(treated / control) = Log(treated) - Log(control) Rule of Thumb for Base10 Logarithms: Log10(2-fold change) = 0.3 Log10(10-fold change) = 1 Log10(0.1-fold change) = -1
January Input files or Fetch data Transform and normalize Principal Components Analysis Create Master file, add treatment groups Compute statistical test, get p-values Correct for multiple comparisons or use FalseDiscoveryRate Compute log fold-change Visualize results Select relevant genes Data Pipeline: files Xform PCA Master Final
January Volcano Plot Significance of change Magnitude of change, Log Scale Selection Regions
January Interpreting Gene Lists FinalAnnot. Filter (FDR<10%) GeneList Significant Terms Ingenuity™, GeneGo™
January GO-SCAN- Gene Ontology Annotations Gene Ontology for Significant Collection of Annotations: GO-SCAN is a bioinformatics tool that selects and presents relevant Gene Ontology (GO) annotations for a gene "hit" list from an Affymetrix microarray experiment.
January Ingenuity Pathway Analysis (Doug Joubert, NIH Library)