2.2 - Microarray Data Modeling Lab Sébastien Lemieux Elitra Canada Ltd.

Slides:



Advertisements
Similar presentations
Liang, Introduction to Java Programming, Ninth Edition, (c) 2013 Pearson Education, Inc. All rights reserved. 1 Chapter 9 Strings.
Advertisements

Formal Language, chapter 4, slide 1Copyright © 2007 by Adam Webber Chapter Four: DFA Applications.
Minimum Information About a Microarray Experiment - MIAME MGED 5 workshop.
Arrays.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Image Quantitation in Microarray Analysis More tomorrow...
Microarray Data Analysis Stuart M. Brown NYU School of Medicine.
TIGR Spotfinder: a tool for microarray image processing
Chapter 7 – Arrays.
DNA Microarray: A Recombinant DNA Method. Basic Steps to Microarray: Obtain cells with genes that are needed for analysis. Isolate the mRNA using extraction.
Microarray Data Preprocessing and Clustering Analysis
1 Chapter 4 Language Fundamentals. 2 Identifiers Program parts such as packages, classes, and class members have names, which are formally known as identifiers.
Aalborg Media Lab 21-Jun-15 Software Design Lecture 2 “ Data and Expressions”
1 CS 177 Week 15 Recitation Slides Review. Announcements Final Exam on Sat. May 8th  PHY 112 from 8-10 AM Complete your online review of your classes.
Microarray Technology Types Normalization Microarray Technology Microarray: –New Technology (first paper: 1995) Allows study of thousands of genes at.
Data Extraction cDNA arrays Affy arrays. Stanford microarray database.
Inferring the nature of the gene network connectivity Dynamic modeling of gene expression data Neal S. Holter, Amos Maritan, Marek Cieplak, Nina V. Fedoroff,
Introduce to Microarray
Scanning and image analysis Scanning -Dyes -Confocal scanner -CCD scanner Image File Formats Image analysis -Locating the spots -Segmentation -Evaluating.
COMP 14: Primitive Data and Objects May 24, 2000 Nick Vallidis.
Microarrays: Basic Principle AGCCTAGCCT ACCGAACCGA GCGGAGCGGA CCGGACCGGA TCGGATCGGA Probe Targets Highly parallel molecular search and sort process based.
Microarray Preprocessing
Options for User Input Options for getting information from the user –Write event-driven code Con: requires a significant amount of new code to set-up.
Image Quantitation in Microarray Analysis More tomorrow...
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
Affymetrix vs. glass slide based arrays
Abstract Data Types (ADTs) and data structures: terminology and definitions A type is a collection of values. For example, the boolean type consists of.
The following slides have been adapted from to be presented at the Follow-up course on Microarray Data Analysis.
Gene Expression Omnibus (GEO)
Recitation 1 CS0445 Data Structures Mehmud Abliz.
Introduction to Programming David Goldschmidt, Ph.D. Computer Science The College of Saint Rose Java Fundamentals (Comments, Variables, etc.)
Chapter 9 1 Chapter 9 – Part 1 l Overview of Streams and File I/O l Text File I/O l Binary File I/O l File Objects and File Names Streams and File I/O.
Hello.java Program Output 1 public class Hello { 2 public static void main( String [] args ) 3 { 4 System.out.println( “Hello!" ); 5 } // end method main.
Microarray - Leukemia vs. normal GeneChip System.
1 maxdLoad The maxd website: © 2002 Norman Morrison for Manchester Bioinformatics.
Scenario 6 Distinguishing different types of leukemia to target treatment.
D. M. Akbar Hussain: Department of Software & Media Technology 1 Compiler is tool: which translate notations from one system to another, usually from source.
Chapter 9-Text File I/O. Overview n Text File I/O and Streams n Writing to a file. n Reading from a file. n Parsing and tokenizing. n Random Access n.
 Pearson Education, Inc. All rights reserved Introduction to Java Applications.
CS 112 Introduction to Programming Arrays; Loop Patterns (break) Yang (Richard) Yang Computer Science Department Yale University 308A Watson, Phone:
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
What Is Microarray A new powerful technology for biological exploration Parallel High-throughput Large-scale Genomic scale.
Lab 2 Primer Assignment 3 Structure File I/O More parsing and HTTP Formatting.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Assignment An assignment statement changes the value of a variable The assignment operator is the = sign total = 55; Copyright © 2012 Pearson Education,
PROGNOCHIP-BASE, FORTH-ICS 1 PrognoChip-BASE: An Information System for the Management of Spotted DNA MicroArray Experiments Extension of BASE v
Statistics for Differential Expression Naomi Altman Oct. 06.
“Never doubt that a small group of thoughtful, committed people can change the world. Indeed, it is the only thing that ever has.” – Margaret Meade Thought.
1 Outline Standardization - necessary components –what information should be exchanged –how the information should be exchanged –common terms (ontologies)
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Microarray Technology. Introduction Introduction –Microarrays are extremely powerful ways to analyze gene expression. –Using a microarray, it is possible.
Microarray hybridization Usually comparative – Ratio between two samples Examples – Tumor vs. normal tissue – Drug treatment vs. no treatment – Embryo.
Chapter 9 1 Chapter 9 – Part 2 l Overview of Streams and File I/O l Text File I/O l Binary File I/O l File Objects and File Names Streams and File I/O.
Overview of Microarray. 2/71 Gene Expression Gene expression Production of mRNA is very much a reflection of the activity level of gene In the past, looking.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
ICS3U_FileIO.ppt File Input/Output (I/O)‏ ICS3U_FileIO.ppt File I/O Declare a file object File myFile = new File("billy.txt"); a file object whose name.
Introduction and Applications of Microarray Databases Chen-hsiung Chan Department of Computer Science and Information Engineering National Taiwan University.
Exceptions. Exception  Abnormal event occurring during program execution  Examples Manipulate nonexistent files FileReader in = new FileReader("mumbers.txt“);
DNA Microarray Overview and Application. Table of Contents Section One : Introduction Section Two : Microarray Technique Section Three : Types of DNA.
From: Duggan et.al. Nature Genetics 21:10-14, 1999 Microarray-Based Assays (The Basics) Each feature or “spot” represents a specific expressed gene (mRNA).
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Introduction to Oligonucleotide Microarray Technology
Chapter 9 Introduction to Arrays Fundamentals of Java.
MESA A Simple Microarray Data Management Server. General MESA is a prototype web-based database solution for the massive amounts of initial data generated.
Microarray - Leukemia vs. normal GeneChip System.
Lab 4.1 From Database to Data mining
The Basics of cDNA Microarray Technology
Microarray Technology and Applications
Introduction to cDNA Microarray Technology
The Basics of Microarray Image Processing
Presentation transcript:

2.2 - Microarray Data Modeling Lab Sébastien Lemieux Elitra Canada Ltd.

Lab 2.21 Objectives to learn the basic principles about microarray experiments. to understand the main features seen in microarray data. to prepare Java classes to interact with the Golub et al. dataset.

Lab 2.22 Overview Experimental setup. Common features of microarray data: –Data organization; –File formats and databases. Golub et al. dataset Parsing the raw data: Java classes.

Lab 2.23 Experimental setup Two types of printing: –robotically spotted - oligonucleotides or cDNA; –photolithography - oligonucleotides (Affymetrix). Spotted arrays: –two samples are labeled with different fluorescent dyes (Cy3, Cy5) and co-hybridized on the same array.

Lab 2.24 Hybridization (spotted array) Competitive hybridization of Cy5 or Cy3-labeled nucleic acids. Total amount of Cy5 and Cy3 should be matched. Normalization: physical and numerical.

Lab 2.25 Experimental setup (cont.) Scan –TIFF 16-bit ( ), one image file per dye; –Typical size > 10Mb per channel. Quantification –Gridding: identifies the location on the image of each element. Usually done in two step: grid locations followed by spot locations. –Segmentation: for each spot location, identifies which pixels are part of the spot and which are part of the background.

Lab 2.26 Microarray - the scanned image Example of a yeast spotted array. –16 x 16 pins. –15 x 24 subarrays. –duplicate spots side- by-side.

Lab 2.27 Microarray - the scanned image (cont.) Target a resolution that will result in spots of ~10 pixels of diameters. Too dark spots are unusable, but saturated spots are also unusable. The donut shape is a frequent artifact of spotted arrays. –Depending on the quantification software it may significantly reduce the quality of the data acquired.

Lab 2.28 Features of microarray data On the experimental side: –One dataset contains one or more conditions (cell types, media, time, temperature, etc.); –For each condition one or more hybridization can be obtained (replicates, fluor-reverse); Fluor-reverse hybridization is required to account for dye- specific preferential amplification of some probes. The number of replicate will determine the sensitivity of the experiment. –Each hybridization is one (affy) or two (spotted) channels; For spotted arrays, the two channels correspond to different dyes (Cy3, Cy5) and are typically obtained from two different conditions.

Lab 2.29 Features of microarray data (cont.) On the array (spotted): –Multiple grid corresponding to each spotting pin; –Each grid contains several probes (cDNA or oligo); –Each probe can be replicated (1, 2 or 3 times). Keeping track of this structure may help data analysis: –pin-specific normalization.

Lab Features of microarray data (cont.) Published information is typically a matrix presenting for each pair condition / probe the final quantification used: –log-ratio for spotted arrays; –background corrected intensity for affy. More complex to represent: –Information about the conditions; –Information about the array construction.

Lab Data structures A solution: –MIAME: Minimum information about a microarray experiment ( –MAGE-ML: microarray data exchange format adopted as a standard for gene expression by the OMG and supported by MGED ( –MGED: Microarray Gene Expression Database ( For the actual data: –Good old tab-delimited format is still widely used.

Lab Golub et al. dataset Two datasets: –initial (training, 38 samples); –independent (test, 34 samples). Intensity values have been re-scaled such that overall intensities for each chip are equivalent. –Linear regression using intensities of all genes in both the first sample and each other. –1 / slope of the linear regression becomes the re-scaling factor. –File: table_ALL_AML_rfactors.txt Two files: –table_ALL_AML_samples.txt: contains information about each sample (the condition); –data_set_ALL_AML_train.txt: contains information about each gene (probe) along with the matrix of quantified intensities.

Lab Golub et al. dataset (cont.) From the paper’s website: – Retrieve the following files: –Samples table (text); –Train dataset (text). And, take a look at the Paper (PDF).

Lab Golub et al. dataset (cont.) Three data structures: –Sample: String name; –Gene: String description; String accession; –Expression: double value; –Visualize it as a matrix where the sample information describe each row and the gene information describe each column. We will adopt a simple structure: –The sample will contain the expression data as a HashMap using the gene_id.

Lab Sample information 9 header lines. tab / space usage is erratic. We will first store the information as a single String...

Lab Expression data file 1 header line. One tab per column separation. One row per gene: –We want to have one object per sample, since we will want to operate mainly on the samples. Genes are identified by a description and an accession. We will merge the two as a String for now. And we will ignore the absent / present call.

Lab Overall architecture LoadData.java: main program to parse and display the data files. Uses a GolubParser to obtain an ArrayList of GolubSample. GolubParser.java: takes care of parsing the tab- delimited files published as part of the Golub et al. dataset. GolubSample.java: handles data for one condition. Will be extended during lab 4.1.

Lab Data members for the GolubSample java class: sampleName will contain information about the sample; expressionMap will hold a HashMap using a gene_id as the key. Final structure for data public class GolubSample { private String sampleName; private HashMap geneExpressionMap; public GolubSample(String sampleName, HashMap geneExpressionMap) { this.sampleName = sampleName; this.geneExpressionMap = new HashMap(geneExpressionMap); }... }

Lab LoadData Load the data from the published files and create the GolubSample structures. public class LoadData { public static void main (String[] args) throws Exception { GolubParser p = new GolubParser ("table_ALL_AML_samples.txt", "data_set_ALL_AML_train.txt"); ArrayList allSamples = p.getGolubSamples (); ((GolubSample)allSamples.get (0)).print (System.out); }

Lab Sample information (cont.) The data is parsed in two steps: –parseSampleFile: Load information about the samples. –parseDataFile: Load information about the genes and their expression values for the different samples. public class GolubParser { private ArrayList sampleInfo; private ArrayList geneInfo; private HashMap sampleData; public GolubParser(String sampleFileName, String dataFileName) { this.sampleInfo = new ArrayList (); this.geneInfo = new ArrayList (); this.sampleData = new HashMap(); try { this.parseSampleFile (sampleFileName); this.parseDataFile (dataFileName); } catch (FileNotFoundException e) { System.err.println ("Problem opening a file..."); }

Lab A simple tokenizer... Java 1.4 provides String.split () for that purpose... private ArrayList splitTokens (String line, String sep) { ArrayList l = new ArrayList (); int last = -1; for (int i = 0; i < line.length (); i++) { char c = line.charAt (i); boolean is_sep = false; for (int j = 0; j < sep.length (); j++) { if (c == sep.charAt (j)) { is_sep = true; break; } if (is_sep) { if (last >= 0) { l.add (line.substring (last, i)); last = -1; } } else if (last < 0) last = i; } return l; }

Lab Parsing sample info Expression data HashMap are inserted in sampleData as empty HashMap. private void parseSampleFile (String sampleFileName) throws FileNotFoundException { BufferedReader r = new BufferedReader(new FileReader (sampleFileName)); try { for (int i = 0; i < 9; i++) { String skip = r.readLine (); } while (r.ready ()) { String line = r.readLine (); ArrayList tokens = splitTokens (line, " \t"); if (tokens.size () == 0) break; String sampleName = tokens.toString (); Integer sample_id = new Integer (sampleInfo.size ()); sampleData.put (sample_id, new HashMap ()); sampleInfo.add (sampleName); } } catch (IOException e) { System.out.println ("Done."); } System.out.println ("Loaded " + sampleInfo.size () + " sample info."); }

Lab Parsing expression data file private void parseDataFile (String dataFileName) throws FileNotFoundException { int count = 0; BufferedReader r = new BufferedReader (new FileReader (dataFileName)); try { for (int i = 0; i < 1; i++) { String skip = r.readLine (); } while (r.ready ()) { String line = r.readLine (); ArrayList tokens = splitTokens (line, "\t"); if (tokens.size () == 0) continue; String geneName = tokens.subList (0, 2).toString (); Integer gene_id = new Integer (geneInfo.size ()); geneInfo.add (geneName); for (int i = 2; i < tokens.size (); i += 2) { Double exp = new Double ((String)tokens.get (i)); Integer sample_id = new Integer ((i - 2) / 2); ((HashMap)sampleData.get (sample_id)).put (gene_id, exp); count++; } } catch (IOException e) { System.out.println ("Done."); } System.out.println ("Loaded " + count + " expression values."); }

Lab Parsing expression data file (cont.) String geneName = tokens.subList (0, 2).toString (); Integer gene_id = new Integer (geneInfo.size ()); geneInfo.add (geneName); for (int i = 2; i < tokens.size (); i += 2) { Double exp = new Double ((String)tokens.get (i)); Integer sample_id = new Integer ((i - 2) / 2); ((HashMap)sampleData.get (sample_id)).put (gene_id, exp); count++; }

Lab Creating the GolubSample Extract the expression data from the parse and return an ArrayList of GolubSample. public ArrayList getGolubSamples () { ArrayList l = new ArrayList (); for (int i = 0; i < sampleInfo.size (); i++) { GolubSample sample = new GolubSample ( (String)sampleInfo.get (i), (HashMap)sampleData.get (new Integer (i))); l.add (sample); } return l; }

Lab Going further... Create GolubGene to hold information about the genes, keeping the name and accession separate. In GolubSample, keep in separate fields the type of cancer ALL or AML and the type of cell used. Download the file “Test dataset (text)” and modify the parser to load both the Test and Training dataset. Add a field in GolubSample to distinguish them. What structure would allow direct manipulation of values related to a sample or values related to a gene?

Lab References Y. F. Leung and D. Cavalieri, Fundamentals of cDNA microarray data analysis, TRENDS in Genetics, 19: T. R. Golub, et al., Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science, 286: