Tools Needed for Data Analysis Pipeline: Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/)http://www.ncbi.nlm.nih.gov/geo/ R, Version 2.4.1 (http://www.R-roject.org)http://www.R-roject.org.

Tools Needed for Data Analysis Pipeline: Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/)http://www.ncbi.nlm.nih.gov/geo/ R, Version 2.4.1 (http://www.R-roject.org)http://www.R-roject.org Affy scripts (http://www.jax.org/staff/churchill/labsite/)http://www.jax.org/staff/churchill/labsite/ SAM, Version 3.00 (http://genome- www5.stanford.edu/resources/restech.sht ml)http://genome- www5.stanford.edu/resources/restech.sht ml JMP 6 (http://www.jmp.com)http://www.jmp.com

Annotated Affymetrix files (http://www.affymetrix.com/support/technic al/annotationfilesmain.affx)http://www.affymetrix.com/support/technic al/annotationfilesmain.affx MeV, Version 4.0.01 (http://www.tm4.org)http://www.tm4.org Gene/Marker Batch Query Prototype (http://proto.informatics.jax.org/batchwi/ind ex.do)http://proto.informatics.jax.org/batchwi/ind ex.do Ingenuity Pathways Analysis (www.ingenuity.com)www.ingenuity.com

VLAD (http://proto.informatics.jax.org/pr)http://proto.informatics.jax.org/pr MGI (http://www.informatics.jax.org/)http://www.informatics.jax.org/ PubMed (http://www.ncbi.nlm.nih.gov/entrez/query.f cgi?DB=pubmed)http://www.ncbi.nlm.nih.gov/entrez/query.f cgi?DB=pubmed

The first step is to locate data files at GEO: http://www.ncbi.nlm.nih.gov/geo/ Enter record # here

Are CEL (or raw) Data Files Available? If Yes, continue to next slide. If No, advance to slide # 70. GSE6077 will be used as an example

The next screen will indicate whether CEL files are available as supplementary files.

Click on the link to view the record.

At the very bottom of the page you will see the link to download the raw data. Click on this.

You will then be prompted with a dialog box to download the files.

The files are downloaded as.gz files which then need to be unzipped

Select the files and then click on Extract Selected Files.

You’ll get a dialog box in which you can browse to the location you want the files saved. Ideally, you should have created a project folder ahead of time into which all data and analysis files will be kept. Then click OK.

You find them where you saved them. Put them in a folder like this. Your project folder should have an informative name, if possible, so you’ll know what’s in it later on.

Open the folder and there they are, ready to be un-stuffed.

Double click on the little boxes and they open as your.CEL files. These are the files that you’ll now perform Quality Control (QC) analysis on. Move the CEL Files out of the.gz folder (In this case put them in CEL_QC_instructions)

Move the.gz folder (which contains the compressed files) to another location, but don’t get rid of it. (I’m not sure why, but I found this was necessary in order to access the individual.CEL files) Ignore the Affy Scripts in this slide, there aren’t supposed to be there yet.

Next you’ll need three R scripts, in this order: AffyQC, Affyprocessing, and Affymaanova. Save them in your project folder along with R v2.4.1 (http://www.R-roject.org) and the.CEL files.http://www.R-roject.org http://www.jax.org/staff/churchill/labsite/ 1st 2nd3rd

AffyQC The AffyQC script will read in all CEL files in your working directory and perform Quality Control on them. Visualizations of the quality of the data will be created and output as JPEG files. Once the script is finished, you’ll find boxplot, histogram, and scatterplot JPEG files in your project folder as well as an RMA expression.DAT file.

Here is the affyQC script. The name for the workspace file should be changed to match the title of your project folder and/or data. Name of workspace file

Now fire up R and you’ll see a workspace console that looks like this. In the file menu, select Change Directory.

Browse to the same project folder where you have saved your.CEL files and the R scripts. Then click OK.

Under the file menu again, this time select Source R Code.

You’ll see a dialog box like this one. Only the affyQC and the affyprocessing scripts are saved as R files. The affymaanova script should be saved in notepad. (The reason for this will become apparent later on) This should not actually be in here.

Select affyQC and click Open.

When the QC script is done running, you’ll know because you’ll be back to the red > in the R Console.

Look in your project folder, and you’ll see these new files automatically downloaded and saved by the QC script.

Open up the rma.expr file in Excel and you’ll see the normalized expression values for each microarray sample (columns B through E). Column A contains the Probe ID.

The Scatterplot.jpeg looks like this. In this case, the quality of the data looks pretty good. There is much less variation between biological replicates (quadrants II and IV) than between experimental conditions (quadrants I and III). I II III IV For very large data sets sometimes the scatterplot doesn’t work. In these cases the rma.dat file can be opened in JMP and a scatterplot can be made there. See slide 100.

Here is the histogram.jpeg

And the Boxplot.jpeg. The distribution in the four samples looks pretty consistent. No one sample looks way “out of line” with the others.

The next step is to run the affyprocessing script. What this script does is create a design file for the ANOVA analysis to follow. (It is possible to create the design file “by hand” as a text file and skip this step).

Some changes in the original script are necessary here as well (in the design matrix) so the names and numbers of samples fit the data. Note that for this example we have 4 samples – 2 Wild Type and 2 Mutant.

From the File menu, again select Source R Code.

In the dialog box that pops up, this time select the affyprocessing script and click Open

When the affyprocessing script is done, you’ll know because you are back to the red > in the console window again. Look in your project folder and you should see the design file saved there now.

Open the design file with word pad, just to make sure it is correct. The first column (array) contains the CEL file names. The second file (strain) puts the samples into groups – in this case Wild Type vs. Mutant. The third column (sample) orders the samples. The fourth column (dye) labels all the samples with a “1”.

Now you are ready to run the maanova script. This is where things get a little more complicated …

Underlined in this script are the changes necessary to fit this particular data set. Log transformation is set at False, because the affyQC script does it.

This is the design file. More underlining of what might need to be changed to fit the particular data sets. This has been changed from 500 permutations to 1,000

And this is the end of the script. Name of workspace file

The final step is to run the affymaanova script. You’ll want to run this script line by line, just to make sure all the correct changes have been made. (It is more interactive this way) That’s why the script was saved as a text file rather than R source code.

Copy the first line of code …

And paste into the R Console.

Copy and paste the next line into the R console. (The lines preceded by a # are comment lines and they don’t get copied and pasted in)

Each time you see the red > you know that R is ready for the next command line.

Highlighted here is the real time consuming step in the program. Because we set it to run 1000 permutations instead of 500, it might take as long as an hour or so.

This is what your R Console should look like when you are all done with the affymaanova script.

Once again, you’ll see some new things automatically saved in the project folder you’ve been working in.

Here is what the Fspvalperm.jpeg looks like.

Here is the volcano plot. Looks very cool, but it isn’t necessary to spend lots of time staring at it.

Open the top.hits.results file in Excel and you’ll see something like this.

In column B are the fold change values. (In this case, MUT vs WT)

Back at GEO, you can find the affymetrix GeneChip Array used on the same page where you downloaded the RAW.tar files

Click on the link indicating the Platform.

You’ll find a link to the affymetrix site where you can download an annotation file for the array. (Or you can go directly to: http://www.affymetrix.com/support/technical/annotationfilesmain.affx) http://www.affymetrix.com/support/technical/annotationfilesmain.affx

Choose the appropriate array set (in this case Mouse430A_2) in the CVS format.

You will first be prompted to register with Affymetrix (it’s free). Then click on the link and you’ll see a download dialog box.

As you did with the CEL files, select the annotation file and click on Extract Selected Files

In the dialog box, browse to your project folder and then click OK.

When you open the annotation file in Excel, you’ll see something like this. (it is a huge file)

JMP won’t be able to open this large of a file so, in Excel, delete all columns after the Entrez Gene column.

Next you’ll need the application JMP, Version 6. http://www.jmp.com

First, open JMP. Then under the File menu, select Open. When the dialog box appears, navigate to your project folder and the top hits gene list. (You’ll have to select All Files in the file type box)

Your Top Hits gene list should look like this in JMP.

Repeat the last step, this time opening the Affymetrix annotation file as a JMP table.

With your Top Hits list open in JMP, select Join under the Tables menu.

In the dialog box: (1) select the annotation table to be joined with the top hits table, (2) select matching columns under Matching Specification, and (3) select cloneid and Probe Set ID as the columns to be matched (be sure to click the Match button). Then (4), click OK. (1)(2) (3) (4) Name file

Save this new JMP table with an informative name in your project folder.

Then save the annotated table as a text file (.TXT) because IPA does not accept JMP formatting.

The annotated file is now ready to be loaded into IPA for analysis.

If CEL files are not available for download, you can perform some QC by hand, using JMP. The following is only for data sets where the QC is not available

We’ll use the same GSE6077 files from GEO: http://www.ncbi.nlm.nih.gov/geo/ http://www.ncbi.nlm.nih.gov/geo/

Click on the GSE6077 record link.

This time at the bottom of the following page, select the SOFT formatted family file(s) rather than the RAW.tar files.

A dialog box prompts you to save the.gz files

When you double click on the compressed file, you get an Excel file with the same name (in this case GSE6077_family).

It is a really huge file – too big to open in Excel – so open first with Notepad. The file has the four samples, one below the other, rather than side by side in columns. It is necessary to break the samples into separate files in Notepad: Scroll down to find the first Sample #GSM14027

Do this again with the next file (#GSM14086), each time copying and pasting into its own Notepad file.

Repeat this with all four samples until each is its own text file. GSM140827 GSM140863 GSM140864 GSM140865

Next you’ll need the application JMP, Version 6. http://www.jmp.com

First, open JMP. Then under the File menu, select Open. When the dialog box appears, navigate to your project folder and the text file for the first sample. (you’ll have to select All Files in the file type box) and click Open.

The first 37 rows are annotation notes. Delete these.

Now you’ve got a file with just the ID_Ref, expression value, ABS_call, and P- value. Repeat this same step with the other 3 samples

When finished, you should have four separate JMP tables, one for each sample. You’ll now join these together into one big table, matching ID_Ref columns.

You’ll need all four JMP tables open at once for this next step.

With Sample_1 open, select Join from the Tables menu.

In the dialog box: (1) select Sample_2 to be joined with Sample_1, (2) select Matching Columns as the Matching Specification, (3) select ID-Ref as the columns to be matched and click on Match. Then (4), click OK. (1) (2) (3) (4)

You’ll be doing this step two more times, each time creating a new table. So give each one a name that is meaningful. Otherwise, they’ll all be called “Untitled”.

Notice in this dialog box that I’m joining Sample1_Sample2 to Sample 3. Matching columns will again be the ID_Ref columns.

As you subsequently join the remaining table together, it may be necessary to rename the columns to keep track of which sample is which. Notice that in the table below, Sample 4 isn’t clearly labelled.

To change the column heading, select the column in the table. Then under the Cols menu, select Column info…

You’ll get a dialog box where you can change the name of the column.

Now that all four samples have been joined together in one table, it’s time to do some log-transformation. In this case, the broad range of values in the “VALUE of” column tells me these values have not yet been log transformed.

JMP does log (base e) transformation. It is first necessary to add a new column. JMP will add the column to the end of the table, but it can be moved later. Under the Col menu, select New Column…

In the dialog box that appears, give the new column a meaningful name (like Sample1_transformed). Then click on Column Properties and select “Formula”.

Under Functions (grouped), select Transcendental and then Log from the pop- up menus.

Click on VALUE of sample_1 and then the Apply button.

To the far right in your table, you see the new column of log-transformed values for Sample 1.

Repeat these last steps again for Sample 2, Sample 3 and Sample 4. The end of your table should now look like this: (It is okay to leave all the log-transformed columns lumped together at the end)

Now that you have log-transformed values for the four samples, you can create a scatter plot and box plots to examine the quality of this data. Under the Analyze menu, select Multivariate Methods and then Multivariate from the pop- up menus.

In the dialog box that appears, (1) select the four columns with log-transformed data and (2) click the Y Columns button. Then (3) click OK. (1) (2) (3)

This doesn’t look so good. Samples 1 and 2 are the Wild Type samples and Samples 3 and 4 are the Mutant. This scatterplot shows a fair amount of variation between biological replicates. Compare this to the scatterplot on slide # 28

JMP will also produce graphical representations of the distribution of values in each sample. Again under the Analyze menu, this time select Distribution.

In the dialog box: (1) select the 4 columns of log-transformed values, (2) click on the Y Columns button, and then (3) click OK. (1) (2) (3)

This plot gives you visuals of the distribution of values in the samples as well as some statistical information. Compare these to the histogram and box plots on slides # 29 & 30.

Tools Needed for Data Analysis Pipeline: Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/)http://www.ncbi.nlm.nih.gov/geo/ R, Version 2.4.1 (http://www.R-roject.org)http://www.R-roject.org.

Similar presentations

Presentation on theme: "Tools Needed for Data Analysis Pipeline: Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/)http://www.ncbi.nlm.nih.gov/geo/ R, Version 2.4.1 (http://www.R-roject.org)http://www.R-roject.org."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tools Needed for Data Analysis Pipeline: Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/)http://www.ncbi.nlm.nih.gov/geo/ R, Version 2.4.1 (http://www.R-roject.org)http://www.R-roject.org.

Similar presentations

Presentation on theme: "Tools Needed for Data Analysis Pipeline: Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/)http://www.ncbi.nlm.nih.gov/geo/ R, Version 2.4.1 (http://www.R-roject.org)http://www.R-roject.org."— Presentation transcript:

Similar presentations

About project

Feedback