Using GC content to distinguish Phytophthora sequences from tomato sequences
Mission #1 Calculate the GC content of each sequence in the Phytophthora-tomato interactome We will use a perl script to accomplish the mission.
Preparation Download the perl script (gc.pl) from the class web site and store it in C:/BioDownload folder
Open cygwin, or command prompt (Vista users), or terminal (Mac users) Change directory (cd) to the BioDownload folder perl gc.pl PhytophSeq1.txt phyto_gc.out Running the script
In cygwin (Windows users) or terminal (Mac users) grep --perl-regexp ”\t” -c phytoph_gc.out grep ”>” -c PhytophSeq1.txt You should get the same number from the two commands. The number should be Results
The output file GC content column Name column
Build a histogram of the values of GC content We will use R program to accomplish this mission. Mission #2
Mac users
All Windows users
XP users Vista users
getwd() to know which folder you are in now
setwd(“c:/BioDownload”) to change the working directory to C:/BioDownload setwd(“/path/to/biodownload”) for Mac users
data<-read.table(“phytoph_gc.out”,sep=“\t”,header=FALSE) to read in the data in the file phytoph_gc.out (your file name may be different)
data[1:10,] to see the first 10 lines of the vector “data”
gc<-data[,2] to assign the values from the 2 nd column of “data” to a new vector “gc”
summary(gc) to get the summary of the values in the vector “gc”
hist(gc,breaks=58) to draw a histogram of the values in “gc” vector Breaks indicates how many cells you want for the histogram. It was calculated as 78.7 (max) (min). It means the bin of the histogram is ~ 1 GC value
hist(gc,breaks=58,xlab=“GC content”,ylim=range(c(0,400)),main=“Histogram of GC content of sequences\ninPhytophthora-tomato interactome”) to make the histogram look better
>pdf(“gc_histogram.pdf”) >hist(gc,breaks=58,xlab=“GC content”,ylim=range(c(0,400)),main=“Histogram of GC content of sequences\ninPhytophthora-tomato interactome”) >dev.off() To output the histogram to a PDF file.
location file