Using Unix Shell Scripts to Manage Large Data

1 Using Unix Shell Scripts to Manage Large Data

2 What is Unix shell script?
A collection of unix commands may be stored in a file, and csh/bash can be invoked to execute the commands in that file. Like other programming languages, it has variables and flow control statements, e.g., if-then-else; while; for; goto. you can run any shell simply by typing its name.

3 Useful Unix commands grep: globally searches for regular expressions in files and prints all lines that contain the expression cut: select fields or characters from each line of a file head/tail: cut the first/last # lines of a file wc: count # characters/words/lines of a file split: read a file and writes it in n line pieces into a set of output files cat/paste: join files by rows or columns join: merge two files by a common field awk: a POWERFUL pattern scanning and processing language Use “man command_name” to see the help file

4 Motivating example Genome-wide DNA methylation data
~3000 samples (rows) ~485,000 sites (columns) Data came in batches (~300 sample per file, ~1Gb each) For our analysis, we would like to: Pool all samples together but split to ~50,000 sites per file Load to R? will take ~14GB memory and R takes hours to read each file (recommend data.table package) Using csh scripts, only takes ~10 minutes

5 csh script: pool samples
#!/bin/csh cd /dir rm -f cpg.txt cp -f All_Beta_Values1.txt cpg.txt foreach m (`seq 2 9`) # count number of samples @ l = `wc -l All_Beta_Values${m}.txt | cut -f 1 -d " "` - 1 echo "file = ${m}, nrow = $l" rm -f test.txt # remove the header tail -n $l All_Beta_Values${m}.txt > test.txt cat test.txt >> cpg.txt end

6 csh script: split by sites
#!/bin/csh cd /dir foreach n (`seq 1 9`) rm -f beta2950_${n}of10.txt # l = ($n - 1) * # r = $n * zcat cpg.txt.gz | cut -f 1,$l-$r > beta2950_${n}of10.txt end zcat cpg.txt.gz | cut -f 1, > beta2950_10of10.txt

7 Some tips To check whether a data file contains header or not, whether it is tab- or comma-delimited > head -n 1 filename To check a selected variable/column (e.g., to see how missing values were coded) > head -n 10 filename | cut -f #,# To get a subset of samples by matching ID > grep -f ID.txt filename To find a certain column > zcat filename.txt.gz | head -n 1 | awk '/variable_name/{for(i=1;i<=NF;++i)if($i~/variable_name/)print NR,i,$i}'

8 Using scripts to generate scripts
#!/bin/bash -l #PBS -l walltime=16:00:00,pmem=2800mb,nodes=13:ppn=8 #PBS -m abe proc=0 for i in `seq 0 12` do for j in `seq 1 8` job=$(($i*8+$j-1)) scripts=/path echo "#!/bin/bash -l" >$scripts/sim$ echo "cd $scripts">>$scripts/sim$ echo "module load R" >>$scripts/sim$ echo "R CMD BATCH --no-save --no-restore '--args job=$job' /path/assoc.R /path/log/sim$job.txt" >> $scripts/sim$ chmod 770 $scripts/sim$ pbsdsh -n $proc $scripts/sim$ & proc=$(($proc+1)) done wait

