Download presentation
Presentation is loading. Please wait.
1
Using Unix Shell Scripts to Manage Large Data
2
What is Unix shell script?
A collection of unix commands may be stored in a file, and csh/bash can be invoked to execute the commands in that file. Like other programming languages, it has variables and flow control statements, e.g., if-then-else; while; for; goto. you can run any shell simply by typing its name.
3
Useful Unix commands grep: globally searches for regular expressions in files and prints all lines that contain the expression cut: select fields or characters from each line of a file head/tail: cut the first/last # lines of a file wc: count # characters/words/lines of a file split: read a file and writes it in n line pieces into a set of output files cat/paste: join files by rows or columns join: merge two files by a common field awk: a POWERFUL pattern scanning and processing language
4
Motivating example Genome-wide DNA methylation data
~3000 samples (rows) ~485,000 sites (columns) Data came in batches (~300 sample per file, ~1Gb each) For our analysis, we would like to: Pool all samples together but split to ~50,000 sites per file Load to R? will take ~14GB memory and R takes hours to read each file Using csh scripts, only takes ~10 minutes
5
csh script: pool samples
#!/bin/csh cd /dir rm -f cpg.txt cp -f All_Beta_Values1.txt cpg.txt foreach m (`seq 2 9`) # count number of samples @ l = `wc -l All_Beta_Values${m}.txt | cut -f 1 -d " "` - 1 echo "file = ${m}, nrow = $l" rm -f test.txt # remove the header tail -n $l All_Beta_Values${m}.txt > test.txt cat test.txt >> cpg.txt end
6
csh script: split by sites
#!/bin/csh cd /dir foreach n (`seq 1 9`) rm -f beta2950_${n}of10.txt # l = ($n - 1) * # r = $n * zcat cpg.txt.gz | cut -f 1,$l-$r > beta2950_${n}of10.txt end zcat cpg.txt.gz | cut -f 1, > beta2950_10of10.txt
7
Some tips To check whether a data file contains header or not, whether it is tab- or comma-delimited > head -n 1 filename To check a selected variable/column (e.g., to see how missing values were coded) > head -n 10 filename | cut -f #,# To get a subset of samples by matching ID > grep -f ID.txt filename
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.