Using Unix Shell Scripts to Manage Large Data

Slides:



Advertisements
Similar presentations
Shell Script Assignment 1.
Advertisements

CS 497C – Introduction to UNIX Lecture 22: - The Shell Chin-Chih Chang
CS Lecture 03 Outline Sed and awk from previous lecture Writing simple bash script Assignment 1 discussion 1CS 311 Operating SystemsLecture 03.
Now, return to the Unix Unix shells: Subshells--- Variable---1. Local 2. Environmental.
More Shell Programming Learning Objectives: 1. To learn the usage of environment (shell) variables in shell programming 2. To understand the handling of.
T UTORIAL OF U NIX C OMMAND & SHELL SCRIPT S 5027 Professor: Dr. Shu-Ching Chen TA: Samira Pouyanfar Spring 2015.
Shell Programming 1. Understanding Unix shell programming language: A. It has features of high-level languages. B. Convenient to do the programming. C.
Unix Shell Scripts. What are scripts ? Text files in certain format that are run by another program Examples: –Perl –Javascript –Shell scripts (we learn.
Shell Scripting Basics Arun Sethuraman. What’s a shell? Command line interpreter for Unix Bourne (sh), Bourne-again (bash), C shell (csh, tcsh), etc Handful.
Using Unix Shell Scripts to Manage Large Data
UNIX Filters.
Shell Script Examples.
Advanced File Processing
Using the Unix Shell There is No ‘Undelete’. The Unix Shell “A Unix shell is a command-line interpreter or shell that provides a traditional user interface.
1 Operating Systems Lecture 3 Shell Scripts. 2 Shell Programming 1.Shell scripts must be marked as executable: chmod a+x myScript 2. Use # to start a.
1 Operating Systems Lecture 3 Shell Scripts. 2 Brief review of unix1.txt n Glob Construct (metacharacters) and other special characters F ?, *, [] F Ex.
Chapter Nine Advanced Shell Scripting1 System Programming Advanced Shell Scripting.
LIN 6932 Unix Lecture 6 Hana Filip. LIN 6932 HW6 - Part II solutions posted on my website see syllabus.
Introduction to Unix (CA263) File Processing. Guide to UNIX Using Linux, Third Edition 2 Objectives Explain UNIX and Linux file processing Use basic file.
Unix programming Term: III B.Tech II semester Unit-II PPT Slides Text Books: (1)unix the ultimate guide by Sumitabha Das (2)Advanced programming.
CS 6560 Operating System Design Lecture 3:Tour of GNU/Linux.
Writing C-shell scripts #!/bin/csh # Author: Ken Berman # Date: # Purpose: display command and parameters echo $0 echo $argv[*]
Significance of Scripting Languages for Operating System Administration Vladimir Mateljan Željka Požgaj Krunoslav Peter INFuture2007.
Keyword Shell Variables The shell sets keyword shell variables. You can use (and change) them. HOME The path to your home directory PATH Directories where.
UNIX Shell Script (1) Dr. Tran, Van Hoai Faculty of Computer Science and Engineering HCMC Uni. of Technology
Module 6 – Redirections, Pipes and Power Tools.. STDin 0 STDout 1 STDerr 2 Redirections.
Agenda Link of the week Use of Virtual Machine Review week one lab assignment This week’s expected outcomes Review next lab assignments Break Out Problems.
1 Operating Systems Lecture 2 UNIX and Shell Scripts.
Week Two Agenda Announcements Link of the week Use of Virtual Machine Review week one lab assignment This week’s expected outcomes Next lab assignments.
Chapter Five Advanced File Processing. 2 Lesson A Selecting, Manipulating, and Formatting Information.
Searching and Sorting. Why Use Data Files? There are many cases where the input to the program may come from a data file.Using data files in your programs.
(Re)introduction to Unix Sarah Medland. So Unix…  Long and venerable history  
Getting the most out of the workshop Ask questions!!! Don’t sit next to someone you already know Work with someone with a different skillset and different.
40 Years and Still Rocking the Terminal!
LIN Unix Lecture 5 Unix Shell Scripts. LIN Command Coordination ; && || command1 ; command2 Interpretation: Do command 1. Then do command.
1 Lecture 9 Shell Programming – Command substitution Regular expressions and grep Use of exit, for loop and expr commands COP 3353 Introduction to UNIX.
(Re)introduction to Unix Sarah Medland. So Unix…  Long and venerable history  
CSCI 330 UNIX and Network Programming Unit IX: Shell Scripts.
CS252: Systems Programming Ninghui Li Slides by Prof. Gustavo Rodriguez-Rivera Topic 7: Unix Tools and Shell Scripts.
Week Two Agenda Announcements Link of the week Use of Virtual Machine Review week one lab assignment This week’s expected outcomes Next lab assignments.
Sed. Class Issues vSphere Issues – root only until lab 3.
1 Lecture 10 Introduction to AWK COP 3344 Introduction to UNIX.
1 Writing Shell Scripts Professor Ching-Chi Hsu 1998 年 4 月.
UNIX commands Head More (press Q to exit) Cat – Example cat file – Example cat file1 file2 Grep – Grep –v ‘expression’ – Grep –A 1 ‘expression’ – Grep.
Dept. of Animal Breeding and Genetics Programming basics & introduction to PERL Mats Pettersson.
CS 403: Programming Languages Lecture 20 Fall 2003 Department of Computer Science University of Alabama Joel Jones.
UNIX-21 WEEK 2 4/5/2005. UNIX-22 TOPICS Functions (contd.) pushd, popd, dirs Debugging Shell scripts Scheduling Unix jobs Job Management.
Filters and Utilities. Notes: This is a simple overview of the filtering capability Some of these commands are very powerful ▫Only showing some of the.
Tutorial of Unix Command & shell scriptS 5027
ENEE150 Discussion 04 Section 0101 Adam Wang.
Prepared by: Eng. Maryam Adel Abdel-Hady
Linux 101 Training Module Linux Basics.
The UNIX Shell Learning Objectives:
Part 1: Basic Commands/Utilities
LINUX System : Lecture 5 (English-Only Lecture)
Lecture 9 Shell Programming – Command substitution
Unix Scripting Session 4 March 27, 2008.
Shell Script Assignment 1.
INTRODUCTION TO UNIX: The Shell Command Interface
Tutorial of Unix Command & shell scriptS 5027
Basic UNIX OLC Training.
Tutorial of Unix Command & shell scriptS 5027
What is Bash Shell Scripting?
Guide To UNIX Using Linux Third Edition
Tutorial of Unix Command & shell scriptS 5027
Chapter Four UNIX File Processing.
UNIX Reference Sheets CSE 2031 Fall 2010.
Shell Programming.
Review.
Presentation transcript:

Using Unix Shell Scripts to Manage Large Data

What is Unix shell script? A collection of unix commands may be stored in a file, and csh/bash can be invoked to execute the commands in that file. Like other programming languages, it has variables and flow control statements, e.g., if-then-else; while; for; goto. you can run any shell simply by typing its name.

Useful Unix commands grep: globally searches for regular expressions in files and prints all lines that contain the expression cut: select fields or characters from each line of a file head/tail: cut the first/last # lines of a file wc: count # characters/words/lines of a file split: read a file and writes it in n line pieces into a set of output files cat/paste: join files by rows or columns join: merge two files by a common field awk: a POWERFUL pattern scanning and processing language Use “man command_name” to see the help file

Motivating example Genome-wide DNA methylation data ~3000 samples (rows) ~485,000 sites (columns) Data came in batches (~300 sample per file, ~1Gb each) For our analysis, we would like to: Pool all samples together but split to ~50,000 sites per file Load to R? will take ~14GB memory and R takes hours to read each file (recommend data.table package) Using csh scripts, only takes ~10 minutes

csh script: pool samples #!/bin/csh cd /dir rm -f cpg.txt cp -f All_Beta_Values1.txt cpg.txt foreach m (`seq 2 9`) # count number of samples @ l = `wc -l All_Beta_Values${m}.txt | cut -f 1 -d " "` - 1 echo "file = ${m}, nrow = $l" rm -f test.txt # remove the header tail -n $l All_Beta_Values${m}.txt > test.txt cat test.txt >> cpg.txt end

csh script: split by sites #!/bin/csh cd /dir foreach n (`seq 1 9`) rm -f beta2950_${n}of10.txt # start @ l = ($n - 1) * 50000 + 2 # end @ r = $n * 50000 + 1 zcat cpg.txt.gz | cut -f 1,$l-$r > beta2950_${n}of10.txt end zcat cpg.txt.gz | cut -f 1,450002- > beta2950_10of10.txt

Some tips To check whether a data file contains header or not, whether it is tab- or comma-delimited > head -n 1 filename To check a selected variable/column (e.g., to see how missing values were coded) > head -n 10 filename | cut -f #,# To get a subset of samples by matching ID > grep -f ID.txt filename To find a certain column > zcat filename.txt.gz | head -n 1 | awk '/variable_name/{for(i=1;i<=NF;++i)if($i~/variable_name/)print NR,i,$i}'

Using scripts to generate scripts #!/bin/bash -l #PBS -l walltime=16:00:00,pmem=2800mb,nodes=13:ppn=8 #PBS -m abe proc=0 for i in `seq 0 12` do for j in `seq 1 8` job=$(($i*8+$j-1)) scripts=/path echo "#!/bin/bash -l" >$scripts/sim$job.sh echo "cd $scripts">>$scripts/sim$job.sh echo "module load R" >>$scripts/sim$job.sh echo "R CMD BATCH --no-save --no-restore '--args job=$job' /path/assoc.R /path/log/sim$job.txt" >> $scripts/sim$job.sh chmod 770 $scripts/sim$job.sh pbsdsh -n $proc $scripts/sim$job.sh & proc=$(($proc+1)) done wait