Computational Skills Course week 1 Mike Gilchrist NIMR May-July 2011.

Slides:



Advertisements
Similar presentations
CST8177 awk. The awk program is not named after the sea-bird (that's auk), nor is it a cry from a parrot (awwwk!). It's the initials of the authors, Aho,
Advertisements

Bioinformatics is … - the use of computers and information technology to assist biological studies - a multi-dimensional and multi-lingual discipline Chapters.
A Guide to Unix Using Linux Fourth Edition
Now, return to the Unix Unix shells: Subshells--- Variable---1. Local 2. Environmental.
Introduction to Perl Learning Objectives: 1. To introduce the features provided by Perl 2. To learn the basic Syntax & simple Input/Output control in Perl.
Guide To UNIX Using Linux Third Edition
T UTORIAL OF U NIX C OMMAND & SHELL SCRIPT S 5027 Professor: Dr. Shu-Ching Chen TA: Samira Pouyanfar Spring 2015.
Guide To UNIX Using Linux Third Edition
Guide To UNIX Using Linux Third Edition
Guide to Linux Installation and Administration, 2e1 Chapter 6 Using the Shell and Text Files.
Introduction to Unix (CA263) Introduction to Shell Script Programming By Tariq Ibn Aziz.
Lecture 02CS311 – Operating Systems 1 1 CS311 – Lecture 02 Outline UNIX/Linux features – Redirection – pipes – Terminating a command – Running program.
Introduction to UNIX GPS Processing and Analysis with GAMIT/GLOBK/TRACK T. Herring, R. King. M. Floyd – MIT UNAVCO, Boulder - July 8-12, 2013 Directory.
Microsoft Access 2007 Microsoft Access 2007 Introduction to Database Programs.
Linux & Shell Scripting Small Group Lecture 4 How to Learn to Code Workshop group/ Erin.
Shell Scripting Basics Arun Sethuraman. What’s a shell? Command line interpreter for Unix Bourne (sh), Bourne-again (bash), C shell (csh, tcsh), etc Handful.
Introduction to UNIX/Linux Exercises Dan Stanzione.
MCB Lecture #3 Sept 2/14 Intro to UNIX terminal.
Advanced File Processing
Linux environment ● Graphical interface – X-window + window manager ● Text interface – terminal + shell.
Introduction to Shell Script Programming
Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp
Agenda User Profile File (.profile) –Keyword Shell Variables Linux (Unix) filters –Purpose –Commands: grep, sort, awk cut, tr, wc, spell.
Chapter Four UNIX File Processing. 2 Lesson A Extracting Information from Files.
Guide To UNIX Using Linux Fourth Edition
Unix Talk #2 (sed). 2 You have learned…  Regular expressions, grep, & egrep  grep & egrep are tools used to search for text in a file  AWK -- powerful.
Introduction to Unix (CA263) File Processing. Guide to UNIX Using Linux, Third Edition 2 Objectives Explain UNIX and Linux file processing Use basic file.
Dedan Githae, BecA-ILRI Hub Introduction to Linux / UNIX OS MARI eBioKit Workshop; Nov , 2014.
Introduction to Linux ( I ) Sidney Fong 4 th Feb 2006.
CS 6560 Operating System Design Lecture 3:Tour of GNU/Linux.
Session 2 Wharton Summer Tech Camp Basic Unix. Agenda Cover basic UNIX commands and useful functions.
Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command to search for.
Chapter Five Advanced File Processing Guide To UNIX Using Linux Fourth Edition Chapter 5 Unix (34 slides)1 CTEC 110.
Chapter Five Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command.
Module 6 – Redirections, Pipes and Power Tools.. STDin 0 STDout 1 STDerr 2 Redirections.
Agenda Regular Expressions (Appendix A in Text) –Definition / Purpose –Commands that Use Regular Expressions –Using Regular Expressions –Using the Replacement.
1 Operating Systems and Using Linux Topics What is an Operating System? Linux Overview Frequently Used Linux Commands Some content in this lecture added.
Computational Skills Course week 3 Mike Gilchrist NIMR May-July 2011.
Introduction to Unix – CS 21 Lecture 12. Lecture Overview A few more bash programming tricks The here document Trapping signals in bash cut and tr sed.
Chapter Five Advanced File Processing. 2 Lesson A Selecting, Manipulating, and Formatting Information.
Getting the most out of the workshop Ask questions!!! Don’t sit next to someone you already know Work with someone with a different skillset and different.
I/O Redirection & Regular Expressions CS 2204 Class meeting 4 *Notes by Doug Bowman and other members of the CS faculty at Virginia Tech. Copyright
Shell Programming Learning Objectives: 1. To understand the some basic utilities of UNIX File 2. To compare UNIX shell and popular shell 3. To learn the.
16-Dec-15Advanced Programming Spring 2002 sed and awk Henning Schulzrinne Dept. of Computer Science Columbia University.
Lesson 3-Touring Utilities and System Features. Overview Employing fundamental utilities. Linux terminal sessions. Managing input and output. Using special.
CSE 374 Programming Concepts & Tools Hal Perkins Fall 2015 Lecture 6 – sed, command-line tools wrapup.
Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
– Introduction to the Shell 1/21/2016 Introduction to the Shell – Session Introduction to the Shell – Session 3 · Job control · Start,
1 P51UST: Unix and Software Tools Unix and Software Tools (P51UST) Awk Programming Ruibin Bai (Room AB326) Division of Computer Science The University.
Computational Skills Course week 1 Mike Gilchrist NIMR May-July 2011.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
Holly Cate January 20, 2010 Main Bioinformatics Laboratory.
Operating Systems and Using Linux Courtesy of John Y. Park 1.
CS 403: Programming Languages Lecture 20 Fall 2003 Department of Computer Science University of Alabama Joel Jones.
Filters and Utilities. Notes: This is a simple overview of the filtering capability Some of these commands are very powerful ▫Only showing some of the.
Tutorial of Unix Command & shell scriptS 5027
Arun Vishwanathan Nevis Networks Pvt. Ltd.
Lesson 5-Exploring Utilities
CSE 374 Programming Concepts & Tools
Linux command line basics III: piping commands for text processing
PROGRAMMING THE BASH SHELL PART IV by İlker Korkmaz and Kaya Oğuz
Introduction to UNIX Directory structure and navigation
Tutorial of Unix Command & shell scriptS 5027
Tutorial of Unix Command & shell scriptS 5027
Intro to PHP & Variables
Operation System Program 4
Guide To UNIX Using Linux Third Edition
Tutorial of Unix Command & shell scriptS 5027
Unix Talk #2 (sed).
Chapter Four UNIX File Processing.
Presentation transcript:

Computational Skills Course week 1 Mike Gilchrist NIMR May-July 2011

WEEK ONE Introduction and scope of course. The command line, the nature of computer data, and some useful command line tools for modifying text files.

Main topics The command line Data and text files Simple command line tools Third party bioinformatics tools Databases and SQL 3GL’s – industrial strength computing Scripts and pipelines What will not appear and why...

Axioms Google it You cannot break your computer Familiarity breeds content Bugs are always your fault (and can always be found - eventually) There are many ways of doing any give task (but one of them may be much better) Computational biologists tend to obfuscate GNU = Gnu is Not Unix

WEEK ONE Basic concepts of computer based data, the command line, and some essential command line utilities.

The terminal window

At the command line D:\projects\current> program -U jackyO -n 25 -i E:\data\experiment-1.txt > output.txt program = program.exe or -rwxr-xr-x 1 migil sysbio 18K Mar 12 16:16 program gcc]$ hts_utils

PATH F:\data\projects\khokha\exon-capture\blast\data-12may11.txt Also the PATH environment variable which tells the computer where to look for programs

Computer data a ‘byte’ = 35 (base 10) = = possible characters in the simplest form of computer data

The ascii ‘alphabet’ = 0 \0(the null terminator) = 9 \tTAB = 10 \n(line feed) = 13 \r(return) = 32 ‘ ‘(space) = 48 ‘0’ = 57 ‘9’ : : = 65 ‘A’ = 90 ‘Z’ : :

@HWUSI-EAS582_0299:3:1:87:82/1 GGCGACGATACATTCGGATGTCTGCCCTATCAACTTTCGAT +HWUSI-EAS582_0299:3:1:87:82/1 CGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCT +HWUSI-EAS582_0299:3:1:87:1949/1 CGGGCATCTAAGGGCATCACAGACCTGTTATTGCTCGATCT +HWUSI-EAS582_0299:3:1:87:1688/1 CGCTTGTTTCCTGATGTCACATGACAACACAAGATCGATAA +HWUSI-EAS582_0299:3:1:87:1773/1 ATTGTGCAAGTCTCCCAATGTCGATTTAATGAAATCCCTAC +HWUSI-EAS582_0299:3:1:87:1000/1 GGAGTATGGTTGCAAAGCTGAAACTTAAAGGAATTGACGGA +HWUSI-EAS582_0299:3:1:87:842/1 TTCGGACACACTGGGCCCAGATGGCTTCTTGGATTTAGGGG +HWUSI-EAS582_0299:3:1:88:1826/1 cWbgggcfgfc`g^gc^ddgg\ggbcd\`OfdZW`d`dJdg A typical text data file

The line ending/platform issue 582_0299:3:1:87:82/1\n\r GATGTCTGCCCTATCAACTTTCGAT\n\r 582_0299:3:1:87:82/1\n\r fbee_M_f]UV^JYRXbSdf`bfff\n\r 582_0299:3:1:87:82/1\n GATGTCTGCCCTATCAACTTTCGAT\n 582_0299:3:1:87:82/1\n fbee_M_f]UV^JYRXbSdf`bfff\n DOS (Windows)Unix/Linux (newer Macs) Programs which read text in a line at a time may give the wrong result if the text file is from an incompatible platform... If you are moving a file from Linux -> Windows run unix2dos first. >unix2dos 582_0299:3:1:87:82/1\n\rGATGTCTGCCCTATCAACTTTCGAT\n\r582_0299:3:1:87.. (really)

Command line utilities for manipulating data files >grep finds lines matching ‘patterns’ >sed swaps one pattern for another >awk operates on ‘fields’ in a record >cat >tail >more/less >unix2dos/dos2unix AND THEN THERE ARE PIPES... ‘|’ AND REDIRECTS... ‘>’

grep A typical fasta sequence file (Xenopus tropicalis mRNAs) : ACAAGCGATCTTGTAGAGCAATTCCAGCAACACTTAAAGGGACTTCTGTCTGTACTTACC AAACTGACAAAAAAGGCCAATCTACTGACAAACTCTTACAAAAAGCAGATTGGCATTGGT GCTCCGAGCAGT >ENSXETG |ENSXETT |nodal2 ATGGCAGCCCTAGGAGCCCTCTTTTTATTTGCCATGGCCTCCCTTGTGCACGGGAAGCCC ATTCATTCAGACAGAAAAGGAGCTAAAATCCCTCTGGCAGGATCTAACCTGGGATACAAG AAATCCAGCAATTCATATGGTTCCAGACTGTCGCAGGGTATGAGATACCCCCCTTCCATG ATGCAGTTATACCAGACTCTGATTTTGGGGAATGATACGGATCTGTCAATCCTGGAATAT CCCATCCTGCAGGAATCTGATGCCGTTCTAAGCCTCATTGCAAAAAGTTGCGATGTAGTG GGCAATCGATGGACATTGTCCTTTGACATGTCTTCTATATCGAGCAGCAATGAGCTGAAA TTGGCCGAGTTGAGGATCCGCCTCCCTTCCTTTGAAAGGTCCCAGGAT >ENSXETG |ENSXETT |neurod4 AGCCCCGGACTCATTGATATTGGGCACAGCGAGTCTGCCTGGGAGCTGTCCAGCACTCCA TGCTCCTGAAATAACTTGGGCAACAAGTCCGATCTGCCCGCTACTCTGTGCCTCCAGCTC AGGCCCGGGGAGAGGGACCCTGCTGAGCAGGACTCAGGACACTGTTTGAAGATCACATCA AATTCTGCTAATATGTCGGAGATAGTCAGTGTGCATGGGTGGATGGAGGAAGCCCTTAGT TCCCAGGATGAGATGGAGAGGAATCAGCGGCAATCTGCCTATGATATCATTTCAGGTCTG AGTCACGAGGAAAGGTGTAGCATAGATGGAGAAGATGATGATGAAGAAGAAGAGGATGGA GAGAAACCAAAAAAGAGGGGACCCAAAAAAAAGAAGATGACCAAGGCTAGACTGGAGAGG TTTCGTGTGCGCAGAGTAAAAGCCAATGCCAGGGAGCGCACCAGAATGCATGGACTTAAT GATGCCCTAGAAAATTTAAGGAGGGTCATGCCTTGCTATTCCAAAACACAAAA >ENSXETG |ENSXETT |camk2g AGACAAGAAACAGTGGAATGTTTGAGAAAGTTCAATGCACGGAGAAAGCTTAAAGGTGCA ATTCTCACAACAATGTTGGTTTCTCGGAATTTTTCTGGAATTGCATTTGGATGCCGAAAA GCTGCATCCACTGTCCCGTGTACCTCTTCAACGGGGGACACTATAACTGGTGTTGGCAGG CAGACCTCCGCCCCTGTTGTGGCCGCCACCAGTGCTGCCAACTTAGTCGAGCAAGCTGCC AAGAGTTTGTTGAACAAGAAGACAGATGGTGTCAAGCCACAGACCAACAACAAAAACAGC ATAATAAGCCCTGCAAAAGAAAACCCCCCATTGCAGACATCAATGGAACCTCAAACAACT GTTGTCCACAATGCAACTGATGGGATAAAAGGATCAACAGAGAGTTGTAACACCACCACT :

grep F:dev\sql>grep ACGTACGT my-sequence-file.fasta GAGAAGAACGTACGTGAGTGCAACCCATTCCTGGACCCGGAGATGGTGCGATTCCTCTGG TATTAAGAAAGAAAAGTTACGTACGTTGATAGACCTTGTAAGTGAAGAGAAGATGTTAGA TTCAACCCAACTTACTATGTTACTATTGCTTCATTCCTTTTCACGTACGTCTGGTCTCAA GGAATTAATAACCAGGATTTTGAAGGGGATTGCTACGTACGTCGCAGGTTATCCGGTGGA : F:dev\sql>grep –c ACGTACGT my-sequence-file.fasta 76 F:dev\sql>grep nodal my-sequence-file.fasta >ENSXETG |ENSXETT |nodal2 >ENSXETG |ENSXETT |nodal3 >ENSXETG |ENSXETT |nodal2 >ENSXETG |ENSXETT |nodal1 >ENSXETG |ENSXETT |nodal5.2 >ENSXETG |ENSXETT |nodal5 >ENSXETG |ENSXETT |nodal6 >ENSXETG |ENSXETT |nodal F:dev\sql>grep –c “>” my-sequence-file.fasta F:dev\sql>grep “>” my-sequence-file.fasta > my-sequence-file-def-lines.txt grep reads the input file line by line and reports on each line containing the pattern

sed F:dev\sql>grep nodal my-sequence-file.fasta > tmp.txt F:dev\sql>more tmp.txt >ENSXETG |ENSXETT |nodal2 >ENSXETG |ENSXETT |nodal3 >ENSXETG |ENSXETT |nodal2 >ENSXETG |ENSXETT |nodal1 >ENSXETG |ENSXETT |nodal5.2 >ENSXETG |ENSXETT |nodal5 >ENSXETG |ENSXETT |nodal6 >ENSXETG |ENSXETT |nodal F:\dev\sql>sed "s/nodal/xnr/" tmp.txt >ENSXETG |ENSXETT |xnr2 >ENSXETG |ENSXETT |xnr3 >ENSXETG |ENSXETT |xnr2 >ENSXETG |ENSXETT |xnr1 >ENSXETG |ENSXETT |xnr5.2 >ENSXETG |ENSXETT |xnr5 >ENSXETG |ENSXETT |xnr6 >ENSXETG |ENSXETT |xnr stream editor: reads input line at a time, and operates on the line as requested – typically to make a substitution. E.g. Replace groups of space with a TAB, etc.

sed F:dev\sql>more tmp.txt >ENSXETG |ENSXETT |nodal2 >ENSXETG |ENSXETT |nodal3 >ENSXETG |ENSXETT |nodal2 >ENSXETG |ENSXETT |nodal1 >ENSXETG |ENSXETT |nodal5.2 >ENSXETG |ENSXETT |nodal5 >ENSXETG |ENSXETT |nodal6 >ENSXETG |ENSXETT |nodal F:\dev\sql>sed "s/0/_/" tmp.txt >ENSXETG_ |ENSXETT |nodal2 >ENSXETG_ |ENSXETT |nodal3 >ENSXETG_ |ENSXETT |nodal2 >ENSXETG_ |ENSXETT |nodal1 >ENSXETG_ |ENSXETT |nodal5.2 >ENSXETG_ |ENSXETT |nodal5 >ENSXETG_ |ENSXETT |nodal6 >ENSXETG_ |ENSXETT |nodal F:\dev\sql>sed "s/0/_/g" tmp.txt >ENSXETG______25789|ENSXETT______19729|nodal2 >ENSXETG_______9__9|ENSXETT______1973_|nodal3 >ENSXETG______25789|ENSXETT______19728|nodal2 >ENSXETG_______9__8|ENSXETT______19726|nodal1 >ENSXETG______17442|ENSXETT______37932|nodal5.2 >ENSXETG______16779|ENSXETT______36596|nodal5 >ENSXETG______16778|ENSXETT______36593|nodal6 >ENSXETG______23748|ENSXETT______51228|nodal

The pipe F:dev\sql>grep “>” my-sequence-file.fasta > tmp.txt F:\dev\sql>sed "s/nodal/xnr/" tmp.txt >ENSXETG |ENSXETT |xnr2 >ENSXETG |ENSXETT |xnr3 >ENSXETG |ENSXETT |xnr2 >ENSXETG |ENSXETT |xnr1 >ENSXETG |ENSXETT |xnr5.2 >ENSXETG |ENSXETT |xnr5 >ENSXETG |ENSXETT |xnr6 >ENSXETG |ENSXETT |xnr F:dev\sql>grep “>” my-sequence-file.fasta | sed "s/nodal/xnr/" >ENSXETG |ENSXETT |xnr2 >ENSXETG |ENSXETT |xnr3 >ENSXETG |ENSXETT |xnr2 >ENSXETG |ENSXETT |xnr1 >ENSXETG |ENSXETT |xnr5.2 >ENSXETG |ENSXETT |xnr5 >ENSXETG |ENSXETT |xnr6 >ENSXETG |ENSXETT |xnr

(n)awk NeuroPachnis BiolSmith BiolGoldstein BiolMohun BiolLogan BiolSmith NeurobiolBriscoe BiolSmith Awk: manipulates data in structured, tabular, format - reads input one line at a time, but operates on fields. F:dev\sql> nawk -F\t "{ print $4, $7; }" I:\transfer\workshop.txt Ashleigh Leona Eric Guilherme Mary : F:dev\sql> nawk -F\t "{ print $4, $7; }" I:\transfer\workshop.txt | sed AT /“ Ashleigh ahowes AT nimr.mrc.ac.uk Leona lgabrys AT nimr.mrc.ac.uk Eric edang AT nimr.mrc.ac.uk Guilherme gneves AT nimr.mrc.ac.uk Mary mwu AT nimr.mrc.ac.uk :

Now let’s go and get some data... Download the set of gene locus coordinates for your model organism from BioMart (Ensembl). Practise using grep, sed and (n)awl on this data. E.g. How many nodal genes are there? Etc.

Simple exercise >gi| |ref|NP_ | mitogen-inducible gene 6 protein [Homo sapiens] MSIAGVAAQEIRVPLKTGFLHNGRAMGNMRKTYWSSRSEFKNNFLNIDPITMAYSLNSSA QERLIPLGHASKSAPMNGHCFAENGPSQKSSLPPLLIPPSENLGPHEEDQVVCGFKKLTV NGVCASTPPLTPIKNSPSLFPCAPLCERGSRPLPPLPISEALSLDDTDCE >gi| |ref|NP_ | small muscle protein, X-linked [Homo sapiens] MNMSKQPVSNVRAIQANINIPMGAFRPGAGQPPRRKECTPEVEEGVPPTSDEEKKPIPGA KKLPGPAVNLSEIQNIKSELKYVPKAEQ >gi| |ref|NP_ | WW domain binding protein 5 [Homo sapiens] MKSCQKMEGKPENESEPKHEEEPKPEEKPEEEEKLEEEAKAKGTFRERLIQSLQEFKEDI HNRHLSNEDMFREVDEIDEIRRVRNKLIVMRWKVNRNHPYPYLM >gi| |ref|NP_ | ribosomal protein L24-like [Homo sapiens] MRIEKCYFCSGPIYPGHGMMFVRNDCKVFRFCKSKCHKNFKKKRNPRKVRWTKAFRKAAG KELTVDNSFEFEKRRNEPIKYQRELWNKTIDAMKRVEEIKQKRQAKFIMNRLKKNKELQK VQDIKEVKQNIHLIRAPLAGKGKQLEEKMVQQLQEDVDMEDAP Here is a small test, driven by a real need. Take the following tiny section of a fasta file of human proteins from NCBI, and put it in your own fasta/text file. Then see if you can edit it to produce new versions of the fasta file by: (a) removing the NCBI IDs string, i.e. -> >mitogen-inducible gene 6 protein [Homo sapiens] MSIAGVAAQEIRVPLKTGFLHNGRAMGNMRKTYWSSRSEFKNNFLNIDPITMAYSLNSSA QERLIPLGHASKSAPMNGHCFAENGPSQKSSLPPLLIPPSENLGPHEEDQVVCGFKKLTV etc. (b) removing everything except the accession number >NP_ MSIAGVAAQEIRVPLKTGFLHNGRAMGNMRKTYWSSRSEFKNNFLNIDPITMAYSLNSSA QERLIPLGHASKSAPMNGHCFAENGPSQKSSLPPLLIPPSENLGPHEEDQVVCGFKKLTV etc. (c) identify and substitute the species name in the square brackets >gi| |ref|NP_ | mitogen-inducible gene 6 protein (species) MSIAGVAAQEIRVPLKTGFLHNGRAMGNMRKTYWSSRSEFKNNFLNIDPITMAYSLNSSA QERLIPLGHASKSAPMNGHCFAENGPSQKSSLPPLLIPPSENLGPHEEDQVVCGFKKLTV etc.

Catches and other platform specific stuff Paths: folder dividers ‘/’ in mac/linux ‘\’ in windows Quotes marks: you may have to experiment with double and single quotes, and even sometimes the backquote. With sed you may or may not have to quote the command string, i.e. >sed ‘s/nodal/xnr/’ OR >sed “s/nodal/xnr/” or no quotes at all. Unix/Linux is case sensitive: Hello != hello ^C (control-C) is your get out of jail free card!

For next week Download and install MySQL… This seems to work easily on Windows, but may be rather complicated the Mac, make sure you follow instructions fairly closely. The server needs to be ‘running in the background’ for you to be able to log on and create databases and tables, etc.