1 96-Summer 生物資訊程式設計實習 ( 二 ) Bioinformatics with Perl 8/13~8/22 蘇中才 8/24~8/29 張天豪 8/31 曾宇鳯
2 Schedule DateTimeSubjectSpeak er 8/13 一 13:30~17:30Perl Basics 蘇中才 8/15 三 13:30~17:30Programming Basics 蘇中才 8/17 五 13:30~17:30Regular expression 蘇中才 8/20 一 13:30~17:30Retrieving Data from Protein Sequence Database 蘇中才 8/22 三 13:30~17:30Perl combines with Genbank, BLAST 蘇中才 8/24 五 13:30~17:30PDB database and structure files 張天豪 8/27 一 8:30~12:30Extracting ATOM information 張天豪 8/27 一 13:30~17:30Mapping of Protein Sequence IDs and Structure IDs 張天豪 8/31 五 13:30~17:30Final and Examination 曾宇鳳
3 Reference Books Learning Perl (Perl 學習手冊 ) Beginning Perl for Bioinformatics Bioinformatics Biocomputing and Perl: An Introduction to Bioinformatics Computing Skills and Practice
4
5 Learning Perl
6 Perl Practical Extraction and Report Language Created by Larry Wall in the middle 1980`s. Suitable for “quick-and-dirty” Suitable for string-handling Powerful regular expression
7 Preparation Downloading putty.exe / pietty.exe Getting materials for this course: Server: ssh Id : course1 ~ course20 Password:
8 Installing Perl on Windows Download package from 5.8/ActivePerl MSWin32-x msi 5.8/ActivePerl MSWin32-x msi Versions of Perl Unix, Linux, Windows (ActivePerl), Mac (MacPerl)
9 Text Editors A convenient (text) editor for programming Ultraedit: good for me Notepad: just an editor Vim: UNIX/Linux lover _menu.html _menu.html Joe : easy to use for Unix beginner
10 Finding Help Best resource finding tool – On-line Resources, use HTML Help in ActivePerl Command Line (highly recommended) perldoc –f # search function perldoc –q # search FAQ perldoc # search module perldoc perldoc
11 Perl Basic Starting
12 $ vi welcome #! /usr/bin/perl -w print “Hello, world\n”; $ chmod +x welcome $./welcome Hello, world $ perl welcome Hello, world Program: run thyself! perl]$ ls -al -rw-rw-r-- 1 sbb sbb 20 Jul 2 15:27 welcome perl]$ chmod +x welcome perl]$ ls -al -rwxrwxr-x 1 sbb sbb 20 Jul 2 15:27 welcome
13 #! /usr/bin/perl -w # The 'forever' program - a (Perl) program, # which does not stop until someone presses Ctrl-C. use constant TRUE => 1; use constant FALSE => 0; while ( TRUE ) { print "Welcome to the Wonderful World of Bioinformatics!\n"; sleep 1; } Using the Perl while construct
14 $ chmod +x forever $./forever Welcome to the Wonderful World of Bioinformatics!. Running forever...
15 Perl Basic Variables
16 Variables Scalar ($) Number 1; 1.23; 12e34 String “abc”; ‘ABC’ ; “Hello, world!”; Array / List Hash (%)
17 Introducing variable containers The simplest type of variable container is the scalar ( 純量 ). In Perl, scalars can hold, for example, a number, a word, a sentence or a disk-file. $name $_address $programming_101 $z $abc $swissprot_to_interpro_mapping $SwissProt2InterProMapping Variable naming is ART !
18 scalar #!/usr/bin/perl -w # lower case for user defined ; upper case for system default my $ARGV = “example.pl"; my $number = 1.2; my $string = "Hello, world!"; my $123 = 123;#error my $abc = "123"; my $_123 = '123'; my $O000OoO00 = 1; my $OO00Oo000 = 2; my $OO00OoOOO = 3; $abc = $O000OoO00 * $OO00Oo000 - $OO00OoOOO; print $abc x 4. "\n"; print 5 x 4. "\n"; print 5 * 4. "\n";
19 Number Format (range: 1e-100 ~ 1e100 ?) 2000 1.25 -6.5e45 (-6.5*10^45) 123_456_789 Other format 0377 #octal (decimal 255) 0xFF #hexadecimal 0b #binary
20 number $integer = 12; $real = 12.34; $oct = 0377; $bin = 0b ; $hex = 0xff; $long = ; $long_ = 123_456_789; $large = 1E100;#1E200 $small = 1E-100;#1E-200 print "integer : $integer\n"; print "real : $real\n"; print "oct=$oct bin=$bin hex=$hex\n"; #printf("oct=0%o bin=0b%b hex=0x%x\n",$oct,$bin,$hex);
21 parameters of printf (ref : number) specifierOutputExample c Character a d or i Signed decimal integer 392 e Scientific notation (mantise/exponent) using e character e+2 E Scientific notation (mantise/exponent) using E character E+2 f Decimal floating point g Use the shorter of %e or %f G Use the shorter of %E or %f o Signed octal 610 s String of characters sample u Unsigned decimal integer 7235 x Unsigned hexadecimal integer 7fa X Unsigned hexadecimal integer (capital letters) 7FA p Pointer address B800:0000 n Nothing printed. The argument must be a pointer to a signed int, where the number of characters written so far is stored. % A % followed by another % character will write % to stdout.
22 operator 2 + 3#5 5.1 – 2.4#2.7 3 * 12#36 14 / 2# / 0.3#34 10 / 3#3.333… 10 % 3#1
23 Operator Function + Addition - Subtraction, Negative Numbers, Unary Negation * Multiplication / Division % Modulus ** Exponent OperatorFunction =Normal Assignment +=Add and Assign -=Subtract and Assign *=Multiply and Assign /=Divide and Assign %=Modulus and Assign **=Exponent and Assign $number = $number + 100;$number += 100;
24 Take a break … modulus 10.5 % 3.2 = ? exponentiation 2^3 = ?
25 string Format Single quotes ‘hello’ ‘hello\nhello’ ‘hello,$name’ Double quotes “hello” “hello\nhello” “hello,$name” Exceptions ‘\’\\’ “\”\\” #!/usr/bin/perl –w print ‘hello’; print “hello”;
26 Backslash escapes Escape Sequences Description or Character Escape Sequences Description or Character \b\b Backspace Ampersand \e\e Escape \ 0nnn Any Octal byte \f\f Form Feed \ xnn Any Hexadecimal byte \n\n New line \ cn Any Control character \r\r Carriage Return \l\l Change the next character to lowercase \t\t Tab \u\u Change the next character to uppercase \v\v Vertical Tab \\ Backslash \$\$ Dollar Sign
27 conversion between String and number $answer = “Hello ”. “ “. “ world\n”; $answer = “12”. “3”; $answer = “12” * “3”; $answer = “12Hello34” * “3”;#warning !!! $answer = “A”. 3*5; $answer = “A” x (3*5); $answer = “12”x”3”;
28 #! /usr/bin/perl -w # The 'tentimes' program - a (Perl) program, # which stops after ten iterations. use constant HOWMANY => 10; $count = 0; while ( $count < HOWMANY ) { print "Welcome to the Wonderful World of Bioinformatics!\n"; $count++; } Variable containers and loops
29 $ chmod +x tentimes $./tentimes Welcome to the Wonderful World of Bioinformatics! Running tentimes...
30 #! /usr/bin/perl -w # The 'fivetimes' program - a (Perl) program, # which stops after five iterations. use constant TRUE => 1; use constant FALSE => 0; use constant HOWMANY => 5; $count = 0; while ( TRUE ) { $count++; print "Welcome to the Wonderful World of Bioinformatics!\n"; if ( $count == HOWMANY ) { last; } Using the Perl if construct
31 #! /usr/bin/perl -w # The 'oddeven' program. use constant HOWMANY => 4; $count = 0; while ( $count < HOWMANY ) { $count++; if ( $count % 2 == 0 ) { print “$count : even\n"; } else # $count % 2 is not zero. { print “$count : odd\n"; } The oddeven program
32 Comparison operator ComparisonNumberString Equal==eq Not equal!=ne Less than<lt Greater than>gt Less than or equal<=le Greater than or equal>=ge Comparison cmp
33 Variable Interpolation #! /usr/bin/perl -w # The ‘interpolation' program which interpolate variables by variable. $language = “Perl”; $string = “I love $language”; print $string.”\n”; $string = ‘I love $language”; print $string.”\n”; $string = ‘I love ‘.$language; print $string.”\n”; $string = “I love \$language”; print $string.”\n”; $string = “I love $languages”; print $string.”\n”; #${language}s
( 'TTATTATGTT', 'GCTCAGTTCT', 'GACCTCTTAA' = ( 'TTATTATGTT', 'GCTCAGTTCT', 'GACCTCTTAA' ); Arrays: Associating Data With Numbers
35 Array
36 print "$list_of_sequences[1]\n"; GCTCAGTTCT $list_of_sequences[1] = 'CTATGCGGTA'; $list_of_sequences[3] = 'GGTCCATGAA'; Working with array elements
37 The Array
38 print "The array size is: ", $#list_of_sequences+1, ".\n"; print "The array size is: ", ".\n"; The array size is: 4. How big is the array?
= ( 'TTATTATGTT', 'GCTCAGTTCT', 'GACCTCTTAA' = 'CTATGCGGTA' ); print TTATTATGTT GCTCAGTTCT GACCTCTTAA = ( 'TTATTATGTT', 'GCTCAGTTCT', 'GACCTCTTAA' = ( 'CTATGCGGTA' ); print CTATGCGGTA Adding elements to an array
= ( 'TTATTATGTT', 'GCTCAGTTCT', 'GACCTCTTAA' = ( 'CTATGCGGTA', 'CTATTATGTC' ) ); print TTATTATGTT GCTCAGTTCT GACCTCTTAA CTATGCGGTA = ( 'TTATTATGTT', 'GCTCAGTTCT', 'GACCTCTTAA' = ( 'GCTCAGTTCT', 'GACCTCTTAA' ); print TTATTATGTT GCTCAGTTCT GACCTCTTAA GCTCAGTTCT GACCTCTTAA Adding more elements to an array
= ( 'TTATTATGTT', 'GCTCAGTTCT', 'GACCTCTTAA', 'TTATTATGTT' = 1, 2; print print GCTCAGTTCT GACCTCTTAA TTATTATGTT #clean all elements of an = (); Removing elements from an array
42 #! /usr/bin/perl -w # The 'slices' program - slicing = ( 'TTATTATGTT', 'GCTCAGTTCT', 'GACCTCTTAA', 'CTATGCGGTA', 'ATCTGACCTC' ); ]; print = 1, 3; print print The slices program
43 TTATTATGTT GCTCAGTTCT GACCTCTTAA CTATGCGGTA ATCTGACCTC GCTCAGTTCT GACCTCTTAA CTATGCGGTA TTATTATGTT GCTCAGTTCT GACCTCTTAA CTATGCGGTA ATCTGACCTC TTATTATGTT ATCTGACCTC GCTCAGTTCT GACCTCTTAA CTATGCGGTA Results from slices...
44 #! /usr/bin/perl -w # The 'iterateW' program - iterate over an entire array # with = ( 'TTATTATGTT', 'GCTCAGTTCT', 'GACCTCTTAA', 'CTATGCGGTA', 'ATCTGACCTC' ); $index = 0; $last_index = $#sequences; while ( $index <= $last_index ) { print "$sequences[ $index ]\n"; ++$index; } Processing every element in an array
45 TTATTATGTT GCTCAGTTCT GACCTCTTAA CTATGCGGTA ATCTGACCTC Results from iterateW...
46 #! /usr/bin/perl -w # The 'iterateF' program - iterate over an entire array # with = ( 'TTATTATGTT', 'GCTCAGTTCT', 'GACCTCTTAA', 'CTATGCGGTA', 'ATCTGACCTC' ); foreach $value ) { print "$value\n"; } The iterateF program
= ( 'TTATTATGTT', 'GCTCAGTTCT', 'GACCTCTTAA', 'CTATGCGGTA', 'ATCTGACCTC' = ( TTATTATGTT, GCTCAGTTCT, GACCTCTTAA, CTATGCGGTA, ATCTGACCTC = qw( TTATTATGTT GCTCAGTTCT GACCTCTTAA CTATGCGGTA ATCTGACCTC ); Making lists easier to work with
48 Quoted words #!/usr/bin/perl -w # The ‘quoted_words’ = ( 'TTATTATGTT', 'GCTCAGTTCT', 'GACCTCTTAA' = qw/TTATTATGTT GCTCAGTTCT = qw{TTATTATGTT GCTCAGTTCT = qw!TTATTATGTT GCTCAGTTCT = qw[TTATTATGTT GCTCAGTTCT = qw = qw#TTATTATGTT GCTCAGTTCT GACCTCTTAA#; print print "The array size is: ", $#list_of_sequences+1, ".\n";
49 pop/push/shift/unshift #!/usr/bin/perl -w #The “array_operator” = 5..9; print "array = $item = print "item = [$item]\n"; print "array = 9; print "array = $item = print "item = [$item]\n"; print "array = 1..5; print "array =
50 pop/push/shift/unshift array = [ ] ==========pop========== item = [9] array = [ ] ==========push 9========== array = [ ] ==========shift========== item = [5] array = [ ] ==========unshift 1..5========== array = [ ]
51 reverse / sort #!/usr/bin/perl -w #The “array_operator1” = qw / /; print "array = print "reverse array = print "sort array = reverse print "reverse sort array = sort print "sort reverse array =
52 reverse / sort array = [ ] ======================================== reverse array = [ ] ======================================== sort array = [ ] ======================================== reverse sort array = [ ] ======================================== sort reverse array = [ ]
53 split/join #!/usr/bin/perl -w #The “array_operator2” program - join / split $string = " = split/ /, $string; print "array = $string = join print "array = [$string]\n"; array = [ ] array = [5,4,9,8,1,3,6,2,7,10]
54 How to map between IP and domain name ? IPDomain name gene.csie.ntu.edu.tw biominer.csie.ntu.edu.tw knn.csie.ntu.edu.tw
55 Use 2 array to map between IP and domain name gene.csie.ntu.edu.tw biominer.csie.ntu.edu.tw knn.csie.ntu.edu.tw [0] [1] [2] [0] [1] [2]
56 How to search a certain ip or domain name gene.csie.ntu.edu.tw biominer.csie.ntu.edu.tw knn.csie.ntu.edu.tw [0] [1] [2] [0] [1] [2]
57 Why Hash ? %Domain_name gene.csie.ntu.edu.tw biominer.csie.ntu.edu.tw knn.csie.ntu.edu.tw [ ] [ ] [ ] KeyValue
58 How to get a certain domain name? %Domain_name gene.csie.ntu.edu.tw biominer.csie.ntu.edu.tw knn.csie.ntu.edu.tw [ ] [ ] [ ] KeyValue $Domain_name{“ ”}
59 Examples of Hash
60 Hashes: Associating Data With Words %nucleotide_bases %nucleotide_bases = ( A, Adenine, T, Thymine ); %nucleotide_based = ( A => Adenine, T => Thymine); keyvalue
61 print "The expanded name for 'A' is $nucleotide_bases{ 'A' }\n"; The expanded name for 'A' is Adenine Working with hash entries
62 %nucleotide_bases = ( A, Adenine, T, Thymine = keys %nucleotide_bases; print "The names in the %nucleotide_bases hash The names in the %nucleotide_bases hash are: A T %nucleotide_bases = ( A, Adenine, T, Thymine ); $hash_size = keys %nucleotide_bases; print "The size of the %nucleotide_bases hash is: $hash_size\n"; The size of the %nucleotide_bases hash is: 2 How big is the hash?
63 $nucleotide_bases{ 'G' } = 'Guanine'; $nucleotide_bases{ 'C' } = 'Cytosine'; %nucleotide_bases = ( A => Adenine, T => Thymine, G => Guanine, C => Cytosine ); Adding entries to a hash
64 The Grown %nucleotide_bases Hash
65 delete $nucleotide_bases{ ‘C' }; $nucleotide_bases{ 'C' } = undef; Removing entries from a hash
66 #! /usr/bin/perl -w # The ‘slicing_hashes' program – extract a certain subset among a hash %gene_counts = ( Human => 31000, 'Thale cress' => 26000, 'Nematode worm' => 18000, 'Fruit fly' => 13000, Yeast => 6000, 'Tuberculosis microbe' => Human, “Fruit fly”, 'Tuberculosis microbe' }; print Slicing hashes
67 #! /usr/bin/perl -w # The 'bases' program - a hash of the nucleotide bases. %nucleotide_bases = ( A => Adenine, T => Thymine, G => Guanine, C => Cytosine ); $sequence = 'CTATGCGGTA'; print "\nThe sequence is $sequence, which expands to:\n\n"; while ( $sequence =~ /(.)/g ) { print "\t$nucleotide_bases{ $1 }\n"; } Working with hash entries: a complete example
68 The sequence is CTATGCGGTA, which expands to: Cytosine Thymine Adenine Thymine Guanine Cytosine Guanine Thymine Adenine Results from bases...
69 #! /usr/bin/perl -w # The 'genes' program - a hash of gene counts. use constant LINE_LENGTH => 60; %gene_counts = ( Human => 31000, 'Thale cress' => 26000, 'Nematode worm' => 18000, 'Fruit fly' => 13000, Yeast => 6000, 'Tuberculosis microbe' => 4000 ); Processing every entry in a hash
70 print '-' x LINE_LENGTH, "\n"; while ( ( $genome, $count ) = each %gene_counts ) { print "`$genome' has a gene count of $count\n"; } print '-' x LINE_LENGTH, "\n"; foreach $genome ( sort keys %gene_counts ) { print "`$genome' has a gene count of $gene_counts{ $genome }\n"; } print '-' x LINE_LENGTH, "\n"; The genes program, cont.
'Human' has a gene count of 'Tuberculosis microbe' has a gene count of 4000 'Fruit fly' has a gene count of 'Nematode worm' has a gene count of 'Yeast' has a gene count of 6000 'Thale cress' has a gene count of 'Fruit fly' has a gene count of 'Human' has a gene count of 'Nematode worm' has a gene count of 'Thale cress' has a gene count of 'Tuberculosis microbe' has a gene count of 4000 'Yeast' has a gene count of Results from genes...
72 How to sort by the values ?
73 Exercise Protein sequences
74 FASTA format >P53_HUMAN (P04637) Cellular tumor antigen p53 (Tumor suppressor p53) (Phosphoprotein p53) (Antigen NY-CO-13) - Homo sapiens (Human). MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP DEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAK SVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHE RCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNS SCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELP PGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPG GSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD
75 Read a FASTA file #!/usr/bin/perl -w my ( $line, $queryname, $queryseq ); while ( $line = <> ) { if ( $line =~ />(.+?)\s.+/) { $queryname = $1 ; } else { chomp $line; $queryseq = $queryseq. $line; }
76 Exercise Read more then one sequence Store the protein names and sequences from disorder.fa by 2 array Show all of protein names and sequences. Show the number of proteins and residues. ($len = length $seq;)
77 Exercise Read more then one sequence Store the protein names and sequences from disorder.fa by a hash Show the protein names and sequences sorted by protein name Find the longest sequence