96-Summer 生物資訊程式設計實習 ( 二 ) Bioinformatics with Perl 8/13~8/22 蘇中才 8/24~8/29 張天豪 8/31 曾宇鳯
課前準備 課程網頁 安裝流程 抓 Putty / Pietty 連上 wget course/doc/course1.tgzhttp://gene.csie.ntu.edu.tw/~sbb/summer- course/doc/course1.tgz tar zxvf course1.tgz
序號姓名帳號 1 許郁彬 course1 2 杜羿樞 course2 3 黃裕雄 course3 4 王建智 course4 5 陳士杰 course5 6 莊智傑 course6 7 朱柏威 course7 8 洪文峯 course8 9 吳耿豪 course9 10 張雯琪 course10 11 王悅 course11 12 張嘉芸 course12 13 林義峰 course13 14 游棨元 course14 15 許育堂 course15 16 陳建瑋 course16 17 黃國鑫 course17 18 翁小涵 course18 19 郭建鴻 course19 20 曾意儒 course20
Appendix Scalar, Array, Hash
Variable reset (1/2) $scalar = undef; $scalar = “”; $scalar = = (); %hash = ();
Variable reset = undef; print
Array = ("one", "two", "three"); my $number = ("one", "two", "three"); print print print $#number."\n"; print $number."\n";
= qw" "; print
Array – sort by number #! 5, 4, 22, 9, {$a print join "\n\n";
Hash – show all elements #! /usr/bin/perl -w %nucleotide_bases = ( A => Adenine, T => Thymine, G => Guanine, C => Cytosine ); while (($key, $value)=each %nucleotide_bases) { print "$key ====> $value\n"; } foreach $key (keys %nucleotide_bases) { print "$key ====> $nucleotide_bases{$key}\n"; }
Hash – reverse with identical values %nucleotide_bases = ( A => Adenine, T => Thymine, G => Adenine, C => Cytosine ); while (($key, $value)=each %nucleotide_bases) { print "$key ====> $value\n"; } %reverse = reverse %nucleotide_bases; while (($key, $value)=each %reverse) { print "$key ====> $value\n"; }
Hash – the number of elements How to know the number of elements in a hash? Ex: my %hash = ( 'a'=>1, 'b'=>2); print scalar(keys(%hash))."\n";
Comment # This is a comment =This is a comment, too =This is a comment, three =cut print "Really ?\n";
Appendix STDIN, <>, our/my
$_ - extract data from while ( ) {print;} if ( ) {print;}
<>; $line = <>; #! /usr/bin/perl -w while ( $line = <> ) { print $line; } Processing Data Files (like UNIX command : cat) #! /usr/bin/perl -w while (<> ) { print; }
Others … while (defined($_ = <>)) { print; } while ($_ = <>) { print; } while (<>) { print; } for (;<>;) { print; } print while defined($_ = <>); print while ($_ = <>); print while <>;
our/my my $var; $var = 1; { my $var; $var = 2; print $var,"\n"; } print $var, "\n"; our $var; $var = 1; { our $var; $var = 2; print $var,"\n"; } print $var, "\n";
Appendix Regular expression
Reserved word open log, ">test.txt“ or die “…”; print log "test\n"; close log;
Magic diamond - <> print “$_” while (<>); print “$_” while ( );
Get the list of files in the current directory = ; = glob("*.pl");
Greedy matching my $string = "course1:x:509:510::/home/course1: /bin/bash"; if ($string =~ /(.*):/) { print "matched string = [$1]\n"; } #How to match the first column ?
Greedy matching my $string = "course1:x:509:510::/home/course1:/bin/bash"; if ($string =~ /^([\S]*):/) { print "matched string = [$1]\n"; } if ($string =~ /^([\S]*?):/) { print "matched string = [$1]\n"; } if ($string =~ /([^:]*):/) { print "matched string = [$1]\n"; }
Substitution – remove all x $_ = "China xxxxxx Taiwan"; s/x*//; # How to rewrite ? print; China xxxxx Taiwan
Quoted syntax SymbolGeneralDescriptionInterpolated ‘ q/ /StringNo “ qq/ /StringYes ` qx/ /ExecutionYes ( )qw/ /List of wordsNo / m/ /Pattern matchingYes s/ / / SubstitutionYes y/ / /tr/ / /transliterationNo “ qr/ /Regular expressionYes
Appendix Useful techniques
Shell command – file/directory mkdir(“doc”,0x744); chdir(“doc”); rmdir(“doc”); unlink(“log.txt”); chmod(0x700, “log1.txt”, “log2.txt”,”log3.txt”); rename (“old_name”, “new_name”); chown(,,”log1.txt”,”log2.txt”,”log3.txt”);
Perl Usage: perl [switches] [--] [programfile] [arguments] -c check syntax only (runs BEGIN and CHECK blocks) -d[:debugger] run program under debugger -e program one line of program (several -e's allowed, omit programfile) -i[extension] edit <> files in place (makes backup if extension supplied) -n assume "while (<>) {... }" loop around program -p assume loop like -n but print line also, like sed -u dump core after parsing program -v print version, subversion -w enable many useful warnings (RECOMMENDED) -W enable all warnings -X disable all warnings
Removal of ^M perl -pi.bak -e 's/\r//g;' index.html
File Copy #! /usr/bin/perl use File::Copy; copy("file1", "file2");
Reserved word for debug __FILE__ __LINE__ Ex: print "FILE:".__FILE__." LINE:".__LINE__."\n";
Debug Perl –d “program name”
Debug $perlcc –d test.pl
Special variable $_the last assignment $!Error message $$current process ID $?the status when the previous child process end $”the separator of the list $/ $ `,$&,$ ’ string matching $+the last backreference @_arguments of a subroutine
Bytecode generator $perlcc -B -o test test3.pl
CPAN perl -MCPAN -e "install GD"
BioPerl
PSI-BLAST Position Specific Iterative BLAST constructs a multiple sequence alignment then creates a position-specific scoring matrix (PSSM) Query Sequence Blast Sequence database PSSM Multiple sequence alignment Homologous proteins Blast New homologous proteins
PSSM (1/4) GHEGVGKVVKLGAGA GHEKKGYFEDRGPSA GHEGYGGRSRGGGYS GHEFEGPKGCGALYI GHELRGTTFMPALEC Query Sequence Homologous proteins A C D E F G H I K L M N P Q R S T V W Y Frequency Column 1: f A,1 =0/5, f C,1 =0/5, …, f G,1 =5/5, … Column 2: f A,1 =0/5, f C,1 =0/5, …, f H,1 =5/5, … … Column 15: f A,1 =2/5, f C,1 =1/5, …, f S,1 =1/5, …
PSSM (2/4) The original data: Column 1: f A,1 =0/5, f C,1 =0/5, …, f G,1 =5/5, … Column 2: f A,1 =0/5, f C,1 =0/5, …, f H,1 =5/5, … … Column 15: f A,1 =2/5, f C,1 =1/5, …, f S,1 =1/5, … Set a pseudo-counts of 1: Column 1: f’ A,1 = (0+1)/(5+20),f’ C,1 = (0+1)/(5+20),…,f’ G,1 = (1+1)/(5+20),… Column 2: f’ A,1 = (0+1)/(5+20),f’ C,1 = (0+1)/(5+20),…,f’ H,1 = (1+1)/(5+20),… … Column 15: f’ A,1 = (2+1)/(5+20),f’ C,1 = (1+1)/(5+20),…,f’ S,1 = (1+1)/(5+20),…
PSSM (3/4) The score is derived from the ratio of the observed to the expected frequencies. More precisely, the logarithm of this ratio is taken and refereed to as the log- likelihood ratio: where Score i,j is the score for residue i at position j, f’ ij is the relative frequency for a residue i at position j and q i is the expected relative frequency of residue i in a random sequence.
PSSM (4/4) A C D E F G H I 0.7 K L M N P Q R S T V W Y