Optimizing in Perl By Peter Wad Sackett. Optimizing the code – minor gains 1 ++$i and $st.= $data instead of $i = $i+1 and $st = $st. $data Use index.

Slides:



Advertisements
Similar presentations
Perl Practical Extration and Reporting Language An Introduction by Shwen Ho.
Advertisements

Arrays A list is an ordered collection of scalars. An array is a variable that holds a list. Arrays have a minimum size of 0 and a very large maximum size.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Network Algorithms, Lecture 4: Longest Matching Prefix Lookups George Varghese.
The Linux Operating System Lecture 6: Perl for the Systems Administrator Tonga Institute of Higher Education.
● Perl reference
Computer Programming for Biologists Class 9 Dec 4 th, 2014 Karsten Hokamp
Recursion. Recursion is a powerful technique for thinking about a process It can be used to simulate a loop, or for many other kinds of applications In.
Programming Perls* Objective: To introduce students to the perl language. –Perl is a language for getting your job done. –Making Easy Things Easy & Hard.
Hit or Miss ? !!!.  Cache RAM is high-speed memory (usually SRAM).  The Cache stores frequently requested data.  If the CPU needs data, it will check.
CS 898N – Advanced World Wide Web Technologies Lecture 8: PERL Chin-Chih Chang
CS311 – Today's class Perl – Practical Extraction Report Language. Assignment 2 discussion Lecture 071CS Operating Systems I.
Scripting Languages Chapter 6 I/O Basics. Input from STDIN We’ve been doing so with $line = chomp($line); Same as chomp($line= ); line input op gives.
1 Lecture 6 Performance Measurement and Improvement.
Hashes a “hash” is another fundamental data structure, like scalars and arrays. Hashes are sometimes called “associative arrays”. Basically, a hash associates.
CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.
for($i=0; $i/)
Perl Basics A Perl Tutorial NLP Course What is Perl?  Practical Extraction and Report Language  Interpreted Language Optimized for String Manipulation.
Cse Feb-001 CSE 451 Section February 24, 2000 Project 3 – VM.
Perl Lecture #1 Scripting Languages Fall Perl Practical Extraction and Report Language -created by Larry Wall -- mid – 1980’s –needed a quick language.
Guide To UNIX Using Linux Third Edition
Perl - Advanced More Advanced Perl  Functions  Control Structures  Filehandles  Process Management.
By Dominik Seifert B Overview Data Alignment Hashtable MurMurHash Function “Stupid Parallel Hashing” Lookup The Complete Algorithm The Little.
Control Structures. Important Semantic Difference In all of these loops we are going to discuss, the braces are ALWAYS REQUIRED. Even if your loop/block.
Meet Perl, Part 2 Flow of Control and I/O. Perl Statements Lots of different ways to write similar statements –Can make your code look more like natural.
Sorting.
Copyright © 2010 Certification Partners, LLC -- All Rights Reserved Perl Specialist.
Virtual Memory Review Goal: give illusion of a large memory Allow many processes to share single memory Strategy Break physical memory up into blocks (pages)
Introduction to Perl Yupu Liang cbio at MSKCC
Books. Perl Perl (Practical Extraction and Report Language) by Larry Wall Perl 1.0 was released to usenet's alt.comp.sources in 1987 Perl 5 was released.
Perl: Lecture 1 The language. What Perl is Merger of Unix tools – Very popular under UNIX – shell, sed, awk Programming language – C syntax Scripting.
CSC 221: Recursion. Recursion: Definition Function that solves a problem by relying on itself to compute the correct solution for a smaller version of.
7 1 User-Defined Functions CGI/Perl Programming By Diane Zak.
Chapter 9: Perl (continue) Advanced Perl Programming Some materials are taken from Sams Teach Yourself Perl 5 in 21 Days, Second Edition.
Introduction to Unix – CS 21
Prof. Alfred J Bird, Ph.D., NBCT Office – McCormick 3rd floor 607 Office Hours – Tuesday and.
Computer Programming for Biologists Class 3 Nov 13 th, 2014 Karsten Hokamp
5 1 Data Files CGI/Perl Programming By Diane Zak.
Virtual Memory 1 1.
Research Topics in Computational Science. Agenda Survey Overview.
Programming Logic and Design Fourth Edition, Comprehensive Chapter 8 Arrays.
Perl Tutorial. Why PERL ??? Practical extraction and report language Similar to shell script but lot easier and more powerful Easy availablity All details.
Copyright © 2003 ProsoftTraining. All rights reserved. Perl Fundamentals.
Random Bits of Perl None of this stuff is worthy of it’s own lecture, but it’s all a bunch of things you should learn to use Perl well.
Introduction to Perl October 4, 2004 Class Meeting 7 * Notes on Perl by Lenwood Heath, Virginia Tech © 2004.
CPTG286K Programming - Perl Chapter 1: A Stroll Through Perl Instructor: Denny Lin.
CS333 Intro to Operating Systems Jonathan Walpole.
A Few More Functions. One more quoting operator qw// Takes a space separated sequence of words, and returns a list of single-quoted words. –no interpolation.
Department of Electrical and Computer Engineering Introduction to Perl By Hector M Lugo-Cordero August 26, 2008.
Introduction to Perl. What is Perl Perl is an interpreted language. This means you run it through an interpreter, not a compiler. Similar to shell script.
Computer Programming for Biologists Class 4 Nov 14 th, 2014 Karsten Hokamp
Virtual Memory Review Goal: give illusion of a large memory Allow many processes to share single memory Strategy Break physical memory up into blocks (pages)
 History  Ease of use  Portability  Standard  Security & Privacy  User support  Application &Popularity Today  Ten Most Popular Programming Languages.
PERL By C. Shing ITEC Dept Radford University. Objectives Understand the history Understand constants and variables Understand operators Understand control.
Perl for Bioinformatics Part 2 Stuart Brown NYU School of Medicine.
2000 Copyrights, Danielle S. Lahmani Foreach example = ( 3, 5, 7, 9) foreach $one ) { $one*=3; } is now (9,15,21,27)
Dept. of Animal Breeding and Genetics Programming basics & introduction to PERL Mats Pettersson.
Perl Subroutines User Input Perl on linux Forks and Pipes.
CSC 143T 1 CSC 143 Highlights of Tables and Hashing [Chapter 11 p (Tables)] [Chapter 12 p (Hashing)]
CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.
CS161 – Design and Architecture of Computer
Code Optimization.
Jonathan Walpole Computer Science Portland State University
CS161 – Design and Architecture of Computer
Virtual Memory - Part II
Informatica PowerCenter Performance Tuning Tips
Python – a HowTo Peter Wad Sackett and Henrike Zschach.
CS703 - Advanced Operating Systems
Virtual Memory 1 1.
Presentation transcript:

Optimizing in Perl By Peter Wad Sackett

Optimizing the code – minor gains 1 ++$i and $st.= $data instead of $i = $i+1 and $st = $st. $data Use index instead of regular expression when possible $st =~ tr/X/X/; to count occurences of X’es Substitute en passant ($st2 = $st1) =~ s/eat/drink/g; Start for loops from the end and count down to 0 Use foreach instead of for Use each %hash instead of keys %hash Use single quotes instead of double quotes on literal strings Don’t use $` $& $’ in regex Don’t use $st =~ m/$var/, use $st =~ m/$var/o

Optimizing the code – minor gains 2 Subroutines have an overhead – avoid in inner loops Use references in subroutine calls to avoid copy of large data Blocks {} have a overhead Some loops can be moved into map and grep functions Large arrays/strings should be initialized in the far end to allocate all memory at once Avoid creating local hashes in inner loops Reject common cases early in loops Remove invariant code from loops Avoid system calls if possible

Optimizing code – bigger gains Make (array) searches into hash lookups – drawback is memory use Consider sorting data Cache results if appropiate – Memoize

Changing the algorithm – huge gains A program usually spends the bulk of the time in few places, the bottlenecks. Use a profiler to identify the places if they are not obvious – use Devel::NYTProf; Make the bottleneck part run faster by Simple optimization Avoiding it Reusing results from previous invocations (caching) Changing algorithm either by incremental insight or fundamental change

Changing the environment If the program should run as part of a web server, consider FastCGI mod_perl If the program needs to load a lot of data, consider making a client/server solution. If using ’outside’ resources like databases, then learn methods to access them fast. DBI module has both fast and slow methods.

Reading a 60 GB NGS data 1:N:0: CTGTAACGTACCATAGGTTGACCATACTTCAAAAGCTGTACTCTCATGGCC 1:N:0: CTGTACAGCTGGAGTCANGGGGCCTAGAGCTGTGGGGAGGGAGGTGCAGGG 1:N:0: CAGTATGCCCATCGCAGNTCGCTACACGCAGGACGCTTTTTCACGTTCTGG 1:N:0: CCGTTAGCCACTGTAAGNACTGCTGGGGACACACTGCAGTCAAGCGAAGCG + 1:N:0: GTCTAGCTGGAGAAGATNTTGAGGAACCTCCAGGAGGAAGAAGCCTCTGGG + The input: 1.6 billion records

Stupid attempt my $file = $ARGV[0]; = ('GACT','TCCT','CTGT','GTCT','ACCT','ACAT','CAGT','GCGT'); my $prefix = $ARGV[1]; foreach my $bar open(IN,'<',$file) or die; my $id_line; while (my $line = ){ if ($line = $line;} my $sequence = ; if ($bar eq substr($sequence,0,4)){ print $id_line, substr($sequence,4); my $line3 = ; print $line3; my $line4 = ; print substr($line4,4); } close IN; }

How fast can this go ? Reading the file line by line with perl: 431 sec Reading in 64 KB blocks: 122 sec Simple copy with unix cp: 597 sec Perl copy in standard read line – write line fashion: 1197 sec Perl copying in read/write of 64 KB blocks: 632 sec Perl copy, read in 64 KB blocks, cache output in 100 MB: 404 sec Realistic perl copy read in small block, cache output: 562 sec

Version 1, 2193 sec = qw (GCGT CTGT CAGT GTCT ACCT); open(IN, '<', $ARGV[0]) or die "Can't open $file, reason: $!\n"; my ($id, $seq, $quality, %filehash); foreach my $tag { my $fh; open ($fh, '>', "$tag.$file") or die "$!"; $filehash{$tag} = $fh; } open (my $unassigned, '>', "untagged.$file") or die "$!"; while (1) { $id = ; last unless defined $id_line; $seq = ; $quality = ; $quality = ; my $tag = substr($seq, 0, 4); substr($seq, 0, 4, ''), substr($quality, 0, 4, '') if exists $filehash{$tag}; my $fh = exists $filehash{$tag} ? $filehash{$tag} : $unassigned; print $fh $id_line. $seq. "+\n". $quality; } close IN; close $unassigned; foreach my $fh (values %filehash) { close $fh; }

Version 2, cached output, 1305 sec = qw (GCGT CTGT CAGT GTCT ACCT); open(IN, '<', $ARGV[0]) or die "Can't open $file, reason: $!\n"; 'untagged'); my ($id, $seq, $quality, $count, %filehandle, %cache); foreach my $code { open (my $fh, '>', "$code.$file") or die "$!"; $filehandle{$code} = $fh; $cache{$code} = '0' x ; } # allocate memory foreach my $code { $cache{$code} = ''; } for (;;) { $id_line = ; last unless defined $id_line; $seq = ; $quality = ; $quality = ; my $tag = substr($seq, 0, 4); if (exists $filehandle{$tag}) { $cache{"$tag"}.= $id_line. substr($seq, 4). "+\n". substr($quality, 4); } else { $cache{untagged}.= $id_line. $seq. "+\n". $quality; } next unless ++$count >= ; # A million lines read foreach my $code { my $fh = $filehandle{$code}; print $fh $cache{$code}; $cache{$code} = '’; } $count = 0; } close IN; foreach my $code { my $fh = $filehandle{$code}; print $fh $cache{$code}; close $fh; }

Version 3, pipe, 1482 sec Same as version 2, just adding a fast reading pipe in front

Version 4, block reading, 2020 sec my ($offset, $buffer, $count) = (0,'’,0); while (read(IN, $buffer, 8192, $offset)) { my $newline = chomp $buffer; = split(m/\n/, $buffer); my $lastgoodentry = - % 4; $lastgoodentry -= 4 if $lastgoodentry == and not $newline; if ($lastgoodentry == { $buffer = ''; $offset = 0; } else { $buffer = join("\n", $lastgoodentry)); $buffer.= "\n" if $newline; $offset = length($buffer); } for (my $i = 0; $i < $lastgoodentry; $i+=4) { my ($id, $seq, $plus, $quality) my $tag = substr($seq, 0, 4); if (exists $filehandle{$tag}) { $cache{"$tag"}.= "$id\n". substr($seq, 4). "\n+\n". substr($quality, 4). "\n"; } else {$cache{untagged}.= "$id\n$seq\n+\n$quality\n”;} } next unless ++$count >= 1000; # blocks foreach my $code { my $fh = $filehandle{$code}; print $fh $cache{$code}; $cache{$code} = ''; } $count = 0; }

Version 5, block reading 2, 2000 sec my ($offset, $buffer,$count) = (0,'',0); while (read(IN, $buffer, 65536, $offset)) { my $recpos = 0; for(;;) { my $seqpos = 1 + index($buffer, "\n", $recpos); last unless $seqpos; my $qualpos = 3 + index($buffer, "\n", $seqpos); last unless $qualpos > 2; my $nextrec = 1 + index($buffer, "\n", $qualpos); last unless $nextrec; my $tag = substr($buffer, $seqpos, 4); my $record = substr($buffer, $recpos, $nextrec - $recpos); if (exists $filehandle{$tag}) { substr($record, $qualpos-$recpos, 4, ''); substr($record, $seqpos-$recpos, 4, ''); $cache{"$tag"}.= $record; } else { $cache{untagged}.= $record; } $recpos = $nextrec; } $buffer = substr($buffer, $recpos); $offset = length($buffer); next unless ++$count >= 1000; # blocks foreach my $code { my $fh = $filehandle{$code}; print $fh $cache{$code}; $cache{$code} = ''; } $count = 0; }

Version 6, record reading, 1069 sec my ($count, $record, $size, $extra, $tag, $missing) = (0,'',0x9c,'',''); while ($extra or read(IN, $record, $size)) { $record = $extra, $extra = '' if $extra; if (substr($record, -1, 1) ne "\n") { my $where = rindex($record, "\n") + 1; if (length($record) - $where > 5) { do { read(IN, $missing, 1); $size++; $record.= $missing; } until $missing eq "\n"; } else { $extra = substr($record, $where, length($record) - $where, ''); $size -= length($extra); read(IN, $extra, $size - length($extra), length($extra)); } $tag = substr($record, -106, 4); if (exists $filehandle{$tag}) { substr($record, -106, 4, ''); substr($record, -52, 4, ''); $cache{"$tag"}.= $record; } else { $cache{untagged}.= $record; } next unless ++$count >= ; # records foreach my $code { my $fh = $filehandle{$code}; print $fh $cache{$code}; $cache{$code} = ''; } $count = 0; }

Version 7, mixed reading, 1060 Sec my ($count, $record, $id, $tag) = (0); while (defined ($id = )) { read(IN, $record, 0x6a); $tag = substr($record, 0, 4); if (exists $filehandle{$tag}) { substr($record, 0x36, 4, ''); substr($record, 0, 4, ''); $cache{$tag}.= "$id$record"; } else { $cache{untagged}.= "$id$record"; } next unless ++$count >= ; # records foreach my $code { my $fh = $filehandle{$code}; print $fh $cache{$code}; $cache{$code} = ''; } $count = 0; } close IN; foreach my $code { my $fh = $filehandle{$code}; print $fh $cache{$code}; close $fh; }

The next step It is hard to know when you have reached the best possible speed. When you have no more ideas to try out, then sleep on it – or stop. Consider parallelizing your algorithm – all computers these days have more cores to work with. In this NGS example, we could have one core deal with reading the file into memory, one core to separate the records in in right output caches, and one core to write the output files.

Conclusions Declaration of my $variables takes time. Data that are basically line orientated are hard to work with in block reading. Even the simplest perl statement takes too long if executed many times. Usually, the less code you can use to express your algorithm the faster it will go. Optimizing takes a lot of time, you have too choose when to do it. Optimizing is also a lot about trial and error.