Bridget Thomson McInnes 10 October 2003 The vec Perl Function Bridget Thomson McInnes 10 October 2003
What is vec Perl function that provides compact storage of lists of unsigned integers Integers are packed as tightly as possible within an ordinary Perl string
Why we are interested in vec When using suffix arrays as a data storage mechanism we need to store the entire corpus into an array. This will not work with a corpus of 50 million tokens due to memory constraints. The idea then is to convert the tokens to unique integers and then store the integers in an array. Again, this will not work due to memory constraints which will be discussed briefly later. Therefore, we are trying to find a way to efficiently store a set of approximately 50 million integers.
The vec Function EXPR OFFSET BIT vec EXPR, OFFSET, BIT Is the string that the integers are packed into OFFSET Specifies the index of the particular element that is to be retrieved BIT Specifies how wide each element is in bits
BITS Must be a power of two: 1, 2, 4, 8, 16, ect Example When BITS = 1, there are 8 elements per byte When BITS = 2, there are 4 elements per byte When BITS = 4, there are 2 elements per byte
Quick Example Program $bitstring = “”; $offset = 0; for $i(0..20) { vec($bitstring, $offset++, 4) = $i; }
Retrieving the Data $a = vec($bitstring, 3, 4); $offset = 0; for $i(0..20) { vec($bitstring, $offset++, 4) { } $a = vec($bitstring, 3, 4); $b = vec($bitstring, 6, 4); print “a = $a\n”; print “b = $b\n”;
Output Output: So each offset represents an array indice. csirh012% perl vec.pl a = 3 b = 6 csirh013% So each offset represents an array indice.
Experiments I ran three different experiments: Loaded an array with 50 million integers Loaded a vec with 50 million integers Loaded a vec with 100 million integers
Expirements 1 Loading an array with 50 million integers Results (performed on marengo) Ran out of memory at the 19,857,659 Used approximately 700 MB of memory Memory was determined by using the Perl Module Devel::Size (total_size command)
Experiment 2 Loaded a vec array using a bit parameter of 32 with 50 million integers Results (performed on csirh0) Memory : 192 M Time : 2.78 s
Experiment 3 Loaded a vec array with a bit parameter of 32 with 60 million integers Results (performed on csirh0) Memory : 230 M Time : 2.98s
Notes on Experiments A bit parameter less than 32 will not return all of the integers inserted into it for our experiments Loading a vec array of 100 million integers runs out of memory
Modules that use vec Tie::VecArray Class::Bits Bit::Vector An array interface to a bit vector Class::Bits Class wrapper around bit vectors Bit::Vector Efficient bit vector, set of integers and math library Widely used