A small excursion Empirical Computer Science Binary Search versus Linear Search
Question: From a practical point of view, is there a significant advantage in using binary search rather than linear search?
What we will do describe linear search (in java) describe binary search (in java) propose an experiment present our results
How we realise our experiments Given a string s and an array of strings sorted in lex order, is the string s in the array?
Linear Search
What is this?
Binary Search
Code for Experiments
Paranoia Prior to executing experiments hundreds of thousands of calls were made to linSearch and binSearch making sure that they were in agreement (and initially they were not!)
Data Sets Used Data sets were produced using the following program
Confused?
Given a random sorted subset of the dictionary, called data for each entry in the dictionary determine if it is present in the sorted set data measure the total time for these probes Repeat with varying size of data set Details
Experiments were run on a low spec unix machine
Results
Data set size probed into times
Results CPU milliseconds to perform probes using binary search
Results CPU milliseconds to perform probes using linear search
Conclusion Binary Search is significantly faster than linear search However, the data set must be sorted Note also, linear search appear to scale linearly with problem size ARE YOU SURE?
What would be the effect of using different computers? Would we get the same results?
What’s happening with simeulue? Different platforms
What’s happening with vahanga?
Different platforms What’s happening with Jeremy’s machine?
Any suggestions?
Different platforms Jeremy now calls System.gc() prior to experiment
These regions look different. Why? Different platforms
Jeremy now calls System.gc() prior to experiment These regions look different. Why? Different platforms Could it be the cache?
Inline garbage collection
On simeulue
How would we convince ourselves that binSearch scales O(log2(n))? How about if we plot y against log(x)? i.e. y = log(x) We would hope to see a straight line Question
Conclusion … it ain’t easy need to be very sure about what we are actually measuring we were measuring garbage collection, cache, cpu time (what’s that?) … beware of small scale statistics our sample size was 1! (one data set at each size) was it right to be measuring cpu time only? we could have measured comparisons, or mems (memory access), … be paranoid Empirical CS