Linear Search Efficiency Assessment P. Pete Chong Gonzaga University Spokane, WA
Why Search Efficiency? Information systems help users to obtain “right” information for better decision making, thus require search Better organization reduces time needed to find this “right” info, thus require sort Is savings in search worth the sort?
Search Cost If we assume random access, then the cost is (n+1)/2. That is, for 1000 records, the average search cost is approximately 1000/2=500. For large n, we may use n/2 to simplify the calculation
Search Cost In reality, the usage pattern is not random For payroll, for example, every record is accessed once and only once. In this case sorting have no effect to search efficiency Most of the time the distribution follows the 80/20 Rule (Pareto Principle)
The Pareto Principle Applications80% of the...in 20% of the... Income distributionIncome/WealthPeople Marketing DecisionsBusinessCustomers Population DistributionPopulationCities Firm Size DistributionAssetsFirms Library Resource MgmtTransactionsHoldings Academic ProductivityPapers PublishedAuthors Software Menu DesignUsageFeatures used Database ManagementAccessesData Accessed Inventory ControlValueInventory
Group/NumbersNumbersPaperCumulativeCumulativeAuthorPaper IndexPapersAuthorsSubtotalAuthorsPapersProportionProportion in i f(n i )n i f(n i ) f(n i ) n i f(n i )x i i Total number of Groups (m): 26 Average number of publications ( ):
A Typical Pareto Curve
Formulate the Pareto Curve Chen et al. (1994) define f(n i ) = the number of authors with n i papers, T = = total number of authors, R = = total number of papers, = R/T = the number of published papers per author
Formulate the Pareto Curve for each index level, let x i be the fraction of total number of authors and i be the fraction of total paper published, then x i = and i =.
Formulate the Pareto Curve Plug in the values above into ( i - i+1 )/(x i - x i+1 ), Chen et al. derive the slope formula: s i = When n i = 1, s i = 1/ = T/R, let’s call this particular slope .
Revisit the Pareto Curve = 370/1763 = 0.21
The Significance We now have a quick way to quantify different usage concentrations Simulation shows that in most situations a moderate sample size would be sufficient to assess the usage concentration The inverse of average usage ( ) is easy to calculate
Search Cost Calculation The search cost for a randomly distributed list is n/2. Thus, for 1000 records, the search cost is 500. For a list that has 80/20 distribution, the search cost is (200/2)(80%)+[( )/2](20%) = 200 Or a saving of 60%
Search Cost Calculation Let the first number in the 80/20 be a and the second number be b. Since these two numbers are actually percents, we have a + b = 1. Thus, the expected value for searching cost for a list of n records is the weighted average: (bn/2)(a) + [(bn+n)/2](b) = (bn/2)(a+b+1) = (bn/2)(2) = bn
Search Cost Calculation Thus, b indicates the cost of search in terms of the percentage of records in the list. bn represent an upperbound of the number of searches. For a fully sorted list (by usage) with 80/20 distribution, Knuth (1973) has shown that the average search cost C(n) is only 0.122n.
Search Cost Simulation bC(n) b
Search Cost Simulation
Search Cost Estimate Regression Analyses yield: b = , for 0.2< <1.0 b = , for 0< <0.2, and C(n) =
Conclusion The true search cost is between the estimation of b and C(n) We may use C(n)~0.5 as a way to quickly estimate the search cost of a fully sorted list. That is, take a moderate sample of usage, the search cost will be half of the inverse of the average usage times the total number of records.
“Far-fetched” (?) Applications Define and assess the degree of monopoly? What is the effect of monopoly? Note the gap between b and C(n) (ideal). Gini Index?