Linear Search Efficiency Assessment P. Pete Chong Gonzaga University Spokane, WA 99258-0009

Slides:



Advertisements
Similar presentations
Tests of Hypotheses Based on a Single Sample
Advertisements

Tests of Significance for Regression & Correlation b* will equal the population parameter of the slope rather thanbecause beta has another meaning with.
Objectives 10.1 Simple linear regression
Mean, Proportion, CLT Bootstrap
A Sampling Distribution
10-3 Inferences.
Inference for Regression
Uncertainty and confidence intervals Statistical estimation methods, Finse Friday , 12.45–14.05 Andreas Lindén.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Chapter 10 Curve Fitting and Regression Analysis
1 1 Slide IS 310 – Business Statistics IS 310 Business Statistics CSU Long Beach.
Regression Analysis Using Excel. Econometrics Econometrics is simply the statistical analysis of economic phenomena Here, we just summarize some of the.
Propagation of Error Ch En 475 Unit Operations. Quantifying variables (i.e. answering a question with a number) 1. Directly measure the variable. - referred.
Sampling distributions. Example Take random sample of students. Ask “how many courses did you study for this past weekend?” Calculate a statistic, say,
Chapter 15 (Ch. 13 in 2nd Can.) Association Between Variables Measured at the Interval-Ratio Level: Bivariate Correlation and Regression.
Session 2. Applied Regression -- Prof. Juran2 Outline for Session 2 More Simple Regression –Bottom Part of the Output Hypothesis Testing –Significance.
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Estimation A major purpose of statistics is to estimate some characteristics of a population. Take a sample from the population under study and Compute.
Simple Linear Regression
Tests of Hypotheses: Large Samples Chapter Rejection region Acceptance
Sample size computations Petter Mostad
Sales Potential Indicators: CDI & BDI = ROI
The Normal Probability Distribution
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 7 Sampling.
Regression Chapter 10 Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania.
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Statistics for Business and Economics 7 th Edition Chapter 9 Hypothesis Testing: Single.
Chapter 9 Hypothesis Testing II. Chapter Outline  Introduction  Hypothesis Testing with Sample Means (Large Samples)  Hypothesis Testing with Sample.
1 1 Slide © 2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall Statistics for Business and Economics 8 th Edition Chapter 9 Hypothesis Testing: Single.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 8 Tests of Hypotheses Based on a Single Sample.
Correlation and Linear Regression
Statistical Methods For Engineers ChE 477 (UO Lab) Larry Baxter & Stan Harding Brigham Young University.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 12 Analyzing the Association Between Quantitative Variables: Regression Analysis Section.
Inference for regression - Simple linear regression
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 9 Hypothesis Testing.
Correlation and Linear Regression
Hypothesis Testing II The Two-Sample Case.
1 1 Slide © 2005 Thomson/South-Western Chapter 9, Part B Hypothesis Tests Population Proportion Population Proportion Hypothesis Testing and Decision Making.
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
CORRELATION & REGRESSION
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS St. Edward’s University.
Montecarlo Simulation LAB NOV ECON Montecarlo Simulations Monte Carlo simulation is a method of analysis based on artificially recreating.
Measures of Dispersion & The Standard Normal Distribution 9/12/06.
1 Chapter 12 Simple Linear Regression. 2 Chapter Outline  Simple Linear Regression Model  Least Squares Method  Coefficient of Determination  Model.
1 1 Slide © 2005 Thomson/South-Western Slides Prepared by JOHN S. LOUCKS St. Edward’s University Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
1 Chapter 9 Hypothesis Testing. 2 Chapter Outline  Developing Null and Alternative Hypothesis  Type I and Type II Errors  Population Mean: Known 
Propagation of Error Ch En 475 Unit Operations. Quantifying variables (i.e. answering a question with a number) 1. Directly measure the variable. - referred.
6.1 Inference for a Single Proportion  Statistical confidence  Confidence intervals  How confidence intervals behave.
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
Inferences from sample data Confidence Intervals Hypothesis Testing Regression Model.
2010, ECON Hypothesis Testing 1: Single Coefficient Review of hypothesis testing Testing single coefficient Interval estimation Objectives.
AP Statistics Section 11.1 B More on Significance Tests.
Chapter 5 Sampling Distributions. The Concept of Sampling Distributions Parameter – numerical descriptive measure of a population. It is usually unknown.
Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall Statistics for Business and Economics 8 th Edition Chapter 9 Hypothesis Testing: Single.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Inferences Concerning Means.
AP Stat 2007 Free Response. 1. A. Roughly speaking, the standard deviation (s = 2.141) measures a “typical” distance between the individual discoloration.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
4-1 Statistical Inference Statistical inference is to make decisions or draw conclusions about a population using the information contained in a sample.
Virtual University of Pakistan
Ch5.4 Central Limit Theorem
One-Sample Inference for Proportions
Estimation & Hypothesis Testing for Two Population Parameters
Inference for Regression Lines
St. Edward’s University
Unit 3 – Linear regression
Statistical Inference about Regression
Slides by JOHN LOUCKS St. Edward’s University.
Presentation transcript:

Linear Search Efficiency Assessment P. Pete Chong Gonzaga University Spokane, WA

Why Search Efficiency? Information systems help users to obtain “right” information for better decision making, thus require search Better organization reduces time needed to find this “right” info, thus require sort Is savings in search worth the sort?

Search Cost If we assume random access, then the cost is (n+1)/2. That is, for 1000 records, the average search cost is approximately 1000/2=500. For large n, we may use n/2 to simplify the calculation

Search Cost In reality, the usage pattern is not random For payroll, for example, every record is accessed once and only once. In this case sorting have no effect to search efficiency Most of the time the distribution follows the 80/20 Rule (Pareto Principle)

The Pareto Principle Applications80% of the...in 20% of the... Income distributionIncome/WealthPeople Marketing DecisionsBusinessCustomers Population DistributionPopulationCities Firm Size DistributionAssetsFirms Library Resource MgmtTransactionsHoldings Academic ProductivityPapers PublishedAuthors Software Menu DesignUsageFeatures used Database ManagementAccessesData Accessed Inventory ControlValueInventory

Group/NumbersNumbersPaperCumulativeCumulativeAuthorPaper IndexPapersAuthorsSubtotalAuthorsPapersProportionProportion in i f(n i )n i f(n i )  f(n i )  n i f(n i )x i  i Total number of Groups (m): 26 Average number of publications (  ):

A Typical Pareto Curve

Formulate the Pareto Curve Chen et al. (1994) define f(n i ) = the number of authors with n i papers, T = = total number of authors, R = = total number of papers,  = R/T = the number of published papers per author

Formulate the Pareto Curve for each index level, let x i be the fraction of total number of authors and  i be the fraction of total paper published, then x i = and  i =.

Formulate the Pareto Curve Plug in the values above into (  i -  i+1 )/(x i - x i+1 ), Chen et al. derive the slope formula: s i = When n i = 1, s i = 1/  = T/R, let’s call this particular slope .

Revisit the Pareto Curve  = 370/1763 = 0.21

The Significance We now have a quick way to quantify different usage concentrations Simulation shows that in most situations a moderate sample size would be sufficient to assess the usage concentration The inverse of average usage (  ) is easy to calculate

Search Cost Calculation The search cost for a randomly distributed list is n/2. Thus, for 1000 records, the search cost is 500. For a list that has 80/20 distribution, the search cost is (200/2)(80%)+[( )/2](20%) = 200 Or a saving of 60%

Search Cost Calculation Let the first number in the 80/20 be a and the second number be b. Since these two numbers are actually percents, we have a + b = 1. Thus, the expected value for searching cost for a list of n records is the weighted average: (bn/2)(a) + [(bn+n)/2](b) = (bn/2)(a+b+1) = (bn/2)(2) = bn

Search Cost Calculation Thus, b indicates the cost of search in terms of the percentage of records in the list. bn represent an upperbound of the number of searches. For a fully sorted list (by usage) with 80/20 distribution, Knuth (1973) has shown that the average search cost C(n) is only 0.122n.

Search Cost Simulation  bC(n)  b

Search Cost Simulation

Search Cost Estimate Regression Analyses yield: b = , for 0.2<  <1.0 b = , for 0<  <0.2, and C(n) = 

Conclusion The true search cost is between the estimation of b and C(n) We may use C(n)~0.5  as a way to quickly estimate the search cost of a fully sorted list. That is, take a moderate sample of usage, the search cost will be half of the inverse of the average usage times the total number of records.

“Far-fetched” (?) Applications Define and assess the degree of monopoly? What is the effect of monopoly? Note the gap between b and C(n) (ideal). Gini Index?