Presentation is loading. Please wait.

Presentation is loading. Please wait.

Modeling Web Content Dynamics Brian Brewington George Cybenko IMA February.

Similar presentations


Presentation on theme: "Modeling Web Content Dynamics Brian Brewington George Cybenko IMA February."— Presentation transcript:

1

2 Modeling Web Content Dynamics Brian Brewington (brew@dartmouth.edu)brew@dartmouth.edu George Cybenko (gvc@dartmouth.edu)gvc@dartmouth.edu IMA February 2001

3 Observing changing information sources u An index of changing information sources must re-index items periodically to keep the index from becoming out-of-date. u What does it mean for an observer or index to be “up-to-date” or “current”? u Our work on the web has two parts: –Estimation of change rates for a large sample of web pages –Re-indexing speed requirements with respect to a formal definition of “up-to-date”.

4 Your brain is good at this Where is your visual attention directed when driving a car? Why? Form state estimates; re-observe when uncertainty becomes too large

5 Ingredients 1.A formal definition of “up-to-dateness” 2.Data 3.Scheduling to optimize “up-to-dateness”

6 A meaning for “up to date” An index entry is  current if it is correct to within a grace period of time , with probability at least . To be “  -current”: No alteration allowed in gray region for index entry to be “  -current” (time) (grace period) (next observed) (last observed)   t n (now) t 0 t 0 +Tt n - 

7  currency has meaning in many contexts Any source has a spectrum of possibilities; here are some possible values (guesses) –Newspaper: (0.9, 1 day) –Television news: (0.95, 1 hour) –Broker watching stocks: (0.95, 30 min) –Air traffic controller: (0.95, 20 sec) –Web search engine: (0.6, 1 day) –An old web page’s links: (0.4, 70 day)

8

9 Collecting web page data  Our web page data comes from a web monitoring service. u The Informant runs periodic standing user queries against four search engines and monitors user-selected URLs. When new or updated results appear, users are notified via email. u We download ~100,000 pages per day for ~30,000 users. See http://informant.dartmouth.edu

10 Sampling issues u Biased towards search engine results in the top 10 for users’ queries u No more than one observation of a page per day, pages are usually observed once every three days. u Queries and page checks are run only at night, so sample times are correlated. u Filesystem timestamps are available for about 65% of our observations.

11 Data in our collection u As of March 2000, we had observations of about 3 million web pages. Data in paper spans 7 mo. u Each page is observed an average of 12 times, and the average time span of observation is 38 days. u Each observation includes: –“Last-Modified” timestamps, when available –Observation time (using remote server’s if possible) –Document summary information »Number of bytes (“Content-Length”) »Number of images, tables, forms, lists, banner ads »16-bit hash of text, hyperlinks, and image references

12 “Lifetimes” vs. “ages” u We can model objects as having independent, identically-distributed time periods between modifications. We call these “lifetimes.” u The “age” is the time since the present lifetime began. By analogy, think of replacement parts, each with an independent lifetime length. L1L1 L2L2 (Each “  ” is a change) 0 0.5 1 1.5 2 2.5 3 3.54 Lifetime=1.53 Lifetime=1.14 Lifetime=0.62 Lifetime=0.84 Time Age  1...

13 Determining dynamics from the time data Two ways to find the distribution of change rates: 1. Observe the time between successive modifications. (Lifetimes) Good Good: direct measurement of time between changes Bad Bad: aliasing possible; needs repeat observations 2. Observe the time since the most recent modification. (Ages) Good Good: doesn’t have aliasing problems, works without having to make repeat observations Bad Bad: requires that we accurately account for growth

14 Sampling the lifetime distribution There are two problems with trying to sample the difference of successive change times: time xxooxxx 1. 1. Second observation ( o ) will miss two changes ( x ) x =modification o =observation time xxxooooo 2. 2. Observation window not big enough to see any changes ( x ) o (Observation timespan) (Actual lifetime) (Observed lifetime)

15 Web page age CDF Cumulative Pr Age [days, log scale] 1 day 10 days 100 days Median age 120 days upper 25% > 1 year lowest 25% < 1 month 0 1 0.5 0.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9

16 Empirical lifetime distribution Lifetime PDFLifetime CDF

17 When do changes happen? Change times, mod 24  7 hours, show more changes happen during the span of US working hours (8AM to 8PM, EST) 050100150 0 1 2 3 4 x 10 -3 time since Thursday 12:00 GMT [hours] Relative frequency Weds afternoon Thursday Friday Saturday Sunday Monday Tuesday Weds morning

18 Distribution of mean change times u The Weibull distribution, a generalized exponential, models mean lifetimes fairly well: This can be used to find an age or lifetime CDF for any shape parameter  and scale parameter . But for the age CDF, a growth model is needed, so age-based estimates can be inaccurate.

19

20 (  )  currency for Poisson source A single source has Poisson changes at rate. If re-indexed every T time units, the expected probability  of the index entry being  -current is: 10 -2 10 0 2 0.2 0.4 0.6 0.8 Expected changes per check period, T Probability,  =0.9 =0.25 =0.6 =0.0

21 Probability  of  currency over a collection Expected probability  of a random index entry being  -current (given distribution f(t) of mean change times t ): Distribution of avg. lifetimes Probability of being  -current given avg. lifetime

22 Index performance surface:  as a function of T, / T u Surface formed by integrating out the rate dependence  Large period T implies  =  Plane shown for  =0.95%, intersects at a level set (,T)

23 Re-indexing period, T [days] Grace period,  [days] T=11.5 days 95% level set: (T,  ) pairs

24 Bandwidth needed for (0.95, 1-week) currency u For (0.95, 1 week) currency of this collection: –Must re-index with period around 18 days. –A (0.95, 1-week) index of the whole web (~800 million pages) processes about 50 megabits/sec. –A more “modest” (0.95, 1-week) index of 150 million pages will process 9 megabits/sec. For fixed-period checks, we can estimate processing speed requirements.

25 Empirical search engine  currency

26 A calculus for  currency If x is  current and y is  current, then (x,y) is  max  current. Extend this to other atomic operations on information, eg composition.

27 Summary u About one in five pages has been modified within the last 12 days. u (0.95, 1-week) on our collection: must observe every 18 days u Ideas: More specialty search engines? Distributed monitoring/remote update? u Other work: algorithms for scheduling observation based on source change rate and importance

28 Mathematics of “Semantic Hacking”

29 Problem Denial of Service AttacksInfrastructure System attacksSystems Semantic attacksInformation easy to detect hard to detect

30 Distribution of information “Gaussian” is expected. Outliers Collusion?

31 What makes a good mystery/thriller? “Correct” conclusion “Wrong” conclusion A wrong conclusion can be reached by one large, detectable bad decision or a sequence of small, undetectably perturbed decisions. Understand the whole sequence of decisions not just one in isolation.

32 Ongoing research Develop a model of such “semantic attacks”. Develop a way to quantify such things. Develop some tools for detecting/managing complex decision sequences. Make information/decision systems more robust.

33 Acknowledgements DARPA contract F30602-98-2-0107 DoD MURI (AFOSR contract F49620-97-1-03821) NSF KDI Grant 9873138


Download ppt "Modeling Web Content Dynamics Brian Brewington George Cybenko IMA February."

Similar presentations


Ads by Google