Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina

Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina dewan@unc.edu

2 Manual Hoarding Per user, per workstation hoard profiles, specifying u Files to be added or deleted u Current and future (+) F children (c) F or descendents(d) u Priority a /coda/usr/jjk d+ a /coda/usr/jjk/papers 100:d+ Personal Files a /usr/X11/bin/xterm a /usr/X11/bin/xinit Executables Source Files a /coda/src/venus 100:c+ a /coda/include 100:c+

3 LRU n Works well when u activity remains same F not if context switch occurs after cache fill F context switch occurs after disconnection F application used in disconnected state may not be the same as the ones running when hoarding occurs u can afford to have misses F need to keep entire working set in disconnected state F some dynamically accessed files may not have been referenced recently n Problems addressed by program trace approach

4 Per-program Hoarding n Fixing activity switch problem u Per program traces u User specifies which programs will be used /a/b/c /a/b/c1 /a/b/c2 /a/b/c3 /me/file1 /a/b/x/f/a/b/x/g /me/file2 exec() open()

5 Uniting Traces n Fixing cache miss problem u Look at multiple executions of program (< n) u Unite accesses of all traces F chance of dynamically accessed file being missed lowered F may not want do do so for (execution- specific) data distinguish data from program data in directory with different root directory and has different extension /a/b/c /a/b/c1 /a/b/c2 /a/b/c3 /me/file1 /a/b/x/f/a/b/x/g /me/file2

6 Aggregation Choice n Possible to choose: u Most recent trace u Trace unification

7 Data file choice n Possible to choose: u Data files of all executions. u Data files of all executions by specific user u Data files of most recent execution by specific user.

8 Multi-Program Activities n Bookends u Snapshot spying u User specifies start and end of spying period u Associates it with a bookend name n For each bookend, can ask for hoarding of: u All accesses recorded u Accesses in traces of each program executed F Data file filtering

9 Program Trace Limitations n User involvement u Bookend definition u Hoarding decisions F Data file filtering F Most recent vs. aggregated n Fixed by semantic distance approach

10 Example of Program Trace Limitations n Wish to hoard all chapters of book written using tex n Define a bookend for this project u Get all files accessed by programs (tex) executed during bookend definition u Scheme will get tex and all dynamic files accessed by it n Data file choices: u Get all my data files accessed during bookend spying F Must access all chapters in snapshot u Get all of my data files accessed by tex F Will get more than I want u Get all of my data files accessed during last trace of tex F May not have accessed book recently

11 Semantic Distance Concept n Between files n Low if they belong to same project n High if they do not n Use it to determine files in a project n Hoard all or no files of a project (working set)

12 Temporal Semantic Distance Clock time elapsed between most recent opens/ execs of the files u Clock time not good indicator F Coffee break between references to related files

13 Sequence-based Semantic Distance Number of intervening references (including open of first file) between the most recent opens/execs of the files A: source file B: includeC: include B ? Non commutative u Looks only at first reference time (open) u Files accessed during reference lifetime (open to close) should have equal semantic distance 12 3 open close

14 Lifetime-based Semantic Distance n SD(F1, F2) u 0 if F2 opened before F1 closed u # intervening opens otherwise n Consider an exec as open immediately followed by close n Considers only last reference u Dynamic linking conditional A: source file B: includeC: include B 00 3

15 Aggregation-based Semantic Distance n Take arithmetic mean of SD(F1 i, F2 i ), 1< I < number of references to F1 u 1, 1, 1498 vs. 500, 500, 500 n Take Geometric Mean n Efficiency: u O(N 2 ) storage F Track n (20) closest neighbours u O(N) cost per reference F Update SDs of files accessed in the last m (100) references

16 Clustering n Goal u Cluster files into projects based on SDs n Difficulties u No objective measure of goodness of clustering u Need overlapping clusters F Common header files u SD not commutative

17 Distance-based Threshold F1, F2 in same cluster if SD(F1, F2) <= p or SD(F2,F1) <= p u Size of project not considered u For any p, one can imagine a project with > p files n Combine clusters if they have overlapping files f 1, f 2.. f p combined with f p, f p+1.. f l u All files will become one cluster

18 Common Neighbours-based Threshold n Based on the n (nearest) neighbours n Look at # common neigbours, c n Two thresholds: u k f (far) < k n (near) k n <= c k f <= c <= k n c < k f Clusters combined into one Files inserted in each other’s clusters No action

19 Combining Phase A B C D E F G ABCDEFGABCDEFG knkn kfkf knkn kfkf knkn knkn knkn {A, B} {A, B, C} {D, E}{A, B, C} {D, E}{A, B, C}{F, G} {D, E, F, G}{A, B, C}

20 Insertion Phase A B C D E F G ABCDEFGABCDEFG knkn kfkf knkn kfkf knkn knkn knkn {D, E, F, G}{A, B, C} {A, B, C, D}{C, D, E, F, G}

21 Other Correlating Factors n Directory membership u Files with common ancestor directories related n File naming conventions u Source and header files have same prefix n Other relations u # include files, import statements, common words Ancestor level automatically recorded and subtracted from shared neigbours External investigator generates relationship weight and is added to shared neigbours

22 Another Option n Add/subtract from SD n SD is asymmetric n Directly modifying shared neighbour count has more impact

23 Searching Programs n Example: Find n Opens all files in a sub tree n Destroys LRU and SD information n Accesses of meaningless program ignored u Program accessing > d % of possible directory members n Important to detect meaningless phase rather than program u Get working directory F Does exhaustive search F Accesses during search ignored rather than entire program calling getcwd

24 Shared Libraries n Accessed by all programs u All clusters will be combined via them n Files involved in more than a certain percentage (1%) of accesses ignored and always put in hoard set.

25 Temporary Files n Not important by definition n But may have small semantic distance to other files n System disregards files in certain directories

26 Rarely Accessed Critical Files n Hardly accessed but important u Boot strapping u Suspend/resume files n User specified lists n System-specific heuristics u. Files in unix

27 Non Files n Can be critical u Device file n Access to them may not be recorded u Symbolic link points to actual file n Non-directories take no space u Always hoarded n Directories may be needed to do offline file-name translation u Replication system makes decisions regarding them

28 Handling Hoard Miss n If hoard miss u Add file and its project to hoard set n Record it for goodness measure.

29 Goodness Measure n Caching u Cache miss rate n Hoarding u Time to first cache miss F Does not take into account working set size vs. hoard size working set no miss Working set ~ hoard size -> high miss rate u Miss-free hoard size

30 Miss-free Hoard Size Under LRU n Look at references before most recent disconnection u F4 F3 F1 F2 F1 F5 n Keep only most recent reference to each file u F4 F3 F2 F1 F5 n Mark files accessed since disconnection u F3 F5 n Locate the first marked file in sorted list u F3 n Sum the size of all files between this file and end of sorted list u F3 + F2 + F1 + F5

31 Live usage n Gathered user traces of activities n Few hoard misses in actual usage

32 Comparison Experiments n Gathered user traces of activities n Replaced each trace simulating disconnection duration of u 24 hours u 7 days n Assumed infinitesimal reconnection only for re-hoarding n Mode of traced activities u Connected F Can do activities normally not done in disconnected mode Web access F Access patterns remain same u Disconnected mode F Actual hoard misses could influence activities F But misses were few anyway n Semantic distance leads to hoard size slightly bigger than WS n Much better than LRU

33 Unresolved Issues n Hoarding of fine-grained data

Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina

Similar presentations

Presentation on theme: "Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina

Similar presentations

Presentation on theme: "Automating Hoarding Prasun Dewan Department of Computer Science University of North Carolina"— Presentation transcript:

Similar presentations

About project

Feedback