Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Effective Disk Caching Algorithm in Data Grid Why Disk Caching in Data Grids?  It takes a long latency (up to several minutes) to load data from a.

Similar presentations


Presentation on theme: "An Effective Disk Caching Algorithm in Data Grid Why Disk Caching in Data Grids?  It takes a long latency (up to several minutes) to load data from a."— Presentation transcript:

1 An Effective Disk Caching Algorithm in Data Grid Why Disk Caching in Data Grids?  It takes a long latency (up to several minutes) to load data from a Mass Storage System (MSS) via LAN;  It takes a very long time (up to a few hours) to complete file transfers for a request over wide-area networks (WAN);  A researcher's workstation or even her local computer center may not be able to keep all the required dataset for a long time for his needs. Song Jiang and Xiaodong Zhang College of William and Mary, USA

2 Wide Area Network Clients DRM HRM HPSS Disk Caching in Storage Resource Managers (SRM) HPSS: High Performance Storage System HRM: Hierarchical Resource Manager DRM: Disk Resource Manager

3 A General Replacement Utility Function based on Locality, Sizes, and Transfer Cost Locality estimation of files is the most critical factor determining hit ratio of disk caching. In data grid, the replacement is managed by middleware, the maintenance cost can be more relaxed than that in OS. For a file i at time T, Φ i (t) = L i (t) * C i / S i L i (t) denotes its locality strength, S i denotes the file size, C i denotes its retrieving cost

4 Existing Replacement and Their Limits (1) LRU Based Only -: Carries all the weaknesses of LRU; No size and transfer cost consideration +: Easy to implement and low maintenance cost. (2) LRU Based Locality Estimation -: Carries all the weaknesses of LRU. +: Easy to implement and low maintenance cost. (3) Frequency based Estimation Based on the number of times an object in the file has been accessed since it is fetched. -: The frequency sometimes can be misleading due to frequencies can be interval dependent. Cache pollution: a file with accumulated high frequency can stay for long, but no more accesses. +: Easy to implement, and monitor multiple objects.

5 Existing Replacement and Their Limits (Continued) (4) Greedy-Dual-Size (USENIX Internet Tech’97) For a time interval, locality estimator is access frequency / number of missed objects -: Share the same weaknesses frequency based. +: Easy to implement. (5) Least Cost Beneficial based on K backward References (SC’02) For a time interval, locality estimator is number of most recent references * total references -: the estimator is the time interval dependent. +: Easy to implement and low maintenance cost.

6 Drawbacks in the existing Locality Estimation Methods: Recency Based (1) Locality of large file access in Data Grids is weaker than that of I/O block and Web file caching Hard to deal with weak locality file requests; Example: Greedy Dual-Size (GDS) (2) Frequency Based Pollution problem Examples: Greedy Dual-Size with Frequency (GDSF), Hybrid, Lowest relative Value (LRV) (3) Re-use Density Based Overcome the drawbacks of previous methods; Could be irrelevant to locality because density is computed over wall clock time; Examples: Least Cost Beneficial Based on the K backward References (LCB-K)

7 Our Principle on Disk Replacement  Only relevant history information is used in locality estimation: the order of file requests, and cache size demands  Total cache size is used with the locality estimation to answer the question: Does a file have enough locality so that it if it is cached it could be hit for its next reference before it is replaced from the cache with its specific size ? Our Utility Function Caching Time of a file measures how much cache is consumed since the last reference to the file. A caching time is maintained for a file for a certain period of time even it is replaced, so that the utility of its next access can be more accurately assessed. 050 70 12065 185100 18535...... Resident Non- Resident FileSize CachingTime Caching Time Stack Least Value Based on Caching Time (LVCT) (1) If f is fetched to cache, push its entry to stack top, set caching time ``0”. (2) If file f is hit in the cache, advance the entry up to Size (f) positions. (3) If the access to file f is a miss:  Select a set of resident files whose utility values are smallest among resident files;  If the utility value of file f is smaller than any of the files in the set, f is not cached. (mainly determined by its file size and transfer cost).  Otherwise, f is cached  The caching time stack is updated accordingly.

8 Trace Description  Collected in a MSS system, JASMine, at Jefferson's National Accelerator Facility (JLab),  Represent the file access activities for a period of about 6 months.  207,331 files accessed in the trace, and their total size is 144.9 TBytes. Simulation Results LVCT Advantage  Replace files with large caching times timely because they are less likely to generate hits even if cached;  Cache space is saved to serve small caching time files;  Improvement is more apparent with small cache sizes.

9 Main Results  Disk caching in data grids exhibit properties different from transitional I/O buffering and Web file caching;  We identify a critical drawback in abstracting locality information from Grid request streams, i.e. counting on misleading time measurements.  We propose a new locality estimator using more relevant access events;  The real-life workload traces demonstrate the effectiveness of our replacement.


Download ppt "An Effective Disk Caching Algorithm in Data Grid Why Disk Caching in Data Grids?  It takes a long latency (up to several minutes) to load data from a."

Similar presentations


Ads by Google