Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semantic Data Caching and Replacement

Similar presentations


Presentation on theme: "Semantic Data Caching and Replacement"— Presentation transcript:

1 Semantic Data Caching and Replacement
Based on the talk by Kunhao Zhou about the paper by: Shaul Dar, Michael J. Frankin, Bjorn T. Jonsson, Divesh Srivastava, Michael Tan Proceedings of the 22nd VLDB Conferences Mumbai (Bombay), India, 1996

2 Outline Motivation Client Caching Architecture
Model of Semantic Caching Simulations and Results Conclusion and Future Work

3 Motivation Distributed database
Clients are high-end workstations (fat client) High computational power. Big local storage Client and server are connected by network. Client used to be thin-client which may be some dumb-terminal. No local disk, no computational power. So, all it can do is just to proved a user interface. Because of the fast development of PC. The client can have very high computational power and very big local disk.

4 Motivation (Contd.) Effective use of a client is the key to achieving high performance. Less network traffic. Faster response time. Higher server throughput. Better scalability. With “fat client” in mind, if we can effectively use the client’s resource, we can achieve high performance in client/server database system. We can have less network traffic. (In distributed database system, the shipping time of the data is the most outstanding, so if we achieve this, others are just natural outcome)

5 Client Caching Architecture
Data-Shipping. Client process query. Data is brought on-demand from servers. Navigational access. Object ID (Tuple ID or Page ID). Can be categorized as tuple-based or page-based Cache Replacement Policies: LRU. MRU. Now, we talk about the client caching architecture. There are two major architectures. One is Data-caching. Here is how the client manage the client caching: A client process query, if any data is missed, it will be shipped from the servers. The way the client access data is navigational, that means it use either tuple ID or page ID to locate the data. Based on the access unit, the author categorize this data-shipping architecture as tuple-based and page-based. The relationship between client and the server here is like the buffer management system in traditional database. So LRU or MRU are used as cache replacement policies.

6 Client Caching Architecture (Contd.)
Data-Shipping. Problem. Applications require associative access to data, that is, as provided by relational query languages. The problem of data-shipping is it doesn’t support associative access, which is typical and very important in relational database. Like when we use SQL, we are using associative access.

7 Client Caching Architecture (Contd.)
Query-Shipping. Associative access to data. Problems. Implementations do not support client caching. (No caching). There is another client caching architecture Query-shipping architecture. It provide associative access, but the problem is the implementation doesn’t support client caching. So, it can’t make any use of the client resource.

8 Client Caching Architecture (Contd.)
Semantic Caching. A model that integrates support for associative access into an architecture based on data-shipping. Advantage. Exploit the semantic information to effectively manage client cache. So, because the problem of those two client caching architecture, the author proposed a new architecture which is called ”Semantic caching” Here is the definition: A model that integrates support for associative access into an architecture based on data-shipping. The advantage is the semantic information can be used to effectively manage client cache.

9 Client Caching Architecture (Contd.)
Semantic Caching. Semantic description of the data rather than use record-id or page-id. Can be used to generate remainder query to send to server if the requested tuples are not available locally. Information for replacement is maintained as semantic regions. Low overhead, insensitive to bad clustering. Cache replacement use value function based on semantic description. Not just LRU or MRU. Here is the three key idea of semantic caching. First, data are described by semantic value, not page id or record id. The advantage is the semantic description can be used to generate remainder query. The remainder query describe those data that are not locally available when answering a query. We will talk about this more later. Second, cache replacement information is maintained as semantic regions. The advantage is low overhead, (compare to tuple-based), insensitive to clustering. We will see this in the experiment and result part. Another key idea is semantic caching can use some sophisticated value function as replacement policy.

10 Client Caching Architecture (Contd.)
Data Granularity Missing Data Cache Replacement Page Caching Group Faulting Temporal locality (LRU, MRU) Spatial locality (Clustering) Tuple Caching Single Semantic Caching Dynamically Group Remainder Queries Semantic Locality

11 Model of Semantic Caching
Remainder Query Semantic Regions Replacement Issues

12 Remainder Query P R Re V Q Relation Re, query Q, client cache V.
Probe query P(Q,V) = Q ÙV can be answered locally. Remainder query R(Q,V) = QÙ(Ø V) should be sent to the server. Example: Select * from E where. salary< 60,000 and salary >30,000. Client cache all the tuples, which salary < 50,000. Q = (salary< 60,000 ) Ù (salary >30,000). V = (salary <50,000). P = (salary<50,000) Ù(salary >30,000). R = (salary>=50,000) Ù(salary< 60,000 ). P R Re Here is how remainder query looks like. The advantage of having remainder query is once the remainder query is created, the client and server can process the query in parallel. Also, if all the query can be answered locally, there is no need to contact the server. V Q

13 Semantic Regions Cache management and replacement unit.
Grouped by semantic value. Each semantic region has a single replacement value. Described by a constrained formula. Consideration: Semantic region merge. Like page is a replacement unit in buffer management system, here the replacement unit is semantic regions. How to describe a semantic region? It is described by a constrained formula, just like what we define the client cache just now. So, when a new query intersects with the old semantic regions, we can split the old semantic regions as two disjoint parts,and each part as a new semantic region. So if a query intersect n regions, we will have 2n+1 semantic regions after the query. This may cause large overhead. (a)Original regions (a)Regions after Q

14 Semantic Regions Cache management and replacement unit.
Grouped by semantic value. Each semantic region has a single replacement value. Described by a constrained formula. Consideration: Semantic region merge.(always merge) Or, we can choose always merge the semantic regions, like in this picture, this will lead to a large semantic region. But it may cause another problem. Because semantic region is the replacement unit here. When we need to replace a large semantic region, we may make a big hole in the client cache, resulting in poor cache utilization. So, in the author’s model, they set a threshold to control when to merge the semantic regions. (a)Original regions (a)Regions after Q

15 Replacement Issues Temporal locality LRU, MRU
Here we have a picture, tell us how can we calculate the replacement value for recency of usage. Whether it is LRU or MRU, depends on we will replace small value or bigger value when we need to replace the cache.If we replace the small value, it is LRU, if we replace the bigger value, it is MRU. The dot line here is if we choose not to merge the semantic regions.

16 Replacement Issues (Contd.)
Semantic locality Manhattan distance (Note) Manhattan distance Definition: The distance between two points measured along axes at right angles. In a plane with p1 at (x1, y1) and p2 at (x2, y2), it is |x1 - x2| + |y1 - y2|. O p1 The other picture is how Manhattan distance algorithm can be performed in semantic caching architecture. O O o p2 | p1 p2 | = | p2O | + | p1O |

17 Simulation and Result RelSize 10000 Relation size (tuples) TupleSize 200 Size of tuple (bytes) TuplePerPage 20 How many tuples per page QuerySize 1-10% % of relation selected by each query Skew 90% % of queries within a hot region HotSpot 10% Size of the hot region (% of relation) CacheSize 250 Client Cache size (kb) Relation has three candidate keys, Unique2 is indexed and clustered, Unique1 is indexed and unclustered, Unique3 is unindexed and unclustered.

18 Simulation and Result (Contd.)
Unique2 (Clustered Index). Performance: Almost the same. Page-based is slightly better. Reason: Page-based overhead is smaller. In this example, the query is on unique2, which is clustered index. The replacement algorithm is LRU. The performance are almost the same. And The page-based is slightly better. The reason is because the page-based overhead is smaller. The semantic-based also has very smaller overhead if we aggressively merge the semantic regions. But as we stated before, that lead to poor cache utilization. So, here page-based is better.

19 Simulation and Result (Contd.)
Unique1(Unclustered Index). Performance: Tuple-based and semantic-based. are much better. Reason: Page-based is sensitive to clustered. This query is on unique1, which is unclustered index. As we can see from the picture, the page-based is far worse than the other two, because page-based is sensitive to clustered. Page-based management the cache using page as unit, so for each tuple missing, it need to access a whole page.

20 Simulation and Result (Contd.)
Unique3(UnIndexed and Unclustered). Performance: Semantic-based is better. Reason: Remainder enables client and server. process query in parallel. This query is on unique3 which is unindexed and unclustered. Because the page-based is very sensitive to cluster. It perform very bad for this query, thus we didn’t put it on the picture. Because unique3 is unindexed, so the client or the server need to scan the whole disk to answer this query. Here semantic-based is better, because it create remainder query which allow the server and the client to process the query in parallel. On the other hand, for tuple-based, if the client process the query first (means scan the disk), and then asked the server for missing data. Then the server scan the disk again. It is a long time. So, we noticed here for tuple-based, if it totally ignore the cache, just send the query to the server, it performed even better.

21 Simulation and Result (Contd.)
Semantic locality / Manhattan distance on Unique1. Performance: Manhattan distance is better than LRU. Reason: “Cold regions” will be replaced faster. The final example is to compare the LRU and Manhattan distance. The query is on Unique1 which is indexed and clustered. As we can see, the Manhattan distance algorithm perform better. The reason is “cold regions” will be replaced faster. Remember in this case, 90% of the query can be answer by 10% “hot region”. Which also means 90% of the relation can just answer 10% of the query. Those regions are called “cold regions”. If those cold regions can be replaced faster, we will have a better cache hit. Base on the Manhattan distance, the 90% cold region will be replace faster.

22 Conclusion and Future Work
A simple model with selection query, semantic caching provides better performance. Future work. Implementation issues for complex query, update, deletion, and insertion: Concurrency control. Consistency. Completeness. A Predicate-based caching scheme for client-server database architecture. (Arthur M. Keller and Julie Basu) So, this is the conclusion. The author here just investigate his model with selection query. And semantic caching provides better performance over tradition architecture. For future work. The implementation issues for complex query, update, deletion and insertion which are very typical operation in database should be considered. We need to consider concurrency control, that means if we have multiple client to access data simultaneous, how to control that. Consistency problem. If some cache data is updated by another client, what should be done? Completeness, the central server need to know the cache content of all the clients, how to make that information complete? Actually, these issue are talked about in the other papers.


Download ppt "Semantic Data Caching and Replacement"

Similar presentations


Ads by Google