By Apeksha Shetty Thesis Committee: Dr. Gagan Agrawal

Evaluating and Optimizing Indexing Schemes for an Elastic Cache in a Cloud Environment.
By Apeksha Shetty Thesis Committee: Dr. Gagan Agrawal Dr. Hakan Ferhatosmanoglu 9/20/2018

Outline What is Cloud Computing? What is an Elastic Cache?
Implementation of Elastic Cache on Amazon EC2. Experimental Results Performance Evaluation of Indexing Schemes Evaluation of our Optimization Method 9/20/2018

What is Cloud Computing?
What is cloud computing? Is it computing resources up in the cloud? Obviously not. The “cloud” here stands for internet and cloud computing means providing services over the internet. There are three basic services : IaaS, Paas and SaaS. Iaas – organization outsources the infrastructure i.e. storage resources, hardware, networking components, etc. Paas – Provides computing platform as a service and facilitates application design, development, testing, deployment and hosting Saas – Its been around the longest and it is web services and softwares provided over the internet Cloud Computing = IaaS + PaaS + SaaS 9/20/2018

What is Cloud Computing?
Utility Computing provides computational and storage resources on a on-demand, “pay – as – you – go” basis. It is Elastic : resources can be acquired and released dynamically The entire environment is managed by the provider The main characteristics of cloud computing are : It is an on demand service and sold on a “pay-as-you-go” basis, typically the services are charged by the hour. It is elastic: Resources can be acquired and released dynamically. Fully managed by a provider and the user just needs a computer and an internet connection. Hence, it is ideal for small businesses as they save up on acquiring and setting up the infrastructure and maintaining it. Moreover, businesses can vary its resource utilization based on the requirement and avoid over or under provisioning of resources. The popularity of cloud computing has risen over the years, with some big companies like Amazon , Google , Microsoft and IBM providing public as well as private clouds to the consumers. 9/20/2018

Need for Cache Demand for applications varies a lot
May be expected – upcoming sporting event May be unexpected – natural calamity Queries are correlated Result Result2 Demand for applications varies a lot. There could be an expected rise in interest (Football world cup coming up) OR an abrupt increase due to natural calamity (an earthquake or tsunami). Generally, the results of queries are correlated. Results of some queries could be used to derive the results of some other queries. Hence, we cache the intermediate results to accelerate query execution. We can further speed up computations by storing the entire cache in main memory. But the data would be too large, hence we split data among multiple nodes. This way each node contains a portion of the data but none contains the entire data. 9/20/2018

What is an Elastic Cache?
Developed and implemented on a cluster to speed up computations in Auspice (Scientific Workflow System of our research group) Elastic cache – Cache intermediate results in the main memory Split data among a set of nodes The set of nodes can expand or contract depending on the data size This system was first developed by our research group to accelerate the computations in our scientific workflow system called Auspice. We follow the concept mentioned in the previous slide. The elastic part lies in the fact that we expand or contract the number of nodes based on the data size. Thus, it forms an elastic cache. 9/20/2018

Contributions Implemented the Elastic cache on Amazon EC2
Evaluated various indexing schemes and determined which was best suited for our Elastic cache Optimized the time efficiency of the system by minimizing idle time Driven by the popularity and inherent elasticity of the Clouds we implemented the elastic cache on the Cloud. The performance of the elastic cache depends on the indexing scheme used, hence we evaluated… finally we optimized our system and tried improving its time efficiency. 9/20/2018

Elastic Cache – Architecture
The system is transparent to the user, i.e., as far as the user is concerned it’s a single cache. We achieve this, by help of a coordinator. The coordinator determines which node may contain the data based on the hash table or directory service. Then there is a set of cache servers. Each cache server stores only a portion of the cache, hence they cooperate with each other and form a cooperative cache. 9/20/2018

Elastic Cache – Working
9/20/2018

Elastic Cache – Coordinator
Manages everything Consistent Hashing : Determines node which may contain the data In case of overflow: Migrates keys between the Minimum and the Median. Greedy Bucket Allocation Algorithm: Chooses preexisting nodes over launching new instances The coordinator manages everything and all queries are submitted to it. When a query is submitted, the coordinator determines the node containing the data using consistent hashing. Our research group borrowed this idea from web caching. Then it tries locating the data on the node, if it is a miss, the coordinator calls the service and then tries inserting the newly computed result into the node. It uses Greedy Bucket Allocation Insert Algorithm for this. An insertion into a node may cause it to overflow. In such a case, we migrate half the records, typically from the min to the median into a destination node. Our algorithm chooses a pre-existing node as the destination node over locating a new node. This property makes our algorithm greedy as well as cost efficient as it delays the launching of new nodes and reduces the resource utilization cost. Also the system tries stabilizing with in the allocated resources and hence avoids over provisioning of resources. 9/20/2018

Elastic Cache – Cache Servers
Each server implements an indexing scheme and a replacement policy All contain only a portion of the cache and cooperate with each other to form a Cooperative Cache All the data is present in the main memory. Hence, all servers contain a portion of the cache, but no node contains the entire data. Hence, they cooperate with each other to form a Cooperative Cache. 9/20/2018

Implementation on EC2 EC2 – Amazon’s Elastic Compute Cloud
Secure Access Create accounts – Amazon Web Services (AWS) Amazon EC2 (EC2) Amazon S3 (Simple Storage Service) In order to work in this Cloud environment , secure access is a must. Amazon provides multiple levels of security which helps protect and authenticate the user to AWS. Each account is associated with multiple security credentials. Sign in : to authenticate the user to AWS Account identifiers: to ease sharing of resources among accounts. For e.g. Canonical ID is used exclusively for S3 and AWS Account ID used for all other AWS services. Key pairs: a public – private key pairs can be launched to authenticate requests to AWS services like launching or terminating instances.. Private key stored with user and public key registered with AWS. 9/20/2018

Implementation on EC2 contd…
Instances Standard / High-Memory / High-Compute Instances Reserved / On-Demand / Spot pricing schemes On-Demand Small Instance – 1.7 GB, 1 Virtual core, 32 bit platform Amazon Machine Image (AMI) Snapshot of the operating system Ubuntu Linux Image and Download the cache server from remote location What is this instance we talk about? In order to provide elastic computing to the users, Clouds stretch their physical resources by multiplexing the cpu cores among multiple users using virtualization. An instance is a portion of the actual machine provided to the user. 3 main categories of instances: Standard, High-Memory and High-Compute. 3 pricing schemes : Reserved / On-Demand / Spot . Reserved: pay upfront and reserve instances for a long period of time ( in years) and pay less hourly rates. On-Demand: as the name suggest is on-demand with no upfront charges and costs more hourly. Spot: Bid for the unused capacity. On-Demand instances are best suited for our experiment, but Amazon limits the maximum number of concurrent instances to 20. Amazon Machine Image makes working on EC2 possible. An AMI is a snapshot of the operating system and it has all the information required to setup and boot an instance. AMIs can be created by two methods – Loopback method – where you start with a bare minimum system and load your own operating system. Other and more commonly used method is to modify an existing AMI and save the changes as a new AMI. AWS maintains a huge repository of useful images. These AMIs are stored in Amazon S3 and one would use the canonical ID to authenticate himself. We used an ubuntu linux image and downloaded our cache server from a remote location. 9/20/2018

Implementation on EC2 contd…
EC2 API Command Description ec2-add-keypair Generates a new public-private key pair ec2-describe-images Returns all AMIs stored in EC2 Library ec2-bundle-image Bundles a new AMI ec2-upload-bundle Uploads the newly bundled AMI ec2-register Registers the newly uploaded AMI ec2-run-instances Launches a new instance ec2-terminate-instances Terminates instances ec2-describe-instances Returns the status of the instance A developer can work on Ec2 either with Amazon management console or command line tools. The console is a simple GUI, which requires human interaction, but we required a simple API. Hence, we use Command line tools. These are some of the command we used. 9/20/2018

Experiments Ran a real service – Shoreline Extraction Service
Inputs: Location (L) and Time of Interest (T) Output: The shoreline Cache starts out cold with 1 coordinator and 1 cache server (Bx –tree) Mimics worst case scenario – randomize the inputs over 64K values. Query execution time without cache = 23 secs Provided by colleagues at Department of Civil and Environmental and Geodetic Sciences Tries retrieving cached results – precomputed results, which were stored because of an earlier miss. 23 secs to run one query when it’s a miss. Worst case scenario as there is no correlation between consecutive queries. 9/20/2018

The Results – 50 Queries/Timestep
Explain in short the plot – the axis, the query rate, the static and GBA. Things to point out: Static -2 and 4 converge very early. GBA has the same performance as static -8. the cost would be equivalent to running approx 5.5 = 6 nodes from the start. Hence we obtain static -8 performance at an approximate cost of static-6 system. Greedy BA used, thus the system stabilizes after launching 4 and then launches 4 in close proximity. Thus avoids over provisioning and tries minimizing cost. Time Steps 9/20/2018

The Results – 225 Queries/Timestep
Static – 8 also converges. For GBA- mean query time goes down significantly and system achieves near –zero percent miss rate. The performance of static -15 using approximately 12.6 = 13 nodes. 9/20/2018 Time Steps

Summary and Observations
Achieves the same performance as static system Cost efficient– as node allocation is delayed System performance => Speed of retrieving cached results => Determining which node may contain the result (Consistent hashing) + Determining whether it is a hit or a miss (Indexing Scheme used) System is suited for cloud environment Determining whether it is a hit or a miss depends on the indexing scheme used. Hence, we compared the suitability of various indexing schemes for our elastic cache. 9/20/2018

Indexing Evaluated – Bx –tree , Extensible Hashing and Counting Bloom Filters Extensively used in real applications Have different structures The performance of our system depends on speed of access and that depends on how fast we determine which node contains the data and how fast we determine whether it is a hit or a miss. We evaluate 3 indexing schemes as they have different structure and are extensively used in real applications. Bxtree uses linearization technique involving space filling curve to incorporate spatio temporal data into one dimensional b+tree. Hence the structure of bxtree is same as b+tree and we will be considering b+tree for the rest of the presentation. 9/20/2018

B+ –tree Branch factor (n) = 3 Root Intermediate Leaf Nodes
This is the basic structure of Bxtree. It is a balanced tree, as all paths from the root to the leaf node are of equal length. It is a mutli-level index, which adjusts itself according to the file size. Three layers – root , intermediate layer and leaf nodes. Each node except the root contains between (n-1)/2 – n-1 entries. Keys in leaf nodes are sorted left to right and all the leaf nodes are connected to the next and the previous node. Hence, it is suitable for range queries. Keys lesser than or equal to the node entry lie on the left branch and keys greater than it lie on the right branch. Hence, searching is very easy. An insertion may lead to an overflow, which would lead to a split and that may lead to an introduction of a new level. Conversely, a deletion may lead to an underflow, which may lead to shifting of entries or merging of nodes, which may lead to reduction in the level of the tree. Leaf Nodes Branch factor (n) = 3 9/20/2018

Migration Recap 9/20/2018

B+ –tree - Migration Migrate (1,4) Explain migrate (1,4)
Compare 1 with root, then arrive at the leaf which may contain 1 and traverse along the leaf nodes. 9/20/2018

B+ –tree – After Migration
9/20/2018

Extensible Hashing Directory Buckets
In static hashing, the hash function is applied to a key and the hash value points to bucket containing the record. If multiple keys hash to the same value, it is called collision. Collisions can be handled by using overflow chains. But, this may lead to degradation of performance as the overflow chains can grow too long. To avoid this We used a type of dynamic hashing called extensible hashing. It has two levels of indirection and contains following concepts : directory (contains pointers to buckets), bucket (pointers to records). The hash function converts a search key into a 32 bit binary sequence. A variable i determines the bucket number from the hash value and also keeps track of the number of times the directory grew. Another variable j keeps track of the number of times a bucket split. If a bucket overflows, a new bucket is created and the records are split among the two buckets. If j= i, then the directory doubles, otherwise it doesn’t. Directory Buckets 9/20/2018

Extensible Hashing – Hash Function
Till it grows to 33 bit length 9/20/2018

Extensible Hashing – Migration
Migrate (1,6) There is no specific order among the records stored. Hence, we sequentially search from start till end and migrate keys that lie within the range. Ideally the buckets should merge and directory should shrink after deletion. But, we do not implement this. 9/20/2018

EH - After Migration 9/20/2018

Counting Bloom Filter This is a probabilistic data structure to determine the membership of the record in a set. It has a m bit array and 4 bit counters associated with each bit of the array. To insert a record with key x, K hash functions are applied to x and the corresponding bits are set in the bit array. We also increment the respective counters. Searching is simple, if all bits are set i.e. AND operation is a 1, then it is a member of the node, otherwise No. Deletion is similar to insertion but the counter is decremented, and if it becomes zero, we reset the corresponding bit in the bit array. In order to insert , delete or search for a key, we first apply the k hash functions and obtain the k bit location. In case of insert we 9/20/2018

Counting Bloom Filter – False Positive Rate
Where, k is the number of hash functions m is the length of the bit array N is the number of set bits A CBF can have false positives, i.e., it may say that a record is present, while it is actually not. But it cannot have false negatives, i.e., if the answer is negative then the result is definitely not present. If we choose large values of m and independent k hash functions. Then the false positive rate becomes negligible. 9/20/2018

Counting Bloom Filter – Migration
Migrate (1,4) We used the CBF developed by Xlattice Project. M = 16 , k = 8. It difficult to come up with 8 independent hash functions, hence we used SHA-1, which returns a 160 bit pseudo random number and split it into 8 slices and treat each as the output from the 8 hash functions. SHA-1 was used as there very little correlation among the bits. CBF returns in constant time whether a record is present in a set. Hence, we used this property in our migration method and iterated from the lower bound to the upper bound of the migration range and checked if the key was member of the node. If yes, then we searched through the file to find the physical location of record. isMember(i) | i  (1,4) 9/20/2018

CBF – After Migration 9/20/2018

Experiments Ran same experiment as before
Initially it is High Insert (A lot Cache Misses) Later on it becomes High Retrieval (Near-Zero Miss Rate) 9/20/2018

Mention the axis, the query intensity, EH and that Bloom filter is in fact CBF and not the classical BF. EH (300) had least time of and bloom filter has the worst time approx. 45. Things to point out. Time taken by EH (100 and 1000) hence changing parameter of EH affects its performance. 9/20/2018

Performance of EH 500 degraded and Eh 300 is the best followed by Bxtree in terms of suitability and bloom filter is the worst. 9/20/2018

EH(500) has the best times, closely followed by EH 300 and Bxtree.
9/20/2018

Observations from Results
Extensible hashing with well chosen parameters is the best suited for the Elastic cache Bx –tree performs consistently well, irrespective of the parameters Counting Bloom Filter is the least suited for the system 9/20/2018

Need for Optimization 9/20/2018

After Optimization Since the instance startup time is variable, we cannot assure complete elimination of idle time. 9/20/2018

Threshold Selection Too low -> Zero idle time, but extra instances launched Too high -> Idle times not minimized Threshold = (node capacity * 0.5) * (1 + (number of nodes allocated – (query intensity/50))*0.1) Choosing the right threshold is tricky. If it is too low, there might be enough time for an instance to startup. Hence, the idle time may reduce to zero, but extra instances may be launched. If it is too high, the instance probably won’t have enough time to bootup. The threshold should vary correspondingly with increase in query intensity. Higher the intensity , lower the threshold should be. Also as the experiment proceeds, the threshold needs to increase, as more nodes would take longer to overflow. 9/20/2018

Threshold Selection For Instance, Node capacity = 5000,
Number of nodes allocated = 1 Query intensity = 50 Threshold = 2500 Query intensity = 225 Threshold = 1625 9/20/2018

Multi-Threaded Implementation
Maximum threads = 4 Main thread 2 threads t1,t2 – migrate records from nodes which have reached threshold t3 - migrates records from node that overflows, when t1 and t2 are handling some other nodes Insert() , Delete() and Migrate() are synchronized Idle time is comprised of two tasks: one was starting up an instance and other was migrating records to the instance. We decided to multitask the execution , hence overlapping the migration of records with normal program execution. We introduced 3 threads for this purpose. It was observed that 2 nodes would almost simultaneously reach the threshold. Hence, threads t1 and t2 launched and migrated records whenever the threshold was met and t3 handled the rare case, when a instance would overflow and t1 and t2 are handling some other instances. In order to maintain the consistency of the cache, we synchronized methods that modified the structure of the index. The methods, like insert, delete and migrate would acquire locks before performing the operation. 9/20/2018

9/20/2018

Conclusions Implemented Elastic cache on Amazon EC2
Achieved near-zero miss rates System was cost efficient Compared various indexing schemes and found that Extensible hashing with well chosen parameter is the best suited for our system Optimized the system and reduced the idle time significantly 9/20/2018

Questions??? 9/20/2018

Thank You!!! 9/20/2018

By Apeksha Shetty Thesis Committee: Dr. Gagan Agrawal

Similar presentations

Presentation on theme: "By Apeksha Shetty Thesis Committee: Dr. Gagan Agrawal"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

By Apeksha Shetty Thesis Committee: Dr. Gagan Agrawal

Similar presentations

Presentation on theme: "By Apeksha Shetty Thesis Committee: Dr. Gagan Agrawal"— Presentation transcript:

Similar presentations

About project

Feedback