Goal: To build a ubiquitous and robust storage infrastructure Requirement: Scalability, availability, performance, robustness Solution: Dynamic object replication and migration in hybrid architecture Background
Three-layer replica creation Object Layer + Intelligent Disk Layer + Regional Manager Layer Object Layer - Metadata entries - object GUID - replication_threshold: requests/time - delete_threshold: requests/time - replication_where: region_list - itinerary: (region1, DiskIP1, period1), (region2, DiskIP2, period2), … - subobject_list: (GUID1, pointer1), (GUID2, pointer2), ….
Intelligent Disk Layer - Metadata - device description bandwidth, CPU utilization, available space, region ID, IP address - object layout and other metadata (defined above) - request information for each object the amount of requests over a period of time for each region the popularity: request amount * region weight - soft-state information about its neighbor devices IP address, bandwidth, CPU utilization, available space neighbor devices are those in two hops and have the same region ID the disk will broadcast its load and free space information to all its neighbors periodically
Intelligent Disk Layer - Policy Parameters: disk_replication_threshold, disk_load_threshold 1. If the requests to an object exceed the replication_threshold associated with the object, the disk will create a new replica on its neighbor based on neighbor’s status. 2. If the requests to one object from one region exceed the disk_replication_threshold, the disk will replicate the object to that region. If the region is not the same region as the disk in, disk needs to ask its own regional manager to replicate the object. If the region is the same region as the disk in, the disk can replicate the object on its neighbors based on neighbor’s status. The disk is responsible for redirecting the request to the new replica. 3. If the disk load exceeds the disk load threshold, the disk needs to replicate the top 5 most popular objects to either its neighbors or through its regional manager to other regions. 4. If the disk receives replication requests from its neighbors, it will check whether it has a replica and check disk load, then decide whether to agree or not. 5. The disk will replace the object when needs more space using LRU.
Regional Manager Layer -- Metadata - Object layout, location information Because the creation of a new object must go to regional manager first, the regional manager can record all the initial object location information. In addition, both replica creation and deletion are required to register to the regional manager. - Device status information IP address, bandwidth, CPU utilization, available space Devices periodically send their status to the regional manager - Request information for each object the number of requests over a period of time for each region - Other regions information location of other regional managers, distance to other regional managers
Regional Manager Layer -- Policy Parameters: region_replication_threshold 1. The regional manager replicates an object as the “replicate_where”, “itinerary” entries associated with the object. 2. If a regional manager observes requests (open) to an object from one region exceed the region_replication_threshold, if the region is not the same as the regional manager’s, the regional manager will ask that regional manager to create a replica in that region, else the regional manager will find a disk in its region to create a new replica on it. 3. If a regional manager receives a request from devices in its region asking for creating a replica to a specific region, the regional manager will contact the regional manager of the specific region and returns the disk IP where to replicate to its disk. 4. If the regional manager receives a request from other regional manager to create a new replica in the region, it will first choose a disk to host the replica and tell the regional manager the disk IP.
The client queries the regional manager with the GUID of an object. - If the object is in the region the regional manager just randomly chooses one for the client. - If the object is not in the current region the regional manager applies the mechanism as in Oceanstore to locate the nearby regional manager that has the object. Then that regional manager randomly chooses one replica for the client. For client that pre-schedules object migration, the client knows the IP address of the disk or the region where the object is in. Replica selection
Compound object has a metadata entry listing all the GUID and Pointer pair of its sub-objects. Pointer is the IP address of the disk that hosts sub-object for the current compound object, it is different for compound objects on different disks. Every time a new replica for compound object is created, the entity that implements the replication will find out whether there is nearby sub-objects. If there is, the pointer is set to the IP address of nearby sub-objects. If not, the sub-objects will be replicated along with the compound objects. Compound Object
Experimental environment Parameters: Object number, Disk number, Region number, Client number Object size, Disk bandwidth, Regional manager bandwidth, Client bandwidth, Network delay (per hop) Request (open & read) generator Thresholds, time period to calculate request amount and frequency How to organize the region, disk, and object? Assume that at initial state, disks are geographically grouped into regions. The object is randomly scattered throughout the disks, and the object belongs to the region that the disk belongs to. Start with region number = 1.
- The average access time is greatly reduced - The network overhead (control message and object replication traffic) does not surpass the benefit (the reduced traffic due to access of nearby copy) - The ratio of storage overhead (metadata and the replicated objects) over total storage space is insignificant - The total number of replicas should be proportional with the request number of the replica - Under failure (by randomly removing the disks off the network), the availability of the system (the average success access over all access under different percentage of the live disk) is acceptable. - Scalability.The above factors (average access time, the ratio of network overhead over benefit, the storage overhead) increase linearly with the increase of the size of the system. The availability of the system should not be changed much under different system size. What to expect
- Refine the parameters - Refine the regional manager layer replica selection and add disk layer replica selection - Experiment with real workload for an application - Combine with the concurrency control work - Modify policies based on security limitation Future Work