PARALLEL DATA LABORATORY Carnegie Mellon University An Architecture for Self- Storage Systems Andrew Klosterman, John Strunk Greg Ganger
Klosterman, Strunk, Ganger June 4, Self- Overview Object store Provide clients with object-based interface Aggregate object-based workers Support snapshot, clone, and versioning Easy to manage Goal-based & complaint-based tuning Problem diagnosis via history Automatic integration of new resources No “on-call” administrator
Klosterman, Strunk, Ganger June 4, Maintenance & Fault-Tolerance Repairs necessary, but accident-prone Tolerate mistakes during repair Support simulated failures: fire-drills Keep maintenance procedures short Reduce number of “destructive” activities Time pressure causes mistakes No repairs required in less than 1 week!
Klosterman, Strunk, Ganger June 4, System Deployment Single datacenter environment High bandwidth, tightly coupled How big can it be? Integrated is easier to manage Vertical: disk file system Horizontal: large device vs. small ones Capacity: 16 PB with current technology Objects: 128 billion (128 KB object size)
Klosterman, Strunk, Ganger June 4, System Architecture I/O request routing Administrator Clients Management hierarchy Workers
Klosterman, Strunk, Ganger June 4, Admin Console Supervisor Worker Discovery, Group Membership and Directory Service Event Notification Service Security Service Metadata Service Head-end Interface Messaging LayerRouterRead / Write ProtocolEncode / Decode Administrator Management hierarchy Workers I/O request routing Clients Router
Klosterman, Strunk, Ganger June 4, Management Hierarchy Admin Console System-wide monitoring Goal determination Goal distribution Complaint-based tuning Supervisor Monitoring Sub-systems Goals Performance tuning Fault detection Fault recovery
Klosterman, Strunk, Ganger June 4, Worker Object-based storage device Objects: read / write Attributes: read / write Comprehensive versioning Fast-copy clone() A copy-on-write object “Intelligent bricks” 1U, P4, 2 GB RAM, 2 Gb NICs 4 SATA 250 GB or 4 SCSI 73 GB
Klosterman, Strunk, Ganger June 4, Request Routing #1 Discovery Detect new components Assign system ID Group Membership Aggregate components Service groups Directory Lookup service “DNS” Query for contact info Event Notification post() subscribe() Situations trigger event posting Receipt of a subscribed message triggers reaction
Klosterman, Strunk, Ganger June 4, Request Routing #2 Security Service Authenticity Token based Checked at Workers against ACLs Confidentiality: PASIS Encode / Decode Integrity: Messaging Layer Key management
Klosterman, Strunk, Ganger June 4, Request Routing #2 Security Service
Klosterman, Strunk, Ganger June 4, Request Routing #3 Metadata Service Gigantic B-tree of object metadata Goals Encoding Share locations Can be rebuilt from data on workers Supports enumeration for fsck
Klosterman, Strunk, Ganger June 4, Head-end Interface Object-storage interface Supports additional calls Goal assignment Side-band performance tuning Two types of head-ends Translation: exports NFS, AFS, CIFS, etc. Direct: raw access to self- objects Clients
Klosterman, Strunk, Ganger June 4, Communication Infrastructure #1 Encode/Decode Encoding chosen to meet Goals Breaks objects into shares on write() Reconstructs objects from shares on read() Read / Write Protocol Atomic changes to ~64 kB chunks of objects
Klosterman, Strunk, Ganger June 4, Communication Infrastructure #2 Router Decision maker Picks destination Distributed services Shares on read() Messaging Layer Forwards messages Picks network Interacts with Directory service
Klosterman, Strunk, Ganger June 4, Admin Console Supervisor Worker Discovery, Group Membership and Directory Service Event Notification Service Security Service Metadata Service Head-end Interface Messaging LayerRouterRead / Write ProtocolEncode / Decode Administrator Management hierarchy Workers I/O request routing Clients