Distributed Data Storage and Processing over Commodity Clusters Sector & Sphere Yunhong Gu Univ. of Illinois at of Chicago, Feb. 17, 2009
What is Sector/Sphere? Sector: Distributed Storage System Sphere: Run-time middleware that supports simplified distributed data processing. Open source software, GPL, written in C++. Started since 2006, current version
Overview Motivation Sector Sphere Experimental studies Future work
Motivation Super-computer model: Expensive, data IO bottleneck Sector/Sphere model: Inexpensive, parallel data IO
Motivation Parallel/Distributed Programming with MPI, etc.: Flexible and powerful. BUT complicated, no data locality Sector/Sphere model: Clusters are a unity to the developer, simplified programming interface, data locality support from the storage layer. Limited to certain data parallel applications.
Motivation Systems for single data centers: Requires additional effort to locate and move data. Sector/Sphere model: Support wide-area data collection and distribution.
Sector: Distributed Storage System Security ServerMaster slaves SSL Client User account Data protection System Security Storage System Mgmt. Processing Scheduling Service provider System access tools App. Programming Interfaces Storage and Processing Data UDT Encryption optional
Sector: Distributed Storage System Sector stores files on the native/local file system of each slave node. Sector does not split files into blocks Pro: simple/robust, suitable for wide area Con: file size limit Sector uses replications for better reliability and availability The master node maintains the file system metadata. No permanent metadata is needed. Topology aware
Sector: Write/Read Write is exclusive Replicas are updated in a chained manner: the client updates one replica, and then this replica updates another, and so on. All replicas are updated upon the completion of a Write operation. Read: different replicas can serve different clients at the same time. Nearest replica to the client is chosen whenever possible.
Sector: Tools and API Supported file system operation: ls, stat, mv, cp, mkdir, rm, upload, download Wild card characters supported System monitoring: sysinfo. C++ API: list, stat, move, copy, mkdir, remove, open, close, read, write, sysinfo.
Sphere: Simplified Data Processing Data parallel applications Data is processed at where it resides, or on the nearest possible node (locality) Same user defined functions (UDF) can be applied on all elements (records, blocks, or files) Processing output can be written to Sector files, on the same node or other nodes Generalized Map/Reduce
Sphere: Simplified Data Processing InputOutputUDF InputIntermediateUDFOutputUDF Input 1 OutputUDF Input 2
Sphere: Simplified Data Processing for each file F in (SDSS datasets) for each image I in F findBrownDwarf(I, …); SphereStream sdss; sdss.init("sdss files"); SphereProcess myproc; myproc->run(sdss,"findBrownDwarf", …); myproc->read(result); findBrownDwarf(char* image, int isize, char* result, int rsize);
Sphere: Data Movement Slave -> Slave Local Slave -> Slaves (Shuffle/Hash) Slave -> Client
Load Balance & Fault Tolerance The number of data segments is much more than the number of SPEs. When an SPE completes a data segment, a new segment will be assigned to the SPE. If one SPE fails, the data segment assigned to it will be re-assigned to another SPE and be processed again. Detect and remove "fault" nodes.
Open Cloud Testbed 4 Racks in Baltimore (JHU), Chicago (StarLight and UIC), and San Diego (Calit2) 10Gb/s inter-site connection on CiscoWave 1Gb/s inter-rack connection Two dual-core AMD CPU, 12GB RAM, 1TB single disk
Open Cloud Testbed
Example: Sorting a TeraByte Data is split into small files, scattered on all slaves Stage 1: On each slave, an SPE scans local files, sends each record to a bucket file on a remote node according to the key, so that all buckets are sorted. Stage 2: On each destination node, an SPE sort all data inside each bucket.
TeraSort 10-byte90-byte Key Value 10-bit Bucket-0 Bucket-1 Bucket Stage 1: Hash based on the first 10 bits Bucket-0 Bucket-1 Bucket-1023 Stage 2: Sort each bucket on local node Binary Record 100 bytes
Performance Results: TeraSort Data Size SphereHadoop (3 replicas) Hadoop (1 replica) UIC300GB UIC + StarLight600GB UIC + StarLight + Calit2 900GB UIC + StarLight + Calit2 + JHU 1.2TB Run time: seconds Sector v1.16 vs Hadoop 0.17
Performance Results: TeraSort Sorting 1.2TB on 120 nodes Hash vs. Local Sort: 981sec : 545sec Hash Per rack: 220GB in/out; Per node: 10GB in/out CPU: 130% MEM: 900MB Local Sort No network IO CPU: 80% MEM: 1.4GB Hadoop: CPU 150% MEM 2GB
CreditStone Merchant IDTime KeyValue 3-byte merch-000X merch-001X merch-999X Stage 1: Process each record and hash into buckets according to merchant ID merch-000X merch-001X merch-999x Stage 2: Compute fraudulent rate for each merchant Trans ID|Time|Merchant ID|Fraud|Amount | | |0|66.49 Text Record Transform Text Record Fraud
Performance Results: CreditStone RacksJHUJHU, SLJHU, SL, Calit2 JHU, SL, Calit2, UIC Number of Nodes Size of Dataset (GB) Size of Dataset (rows)15B29.5B44.5B58.5B Hadoop (min) Sector with Index (min) Sector w/o Index (min) * Courtesy of Jonathan Seidman of Open Data Group.
System Monitoring (Testbed)
System Monitoring (Sector/Sphere)
Future Work High Availability Multiple master servers Scheduling Optimize data channel Enhance compute model and fault tolerance
For More Information Sector/Sphere code & docs: Open Cloud Consortium: NCDM:
Inverted Index 1st letter word_x word_y word_y word_z 1word_x Bucket-A Bucket-B Bucket-Z Stage 1: Process each HTML file and hash (word, file_id) pair to buckets Bucket-A Bucket-B Bucket-Z Stage 2: Sort each bucket on local node, merge same word HTML page_1 1word_y 1word_z word_z 1, 5, 10word_z