Nectar: Efficient Management of Computation and Data in Data Centers Lenin Ravindranath Pradeep Kumar Gunda, Chandu Thekkath, Yuan Yu, Li Zhuang
Motivation Resources are poorly managed in a data center Computation Storage Redundant computations – Wasting resources Manually managed – Unused files occupying space – Redundant output files
Goal Efficiently manage resources in a cluster Computation Storage Nectar
Key Insight Data Center Computation Storage Single query interface for computation and data access DryadLINQ Query Interface User
Goal Efficiently manage resources in a cluster Computation Storage Nectar
Computation PROBLEM: Redundant Computation – Programs share sub queries – Programs share partial data sets SOLUTION: Caching – Cache results of popular sub queries – Automatically rewrite user query to use cache X.Select(…) X.Select(…).Where(…) X.Select(…) (X+X’).Select(…)
Does caching help? Analyzed logs from production clusters Logs of 3 months (Oct – Dec 2008) 33 virtual clusters, jobs Parsed SCOPE programs, extracted sub queries Simulated caching
Caching helps About 50% cache hit on 10 clusters More than 30% cache hit on 20 clusters 35% on average
Goal Efficiently manage resources in a cluster Computation Storage Nectar
Storage PROBLEM: Manually managed – Unused files occupying space 50% data was never accessed in the last 275 days
Storage SOLUTION: Automatically manage data – Track usage and delete infrequently used files – Store programs which re-computes the data
Query Interface Data Center Computation Storage DryadLINQ Query Interface User
Goal Efficiently manage resources in a cluster Computation Storage Nectar
Data Center Computation Storage DryadLINQ Query Interface Nectar User
Nectar Architecture Query Rewriter DryadLINQ Dryad DryadLINQ program Query Cache entries Nectar Client Cache Server Add T to cache P P’ Add R to cache R T Cluster
Nectar Architecture Query Rewriter Nectar Client Cache Server
Query Rewriter Select X X R R X X X’ Select X’ Select R R Concat (R+R’) Cache
Query Rewriter Select X X R R X X X’ Select X’ Select R R Merge Sort (R+R’) Cache Order by
Query Rewriter Generates multiple plans – Using multiple cache entries Selects the best plan – Based on benefit Execution time Output Size Whether pipeline is broken Operators supported – Select, Where, Order by, Group by, Join X.Select(…) X.Select(…).Where(…)
Nectar Architecture Query Rewriter Nectar Client Cache Server
SQL Server Garbage Collector Cache Policy Cache Server URIQuery Fingerprint Query + Data Fingerprint Execution Time Output Size Inquire Stats Usage Stats Fingerprints
Cache policy Insertion Policy – Always add program output to cache – Sub query outputs are added to cache Popularity exceeds a threshold Savings exceeds a threshold
Garbage Collector Storage pressure – Delete infrequently used files Deletion policy – Based on savings – Cache type Mark and sweep algorithm – Delete cache entry – Reachability analysis Delete files Cache Server Distributed FS 1 2
What if I try to access a garbage collected file?
Nectar Architecture Query Rewriter Nectar Client Cache Server Program store
Program Store Store executed programs in the cluster Output file is tied to its corresponding program that generates the output If a file is deleted, the program is executed to regenerate the output
Managing Data Nectar Client Program Store Distributed FS foo.pt Cache Server FP Program FP A31E4.pt ToPartitionedTable (lenin\foo.pt) DryadLINQ Dryad usrNectar P’ Program P
Managing Data Nectar Client Program Store Distributed FS foo.pt Cache Server FP Program FP FromPartitionedTable (lenin\foo.pt) DryadLINQ Dryad usrNectar P A31E4.pt
Managing Data Nectar Client Program Store Distributed FS foo.pt Cache Server FP Program FP FromPartitionedTable (lenin\foo.pt) DryadLINQ Dryad usrNectar P A31E4.pt Program KJ1LM.pt
Goal Efficiently manage resources in a cluster Computation Storage Nectar Computation Storage Unified computation and data
Distributed cache servers Cache Server SQL Server Partitioned by query fingerprint Nectar Client Centralized Garbage collector Centralized Garbage collector Hash based on query fingerprint Program store Cache Server SQL Server
Summary We built Nectar – Automatically manage data – Efficiently manage computation Components Query Rewriter – Automatically rewrite queries to use cache Cache server – Popular sub queries are cached – Garbage collected based on usage Program store – Store programs which regenerates the output
Status Almost done with development – Query Rewriter Including other operators – Fingerprinter Program static analysis – Cache Server – Program Store In the process of deploying
Can we do better?
Cluster Utilization Most clusters have more than 40% Idle time Even the busiest clusters have 10-20% idle time
Exploiting idle time Do speculative caching – Cache popular data before query issued – Run program on new streams when available No side effects – Executed only when cluster is idle – Low priority jobs – Output garbage collected with high priority – More electric bill? Not Really!
Questions
Backup
Caching Results