GreenHadoop: Leveraging Green Energy in Data-Processing Frameworks Íñigo Goiri, Kien Le, Thu D. Nguyen, Jordi Guitart, Jordi Torres, and Ricardo Bianchini.

GreenHadoop: Leveraging Green Energy in Data-Processing Frameworks Íñigo Goiri, Kien Le, Thu D. Nguyen, Jordi Guitart, Jordi Torres, and Ricardo Bianchini

Motivation Datacenters consume large amounts of energy Energy cost is not the only problem – Brown sources: coal, natural gas… Connect datacenters to green sources – Solar panels, wind turbines… – Green datacenter – Early examples in the field 2

Green datacenter Energy sources – Solar/wind: variable over time – Electrical grid: backup Mitigation approaches are not ideal – Batteries and net metering We need to match the energy demand to the supply Power Time Load Solar power Workload 3

J3 Delaying load within time bounds J1J2 Nodes Power Time Nodes Power 4 Delay some jobs is OK (respecting time bounds) J2 J1

Scheduling data-processing workloads in green datacenters Data-processing jobs – Each task operates on a chunk of data – Data distributed among servers Simple workflow: MapReduce – Map tasks: process input data – Reduce tasks: merge maps’ outputs Challenges Match MapReduce workload with green energy availability – No information on #nodes, length, power… Conserve energy while ensuring data availability Map 1 2 3 4 5 Reduce 6 7 Shuffle 5

Overview of GreenHadoop Predict solar energy availability May delay jobs but must meet time bounds – Maximize green energy use – If not enough green energy, minimize brown electricity cost – Brown energy cost + peak brown power cost Deactivate idle servers while keeping data available Divided into two parts 1.Computation scheduling 2.Data management 6

1. Computation scheduling Job3 Job1 Job4 Job5 Job6 Job2 Estimate the energy required by jobs (EWMA) Job3 Job1 Job4 Job5 Job6 Job2 7

1. Computation scheduling Job3 Job1 Job4 Job5 Job6 Job2 Power Time Now Assign green energy first Predict energy availability (weather forecast) On-peakOff-peak 8

1. Computation scheduling Job3 Job1 Job4 Job5 Job6 Job2 Time Now Assign cheap brown energy Power Previous peak On-peakOff-peak 9

1. Computation scheduling Job3 Job1 Job4 Job5 Job6 Job2 Time Now Assign expensive energy Power Active servers On-peakOff-peak 10 Current power → Active servers

1. Computation scheduling Time Now Active servers Power As time goes by… the number of active servers changes 11

2. Data management Deactivate servers to save energy – Some data might become unavailable Prior solution: covering subset [Leverich’09] – Set of servers always running has ALL data 12 Covering subset 7 3 45 216 8 7 1 45 6 3 2 8 1 7 3 Our approach Only required data has to be available We usually require fewer active servers

2. Data management Server 1 1 7 2 Active Decommission Down Server 2 4 35 6 Server 3 4 6 Required file Non-required file Server 4 2 3 8 4 Server 5 36 7 JobA 4 JobB 5 JobC 1 6 Running queue: 13

2. Data management Server 4 2 3 8 4 Server 5 36 7 Active Decommission Down GreenHadoop (computation) requires only 2 servers Server 1 1 7 2 1 7 2 Server 2 4 35 6 Server 3 4 6 Required file Non-required file JobA 4 JobB 5 JobC 1 6 Running queue: 14

2. Data management Active Decommission Down Move required files to Active servers Server 1 1 7 2 Server 2 4 35 6 Server 3 4 6 1 Server 4 2 3 8 4 Server 5 36 7 Replicate JobA 4 JobB 5 JobC 1 6 Running queue: 15

Server 1 1 7 2 2. Data management Active Decommission Down Decommissioned server can be sent to Down Server 1 1 7 2 Server 2 4 35 6 Server 3 4 6 Required file Non-required file 1 Server 4 2 3 8 4 Server 5 36 7 JobA 4 JobB 5 JobC 1 6 Running queue: 16

Server 1 1 7 2 2. Data management Active Decommission Down Jobs to be executed change → Required files change Server 2 4 35 6 Server 3 4 6 Non-required file 1 Server 4 2 3 8 4 Server 5 36 7 JobA 4 JobB 5 JobC 1 6 JobD 8 Required file 6 4 6 4 6 4 8 Running queue: 17

Server 4 2 3 8 4 Server 1 1 7 2 2. Data management Active Decommission Down Make missing data available Server 2 4 35 6 Server 4 2 3 8 4 Server 5 36 7 Server 3 4 6 1 Required file Non-required file JobB 5 JobC 1 JobD 8 Required file Running queue: 18

Server 4 2 3 8 4 Server 1 1 7 2 2. Data management Active Decommission Down Server 2 4 35 6 Server 4 2 3 8 4 Server 5 36 7 GreenHadoop (computation) requires 3 servers Server 3 4 6 1 Non-required file JobB 5 JobC 1 JobD 8 Required file Running queue: 19

Evaluation methodology Cluster with 16 Xeon servers – Hadoop and Hadoop turning off idle servers (EAHadoop) – GreenHadoop: green energy, brown electricity cost Energy profile – NJ electricity pricing (on/off peak and peak cost) – Solar farm energy availability (14 PV panels) – Five pairs of days (combinations of high and low days) Workload – Derived from Facebook [Zaharia’09] – Jobs with up to 37GB, 600 tasks, and 6 hours of length – Internal time bound of one day 20

Energy prediction vs actual rainthunderstormcloud cover 21

30 kWh 59 kWh $8.00 39 kWh 25 kWh $6.06 -24% 31% more green 39% cost savings GreenHadoop for Facebook & high-high days 22 Green consumed Brown consumed Brown price Green predicted Green produced

Different pairs of days Effect of parameters in GreenHadoop GreenHadoop for Facebook 23

Other results Workload intensity (datacenter utilization) High-priority jobs Shorter time bounds Data availability Workloads variations Consistent green energy increases and cost savings 24

Conclusions Data-processing scheduler for green datacenters Predicts green energy availability Increases the use of green energy Reduces brown electricity costs Manages data availability We are building Parasol – Solar-powered μdatacenter – Poster session 25

GreenHadoop: Leveraging Green Energy in Data-Processing Frameworks Íñigo Goiri, Kien Le, Thu D. Nguyen, Jordi Guitart, Jordi Torres, and Ricardo Bianchini

Dealing with electricity costs Schedule jobs: evaluate electricity cost – Green energy is “free” (amortization): $0.00/kWh – Cheap energy (11pm to 9am): $0.08/kWh – Expensive energy (9am to 11pm): $0.13/kWh – Off-peak power cost:$5.59/kW month – On-peak power cost:$13.61/kW month Optimization goal – Minimize electricity related costs while meeting deadlines 27

Our proposal: GreenHadoop Predict green energy availability – Weather forecast Schedule jobs – Maximize green energy use ($0/Wh) – If green not available, consume cheap brown ($/Wh on/off-peak) – When using brown, reduce peak power cost ($/W) Turn off idle servers to save energy Optimization goal – Minimize electricity related costs – May delay jobs but must meet deadlines – Guarantee data availability 28

Evaluation methodology Workloads – FaceD: GridMix derived from Facebook [Zaharia’09] – NutchI: crawling and indexing for Rutgers webpages Length – Tasks from 2 to 60 seconds – Jobs from 4 to 600 tasks – Some jobs take up to 6 hours using the whole our cluster Data – Files distributed in blocks of 64MB – Minimum of 2 replicas per block – Jobs use from 64MB to 37.50GB Default deadline of one day 29

Green datacenter Energy sources – Solar/wind: variable availability over time – Electrical grid: backup Other (problematic) approaches – Batteries: losses, cost, environmental – Bank energy on the grid: losses, cost, unavailability Wind Power Time Solar Power Wind Solar 30

1. Computation scheduling 1.Estimate energy required by jobs 2.Predict energy availability (weather forecast) 3.Schedule energy to minimize electricity costs 1.Assign green energy ($0/Wh) 2.Assign brown energy Cheap energy cost ($/Wh) Expensive energy cost ($/Wh) Peak-power cost ($/W) 4.Calculate current number of Active servers 5.Perform “2. Data management” 6.Submit jobs to execution 7.Send non-required servers to S3 to save energy 31

2. Data management We want to deactivate servers to save energy – Data is distributed among servers – Some data might be not available Common solution: Covering subset [Leverich’09] – ALL data must be always available – Minimum set of servers always running Our approach – Jobs running change → Required data change – Only required data has to be available – Move required data to Active servers – Decommission servers: provide data 32

Other results Workload intensity (datacenter utilization) – Works well with low/medium utilization – Similar to conventional under high utilization High-priority jobs – No performance degradation for high-priority jobs – Large amount of high-priority jobs reduce our benefits Shorter time bounds – 19% violations under really tight time bounds Data availability – Savings equal or higher than the covering subset Workloads variations – Nutch web-crawling and indexing – Consistent green energy increases and cost savings 33

Motivation Datacenters consume large amounts of energy Energy cost is not the only problem – Brown sources: coal, natural gas… Lots of small and medium datacenters – Consume the majority of electricity in DCs Connect datacenters to green sources – Solar panels, wind turbines… – Green datacenter 34

Delaying load within time bounds J1 J2 J3 J2 J3 Nodes Power Time Now J1 J2J3 Nodes Power 35 Delay some jobs is OK (respecting time bounds)

GreenHadoop: Leveraging Green Energy in Data-Processing Frameworks Íñigo Goiri, Kien Le, Thu D. Nguyen, Jordi Guitart, Jordi Torres, and Ricardo Bianchini.

Similar presentations

Presentation on theme: "GreenHadoop: Leveraging Green Energy in Data-Processing Frameworks Íñigo Goiri, Kien Le, Thu D. Nguyen, Jordi Guitart, Jordi Torres, and Ricardo Bianchini."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GreenHadoop: Leveraging Green Energy in Data-Processing Frameworks Íñigo Goiri, Kien Le, Thu D. Nguyen, Jordi Guitart, Jordi Torres, and Ricardo Bianchini.

Similar presentations

Presentation on theme: "GreenHadoop: Leveraging Green Energy in Data-Processing Frameworks Íñigo Goiri, Kien Le, Thu D. Nguyen, Jordi Guitart, Jordi Torres, and Ricardo Bianchini."— Presentation transcript:

Similar presentations

About project

Feedback