David Oppenheimer UCB ROC Retreat 12 January 2005 A case for resource discovery in shared distributed platforms.

David Oppenheimer UCB ROC Retreat 12 January 2005 A case for resource discovery in shared distributed platforms

Introduction Application performance is a function of 1.resources available to the application 2.resources needed by the application or, “application sensitivity to resource constraints” At summer retreat, described SWORD  at app deployment time, find best set of nodes given 1.resources available on a set of distributed nodes 2.application sensitivity to resource constraints  assumptions 1.available resources vary among nodes enough to matter spare CPU, mem, disk space; inter-node latency, avail. bw;... 2.applications are sensitive to resource constraints enough to matter Focus of this talk: verify assumption (1)

Introduction (cont.) Questions we will address  is there enough variation among nodes at any given (deployment) time to justify service placement?  is there enough variation over time on a single node to justify periodic task migration?  are there correlations between attributes on a single node, or among nodes at the same site? All of these questions are important in designing a system for resource discovery and service placement (like SWORD)

Outline 1.How much does the available amount of per-node resources vary among nodes at a fixed time? 2.How much does the available amount of per-node resources vary over time? How much do inter-node latency and available bandwidth vary over time? 3.On a given node, are any per-node attributes strongly correlated? Are inter-node latency and available bandwidth correlated?

Experimental environment Per-node attributes: Ganglia, CoMon  two-week period (Oct 10-Oct 24, 2004)  each node polled every 5 minutes  free memory, free swap, free disk, load average, network bytes sent and received/sec, # active slices Inter-node latency: all-pairs pings  one month period ending Oct 24, 2004  each pair of nodes measured every 15 minutes Inter-node bandwidth: Iperf  one month period ending Oct 24, 2004  each pair of nodes measured 1-2x/week About 250 nodes in the trace each day

Resource heterogeneity: averages How much does available resources vary over the trace? attributemeanstd. dev.10 th %ile90 th %ile # of CPUs1.00.01.0 CPU speed (MHz)194257212632652 Total disk (GB)12788.535.1232 Total memory (MB)11534676282017 Total swap (GB)1.00.01.0

Resource heterogeneity: averages How much does available resources vary over the trace? attributemeanstd. dev.10 th %ile90 th %ile 1 min load average6.8120.061.0511.86 Free memory (MB)62.359125.23413.668105.432 Free swap (MB)755.596178.795524.3361000.268 Free disk (GB)102.886.048.088208.3 Active slices13.35.960.020.0 Bytes/s in50477117023556892877 Bytes/s out52543130112547696214

Resource heterogeneity: CV vs. time

Variability of per-node attributes over time

Can rank degree of variability of each attribute  disk, swap < mem, load < net bytes; #slices mod to sig. CDF curve shifts to right as interval length incrs.  attributes vary less over short time periods than long  migration interval: find “sweet spot” in curve of variability vs. interval length CDF slope decreases as median var. of attr. incr.  may be able to classify nodes as high/low var. over time for mem, load, net bytes (they have high median var.)

Inter-node latency and BW variation over time Most nodes have low latency (and bw) variability even over a month-long trace  migration may not be worthwhile

Correlation among per-node attributes No strong correlations between different attrs.  though some one-hour trace segments had some Some correlation between nodes at same site r load one mem free swap free disk free actv slice byte_inbyte_out load one.080 mem free -.050.627 swap free -.231.274.473 disk free -.035.192.212.929 actv slice.079-.050-.219.049.773 byte_in.059-.033-.074.059.140.209 byte_out.058-.033-.059.078.137.443.188

Correlation between latency and avail BW Moderate inverse power law correlation  Using latency to estimate BW gives 233% error some nodes are bandwidth-capped, some in weird ways Some node pairs showed strong lat-BW correlation  17% within 25%, 56% within 50% r=-.59

Conclusion 1.How much does the available amount of per-node resources vary among nodes at a fixed time? significantly; enough to warrant svc. placement 2.How much does the available amount of per-node resources vary over time? How much do inter-node latency and available bandwidth vary over time? moderate variability; may warrant migration 3.On a given node, are any per-node attributes strongly correlated? Are inter-node latency and available bandwidth correlated? no strong correlation between diff. attrs. some correlation between same attr, same site latency can predict avail. bandwidth

Future work Ask same questions but use application model to answer, rather than analysis of raw data  different apps have different resource sensitivities  different apps have different migration costs Can we predict attribute values?  give warning before migration  or just don’t bother to deploy on “bad” nodes How much “better” could we do if SWORD could schedule jobs?

David Oppenheimer UCB ROC Retreat 12 January 2005 A case for resource discovery in shared distributed platforms.

Similar presentations

Presentation on theme: "David Oppenheimer UCB ROC Retreat 12 January 2005 A case for resource discovery in shared distributed platforms."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

David Oppenheimer UCB ROC Retreat 12 January 2005 A case for resource discovery in shared distributed platforms.

Similar presentations

Presentation on theme: "David Oppenheimer UCB ROC Retreat 12 January 2005 A case for resource discovery in shared distributed platforms."— Presentation transcript:

Similar presentations

About project

Feedback