Download presentation
Presentation is loading. Please wait.
Published byTracy Stanley Curtis Modified over 9 years ago
1
Data Center Networks Jennifer Rexford COS 461: Computer Networks Lectures: MW 10-10:50am in Architecture N101 http://www.cs.princeton.edu/courses/archive/spr12/cos461/
2
Networking Case Studies 2 Data Center Backbone Enterprise Cellular Wireless
3
Cloud Computing 3
4
Elastic resources – Expand and contract resources – Pay-per-use – Infrastructure on demand Multi-tenancy – Multiple independent users – Security and resource isolation – Amortize the cost of the (shared) infrastructure Flexible service management 4
5
Cloud Service Models Software as a Service – Provider licenses applications to users as a service – E.g., customer relationship management, e-mail,.. – Avoid costs of installation, maintenance, patches, … Platform as a Service – Provider offers platform for building applications – E.g., Google’s App-Engine – Avoid worrying about scalability of platform 5
6
Cloud Service Models Infrastructure as a Service – Provider offers raw computing, storage, and network – E.g., Amazon’s Elastic Computing Cloud (EC2) – Avoid buying servers and estimating resource needs 6
7
Enabling Technology: Virtualization Multiple virtual machines on one physical machine Applications run unmodified as on real machine VM can migrate from one computer to another 7
8
Multi-Tier Applications Applications consist of tasks – Many separate components – Running on different machines Commodity computers – Many general-purpose computers – Not one big mainframe – Easier scaling
9
Multi-Tier Applications 9 Front end Server Aggregator … Aggregator Worker … …
10
Data Center Network 10
11
Virtual Switch in Server 11
12
Top-of-Rack Architecture Rack of servers – Commodity servers – And top-of-rack switch Modular design – Preconfigured racks – Power, network, and storage cabling 12
13
Aggregate to the Next Level 13
14
Modularity, Modularity, Modularity Containers Many containers 14
15
Data Center Network Topology 15 CR AR... S S S S Internet S S S S A A A A A A … S S S S A A A A A A …... Key CR = Core Router AR = Access Router S = Ethernet Switch A = Rack of app. servers ~ 1,000 servers/pod
16
Capacity Mismatch 16 CR AR S S S S S S S S A A A A A A … S S S S A A A A A A …... S S S S S S S S A A A A A A … S S S S A A A A A A … ~ 5:1 ~ 40:1 ~ 200:1
17
Data-Center Routing 17 CR AR... S S S S DC-Layer 3 Internet S S S S A A A A A A … S S S S A A A A A A …... DC-Layer 2 Key CR = Core Router (L3) AR = Access Router (L3) S = Ethernet Switch (L2) A = Rack of app. servers ~ 1,000 servers/pod == IP subnet SSSS SS
18
Reminder: Layer 2 vs. Layer 3 Ethernet switching (layer 2) – Cheaper switch equipment – Fixed addresses and auto-configuration – Seamless mobility, migration, and failover IP routing (layer 3) – Scalability through hierarchical addressing – Efficiency through shortest-path routing – Multipath routing through equal-cost multipath So, like in enterprises… – Connect layer-2 islands by IP routers 18
19
Case Study: Performance Diagnosis in Data Centers http://www.eecs.berkeley.edu/~minl anyu/writeup/nsdi11.pdf 19
20
Applications Inside Data Centers 20 Front end Server Aggregator Workers ….
21
Challenges of Datacenter Diagnosis Multi-tier applications – Hundreds of application components – Tens of thousands of servers Evolving applications – Add new features, fix bugs – Change components while app is still in operation Human factors – Developers may not understand network well – Nagle’s algorithm, delayed ACK, etc. 21
22
Diagnosing in Today’s Data Center 22 Host App OS Packet sniffer App logs: #Reqs/sec Response time 1% req. >200ms delay Switch logs: #bytes/pkts per minute Packet trace: Filter out trace for long delay req. SNAP: Diagnose net-app interactions SNAP: Diagnose net-app interactions
23
Problems of Different Logs 23 Host App OS Packet sniffer App logs: Application-specific Switch logs: Too coarse-grained Packet trace: Too expensive SNAP: Generic, fine-grained, and lightweight SNAP: Generic, fine-grained, and lightweight Runs everywhere, all the time
24
TCP Statistics Instantaneous snapshots – #Bytes in the send buffer – Congestion window size, receiver window size – Snapshots based on random sampling Cumulative counters – #FastRetrans, #Timeout – RTT estimation: #SampleRTT, #SumRTT – RwinLimitTime – Calculate difference between two polls 24
25
Identifying Performance Problems 25 – Not any other problems – Send buffer is almost full – #Fast retransmission – #Timeout – RwinLimitTime – Delayed ACK diff(SumRTT)/diff(SampleRTT) > MaxDelay Sender App Send Buffer Receiver Network Direct measure Sampling Inference
26
SNAP Architecture 26 At each host for every connection Collect data Direct access to OS -Polling per-connection statistics: Snapshots (#bytes in send buffer) Cumulative counters (#FastRestrans) -Adaptive tuning of polling rate
27
SNAP Architecture 27 At each host for every connection Collect data Performance Classifier Classifying based on the life of data transfer -Algorithms for detecting performance problems -Based on direct measurement in the OS
28
SNAP Architecture 28 At each host for every connection Collect data Performance Classifier Cross- connection correlation Direct access to data center configurations -Input Topology, routing information Mapping from connections to processes/apps -Correlate problems across connections Sharing the same switch/link, app code
29
SNAP Deployment Production data center – 8K machines, 700 applications – Ran SNAP for a week, collected petabytes of data Identified 15 major performance problems – Operators: Characterize key problems in data center – Developers: Quickly pinpoint problems in app software, network stack, and their interactions 29
30
Characterizing Perf. Limitations – Bottlenecked by CPU, disk, etc. – Slow due to app design (small writes) 30 Sender App Send Buffer Receiver Networ k #Apps that are limited for > 50% of the time 551 Apps 1 App 6 Apps 8 Apps 144 Apps – Send buffer not large enough – Fast retransmission – Timeout – Not reading fast enough (CPU, disk, etc.) – Not ACKing fast enough (Delayed ACK)
31
Delayed ACK Delayed ACK caused significant problems – Delayed ACK was used to reduce bandwidth usage and server interruption 31 Data Data+ACK Data A B ACK B has data to send B doesn’t have data to send 200 ms …. Delayed ACK should be disabled in data centers
32
Diagnosing Delayed ACK with SNAP Monitor at the right place – Scalable, low overhead data collection at all hosts Algorithms to identify performance problems – Identify delayed ACK with OS information Correlate problems across connections – Identify the apps with significant delayed ACK issues Fix the problem with operators and developers – Disable delayed ACK in data centers 32
33
Conclusion Cloud computing – Major trend in IT industry – Today’s equivalent of factories Data center networking – Regular topologies interconnecting VMs – Mix of Ethernet and IP networking Modular, multi-tier applications – New ways of building applications – New performance challenges 33
34
Load Balancing 34
35
Load Balancers Spread load over server replicas – Present a single public address (VIP) for a service – Direct each request to a server replica 35 Virtual IP (VIP) 192.121.10.1 10.10.10.1 10.10.10.2 10.10.10.3
36
Wide-Area Network 36 Router DNS Server DNS-based site selection Servers Internet Clients Data Centers
37
Wide-Area Network: Ingress Proxies 37 Router Data Centers Servers Clients Proxy
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.