Download presentation
Presentation is loading. Please wait.
Published byCarmel Greer Modified over 9 years ago
1
Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture
2
Overview Motivation – Explore architectural issues as the computing moves toward the could 1.Impact of sharing memory subsystem resources (LLC, memory bandwidth..) 2.Maximize resource utilization by co-locating applications without hurting QoS 3.Inefficiencies on traditional processors for running scale- out workloads
3
Overview PaperProblemApproach The Impact of Memory Subsystem Resource Sharing on Datacenter Applications Sharing in Memory subsystemSoftware Bubble-UpResource UtilizationSoftware Clearing the cloudInefficiencies for scale-out workload Software Scale-out processorsImprove scale-out workload performance Hardware
4
Impact of memory subsystem sharing
5
Motivation & Problem definition – Machines have multi-core, multi-socket – For better utilization, applications should share Last Level Cache(LLC) / Front Side Bus (FSB) It is important to understand the memory sharing interaction between (datacenter) applications
6
Impact of thread-to-core mapping Sharing Cache Separate FSBs (XX..XX..) Sharing Cache Sharing FSBs (XXXX….) Separate Cache Separate FSBs (X.X.X.X.)
7
Impact of thread-to-core mapping - Performance varies up to 20% - Each Application has different trend. - TTC behavior changes depending on co-located application.
8
Observation 1.Performance can significantly swing simply based on how application threads are mapped to cores. 2.Best TTC mapping changes depends on co-located program. 3.Application characteristics that impact performance – Memory bus usage, Cache line sharing, Cache footprint – Ex) CONTENT ANALYZER has high bus usage, little cache sharing, large cache footprint Works better if it doesn’t share LLC and FSB STITCH use more Bus bandwidth, so co-located CONTENT ANALYZER will have contention on FSB
9
Increasing Utilization in Warehouse scale Computers via Co-location
10
Increasing Utilization via Co-location Motivation – Cloud computing wants to get higher resource utilization. – However, overprovisioning is used to ensure the performance isolation for latency-sensitive task, which lowers the utilization. Need precise prediction in shared resource for better utilization without violating QoS.
11
Bubble-up Methodology 1.QoS sensitivity curve ( ) – Get the sensitivity of the application by iteratively increasing the amount of pressure to memory subsystem 2.Bubble score ( ) – Get amount of pressure that the application causes on a reporter
12
Better Utilization Now we know 1)how QoS changes depending on bubble size (QoS curve) 2)how the application can affect to others (bubble number) Can co-locate applications estimatiing changes on QoS
13
Scale-out workload
14
Examples: – Data Severing – Mapreduce – Media Streaming – SAT Solver – Web Frontend – Web Search
15
Execution-time breakdown A major part of time is waiting for caches misses A clear micro-architectural mismatch
16
Frontend ineffficiencies Cores idle due to high instruction-cache miss rates L2 caches increase average I-fetch latency Excessive LLC capacity leads to long I-fetch latency How to improve? – Bring instructions closer to the cores
17
Core inefficiencies Low instruction level parallelism precludes effectively using the full core width Low memory level parallelism underutilizes reorder buffers and load-store queues. How to improve? – Run many things together: multi-threaded multi- core architecture
18
Data-access inefficiencies Large LLC consumes area, but does not improve performance Simple data prefetchers are ineffective How to improve? – Reduce LLC, leave place for processers
19
Bandwidth inefficiencies Lack of data sharing deprecates coherence and connectivity Off-chip bandwidth exceeds needs by an order of magnitude How to improve? – Scale back on-chip interconnect and off-chip memory bus to give place for processors
20
Scale-out processors So, too large LLC, interconnect, memory bus, but not enough processors Here comes a better one: Improve throughput by 5x-6.5x!
21
Q&A or Discussion
22
Supplement slides
23
Datacenter Applications Applicatio n DescriptionMetricType content analyzer Throughputlatency- sensitive bigtableaverage latencylatency- sensitive websearchqueries per second latency- sensitive stitcherBatch protobufBatch - Google’s production application
24
Key takeaways TTC behavior is mostly determined by – Memory bus usage (for FSB sharing) – Data sharing: Cache line sharing – Cache footprint: Use last level cache miss to estimate the foot print size Example – CONTENT ANALYZER has high bus usage, little cache sharing, large cache footprint Works better if it does not share LLC and FSB – Stich actually uses more Bus bandwidth, so it’s better for CONTENT ANALYZER not to share FSB with stitch
25
1% prediction error on average Prediction accuracy for pairwise co-locations of Google applications
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.