Download presentation
Presentation is loading. Please wait.
Published byEdwin Walsh Modified over 9 years ago
1
© 2009 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice SCAN-Lite: Enterprise-wide analysis on the cheap Craig Soules, Kimberly Keeton, Brad Morrey
2
2 Enterprise information management Search Clustering Provenance Classification IT Trending Virus scanning Metadata Server
3
3 Enterprise information management Data is duplicated across machines! Duplicate analysis is wasted work Metadata Server
4
4 Issues Analysis programs conflict on clients −Contend for system resources (memory, disk) Clients repeat work −Duplicate files on multiple clients Client foreground workloads are impacted −Work exceeds available idle time on busy clients
5
5 Approaches Reduce resource contention Client
6
6 Approaches Avoid duplicate work Clients
7
7 Approaches Leverage duplication to balance client load −Delay analysis to identify all duplicates Clients Global Scheduler
8
8 Solutions Local scheduler −Coordinates analyses to reduce resource contention −Up to 60% improvement Global scheduler −Identifies duplicates to remove work −Balance load −40% reduction in impact to foreground tasks
9
9 Local scheduling Traditionally, analyses are separate programs −Scheduling left to the operating system Potentially at different times −Each program identifies files to scan −Each program opens and reads file data Analysis Programs Disk
10
10 Unified local scheduling Each analysis routine is a separate thread Control thread manages shared tasks −Identify files to scan, and open/read file data Shared memory buffer distributes file data Disk Control Thread Analysis Plugins Shared Memory
11
11 Local scheduling performance Ran a fitness test using 7 analysis routines −42 data sets, each containing files of a fixed size −Ran both approaches over each data set −Calculated per-file elapsed scan time −Dual-core 2.8GHz P4 Xeon, 4GB RAM, 70GB RAID 1 Seven-at-once −Run each analysis routine separately at the same time Unified −SCAN-Lite’s unified local scheduling approach
12
12 Elapsed time vs. CPU time Original fitness test used CPU time −Gave less variable performance curves for modeling Disk contention shows up in elapsed time −CPU time is multiplexed −Elapsed time is not App 1 App 2 This is very bad Sum of CPU times Max of elapsed times Sum of elapsed times
13
13 Local scheduling results 17% - 60% improvement Small random I/Os have worse interaction than larger ones Seven-at-once benefits from deep disk queues, but this hurts foreground apps
14
14 Global scheduler Two goals: −Reduce additional work from duplicate files −Utilize duplication to schedule work to the “best” client Two-phase scanning −Phase one: identify duplicate files using content hashing −Phase two: analyze one copy at the appropriate client −Delaying between phase one and two provides opportunity for additional duplication and deletion
15
15 Traditional scanning Clients Server
16
16 Phase one: Duplicate detection Clients Server
17
17 Phase two: Scheduling Clients Server
18
18 When to schedule Clients upload hashes each scheduling period The freshness specifies a deadline by which new data must be analyzed Scheduling Period Freshness Schedule before this period Scheduling here gives one option Scheduling here gives three options Time
19
19 How to schedule Scheduling is a bin packing problem −Files are balls, clients are bins −Size of bins is available idle time −Color of balls/bins equates to location of duplicates −Size of balls is time required for analysis Idle Time ABCD Clients Files
20
20 How to schedule We use a greedy heuristic for scheduling −Consider idle time and machine priorities −See paper for details Idle Time ABCD Clients Files
21
21 Work ahead Start by scheduling all work that meets freshness Schedule additional work on still idle machines −Any remaining idle time can be used for additional work −We refer to this as work ahead Idle Time ABCD Clients Files
22
22 Two-phase scanning: Trade-offs Clients Two-phase Cost One-phase Cost
23
23 Two-phase scanning: Trade-offs Clients Two-phase Cost One-phase Cost
24
24 Two-phase scanning: Trade-offs If cost of hashing exceeds the additional work from duplicates, then one-phase scanning is better Analysis of hashing costs using SHA-1 indicate that 3% data duplication is the minimum −Do we see that in practice?
25
25 Duplication in enterprise data Examined two data sources: −100 user home directories from a central server −12 user productivity machines In both datasets, saw ~10% duplication −Even more with system files, email servers, sharepoints, etc. This is sufficient duplication for work reduction Data set2+1 Hash = 4/7 duplication
26
26 Global scheduling policies Traditional −One-phase scanning, scan all copies Rand −Two-phase scanning, random scheduling BestPlace −Two-phase scanning, greedy scheduling BestPlaceTime −Two-phase scanning, greedy scheduling + work ahead Opt −Unreplicated data only, delayed + work ahead
27
27 Metrics Total Work −Total elapsed time spent on analysis and hashing Client Impact −Time spent that exceeded client idle time Client Total Work Idle Time Client Impact
28
28 Metrics Metrics calculated for each day Summed over the entire simulation period Client Total Work Idle Time Client Impact
29
29 Experimental setup Implemented a simulator to test a variety of machine configurations and scheduling policies −Config: 50 high priority blades, 50 low priority laptops −Blades were modeled after: Dual-core 2.8GHz P4 Xeon, 4GB RAM, 70GB RAID 1 −Laptops were modeled after: 2GHz Pentium M, 1.5GB RAM, 60GB SATA Simulated 30 days −Daily creation rates and layouts from traced workloads −Freshness of 3 days, scheduling period of 1 day
30
30 Total work Doing work ahead of the freshness delay means analyzing files that would have been deleted Prefers faster blade machines over laptops, increasing their total work to reduce client impact Removes duplicate work, reducing the total work done
31
31 Client impact Less work means less impact Choosing the best place helps hit the idle time targets, reducing average client impact By doing work ahead of the freshness deadline, SCAN-Lite takes better advantage of idle time 40% Improvement Theoretical OPT only 8% better than BestPlaceTime
32
32 Summary Reducing local scanning interference is critical −17% - 60% improvement from reduced contention Two-phase scanning reduces analysis overheads −Reduce total work to near single-copy costs −Reduced client impact by up to 40% on our workload
33
33 Future work This is an initial system for reducing analysis costs −Many improvements remain! Vary freshness delays −Different applications may have different requirements Provide freshness and scan priorities to clients −Could prioritize scan order to not exceed client idle times Try more workloads −May need better bin packing algorithms
34
34 Summary Ever increasing number of analyses in the enterprise −Search, provenance, trending, clustering, classification, etc. Local scheduling to reduce resource contention on clients −Up to 60% performance improvement Two-phase scanning to reduce work and balance load −Delay analysis work to identify duplicate work −Global scheduling to balance load −Reduced client impact by up to 40% on our workload
35
35 Getting a handle on enterprise data Unstructured information growing at XX% per year Increasing number of needs for metadata −eDiscovery −Worker productivity and search −IT trending and historical analysis Lots of different analysis to perform −Term vectors, fingerprints, feature vectors, usage statistics, etc. Data is spread across file servers, web servers, email servers, laptops, desktops, backups, etc.
36
36 Where to perform analysis? On backups? −Not all data is backed up, encrypted, utilized On idle servers? −Requires data migration strategies, may break privacy On end nodes? −May interrupt foreground workloads, frustrate users All solutions desire minimizing work and balancing load to reduce required resources
37
37 The problems Most analysis tools run in isolation −Tools compete for resources locally, create interference Replicated data creates replicated work −Tools produce the same results in multiple locations Machines have difference characteristics −Creation rates, performance, idle time, etc. Goal: perform analysis at the best time and place
38
38 Best place and time? ABCD Time ABCD ABCD Available Time
39
39 Solution: Improve scheduling Local scheduler to coordinate analysis tasks −Single resource controller to prevent competition Global scheduler to single-instance analysis −Centralize decision of when and where to analyze
40
40 Local scheduling Prefetch thread reads data from disk once Analysis routines run in separate parallel threads Shared memory buffer distributes data to routines Files Prefetch Thread Producer/Consumer Buffer Analysis Threads
41
41 Traditional: One-phase scanning Client Server Apps Files Analysis Programs Metadata Store Metadata
42
42 SCAN-Lite: Two-phase scanning Client Server Apps Files Analysis Plugins Metadata Store Hashes Global Scheduler Metadata Local Scheduler Fitness Test Performance Models Idle Time Estimation Utilization Statistics
43
43 Global scheduling Time broken into scheduling periods based on some freshness delay (max time until data scan) Starting each scheduling period, the global scheduler picks which client will scan which data First schedule data that has met its freshness delay −Idle time, priorities, worst-fit, and ordering Second schedule any possible additional data −Work-ahead
44
44 Idle time, priorities, and worst-fit For a given piece of data: −Choose the set of machines that have available idle time If none, then choose all machines −From that, choose the machines with the highest priority −From that, choose the machine with the most idle time If none, choose the machine with the least client impact P2 P1 Idle Time Work Assigned
45
45 Ordering There is still a problem: P2P1 Idle Time Order: P2P1P2P1
46
46 Ordering Assign each piece of data a number based on the number of machines at each priority class Order all data by its ordering number P2P1 Idle Time P3 > P2 > P1 0 1 1 0 1 0
47
47 Work ahead Once all data that has met its freshness delay has been scheduled, assign additional data to any machines with available idle time P2 P1 Idle Time Work Assigned
48
48 How to schedule First, schedule any work that will meet its freshness deadline during this scheduling period Second, schedule any additional work that will fit within the remaining idle time of clients
49
49 Local scheduling results
50
50 Local “performance improvements” What happens when one or more analysis routines try to “improve performance?” For example, using direct I/O to reduce memory footprint, and thus impact on client workloads Seven Direct −Analysis programs implement direct I/O Unified Direct −SCAN-Lite implements direct I/O
51
51 Local scheduling with direct I/O
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.