Download presentation
Presentation is loading. Please wait.
1
www.ci.anl.gov www.ci.uchicago.edu AME: An Any-scale many-task computing Engine Zhao Zhang, University of Chicago Daniel S. Katz, CI University of Chicago & ANL Matei Ripeanu, ECE University of British Columbia Michael Wilde, CI University of Chicago & ANL Ian Foster, CI University of Chicago & ANL
2
www.ci.anl.gov www.ci.uchicago.edu 2 MTC application review mProject mDiff mFit mConFit Sequenced execution of other programs Involves several different programs High degree of inter-task parallelism Large number of invocations Up to Millions …… 123 2&3 2 23 1&2 1 2&3 Parallelism is enabled by file dependency Programs exchange data via (POSIX files)
3
www.ci.anl.gov www.ci.uchicago.edu 3 Supercomputer review Compute Nodes with multi cores No local disk, limited RAM disk Full linux kernel Large number of compute nodes Interconnect IO Exclusive Data Collection Networks Optional Data Collection Network LN Storage Network Control Network
4
www.ci.anl.gov www.ci.uchicago.edu 4 Gaps Resource Provisioning Task Management – Task Dispatching – Dependency Resolution – Load Balancing Data Management Resiliency
5
www.ci.anl.gov www.ci.uchicago.edu 5 AME Overview
6
www.ci.anl.gov www.ci.uchicago.edu 6 Task Management Task Dispatching – All tasks will be sent and queued on workers – Workers do a screen of all tasks – Workers find out the input data states and location for all its tasks – Workers subscribe to FLS (File Location Lookup Service) for the files the tasks need – Tasks can run immediately are pushed into a ready queue, others are kept in a hash table – Tasks in the hash table will be moved to ready queue once the input files are ready.
7
www.ci.anl.gov www.ci.uchicago.edu 7 Task Management Task Dispatching – Test setup o Parameter sweep over scale and task length o Scale = {256, 512, 1024, 2048, 4096, 8192, 16384}cores o Task length = {0, 1, 4, 16, 64, 256} seconds o 16 tasks each core o Dispatch Rate = Decentralized
8
www.ci.anl.gov www.ci.uchicago.edu 8 Task Management Task Dispatching – Test setup o Parameter sweep over scale and task length o Scale = {256, 512, 1024, 2048, 4096, 8192, 16384}cores o Task length = {1, 4, 16, 64, 256} seconds o 16 tasks each core o Efficiency = Decentralized
9
www.ci.anl.gov www.ci.uchicago.edu 9 Task Management Dependency Resolution States of Intermediate Files Invalid: The file is not produced yet. Remote: The file is produced, and stored at some peer node. Local: The file has been moved to local storage. Shared: The file has been moved to global shared file system.
10
www.ci.anl.gov www.ci.uchicago.edu 10 Task Management Dependency Resolution Query a produced file Query an invalid file
11
www.ci.anl.gov www.ci.uchicago.edu 11 Task Management Dependency Resolution – Test Setup: o Parameter Sweep over scales and running time, fixed file size at 10 bytes o Scale = {256, 512, 1024, 2048, 4096, 8192, 16384} cores o Running Time = {0, 1, 4, 16} seconds o Each core runs 16 tasks o 16 tasks are divided into 8 pairs, with a producer/consumer relation in each pair o Run the tests with the worst case Overhead Summary
12
www.ci.anl.gov www.ci.uchicago.edu 12 Task Management Dependency Resolution – File size impact – Test Setup o Parameter Sweep over scales and Data size, with fixed running time of 16 o Scale = {256, 512, 1024, 2048, 4096, 8192} cores o File size = {1KB, 1MB, 10MB} o Each core runs 8 tasks o 8 tasks are divided into 4 pairs, with a producer/consumer relation in each pair o Run the tests with the worst case Performance
13
www.ci.anl.gov www.ci.uchicago.edu 13 Task Management Overhead Analysis – Query/Update/Transfer traffic congested in network transition. – Saturated CPU – Query/Update traffic congested at server side. o Congested in the Queue o Congested by the synchronization of the server Test Setup – Scale: 256 cores – Running Time: 16 seconds – File Size: 10 bytes – Number of Jobs: 16 tasks per core – 16 tasks are divided into 8 pairs, with a producer/consumer relation in each pair Performance Query-QueuingQueryUpdate-QueuingUpdate Average Processing Time144.31 ms0.30 ms2.45 ms0.36 ms Standard Deviation14.24 ms7.15 ms0.085 ms0.14 ms
14
www.ci.anl.gov www.ci.uchicago.edu 14 Data Management Intermediate File Storage – Isolated file storage & processing vs. Collocated File-basedChunk-based Single File SizeLimited to CN RAM Limited to Aggregated Space CollocatedIsolated ScalabilityHighUp to Implementation Storage SpaceSpread among CNConfigurable Data Movements12 Transfer Traffic Pattern Fully-distributedPartially- distributed Saturated CNyesno
15
www.ci.anl.gov www.ci.uchicago.edu 15 Data Management Intermediate File Storage – Isolated file storage & processing vs. Collocated Test Setup – Parameter Sweep over scales, with fixed running time of 16 seconds – Scale = {256, 1024, 4096, 16384} cores – Each core runs 16 tasks – 16 tasks are divided into 8 pairs, with a producer/consumer relation in each pair – Run the tests with the worst case Performance
16
www.ci.anl.gov www.ci.uchicago.edu 16 Application Montage is an astronomy application that composes small images from telescope into one large image. It has been successfully running over supercomputers and grids, with MPI and Pegasus respectively.
17
www.ci.anl.gov www.ci.uchicago.edu 17 Application Test Setup – 6 degree x 6 degree mosaic centered at galaxy M101 – Input: 1319 files, each around 2MB – Output: 1 file, 3.7GB – Parallel Stages: mProjectPP, mDiffFit, mBackground – 512 cores, data management, no load-balancing Number of Tasks TTS 1 core (s)TTS 512 cores (s) SpeedupTTS 256 cores on GPFS (s) mProject131921220.3256.53375.381675.11 mDiffFit388335960.1295.32377.27732.25 mBackground12979815.9264.44152.33287.84
18
www.ci.anl.gov www.ci.uchicago.edu 18 Application Test Setup – 6 degree x 6 degree mosaic centered at galaxy M101 – Input: 1319 files, each around 2MB – Output: 1 file, 3.7GB – Parallel Stages: mProjectPP, mDiffFit, mBackground – 512 cores, data management, no load-balancing GPFS(MB)AME(MB)Saving(%) mProject-input2800 0% mProject-output55000.36100% mDiffFit-input310000100% mDiffFit-output39000.81100% mBackground-input52000100% mBackground-output5200 0% mAdd-input5200 0% mAdd-output3700 0% Total625001690173%
19
www.ci.anl.gov www.ci.uchicago.edu 19 Application Test Setup – 6 degree x 6 degree mosaic centered at galaxy M101 – Input: 1319 files, each around 2MB – Output: 1 file, 3.7GB – Parallel Stages: mProjectPP, mDiffFit, mBackground – 512 cores, data management, no load-balancing
20
www.ci.anl.gov www.ci.uchicago.edu 20 Summary We identify and classify the gaps between MTC applications and supercomputers into six categories: resource provisioning, task dispatching, task dependency resolution, load balancing, data management, and resiliency. We design and implement AME that bridges these gaps. (in future) The results show that AME scales well up to 16,384 core. AME accelerates MTC applications, such as Montage on supercomputers.
21
www.ci.anl.gov www.ci.uchicago.edu 21 References
22
www.ci.anl.gov www.ci.uchicago.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.