Download presentation
Presentation is loading. Please wait.
Published byGarry Welch Modified over 9 years ago
1
NOAO Brown Bag May 13, 2008 Tucson, AZ 1 Data Management Middleware NOAO Brown Bag Tucson, AZ May 13, 2008 Jeff Kantor LSST Corporation
2
NOAO Brown Bag May 13, 2008 Tucson, AZ 2 Data ProductsPipelines Application Framework Application Layer Middleware Layer Data AccessDistr. Processing System Administration, Operations, Security User Interface Infrastructure Layer ComputingCommunications Physical Plant Storage The DM reference design uses layers for scalability, reliability, evolution Scientific Layer Pipelines constructed from reusable, standard “parts”, i.e. Application Framework Data Products representations standardized Metadata extendable without schema change Object-oriented, python, C++ Custom Software Portability to clusters, grid, other Provide standard services so applications behave consistently (e.g. recording provenance) Keep “thin” for performance and scalability Open Source, Off-the-shelf Software, Custom Integration Distributed Platform Different parts specialized for real-time alerting vs peta-scale data access Off-the-shelf, Commercial Hardware & Software, Custom Integration
3
NOAO Brown Bag May 13, 2008 Tucson, AZ 3 Middleware for Pipeline Processing Pipeline Control and Management event system Pipeline Construction Data Access Services Pipeline Execution A configured pipeline data controls pipeline harness persistence framework
4
NOAO Brown Bag May 13, 2008 Tucson, AZ 4 Where we’ve been; where we’re going Middleware –In DC2, we proved out framework for automated processing Pipeline Harness: framework for run application code in a parallel environment Data Persistence framework: data I/O (Lim) Event System: loosely coupled communications –For DC3/PDR, Extend Harness, Persistence, & Event System to apply to more types of pipelines Pipeline Control and Management
5
NOAO Brown Bag May 13, 2008 Tucson, AZ 5 Pipeline Harness A processing framework for hosting scientific algorithms in a parallel environment –Developer creates pipeline pluggable modules in python A module (i.e., a Stage) operates on a data-parallel chunk of data Developer does not worry about managing parallel processes or data I/O Provides facilities for configuration, accessing data, logging, etc. Stages can be chained together into complete pipelines –Harness handles execution on parallel platform Operates on a stream of data, routing data in and out of modules Uses Persistence Framework to do data I/O Uses MPI to exchange signals and data between nodes. Multiple pipelines running simultaneously can coordinate via Event System
6
Parallel processing Slice Stage Queue Stage Queue Pipeline Manager Serial processing Pipeline Parallel processing Slice Pipeline Manager Process executes serial processing controls the parallel slice workers Slice Worker Processes processes one data-parallel portion of the data (e.g. a CCD) Stage Queue Stage Queue Slice Stage Queue Stage Queue Parallel processing Slice Stage Queue Stage Queue Parallel processing Slice Stage Queue Stage Queue DC2 Pipeline Harness
7
NOAO Brown Bag May 13, 2008 Tucson, AZ 7 Event System System that allows loosely coupled components to talk to each other –e.g., Control systems and pipelines –Systems running on different sites DC2 use –Instructing the nightly pipelines that new data are ready for processing –Passing data between separately running pipelines –Capturing log messages from many nodes Implementation –Based on widely used messaging framework –Leveraging some existing middleware that supports framework –All events captured into database
8
NOAO Brown Bag May 13, 2008 Tucson, AZ 8 Coupling Pipelines via the Event Framework Pipeline Manager Image/Detection Pipeline Stage Queue Stage Queue Pipeline Manager Object Association Pipeline Stage Queue Stage Queue Pipeline Manager Moving Objects Pipeline Stage Queue Stage Queue Event System “New Detections available” “New Moving Object Candidates Available”
9
NOAO Brown Bag May 13, 2008 Tucson, AZ 9 Pipeline Harness: What we learned Useful framework for developing pipelines –Successfully implemented and executed nightly pipelines Need better handling of exceptions when things go wrong –Improvements needed in configuring and testing pipeline instances –Association pipeline required more data sharing across nodes –Expect to work well for data release pipelines Good overall performance –Architecture makes it straight forward to measure time spent on different operations Logging system provides timestamps Input and output are separate stages from processing –Harness incurs low overhead Gotchas: how framework can affect performance –Need care when deploying pipeline to a set of processors to balance load –Slower Nodes or data chunks that need more processing can hold back entire pipeline –When one pipeline requires information from another, their performance is coupled.
10
NOAO Brown Bag May 13, 2008 Tucson, AZ 10 Plans for DC3: Pipeline Harness and Control Apply Pipeline Harness to Data Release pipelines –Non-nightly pipelines; for producing products for data release e.g., Astrometry, Deep Photometry, Data Quality Analysis, etc. Apply Harness to less-data-parallel applications Association, WCS determination, Cross-talk correction –Requires greater amount of sharing data across nodes Develop Pipeline Control and Management System –Configures pipeline, deploys it onto a system, executes it, monitors it, cleans up after completion. –Automated –Important for leveraging computing systems beyond the Base Facility and Archive Center –Can leverage existing Grid Middleware
11
NOAO Brown Bag May 13, 2008 Tucson, AZ 11 Plans for Data Challenge 3: Event System Event Monitoring System –Carries out actions in reaction to complex conditions –Fault System: looking for signs of trouble Issues Fault Events when… –Systems go down (fails to hear from running systems) –Pipeline stalls or runs too slowly (fails to receive sequence of status messages in required time) Can result in corrective measures… –Alert an operator –Reschedule a pipeline –Logging System: calculating performance statistics in real time. Eliminate post-processing analysis
12
NOAO Brown Bag May 13, 2008 Tucson, AZ 12 LSST Data Base Classes
13
NOAO Brown Bag May 13, 2008 Tucson, AZ 13 Data Access Base Classes
14
14 LSST All-Hands Meeting National Center for Supercomputing Applications (NCSA) May 19 - 23, 2008 Data Access Middleware Design Goal: Flexible, reconfigurable, high-performance persistence for application objects. Also used for inter-pipeline communication. InputStage and OutputStage handle persistence for pipelines. Persistence object coordinates transactions. Formatter objects for each persistable class translate C++ objects into calls to Storage objects. Storage objects manage interactions with persistent stores: database, FITS files, Boost archives.
15
15 LSST All-Hands Meeting National Center for Supercomputing Applications (NCSA) May 19 - 23, 2008 Plans for DC3: Data Access Add/modify Formatters –Objects for data release processing –ISR inputs –SDQA outputs Build image archive –Raw and science images –Template images –Cached image products Improve retrieval of split pixels/metadata Keep as lightweight as possible –Remove CORAL –Investigate runtime loading of Formatters
16
NOAO Brown Bag May 13, 2008 Tucson, AZ 16 Additional Environment Elements Version Control Build System Coding and Review Standards Defect Tracking Automated unit and integration testing Requirements/Design Traceability http://dev.lsstcorp.org/trac
17
NOAO Brown Bag May 13, 2008 Tucson, AZ 17 Data Challenge Goals #1 Jan 2006 - Oct 2006 Validate infrastructure and middleware scalability to 5% of LSST required rates #2 Jan 2007 - Jan 2008 Validate nightly pipeline algorithms Create Application Framework and Middleware, validate by creating functioning pipelines with them Validate infrastructure and middleware scalability to 10% of LSST required rates #3 Mar 2008 - Jun 2009 Validate deep detection, calibration, SDQA pipelines Expand Middleware for Control & Management, Inter-slice Communications Validate infrastructure and middleware reliability Validate infrastructure and middleware scalability to 15% of LSST required rates #4 Jul 2009 - Jun 2010 Validate open interfaces and data access Validate infrastructure and middleware scalability to 20% of LSST required rates Validating the design - Data Challenges
18
NOAO Brown Bag May 13, 2008 Tucson, AZ 18 Data ChallengeWork Products #1 Jan 2006 - Oct 2006 3 Teragrid nodes used to simulate data transfer: Mountain (Purdue), Base (SDSC), Archive Center (NCSA) using Storage Resource Broker (SRB) IA64 itanium 2 clusters at SDSC, NCSA, 32-bit Xeon cluster at Purdue MPI-based Pipeline Harness developed in C and python Simulated nightly processing application pipelines developed (CPU, i/o, RAM loads) Initial database schema designed and MySQL database configured Data ingest service developed Initial development environment configured, used throughout #2 Jan 2007 - Jan 2008 10-node, 60-core dedicated cluster acquired and configured at NCSA Application Framework and Middleware developed and tested Image Detection, Association pipelines developed Moving object pipeline (jointly developed with Pan-STARRS) ported to DM environment, modularized, and re-architected for nightly mode Major schema upgrade and implementation in MySQL with CORAL Acquired 2.5 TB pre-cursor data (CFHTLS-deep, TALCS) for testing Complete development environment configured, standardized, used throughout Validating the design - Data Challenge work products to date Full reports in LSST document archive
19
NOAO Brown Bag May 13, 2008 Tucson, AZ 19 Data Challenge 1 Results Data Transfer 70 megabytes/ second data transfers >15% of LSST transfer rate Pipeline Execution 192 CCDs (0.1 - 1.0 gigabytes each) runs processed across 16 nodes/32 itanium CPUs with latency and throughput of approximately 141.5 seconds >42% of LSST per node image processing rate Data Ingest 6.1 megabytes/ second source data ingest >100% of LSST required ingest rate at the Base Facility
20
NOAO Brown Bag May 13, 2008 Tucson, AZ 20 Data Challenge 2 Pipeline Results Performance Pipeline Runs runId nVisits nAmps inputImages outputImages outputFrac rlp0127 53 36 1908 1875 0.983 rlp0128 62 36 2232 2228 0.998 rlp0130 62 36 2232 2213 0.991 Total - 1086372 6316 0.991 Average time to process one visit* avg. time (s) 3sigma (s) Image Processing/Detection (IPD) 207.2 108.2 Moving Objects (MOPS) 52.9 4.1 Association207.4 108.5 Note: # visits = 177 Average overhead time per visit for each pipeline avg. time (s) 3sigma (s) Image Processing/Detection (IPD) 1.17 1.03 Moving Objects (MOPS) 0.27 0.20 Association 0.34 0.26 Note: # visits = 177 * Note: Post-DC2 algorithm tweaks improved the IPD by factor of 6
21
NOAO Brown Bag May 13, 2008 Tucson, AZ 21 Data Challenge 2 Data Access Results Read one slice’s 10.3 MB Exposure and 10.3 MB template Exposure from FITS files on NFS file system - 9.0 sec Write one slice’s 10.3 MB Exposure to FITS file on NFS file system and metadata to Database - 6.5 s Write one slice’s 10.3 MB difference Exposure to FITS file on NFS file system - 6.4 s Write one slice’s DIASources to database (average number of DIASources = 36) - 0.05 s Read all DIASources from database (average number of DIASources = 1321) - 0.16 s Write DIASource matches to database (average number of match pairs = 1082) - 0.03 s Write new Object identifier pairs to database (average number of pairs = 537) - 0.02 s
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.