DC2 Postmortem Association Pipeline. AP Architecture 3 main phases for each visit –Load: current knowledge of FOV into memory –Compute: match difference.

Slides:



Advertisements
Similar presentations
13,000 Jobs and counting…. Advertising and Data Platform Our System.
Advertisements

Exploiting Distributed Version Concurrency in a Transactional Memory Cluster Kaloian Manassiev, Madalin Mihailescu and Cristiana Amza University of Toronto,
The google file system Cs 595 Lecture 9.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Organizing the Extremely Large LSST Database for Real-Time Astronomical Processing ADASS London, UK September 23-26, 2007 Jacek Becla 1, Kian-Tat Lim 1,
Computer Monitoring System for EE Faculty By Yaroslav Ross And Denis Zakrevsky Supervisor: Viktor Kulikov.
Fast Track to ColdFusion 9. Getting Started with ColdFusion Understanding Dynamic Web Pages ColdFusion Benchmark Introducing the ColdFusion Language Introducing.
Microsoft Access Ervin Ha.
18.337: Image Median Filter Rafael Palacios Aeronautics and Astronautics department. Visiting professor (IIT-Institute for Research in Technology, University.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.
Descriptive Data Analysis of File Transfer Data Sudarshan Srinivasan Victor Hazlewood Gregory D. Peterson.
Advanced Web Forms with Databases Programming Right from the Start with Visual Basic.NET 1/e 13.
Codeigniter is an open source web application. It occupies a very small amount of space in the memory and is most useful for developers who aim to develop.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Stephen Booth EPCC Stephen Booth GridSafe Overview.
Introduction to SEQUEL. What is SEQUEL? Acronym for Structural English Query Language Acronym for Structural English Query Language Standard language.
MonetDB/X100 hyper-pipelining query execution Peter Boncz, Marcin Zukowski, Niels Nes.
NoSQL Databases Oracle - Berkeley DB Rasanjalee DM Smriti J CSC 8711 Instructor: Dr. Raj Sunderraman.
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
Triggers A Quick Reference and Summary BIT 275. Triggers SQL code permits you to access only one table for an INSERT, UPDATE, or DELETE statement. The.
Fragmentation in Large Object Repositories Russell Sears Catharine van Ingen CIDR 2007 This work was performed at Microsoft Research San Francisco with.
The Client/Server Database Environment Ployphan Sornsuwit KPRU Ref.
Intro – Part 2 Introduction to Database Management: Ch 1 & 2.
Spatial Tajo Supporting Spatial Queries on Apache Tajo Slideshare Shorten URL : goo.gl/j0VLXpgoo.gl/j0VLXp.
A Brief Documentation.  Provides basic information about connection, server, and client.
Computer Organization - 1. INPUT PROCESS OUTPUT List different input devices Compare the use of voice recognition as opposed to the entry of data via.
Serverless Network File Systems Overview by Joseph Thompson.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
Making Watson Fast Daniel Brown HON111. Need for Watson to be fast to play Jeopardy successfully – All computations have to be done in a few seconds –
Implementation of a Relational Database as an Aid to Automatic Target Recognition Christopher C. Frost Computer Science Mentor: Steven Vanstone.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Vector and symbolic processors
Query Optimization CMPE 226 Database Systems By, Arjun Gangisetty
Association Pipeline Take difference sources, moving object predictions for a visit Compare them to current knowledge of the sky –objects, historical difference.
IMS 4212: Database Implementation 1 Dr. Lawrence West, Management Dept., University of Central Florida Physical Database Implementation—Topics.
Steven Perry Dave Vieglais. W a s a b i Web Applications for the Semantic Architecture of Biodiversity Informatics Overview WASABI is a framework for.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Scalable Data Scale #2 site on the Internet (time on site) >200 billion monthly page views Over 1 million developers in 180 countries.
SQL Query Analyzer. Graphical tool that allows you to:  Create queries and other SQL scripts and execute them against SQL Server databases. (Query window)
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
BIG DATA/ Hadoop Interview Questions.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
MESA A Simple Microarray Data Management Server. General MESA is a prototype web-based database solution for the massive amounts of initial data generated.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Y.-H. Chen International College Ming-Chuan University Fall, 2004
Apache Ignite Data Grid Research Corey Pentasuglia.
Memory Management.
CS320 Web and Internet Programming SQL and MySQL
Chapter 2: System Structures
The Client/Server Database Environment
ROLAP partitioning in MS SQL Server 2016
Database Driven Websites
Chapter 6 System and Application Software
Exploring the Power of EPDM Tasks - Working with and Developing Tasks in EPDM By: Marc Young XLM Solutions
CSE 373 Data Structures and Algorithms
Efficient Catalog Matching with Dropout Detection
Chapter 6 System and Application Software
Chapter 6 System and Application Software
Updating Databases With Open SQL
CS 113: Data Structures and Algorithms
CSE 373: Data Structures and Algorithms
Chapter 6 System and Application Software
Updating Databases With Open SQL
Map Reduce, Types, Formats and Features
Presentation transcript:

DC2 Postmortem Association Pipeline

AP Architecture 3 main phases for each visit –Load: current knowledge of FOV into memory –Compute: match difference sources to objects match moving object predictions to difference sources create and update objects based on match results –Store: updated/new objects, difference sources

AP Architecture: Database + files Historical Object Catalog: -master copy in database updated/appended to never read in DC2 -thin slice of object catalog in files primary purpose: efficient retrieval of positions for spatial matching kept in sync with db copy, updated by AP Updates easy since positions never change -difference sources, mops predictions in database, read into memory by custom C++ code when needed -spatial cross match in custom C++ code, all other processing (Object catalog updates) implemented using SQL scripts, in-memory tables, etc…

AP Performance: Data Volume DC2: 417,327 objects in Object catalog (all in a single FOV) FOV is ~1 deg 2 Getting 5k difference sources per FOV worst case 22 moving object predictions worst case Production: Up to 10 million objects per FOV, 14 to 49 billion objects (DR1 - DR11) total FOV is 10 deg 2 100k difference sources per FOV worst case 2.5k moving object predictions per FOV worst case

AP Architecture: Load –Sky is partitioned into chunks (ra/dec boxes). For each chunk: 1 file stores objects from the input Object catalog (i.e. the product of Deep Detect) within the chunk 1 delta file stores new objects accumulated during visit processing –Multiple slices read these chunk files in parallel DC2: –one slice only –files read/written over NFS –all visits have essentially the same FOV (in terms of chunks) –Objects are loaded into a shared memory region –Master creates a spatial index for objects on the fly all slices must run on the same machine as the master

AP Performance: Load Reading chunk files (object positions) Timing included for completeness but not very meaningful Data lives on NFS volume - contention with other pipelines But, same chunks are read every visit - reads are from cache

AP Performance: Load Building zone index for objects seconds on average increases over consecutive visits since new objects are being created visit 1: 417,327 objectsvisit 62: ~450k objects (depends on run)

AP Architecture: Compute –Read difference sources coming from IPDP (via database) –Build spatial index for difference sources –Match difference sources against objects spatial match only –Write difference source to object matches to database –Read moving object predictions from database –Match them to difference sources that didn’t match a known variable object spatial match only –Create objects from difference sources that didn’t match any object DC2: moving object predictions not taken into account –Write Mops prediction to diff. source matches and new objects to database Everything runs on master process. Index building and matching can be multi-threaded (OpenMP) if necessary, but aren’t for DC2.

AP Performance: Compute Reading difference sources from database

AP Performance: Compute Building zone index for difference sources

AP Performance: Compute Matching difference sources to objects

AP Performance: Compute rlp0128/rlp0130: matched objects were matched an average of 1.69/2.01 times in r and 2.12/2.45 times in u over 62 visits

AP Performance: Compute Writing difference source matches to database

AP Performance: Compute Reading mops predictions from database

AP Performance: Compute Matching mops predictions to diff. sources ~0.1 ms (~20 predictions and error ellipses are clamped to 10 arc-seconds) Creating new objects

AP Performance: Compute Writing mops predictions and new objects to database

AP Architecture: Store –Multiple slices write chunk delta files in parallel these files contain object positions for objects accumulated during visit processing DC2: 1 slice, data lives on NFS –Master launches MySQL scripts that use database outputs of compute phase: Update Object catalog. For DC2 this is –# of times an object was observed in a given filter –Latest observation time of an object Insert new objects into Object catalog Append difference sources for visit to DIASource table Append various per-visit result tables to historical tables (for debugging) Drop per-visit tables

AP Performance: Store Writing chunk delta files (positions for objects created during visit processing) Wild swings in timing due to NFS contention IPDP is often loading science/template exposures as AP ends (AP was configured to write chunk deltas to NFS volume).

AP Performance: Store Updating historical Object catalog Long times again due to NFS contention AP writes out SQL script files to NFS, mysql client reads them from NFS (while IPDP is loading exposures) Last visit in run has timing with no interference from other pipelines (~0.4s)

AP Performance: Store Database cleanup: append per-visit tables to per-run accumulator tables drop per-visit tables Suspected culprit - NFS again

Conclusions For the small amounts of data in DC2, AP performs to spec (1.6s - 10s) despite some configuration problems Don’t use NFS Matching is fast: but need to run with more input data to make strong claims about performance Need to plug-in algorithms which make use of non-spatial data to really test the AP design