Presentation is loading. Please wait.

Presentation is loading. Please wait.

PS1 PSPS Object Data Manager Design PSPS Critical Design Review November 5-6, 2007 IfA.

Similar presentations


Presentation on theme: "PS1 PSPS Object Data Manager Design PSPS Critical Design Review November 5-6, 2007 IfA."— Presentation transcript:

1 PS1 PSPS Object Data Manager Design PSPS Critical Design Review November 5-6, 2007 IfA

2 slide 2 Outline  ODM Overview  Critical Requirements Driving Design  Work Completed  Detailed Design  Spatial Querying [AS]  ODM Prototype [MN]  Hardware/Scalability [JV]  How Design Meets Requirements  WBS and Schedule  Issues/Risks [AS] = Alex, [MN] = Maria, [JV] = Jan

3 slide 3 ODM Overview The Object Data Manager will:  Provide a scalable data archive for the Pan- STARRS data products  Provide query access to the data for Pan-STARRS users  Provide detailed usage tracking and logging

4 slide 4 ODM Driving Requirements  Total size 100 TB, 1.5 x 10 11 P2 detections 8.3x10 10 P2 cumulative-sky (stack) detections 5.5x10 9 celestial objects  Nominal daily rate (divide by 3.5x365) P2 detections: 120 Million/day Stack detections: 65 Million/day Objects: 4.3 Million/day  Cross-Match requirement: 120 Million / 12 hrs ~ 2800 / s  DB size requirement: 25 TB / yr ~100 TB by of PS1 (3.5 yrs)

5 slide 5 Work completed so far  Built a prototype  Scoped and built prototype hardware  Generated simulated data 300M SDSS DR5 objects, 1.5B Galactic plane objects  Initial Load done – Created 15 TB DB of simulated data Largest astronomical DB in existence today  Partitioned the data correctly using Zones algorithm  Able to run simple queries on distributed DB  Demonstrated critical steps of incremental loading  It is fast enough Cross-match > 60k detections/sec Required rate is ~3k/sec

6 slide 6 Detailed Design  Reuse SDSS software as much as possible  Data Transformation Layer (DX) – Interface to IPP  Data Loading Pipeline (DLP)  Data Storage (DS) Schema and Test Queries Database Management System Scalable Data Architecture Hardware  Query Manager (QM: CasJobs for prototype)

7 slide 7 High-Level Organization

8 slide 8 Detailed Design  Reuse SDSS software as much as possible  Data Transformation Layer (DX) – Interface to IPP  Data Loading Pipeline (DLP)  Data Storage (DS) Schema and Test Queries Database Management System Scalable Data Architecture Hardware  Query Manager (QM: CasJobs for prototype)

9 slide 9 Data Transformation Layer (DX)  Based on SDSS sqlFits2CSV package LINUX/C++ application FITS reader driven off header files  Convert IPP FITS files to ASCII CSV format for ingest (initially) SQL Server native binary later (3x faster)  Follow the batch and ingest verification procedure described in ICD 4-step batch verification Notification and handling of broken publication cycle  Deposit CSV or Binary input files in directory structure Create “ready” file in each batch directory  Stage input data on LINUX side as it comes in from IPP

10 slide 10 DX Subtasks DX Initialization Job FITS schema FITS reader CSV Converter CSV Writer Initialization Job FITS schema FITS reader CSV Converter CSV Writer Batch Ingest Interface with IPP Naming convention Uncompress batch Read batch Verify Batch Batch Ingest Interface with IPP Naming convention Uncompress batch Read batch Verify Batch Batch Verification Verify Manifest Verify FITS Integrity Verify FITS Content Verify FITS Data Handle Broken Cycle Batch Verification Verify Manifest Verify FITS Integrity Verify FITS Content Verify FITS Data Handle Broken Cycle Batch Conversion CSV Converter Binary Converter “batch_ready” Interface with DLP Batch Conversion CSV Converter Binary Converter “batch_ready” Interface with DLP

11 slide 11 DX-DLP Interface  Directory structure on staging FS (LINUX): Separate directory for each JobID_BatchID Contains a “batch_ready” manifest file –Name, #rows and destination table of each file Contains one file per destination table in ODM –Objects, Detections, other tables  Creation of “batch_ready” file is signal to loader to ingest the batch  Batch size and frequency of ingest cycle TBD

12 slide 12 Detailed Design  Reuse SDSS software as much as possible  Data Transformation Layer (DX) – Interface to IPP  Data Loading Pipeline (DLP)  Data Storage (DS) Schema and Test Queries Database Management System Scalable Data Architecture Hardware  Query Manager (QM: CasJobs for prototype)

13 slide 13 Data Loading Pipeline (DLP)  sqlLoader – SDSS data loading pipeline Pseudo-automated workflow system Loads, validates and publishes data –From CSV to SQL tables Maintains a log of every step of loading Managed from Load Monitor Web interface  Has been used to load every SDSS data release EDR, DR1-6, ~ 15 TB of data altogether Most of it (since DR2) loaded incrementally Kept many data errors from getting into database –Duplicate ObjIDs (symptom of other problems) –Data corruption (CSV format invaluable in catching this)

14 slide 14 sqlLoader Design  Existing functionality Shown for SDSS version Workflow, distributed loading, Load Monitor  New functionality Schema changes Workflow changes Incremental loading –Cross-match and partitioning

15 slide 15 sqlLoader Workflow  Distributed design achieved with linked servers and SQL Server Agent  LOAD stage can be done in parallel by loading into temporary task databases  PUBLISH stage writes from task DBs to final DB  FINISH stage creates indices and auxiliary (derived) tables  Loading pipeline is a system of VB and SQL scripts, stored procedures and functions

16 slide 16 Load Monitor Tasks Page

17 slide 17 Load Monitor Active Tasks

18 slide 18 Load Monitor Statistics Page

19 slide 19 Load Monitor – New Task(s)

20 slide 20 Test Uniqueness Of Primary Keys Test Uniqueness Of Primary Keys Test Foreign Keys Test Foreign Keys Test Cardinalities Test Cardinalities Test HTM IDs Test HTM IDs Test Link Table Consistency Test Link Table Consistency Test the unique Key in each table Test for consistency of keys that link tables Test consistency of numbers of various quantities Test the Hierarchical Triamgular Mesh IDs used for spatial indexing Ensure that links are consistent Data Validation  Tests for data integrity and consistency  Scrubs data and finds problems in upstream pipelines  Most of the validation can be performed within the individual task DB (in parallel)

21 slide 21 Master Slave Samba-mounted CSV/Binary Files Publish Data Publish Data Finish Task DB Task Data Task Data Task DB Task DB View of Master Schema Task Data Task Data LoadSupport Task DB Task DB Task Data Task Data Load Monitor Publish Schema View of Master Schema View of Master Schema Master Schema LoadAdmin Distributed Loading Publish LoadSupport

22 slide 22 Schema Changes  Schema in task and publish DBs is driven off a list of schema DDL files to execute ( xschema.txt )  Requires replacing DDL files in schema/sql directory and updating xschema.txt with their names  PS1 schema DDL files have already been built  Index definitions have also been created  Metadata tables will be automatically generated using metadata scripts already in the loader

23 slide 23 LOAD Export Check CSVs Check CSVs Create Task DBs Create Task DBs Build SQL Schema Build SQL Schema Validate XMatch Workflow Changes  Cross-Match and Partition steps will be added to the workflow  Cross-match will match detections to objects  Partition will horizontally partition data, move it to slice servers, and build DPVs on main PUBLISH Partition

24 slide 24 Matching Detections with Objects  Algorithm described fully in prototype section  Stored procedures to cross-match detections will be part of the LOAD stage in loader pipeline  Vertical partition of Objects table kept on load server for matching with detections  Zones cross-match algorithm used to do 1” and 2” matches  Detections with no matches saved in Orphans table

25 slide 25 XMatch and Partition Data Flow Loadsupport PS1 PmPm Detections Load Detections XMatch Detections_In Pull Chunk LinkToObj_In ObjZoneIndx Orphans Detections_chunk LinkToObj_chunk Merge Partitions Detections_m LinkToObj_m Update Objects Objects_m Pull Partition Switch Partition Objects_m LinkToObj_m Objects LinkToObj

26 slide 26 Detailed Design  Reuse SDSS software as much as possible  Data Transformation Layer (DX) – Interface to IPP  Data Loading Pipeline (DLP)  Data Storage (DS) Schema and Test Queries Database Management System Scalable Data Architecture Hardware  Query Manager (QM: CasJobs for prototype)

27 slide 27 Data Storage – Schema

28 slide 28 PS1 Table Sizes Spreadsheet

29 slide 29 PS1 Table Sizes - All Servers Table Year 1Year 2Year 3Year 3.5 Objects 4.63 4.61 4.59 StackPsfFits 5.0810.1615.2017.76 StackToObj 1.84 3.68 5.56 6.46 StackModelFits 1.16 2.32 3.40 3.96 P2PsfFits 7.8815.7623.6027.60 P2ToObj 2.65 5.31 8.00 9.35 Other Tables 3.41 6.9410.5212.67 Indexes +20% 5.33 9.7614.1816.48 Total 31.9858.5685.0798.87 Sizes are in TB

30 slide 30 Data Storage – Test Queries  Drawn from several sources Initial set of SDSS 20 queries SDSS SkyServer Sample Queries Queries from PS scientists (Monet, Howell, Kaiser, Heasley)  Two objectives Find potential holes/issues in schema Serve as test queries –Test DBMS iintegrity –Test DBMS performance  Loaded into CasJobs (Query Manager) as sample queries for prototype

31 slide 31 Data Storage – DBMS  Microsoft SQL Server 2005 Relational DBMS with excellent query optimizer  Plus Spherical/HTM (C# library + SQL glue) –Spatial index (Hierarchical Triangular Mesh) Zones (SQL library) –Alternate spatial decomposition with dec zones Many stored procedures and functions –From coordinate conversions to neighbor search functions Self-extracting documentation (metadata) and diagnostics

32 slide 32 Documentation and Diagnostics

33 slide 33 Data Storage – Scalable Architecture  Monolithic database design (a la SDSS) will not do it  SQL Server does not have cluster implementation Do it by hand  Partitions vs Slices Partitions are file-groups on the same server –Parallelize disk accesses on the same machine Slices are data partitions on separate servers We use both!  Additional slices can be added for scale-out  For PS1, use SQL Server Distributed Partition Views (DPVs)

34 slide 34 Distributed Partitioned Views  Difference between DPVs and file-group partitioning FG on same database DPVs on separate DBs FGs are for scale-up DPVs are for scale-out  Main server has a view of a partitioned table that includes remote partitions (we call them slices to distinguish them from FG partitions)  Accomplished with SQL Server’s linked server technology  NOT truly parallel, though

35 slide 35 Scalable Data Architecture  Shared-nothing architecture  Detections split across cluster  Objects replicated on Head and Slice DBs  DPVs of Detections tables on the Headnode DB  Queries on Objects stay on head node S2 S3 Head S1 Objects_S1 Objects_S2 Objects_S3 Objects_S1 Objects_S2 Objects_S3 Detections_S1 Detections_S2 Detections_S3 Objects Detections_S1 Detections_S2 Detections_S3 Detections DPV  Queries on detections use only local data on slices

36 slide 36 Hardware - Prototype LX PS01 L1 PS13 L2/M PS05 Staging Loading 10 TB 9 TB 8 4 4 Head PS11 8 DB S1 PS12 8 S2 PS03 4 S3 PS04 4 W PS02 4 MyDB 39 TB 2A A A 2B B RAID5 RAID10 14D/3.5W 12D/4W Total space RAID config Disk/rack config Function 10A = 10 x [13 x 750 GB] 3B = 3 x [12 x 500 GB] LX = Linux L = Load server S/Head = DB server M = MyDB server W = Web server Web 0 TB PS0x = 4-core PS1x = 8-core Server Naming Convention: Storage: Function:

37 slide 37 Hardware – PS1 Offline (Copy 2) Spare (Copy 3) Live (Copy 1) Offline (Copy 2) Spare (Copy 3) Live (Copy 1) Queries Ingest Offline (Copy 1) Spare (Copy 3) Live (Copy 2) Live (Copy 2) Spare (Copy 3) Live (Copy 1) Replicate Queries Replicate Queries  Ping-pong configuration to maintain high availability and query performance  2 copies of each slice and of main (head) node database on fast hardware (hot spares)  3rd spare copy on slow hardware (can be just disk)  Updates/ingest on offline copy then switch copies when ingest and replication finished  Synchronize second copy while first copy is online  Both copies live when no ingest  3x basic config. for PS1

38 slide 38 Detailed Design  Reuse SDSS software as much as possible  Data Transformation Layer (DX) – Interface to IPP  Data Loading Pipeline (DLP)  Data Storage (DS) Schema and Test Queries Database Management System Scalable Data Architecture Hardware  Query Manager (QM: CasJobs for prototype)

39 slide 39 Query Manager  Based on SDSS CasJobs  Configure to work with distributed database, DPVs  Direct links (contexts) to slices can be added later if necessary  Segregates quick queries from long ones  Saves query results server-side in MyDB  Gives users a powerful query workbench  Can be scaled out to meet any query load  PS1 Sample Queries available to users  PS1 Prototype QM demo PS1 Prototype QM demo

40 slide 40 ODM Prototype Components  Data Loading Pipeline  Data Storage  CasJobs Query Manager (QM) Web Based Interface (WBI)  Testing

41 slide 41 Spatial Queries (Alex)

42 slide 42 Spatial Searches in the ODM

43 slide 43 Common Spatial Questions Points in region queries 1.Find all objects in this region 2.Find all “good” objects (not in masked areas) 3.Is this point in any of the regions Region in region 4.Find regions near this region and their area 5.Find all objects with error boxes intersecting region 6.What is the common part of these regions Various statistical operations 7.Find the object counts over a given region list 8.Cross-match these two catalogs in the region

44 slide 44 Sky Coordinates of Points  Many different coordinate systems Equatorial, Galactic, Ecliptic, Supergalactic  Longitude-latitude constraints  Searches often in mix of different coordinate systems gb>40 and dec between 10 and 20 Problem: coordinate singularities, transformations  How can one describe constraints in a easy, uniform fashion?  How can one perform fast database queries in an easy fashion? Fast:Indexes Easy: simple query expressions

45 slide 45 Describing Regions Spacetime metadata for the VO (Arnold Rots)  Includes definitions of Constraint: single small or great circle Convex: intersection of constraints Region: union of convexes  Support both angles and Cartesian descriptions  Constructors for CIRCLE, RECTANGLE, POLYGON, CONVEX HULL  Boolean algebra (INTERSECTION, UNION, DIFF)  Proper language to describe the abstract regions  Similar to GIS, but much better suited for astronomy

46 slide 46 Things Can Get Complex

47 slide 47 We Do Spatial 3 Ways  Hierarchical Triangular Mesh (extension to SQL) Uses table valued functions Acts as a new “spatial access method”  Zones: fits SQL well Surprisingly simple & good  3D Constraints: a novel idea Algebra on regions, can be implemented in pure SQL

48 slide 48 PS1 Footprint  Using the projection cell definitions as centers for tessellation (T. Budavari)

49 slide 49 CrossMatch: Zone Approach  Divide space into declination zones  Objects ordered by zoneid, ra (on the sphere need wrap-around margin.)  Point search look in neighboring zones within ~ (ra ± Δ) bounding box  All inside the relational engine  Avoids “impedance mismatch”  Can “batch” comparisons  Automatically parallel  Details in Maria’s thesis r ra-zoneMax zoneMax x ra ± Δ

50 slide 50 Indexing Using Quadtrees  Cover the sky with hierarchical pixels  COBE – start with a cube  Hierarchical Triangular Mesh (HTM) uses trixels Samet, Fekete  Start with an octahedron, and split each triangle into 4 children, down to 20 levels deep  Smallest triangles are 0.3”  Each trixel has a unique htmID 2,2 2,1 2,0 2,3 2,3,0 2,3,1 2,3,22,3,3 21 23 20 22 222 223 220221

51 slide 51 Space-Filling Curve 100 103 102 101 120 1,2,1 122 121 110 113 112 111 132 133 130 131 [0.12,0.13) [0.122,0.123)[0.121,0.122)[0.120,0.121)[0.123,0.130) Triangles correspond to ranges All points inside the triangle are inside the range. [0.122,0.130) [0.120,0.121)

52 slide 52 SQL HTM Extension  Every object has a 20-deep htmID (44bits)  Clustered index on htmID  Table-valued functions for spatial joins Given a region definition, routine returns up to 10 ranges of covering triangles Spatial query is mapped to ~10 range queries  Current implementation rewritten in C#  Excellent performance, little calling overhead  Three layers General geometry library HTM kernel IO (parsing + SQL interface)

53 slide 53 Writing Spatial SQL -- region description is contained by @area DECLARE @cover TABLE (htmStart bigint,htmEnd bigint) INSERT @cover SELECT * from dbo.fHtmCover(@area) -- DECLARE @region TABLE ( convexId bigint,x float, y float, z float) INSERT @region SELECT dbo.fGetHalfSpaces(@area) -- SELECTo.ra, o.dec, 1 as flag, o.objid FROM (SELECT objID as objid, cx,cy,cz,ra,[dec] FROM Objects q JOIN @cover AS c ON q.htmID between c.HtmIdStart and c.HtmIdEnd ) AS o WHERE NOT EXISTS ( SELECT p.convexId FROM @region AS p WHERE (o.cx*p.x + o.cy*p.y + o.cz*p.z < p.c) GROUP BY p.convexId )

54 slide 54 Status  All three libraries extensively tested  Zones used for Maria’s thesis, plus various papers  New HTM code in production use since July on SDSS  Same code also used by STScI HLA, Galex  Systematic regression tests developed  Footprints computed for all major surveys  Complex mask computations done on SDSS  Loading: zones used for bulk crossmatch  Ad hoc queries: use HTM-based search functions  Excellent performance

55 slide 55 Prototype (Maria)

56 slide 56 PS1 PSPS Object Data Manager Design PSPS Critical Design Review November 5-6, 2007 IfA

57 slide 57 Detail Design  General Concepts  Distributed Database architecture  Ingest Workflow  Prototype

58 slide 58 Zones (spatial partitioning and indexing algorithm)  Partition and bin the data into declination zones ZoneID = floor ((dec + 90.0) / zoneHeight)  Few tricks required to handle spherical geometry  Place the data close on disk Cluster Index on ZoneID and RA  Fully implemented in SQL  Efficient Nearby searches Cross-Match (especially)  Fundamental role in addressing the critical requirements Data volume management Association Speed Spatial capabilities Zones Declination (Dec) Right Ascension (RA)

59 slide 59 Zoned Table ObjIDZoneID*RADecCXCYCZ… 100.0-90.0 220250180.00.0 320250181.00.0 440500360.0+90.0 * ZoneHeight = 8 arcsec in this example ZoneID = floor ((dec + 90.0) / zoneHeight)

60 slide 60 SQL CrossNeighbors SELECT * FROM prObj1 z1 JOIN zoneZone ZZ ON ZZ.zoneID1 = z1.zoneID JOIN prObj2 z2 ON ZZ.ZoneID2 = z2.zoneID WHERE z2.ra BETWEEN z1.ra-ZZ.alpha AND z2.ra+ZZ.alpha AND z2.dec BETWEEN z1.dec-@r AND z1.dec+@r AND (z1.cx*z2.cx+z1.cy*z2.cy+z1.cz*z2.cz) > cos(radians(@r))

61 slide 61 Good CPU Usage

62 slide 62 Partitions  SQL Server 2005 introduces technology to handle tables which are partitioned across different disk volumes and managed by a single server.  Partitioning makes management and access of large tables and indexes more efficient Enables parallel I/O Reduces the amount of data that needs to be accessed Related tables can be aligned and collocated in the same place speeding up JOINS

63 slide 63 Partitions  2 key elements Partitioning function –Specifies how the table or index is partitioned Partitioning schemas –Using a partitioning function, the schema specifies the placement of the partitions on file groups  Data can be managed very efficiently using Partition Switching Add a table as a partition to an existing table Switch a partition from one partitioned table to another Reassign a partition to form a single table  Main requirement The table must be constrained on the partitioning column

64 slide 64 Partitions  For the PS1 design, Partitions mean File Group Partitions Tables are partitioned into ranges of ObjectID, which correspond to declination ranges. ObjectID boundaries are selected so that each partition has a similar number of objects.

65 slide 65 Distributed Partitioned Views  Tables participating in the Distributed Partitioned View (DVP) reside on different databases which reside in different databases which reside on different instances or different (linked) servers

66 slide 66 Concept: Slices  In the PS1 design, the bigger tables will be partitioned across servers  To avoid confusion with the File Group Partitioning, we call them “Slices”  Data is glued together using Distributed Partitioned Views  The ODM will manage slices. Using slices improves system scalability.  For PS1 design, tables are sliced into ranges of ObjectID, which correspond to broad declination ranges. Each slice is subdivided into partitions that correspond to narrower declination ranges.  ObjectID boundaries are selected so that each slice has a similar number of objects.

67 slide 67 Detail Design Outline  General Concepts  Distributed Database architecture  Ingest Workflow  Prototype

68 slide 68 PS1 P1P1 PmPm PartitionsMap Objects LnkToObj Meta [Objects_p1] [LnkToObj_p1] [Detections_p1] Meta [Objects_pm] [LnkToObj_pm] [Detections_pm] Meta Detections Linked servers PS1 database LoadAdmin Load Support 1 objZoneIndx orphans_l1 Detections_l1 LnkToObj_l1 objZoneIndx Orphans_ln Detections_ln LnkToObj_ln detections Load Support n Linked servers detections PartitionsMap PS1 Distributed DB system Legend Database Full table [partitioned table] Output table Partitioned View Query Manager (QM) Web Based Interface (WBI)

69 slide 69 Design Decisions: ObjID  Objects have their positional information encoded in their objID fGetPanObjID (ra, dec, zoneH) ZoneID is the most significant part of the ID  It gives scalability, performance, and spatial functionality  Object tables are range partitioned according to their object ID

70 slide 70 ObjectID Clusters Data Spatially ObjectID = 087941012871550661 Dec = –16.71611583  ZH = 0.008333  ZID = (Dec+90) / ZH = 08794.0661 RA = 101.287155  ObjectID is unique when objects are separated by >0.0043 arcsec

71 slide 71 Design Decisions: DetectID  Detections have their positional information encoded in the detection identifier fGetDetectID (dec, observationID, runningID, zoneH) Primary key (objID, detectionID), to align detections with objects within partitions Provides efficient access to all detections associated to one object Provides efficient access to all detections of nearby objects

72 slide 72 DetectionID Clusters Data in Zones DetectID = 0879410500001234567 Dec = –16.71611583  ZH = 0.008333  ZID = (Dec+90) / ZH = 08794.0661 ObservationID = 1050000 Running ID = 1234567

73 slide 73 ODM Capacity 5.3.1.3 The PS1 ODM shall be able to ingest into the ODM a total of 1.5  10 11 P2 detections 8.3  10 10 cumulative sky (stack) detections 5.5  10 9 celestial objects together with their linkages.

74 slide 74 PS1 Table Sizes - Monolithic Table Year 1Year 2Year 3Year 3.5 Objects 2.31 StackPsfFits 5.0710.1615.2017.74 StackToObj 0.92 1.84 2.76 3.22 StackModelFits 1.15 2.29 3.44 4.01 P2PsfFits 7.8715.7423.6127.54 P2ToObj 1.33 2.67 4.00 4.67 Other Tables 3.19 6.03 8.8710.29 Indexes +20% 4.37 8.2112.0413.96 Total 26.2149.2472.2383.74 Sizes are in TB

75 slide 75 PS1 P1P1 PmPm PartitionsMap Objects LnkToObj Meta Linked servers PS1 database What goes into the main Server Objects LnkToObj Meta PartitionsMap Legend Database Full table [partitioned table] Output table Distributed Partitioned View

76 slide 76 PS1 P1P1 PmPm PartitionsMap Objects LnkToObj Meta Linked servers PS1 database What goes into slices PartitionsMap [Objects_p1] [LnkToObj_p1] [Detections_p1] PartitionsMap Meta [Objects_pm] [LnkToObj_pm] [Detections_pm] PartitionsMap Meta [Objects_p1] [LnkToObj_p1] [Detections_p1] Meta Legend Database Full table [partitioned table] Output table Distributed Partitioned View

77 slide 77 PS1 P1P1 PmPm PartitionsMap Objects LnkToObj Meta Linked servers PS1 database What goes into slices PartitionsMap [Objects_p1] [LnkToObj_p1] [Detections_p1] PartitionsMap Meta [Objects_pm] [LnkToObj_pm] [Detections_pm] PartitionsMap Meta [Objects_p1] [LnkToObj_p1] [Detections_p1] Meta Legend Database Full table [partitioned table] Output table Distributed Partitioned View

78 slide 78 Duplication of Objects & LnkToObj  Objects are distributed across slices  Objects, P2ToObj, and StackToObj are duplicated in the slices to parallelize “inserts” & “updates”  Detections belong into their object’s slice  Orphans belong to the slice where their position would allocate them Orphans near slices’ boundaries will need special treatment  Objects keep their original object identifier Even though positional refinement might change their zoneID and therefore the most significant part of their identifier

79 slide 79 PS1 P1P1 PmPm PartitionsMap Objects LnkToObj Meta Linked servers PS1 database Glue = Distributed Views Detections [Objects_p1] [LnkToObj_p1] [Detections_p1] PartitionsMap Meta [Objects_pm] [LnkToObj_pm] [Detections_pm] PartitionsMap Meta Legend Database Full table [partitioned table] Output table Distributed Partitioned View Detections

80 slide 80 PS1 P1P1 PmPm Web Based Interface (WBI) Linked servers PS1 database Partitioning in Main Server  Main server is partitioned (objects) and collocated (lnkToObj) by objid  Slices are partitioned (objects) and collocated (lnkToObj) by objid Query Manager (QM)

81 slide 81 PS1 Table Sizes - Main Server Table Year 1Year 2Year 3Year 3.5 Objects 2.31 StackPsfFits     StackToObj 0.92 1.84 2.76 3.22 StackModelFits     P2PsfFits     P2ToObj 1.33 2.67 4.00 4.67 Other Tables 0.41 0.46 0.52 0.55 Indexes +20% 0.99 1.46 1.92 2.15 Total 5.96 8.7411.5112.90 Sizes are in TB

82 slide 82 PS1 Table Sizes - Each Slice m=4m=8m=10m=12 Table Year 1Year 2Year 3Year 3.5 Objects 0.580.290.230.19 StackPsfFits 1.27 1.521.48 StackToObj 0.23 0.280.27 StackModelFits 0.29 0.340.33 P2PsfFits 1.97 2.362.30 P2ToObj 0.33 0.400.39 Other Tables 0.750.811.001.01 Indexes +20% 1.081.041.231.19 Total 6.506.237.367.16 Sizes are in TB

83 slide 83 PS1 Table Sizes - All Servers Table Year 1Year 2Year 3Year 3.5 Objects 4.63 4.61 4.59 StackPsfFits 5.0810.1615.2017.76 StackToObj 1.84 3.68 5.56 6.46 StackModelFits 1.16 2.32 3.40 3.96 P2PsfFits 7.8815.7623.6027.60 P2ToObj 2.65 5.31 8.00 9.35 Other Tables 3.41 6.9410.5212.67 Indexes +20% 5.33 9.7614.1816.48 Total 31.9858.5685.0798.87 Sizes are in TB

84 slide 84 Detail Design Outline  General Concepts  Distributed Database architecture  Ingest Workflow  Prototype

85 slide 85 PS1 P1P1 PmPm PartitionsMap Objects LnkToObj Meta [Objects_p1] [LnkToObj_p1] [Detections_p1] PartitionsMap Meta [Objects_pm] [LnkToObj_pm] [Detections_pm] PartitionsMap Meta Detections Linked servers PS1 database LoadAdmin Load Support 1 objZoneIndx orphans_l1 Detections_l1 LnkToObj_l1 objZoneIndx Orphans_ln Detections_ln LnkToObj_ln detections Load Support n Linked servers detections PartitionsMap PS1 Distributed DB system Legend Database Full table [partitioned table] Output table Partitioned View Query Manager (QM) Web Based Interface (WBI)

86 slide 86 “Insert” & “Update”  SQL Insert and Update are expensive operations due to logging and re-indexing  In the PS1 design, Insert and Update have been re- factored into sequences of: Merge + Constrain + Switch Partition  Frequency f1: daily f2: at least monthly f3: TBD (likely to be every 6 months)

87 slide 87 Ingest Workflow ObjectsZ CSV Detect X(1”) DXO_1a NoMatch X(2”) DXO_2a DZone P2PsfFits Resolve P2ToObj Orphans

88 slide 88 Ingest @ frequency = f1 P2ToObj Orphans SLICE_1 MAIN P2PsfFits Metadata+ Objects Orphans_1 P2ToPsfFits_1 P2ToObj_1 Objects_1 111213 Stack*_1 123 P2ToObjStackToObj P2ToObj_1P2ToPsfFits_1 Orphans_1 ObjectsZ LOADER

89 slide 89 SLICE_1 MAIN Metadata+ Objects Orphans_1 P2ToPsfFits_1 P2ToObj_1 Objects_1 111213 Stack*_1 123 P2ToObjStackToObj LOADER Objects Updates @ frequency = f2

90 slide 90 Updates @ frequency = f2 SLICE_1 MAIN Metadata+ Objects Orphans_1 P2ToPsfFits_1 P2ToObj_1 Objects_1 111213 Stack*_1 123 P2ToObjStackToObj Objects LOADER Objects Objects_1

91 slide 91 Snapshots @ frequency = f3 MAIN Metadata+ Objects 123 P2ToObjStackToObj Snapshot Objects

92 slide 92 Batch Update of a Partition A1A2A3 112123 … merged select into switch B1 select into … where B2 + PK index select into … where B3 + PK index switch select into … where B1 + PK index

93 slide 93 P2P3P2P3 PS1 P1P2P1P2 PmP1PmP1 PartitionsMap Objects LnkToObj Meta Legend Database Duplicate Full table [partitioned table] Partitioned View Duplicate P view [Objects_p1] [LnkToObj_p1] [Detections_p1] [Objects_p2] [LnkToObj_p2] [Detections_p2] Meta Query Manager (QM) Detections Linked servers PS1 database [Objects_pm] [LnkToObj_pm] [Detections_pm] [Objects_p1] [LnkToObj_p1] [Detections_p1] Meta PS1 PartitionsMap Objects LnkToObj Meta Detections P m-1 P m Scaling-out  Apply Ping-Pong strategy to satisfy query performance during ingest 2 x ( 1 main + m slices)

94 slide 94 P2P3P2P3 PS1 P1P2P1P2 PmP1PmP1 Legend Database Duplicate Full table [partitioned table] Partitioned View Duplicate P view [Objects_p1] [LnkToObj_p1] [Detections_p1] [Objects_p2] [LnkToObj_p2] [Detections_p2] Meta Query Manager (QM) Linked servers PS1 database [Objects_pm] [LnkToObj_pm] [Detections_pm] [Objects_p1] [LnkToObj_p1] [Detections_p1] Meta PS1 PartitionsMap Objects LnkToObj Meta Detections P m-1 P m Scaling-out  More robustness, fault-tolerance, and reabilability calls for 3 x ( 1 main + m slices) PartitionsMap Objects LnkToObj Meta Detections

95 slide 95 Adding New slices SQL Server range partitioning capabilities make it easy  Recalculate partitioning limits  Transfer data to new slices  Remove data from slices  Define an d Apply new partitioning schema  Add new partitions to main server  Apply new partitioning schema to main server

96 slide 96 Adding New Slices

97 slide 97 Detail Design Outline  General Concepts  Distributed Database architecture  Ingest Workflow  Prototype

98 slide 98 ODM Assoc/Update Requirement  5.3.6.6 The PS1 ODM shall update the derived attributes for objects when new P2, P4  (stack), P4  and cumulative sky detections are being correlated with existing objects.

99 slide 99 ODM Ingest Performance 5.3.1.6 The PS1 ODM shall be able to ingest the data from the IPP at two times the nominal daily arrival rate* * The nominal daily data rate from the IPP is defined as the total data volume to be ingested annually by the ODM divided by 365.  Nominal daily data rate: 1.5  10 11 / 3.5 / 365 = 1.2  10 8 P2 detections / day 8.3  10 10 / 3.5 / 365 = 6.5  10 7 stack detections / day

100 slide 100 Number of Objects miniProtomyProtoPrototypePS1 SDSS* Stars5.7 x 10 4 1.3 x 10 7 1.1 x 10 8 SDSS* Galaxies9.1 x 10 4 1.1 x 10 7 1.7 x 10 8 Galactic Plane1.5 x 10 6 3 x 10 6 1.0 x 10 9 TOTAL1.6 x 10 6 2.6 x 10 7 1.3 x 10 9 5.5 x 10 9 * “SDSS” includes a mirror of 11.3 <  < 30 objects to  < 0 Total GB of csv loaded data: 300 GB CSV Bulk insert load: 8 MB/s Binary Bulk insert:18-20 MB/s Creation Started: October 15 th 2007 Finished:October 29 th 2007 (??) Includes 10 epochs of P2PsfFits detections 1 epoch of Stack detections

101 slide 101 Time to Bulk Insert from CSV FileRowsRowSizeGB MinutesMinutes/GB stars_plus_xai.csv 5383971 56 0.30 1.234.09 galaxy0_xal.csv 10000000436 4.3615.683.60 galaxy0_xam.csv 10000000436 4.3615.753.61 galaxy0_xan.csv 1721366436 0.75 2.753.66 gp_6.csv 106446858264 28.1041.451.47 gp_10.csv 92019350264 24.2931.401.29 gp_11.csv 73728448264 19.4626.051.34 P2PsfFits / Day 12000000018321.96 592.7 CSV Bulk insert speed ~ 8 MB/s BIN Bulk insert speed ~ 18 – 20 MB/s

102 slide 102 Prototype in Context Survey ObjectsDetections SDSS DR6 3.8  10 8 2MASS 4.7  10 8 USNO-B 1.0  10 9 Prototype 1.3  10 9 1.4  10 10 PS1 (end of survey) 5.5  10 9 2.3  10 11

103 slide 103 Size of Prototype Database Table MainSlice1Slice2Slice3LoaderTotal Objects 1.300.43 1.303.89 StackPsfFits 6.49  StackToObj 6.49  StackModelFits 0.87  P2PsfFits  4.023.903.350.3711.64 P2ToObj  4.023.903.350.1211.39 Total 15.158.478.237.131.7940.77 Extra Tables 0.874.894.774.226.8621.61 Grand Total 16.0213.3613.0011.358.6562.38 Table sizes are in billions of rows

104 slide 104 Size of Prototype Database Table MainSlice1Slice2Slice3LoaderTotal Objects 547.6165.4165.3 137.11180.6 StackPsfFits 841.5  841.6 StackToObj 300.9  StackModelFits 476.7  P2PsfFits  879.9853.0733.574.72541.1 P2ToObj  125.7121.9104.83.8356.2 Total 2166.71171.01140.21003.6215.65697.1 Extra Tables 207.9987.1960.2840.7957.33953.2 Allocated / Free 1878.01223.01300.01121.0666.06188.0 Grand Total 4252.63381.13400.42965.31838.915838.3 9.6 TB of data in a distributed database Table sizes are in GB

105 slide 105 Well-Balanced Partitions Server PartitionRowsFractionDec Range Main 1432,590,59833.34% 32.59  Slice 1 1144,199,10511.11% 14.29  Slice 1 2144,229,34311.11% 9.39  Slice 1 3144,162,15011.12% 8.91  Main 2432,456,51133.33% 23.44  Slice 2 1144,261,09811.12% 8.46  Slice 2 2144,073,97211.10% 7.21  Slice 2 3144,121,44111.11% 7.77  Main 3432,496,64833.33% 81.98  Slice 3 1144,270,09311.12% 11.15  Slice 3 2144,090,07111.10% 14.72  Slice 3 3144,136,48411.11% 56.10 

106 slide 106 Ingest and Association Times Task Measured Minutes Create Detections Zone Table39.62 X(0.2") 121M X 1.3B65.25 Build #noMatches Table1.50 X(1") 12k X 1.3B0.65 Build #allMatches Table (121M)6.58 Build Orphans Table0.17 Create P2PsfFits Table11.63 Create P2ToObj Table14.00 Total of Measured Times140.40

107 slide 107 Ingest and Association Times Task Estimated Minutes Compute DetectionID, HTMID30 Remove NULLS15 Index P2PsfFits on ObjID15 Slices Pulling Data from Loader 5 Resolve 1 Detection - N Objects10 Total of Estimated Times75 Educated Guess Wild Guess

108 slide 108 Total Time to I/A daily Data Task Time (hours) Time (hours) Ingest 121M Detections (binary)0.32  Ingest 121M Detections (CSV)  0.98 Total of Measured Times2.34 Total of Estimated Times1.25 Total Time to I/A Daily Data3.914.57 Requirement: Less than 12 hours (more than 2800 detections / s) Detection Processing Rate: 8600 to 7400 detections / s Margin on Requirement: 3.1 to 2.6 Using multiple loaders would improve performance

109 slide 109 Insert Time @ slices Task Estimated Minutes Import P2PsfFits (binary out/in)20.45 Import P2PsfFits (binary out/in)2.68 Import Orphans0.00 Merge P2PsfFits58 Add constraint P2PsfFits193 Merge P2ToObj13 Add constraint P2ToObj54 Total of Measured Times362 6 h with 8 partitions/slice (~1.3 x 10 9 detections/partition) Educated Guess

110 slide 110 Detections Per Partition Years Total Detections Slices Partition per Slice Total Partitions Detections per Slice 0.00.00 48320.00 1.0 4.29  10 10 4832 1.34  10 9 1.0 4.29  10 10 8864 6.7  10 8 2.0 8.57  10 10 8864 1.34  10 9 2.0 8.57  10 10 10880 1.07  10 9 3.0 1.29  10 11 10880 1.61  10 9 3.0 1.29  10 11 12896 1.34  10 9 3.5 1.50  10 11 12896 1.56  10 9

111 slide 111 Total Time for Insert @ slice Task Time (hours) Total of Measured Times0.25 Total of Estimated Times5.3 Total Time for daily insert6 Daily insert may operate in parallel with daily ingest and association. Requirement: Less than 12 hours Margin on Requirement: 2.0 Using more slices will improve insert performance.

112 slide 112 Summary  Ingest + Association < 4 h using 1 loader (@f1= daily) Scales with the number of servers Current margin on requirement 3.1 Room for improvement  Detection Insert @ slices (@f1= daily) 6 h with 8 partitions/slice It may happen in parallel with loading  Detections Lnks Insert @ main (@f2 < monthly) Unknown 6 h available  Objects insert & update @ slices (@f2 < monthly) Unknown 6 hours available  Objects update @ main server (@f2 < monthly) Unknown 12 h available. Transfer can be pipelined as soon as objects have been processed

113 slide 113 Risks  Estimates of Insert & Update at slices could be underestimated Need more empirical evaluation of exercising parallel I/O  Estimates and lay out of disk storage could be underestimated Merges and Indexes require 2x the data size

114 slide 114 Hardware/Scalability (Jan)

115 slide 115 PS1 Prototype Systems Design Jan Vandenberg, JHU Early PS1 Prototype

116 slide 116 Engineering Systems to Support the Database Design  Sequential read performance is our life-blood. Virtually all science queries will be I/O-bound.  ~70 TB raw data: 5.9 hours for full scan on IBM’s fastest 3.3 GB/s Champagne-budget SAN Need 20 GB/s IO engine just to scan the full data in less than an hour. Can’t touch this on a monolith.  Data mining a challenge even with good index coverage ~14 TB worth of indexes: 4-odd times bigger than SDSS DR6.  Hopeless if we rely on any bulk network transfers: must do work where the data is  Loading/Ingest more cpu-bound, though we still need solid write performance

117 slide 117 Choosing I/O Systems  So killer sequential I/O performance is a key systems design goal. Which gear to use? FC/SAN? Vanilla SATA? SAS?

118 slide 118 Fibre Channel, SAN  Expensive but not-so-fast physical links (4 Gbit, 10 Gbit)  Expensive switch  Potentially very flexible  Industrial strength manageability  Little control over RAID controller bottlenecks

119 slide 119 Straight SATA  Fast  Pretty cheap  Not so industrial- strength

120 slide 120 SAS  Fast: 12 Gbit/s FD building blocks  Nice and mature, stable  SCSI’s not just for swanky drives anymore: takes SATA drives!  So we have a way to use SATA without all the “beige”.  Pricey? $4400 for full 15x750GB system ($296/drive == close to Newegg media cost)

121 slide 121 SAS Performance, Gory Details  SAS v. SATA differences

122 slide 122 Per-Controller Performance  One controller can’t quite accommodate the throughput of an entire storage enclosure.

123 slide 123 Resulting PS1 Prototype I/O Topology  1100 MB/s single-threaded sequential reads per server

124 slide 124 RAID-5 v. RAID-10?  Primer, anyone?  RAID-5 perhaps feasible with contemporary controllers…  …but not a ton of redundancy  But after we add enough disks to meet performance goals, we have enough storage to run RAID-10 anyway!

125 slide 125 RAID-10 Performance  0.5*RAID-0 for single-threaded reads  RAID-0 perf for 2-user/2-thread workloads  0.5*RAID-0 writes

126 slide 126 PS1 Prototype Servers

127 slide 127 PS1 Prototype Servers PS1 Prototype

128 slide 128 PS1 Prototype Servers

129 slide 129 Projected PS1 Systems Design

130 slide 130 Backup/Recovery/Replication Strategies  No formal backup …except maybe for mydb’s, f(cost*policy)  3-way replication Replication != backup –Little or no history (though we might have some point-in- time capabilities via metadata –Replicas can be a bit too cozy: must notice badness before replication propagates it Replicas provide redundancy and load balancing… Fully online: zero time to recover Replicas needed for happy production performance plus ingest, anyway  Off-site geoplex Provides continuity if we lose HI (local or trans-Pacific network outage, facilities outage) Could help balance trans-Pacific bandwidth needs (service continental traffic locally)

131 slide 131 Why No Traditional Backups?  Money no object… do traditional backups too!!!  Synergy, economy of scale with other collaboration needs (IPP?)… do traditional backups too!!!  Not super pricey…  …but not very useful relative to a replica for our purposes Time to recover

132 slide 132 Failure Scenarios (Easy Ones)  Zero downtime, little effort: Disks (common) –Simple* hotswap –Automatic rebuild from hotspare or replacement drive Power supplies (not uncommon) –Simple* hotswap Fans (pretty common) –Simple* hotswap * Assuming sufficiently non-beige gear

133 slide 133 Failure Scenarios (Mostly Harmless Ones)  Some downtime and replica cutover: System board (rare) Memory (rare and usually proactively detected and handled via scheduled maintenance) Disk controller (rare, potentially minimal downtime via cold-spare controller) CPU (not utterly uncommon, can be tough and time consuming to diagnose correctly)

134 slide 134 Failure Scenarios (Slightly Spooky Ones)  Database mangling by human or pipeline error Gotta catch this before replication propagates it everywhere Need lots of sanity checks before replicating (and so off-the-shelf near-realtime replication tools don’t help us) Need to run replication backwards from older, healthy replicas. Probably less automated than healthy replication.  Catastrophic loss of datacenter Okay, we have the geoplex –…but we’re dangling by a single copy ‘till recovery is complete –…and this may be a while. –…but are we still in trouble? Depending on colo scenarios, did we also lose the IPP and flatfile archive?

135 slide 135 Failure Scenarios (Nasty Ones)  Unrecoverable badness fully replicated before detection  Catastrophic loss of datacenter without geoplex  Can we ever catch back up with the data rate if we need to start over and rebuild with an ingest campaign? Don’t bet on it!

136 slide 136 Operating Systems, DBMS?  Sql2005 EE x64 Why? Why not DB2, Oracle RAC, PostgreSQL, MySQL, ?  (Win2003 EE x64)  Why EE? Because it’s there.  Scientific Linux 4.x/5.x, or local favorite  Platform rant from JVV available over beers

137 slide 137 Systems/Database Management  Active Directory infrastructure  Windows patching tools, practices  Linux patching tools, practices  Monitoring  Staffing requirements

138 slide 138 Facilities/Infrastructure Projections for PS1  Power/cooling Prototype is 9.2 kW (2.6 Tons AC) PS1: something like 43 kW, 12.1 Tons  Rack space Prototype is 69 RU, <2 42U racks (includes 14U of rackmount UPS at JHU) PS1: about 310 RU (9-ish racks)  Networking: ~40 Gbit Ethernet ports  …plus sundry infrastructure, ideally already in place (domain controllers, monitoring systems, etc.)

139 slide 139 Operational Handoff to UofH  Gulp.

140 slide 140 How Design Meets Requirements  Cross-matching detections with objects Zone cross-match part of loading pipeline Already exceeded requirement with prototype  Query performance Ping-pong configuration for query during ingest Spatial indexing and distributed queries Query manager can be scaled out as necessary  Scalability Shared-nothing architecture Scale out as needed Beyond PS1 we will need truly parallel query plans

141 slide 141 WBS/Development Tasks Refine Prototype/Schema Staging/Transformation Initial Load Load/Resolve Detections Resolve/Synchronize Objects Create Snapshot Replication Module Query Processing Workflow Systems Logging Data Scrubbing SSIS (?) + C# QM/Logging Hardware Documentation 2 PM 3 PM 1 PM 3 PM 1 PM 2 PM 4 PM 2 PM Total Effort:35 PM Delivery: 9/2008 Testing Redistribute Data

142 slide 142 Personnel Available  2 new hires (SW Engineers) 100%  Maria 80%  Ani 20%  Jan 10%  Alainna 15%  Nolan Li 25%  Sam Carliles 25%  George Fekete 5%  Laszlo Dobos 50% (for 6 months)

143 slide 143 Issues/Risks  Versioning Do we need to preserve snapshots of monthly versions? How will users reproduce queries on subsequent versions? Is it ok that a new version of the sky replaces the previous one every month?  Backup/recovery Will we need 3 local copies rather than 2 for safety Is restoring from offsite copy feasible?  Handoff to IfA beyond scope of WBS shown This will involve several PMs

144 Mahalo!

145 slide 145 Context that query is executed in MyDB table that query results go into Name that this query job is given Check query syntax Get graphical query plan Run query in quick (1 minute) mode Submit query to long (8- hour) queue Query buffer Load one of the sample queries into query buffer Query Manager

146 slide 146 Stored procedure arguments SQL code for stored procedure Query Manager

147 slide 147 MyDB context is the default, but other contexts can be selected The space used and total space available Multiple tables can be selected and dropped at once Table list can be sorted by name, size, type. User can browse DB Views, Tables, Functions and Procedures Query Manager

148 slide 148 The query that created this table Query Manager

149 slide 149 Search radius Table to hold results Context to run search on Query Manager


Download ppt "PS1 PSPS Object Data Manager Design PSPS Critical Design Review November 5-6, 2007 IfA."

Similar presentations


Ads by Google