The Pan-STARRS Data Challenge Jim Heasley Institute for Astronomy University of Hawaii
IDIES09 What is Pan-STARRS? Pan-STARRS - a new telescope facility 4 smallish (1.8m) telescopes, but with extremely wide field of view Can scan the sky rapidly and repeatedly, and can detect very faint objects –Unique time-resolution capability Project led by IfA with help from Air Force, Maui High Performance Computer Center, MIT’s Lincoln Lab. The prototype, PS1, will be operated by an international consortium
IDIES09 Pan-STARRS Overview Time domain astronomy –Transient objects –Moving objects –Variable objects Static sky science –Enabled by stacking repeated scans to form a collection of ultra-deep static sky images Pan-STARRS observatory specifications –Four 1.8m R-C + corrector –7 square degree FOV - 1.4Gpixel cameras –Sited in Hawaii –A = 50 –R ~ 24 in 30 s integration –-> 7000 square deg/night –All sky + deep field surveys in g,r,i,z,y
IDIES09 The Published Science Products Subsystem
IDIES09
Front of the Wave Pan-STARRS is only the first of a new generation of astronomical data programs that will generate such large volumes of data: –SkyMapper, southern hemisphere optical –VISTA, southern hemisphere IR survey –LSST, an all sky survey like Pan-STARRS Eventually, these data sets will be useful for data mining.
IDIES09
PS1 Data Products Detections—measurements obtained directly from processed image frames –Detection catalogs –“Stacks” of the sky images source catalogs –Difference catalogs High significance (> 5 transient events) Low significance (transients between 3 and 5 ) –Other Image Stacks (Medium Deep Survey) Objects—aggregates derived from detections
IDIES09 What’s the Challenge? At first blush, this looks pretty much like the Sloan Digital Sky Survey… BUT –Size – Over its 3 year mission, PS1 will record over 150 billion detections for approximately 5.5 billion sources –Dynamic Nature – new data will be always coming into the database system, for things we’ve seen before or new discoveries
IDIES09 How to Approach This Challenge There are many possible approaches to deal with this data challenge. Shared what? –Memory –Disk –Nothing Not all of these approaches are created equal, either in cost and/or performance (DeWitt & Gray, 1992, “Parallel Database Systems: The Future of High Performance Database Processing”).
IDIES09 Conversation with the Pan- STARRS Project Manager Jim: Tom, what are we going to do if the solution proposed by TBJD is more than you can afford? Tom: Jim, I’m sure you’ll think of something! Not long after that, TBJD did give us a hardware/software plan we couldn’t afford. Not long after, Tom resigned from the project to pursue other activities… The Pan-STARRS project teamed up with Alex and his database team at JHU
IDIES09 Building upon the SDSS Heritage In teaming with the team at JHU we hoped to build upon the experience and software developed for the SDSS. A key question was how could we scale the system to deal with the volume of data expected from PS1 (> 10X SDSS in the first year alone). The second key question, could the system keep up with the data flow. The heritage is more one of philosophy than recycled software, as to deal with the challenges posed by PS1 we’ve had to generate a great deal of new code.
IDIES09 The Object Data Manager The Object Data Manager (ODM) was considered to be the “long pole” in the development of the PS1 PSPS. Parallel database systems can provide both data redundancy and spreading very large tables that can’t fit on a single machine over multiple storage volumes. For PS1 (and beyond) we need both.
IDIES09 Distributed Architecture The bigger tables will be spatially partitioned across servers called Slices Using slices improves system scalability Tables are sliced into ranges of ObjectID, which correspond to broad declination ranges ObjectID boundaries are selected so that each slice has a similar number of objects Distributed Partitioned Views “glue” the data together
IDIES09 Distributed Architecture The bigger tables will be spatially partitioned across servers called Slices Using slices improves system scalability Tables are sliced into ranges of ObjectID, which correspond to broad declination ranges ObjectID boundaries are selected so that each slice has a similar number of objects Distributed Partitioned Views “glue” the data together
IDIES09 Design Decisions: ObjID Objects have their positional information encoded in their objID –fGetPanObjID (ra, dec, zoneH) –ZoneID is the most significant part of the ID –objID is the Primary Key Objects are organized (clustered indexed) so nearby objects in the sky are stored on disk nearby as well It gives good search performance, spatial functionality, and scalability
IDIES09 Telescope CSV Files CSV Files Image Procesing Pipeline (IPP) CSV Files CSV Files Load Workflow Load DB Cold Slice DB 1 Cold Slice DB 2 Warm Slice DB 1 Warm Slice DB 2 Merge Workflow Hot Slice DB 2 Hot Slice DB 1 Flip Workflow MainDB Distribute d View MainDB Distribute d View MainDB Distribute d View CASJobs Query Service CASJobs Query Service MyDB The Pan-STARRS Science Cloud ← Behind the Cloud|| User facing services → Validation Exception Notification Data Valet Workflows Data Consumer Queries & Workflows Data flows in one direction→, except for error recovery Slice Fault Recover Workflow Data Creators Astronomers (Data Consumers) Admin & Load-Merge Machines Production Machines Pan-STARRS Data Flow
IDIES09 Pan-STARRS Data Layout Slice 1 Slice 1 Slice 2 Slice 2 Slice 3 Slice 3 Slice 4 Slice 4 Slice 5 Slice 5 Slice 6 Slice 6 Slice 7 Slice 7 Slice 8 Slice 8 S1S1 S1S1 S2S2 S2S2 S3S3 S3S3 S4S4 S4S4 S5S5 S5S5 S6S6 S6S6 S7S7 S7S7 S8S8 S8S8 S9S9 S9S9 S 10 S 10 S 11 S 11 S 12 S 12 S 13 S 13 S 14 S 14 S 15 S 15 S 16 S 16 S 16 S 16 S3S3 S3S3 S2S2 S2S2 S5S5 S5S5 S4S4 S4S4 S7S7 S7S7 S6S6 S6S6 S9S9 S9S9 S8S8 S8S8 S 11 S 11 S 10 S 10 S 13 S 13 S 12 S 12 S 15 S 15 S 14 S 14 S1S1 S1S1 Load Merge 1 Load Merge 1 Load Merge 2 Load Merge 2 Load Merge 3 Load Merge 3 Load Merge 4 Load Merge 4 Load Merge 5 Load Merge 5 Load Merge 6 Load Merge 6 S1S1 S1S1 S2S2 S2S2 S3S3 S3S3 S4S4 S4S4 S5S5 S5S5 S6S6 S6S6 S7S7 S7S7 S8S8 S8S8 S9S9 S9S9 S 10 S 10 S 11 S 11 S 12 S 12 S 13 S 13 S 14 S 14 S 15 S 15 S 16 S 16 csv ImagePipeline HOTHOTHOTHOT WARMWARMWARMWARM Main Distributed View Head 2 Head 1 Slice Nodes Load-Merge Nodes COLDCOLDCOLDCOLD L1 Data L2 Data LOADLOADLOADLOAD Head Nodes
IDIES09 The ODM Infrastructure Much of our software development has gone into extending the ingest pipeline developed for SDSS. Unlike SDSS, we don’t have “campaign” loads but a steady from of data from the telescope through the Image Processing Pipeline to the ODM. We have constructed data workflows to deal with both the regular data flow into the ODM as well as anticipated failure modes (lost disk, RAID, and various severer nodes).
IDIES09 Pan-STARRS Object Data Manager Subsystem Pan-STARRS Cloud Services for Astronomers System Operation UI System Health Monitor UI Query Performance UI System & Administration Workflows Orchestrates all cluster changes, such as, data loading, or fault tolerance System & Administration Workflows Orchestrates all cluster changes, such as, data loading, or fault tolerance Configuration, Health & Performance Monitoring Cluster deployment and operations Configuration, Health & Performance Monitoring Cluster deployment and operations Internal Data Flow and State Logging Tools for supporting workflow authoring and execution Internal Data Flow and State Logging Tools for supporting workflow authoring and execution Loaded Astronomy Databases ~70TB Transfer/Week Loaded Astronomy Databases ~70TB Transfer/Week Deployed Astronomy Databases ~70TB Storage/Year Deployed Astronomy Databases ~70TB Storage/Year Query Manager Science queries and MyDB for results Query Manager Science queries and MyDB for results Image Processing Pipeline Extracts objects like stars and galaxies from telescope images ~1TB Input/Week Image Processing Pipeline Extracts objects like stars and galaxies from telescope images ~1TB Input/Week Pan-STARRS Telescope Data Flow Control Flow 21
IDIES09 What Next? Will this approach scale to our needs? –PS1 – yes. But, we already see the need for better parallel processing query plans. –PS4 – unclear! Even though I’m not from Missouri, “show me!” One year of PS4 produces > data volume than the entire PS1 3 year mission! Cloud computing? –How can we test issues like scalability without actually building the system? –Does each project really need its own data center? –Having these databases “in the cloud” may greatly facilitate data sharing/mining.
IDIES09 Finally, Thanks To Alex for stepping in, hosting the development system at JHU, and building up his core team to construct the ODM, especially –Maria Nieto-Santisteban –Richard Wilton –Susan Werner And at Microsoft to –Michael Thomassy –Yogesh Simmhan –Catharine van Ingen