Panel Summary Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University XLDB 23-October-07
State in High Energy Physics A lot of data 15 PB/Year for LHC Typically, write once data Applications are CPU bound A lot of institutes must be involved Increase total resources Necessity forces a Hybrid Model (RDBMS + Files) Performance impact of consistency is high Not required for LHC Wide range of applications, DB expertise, environments
23-October-07 LHC Issues Power and Cooling Cheap hardware for scaling Reliability problems Patching issues Distributed Deployment Issues Needed to develop in-house tools Multi-dimensional search requirements Usually reason for using “files” for data
23-October-07 LHC Questions Database as a Transactional system, efficient query engine, highly available storage? Can one product do all of this? Multi-Mode Storage How do you measure scaling? Size? Transactions/Second? Etc. Shared everything or shared nothing architectures?
23-October-07 State in Astronomy (LSST A lot of data Trillions or more of rows 14PB by 2024 Only data about the image Actual images (write once) much larger! Data is distributed Telescope and archive physically separate Time for databases technology to catch up (12 years) Some proprietary systems handle even more data today Reliability and Security issues loose Can absorb some data may be lost, up time 98%, public data However must be able to ingest the data Telescope keeps going
23-October-07 Issues in LSST Easy Scaling Add resources on the fly Dependable software sources This is a long term project Data has some unique needs Distributed mining capabilities Varied database data types Not available today except in OO databases Relaxed consistency requirements Fault tolerant software not hardware Human scaling must be low
23-October-07 Scientific Panel I 40% Pure Database Otherwise 20-30% in DB rest in files Majority in the peta-byte range Everyone in the TB range Majority use commercial products Though open source DB’s rampant Few (in XL scale today) use homegrown systems Sometimes driven by need sometimes by legacy
23-October-07 Scientific Panel II Wide range of user analytic needs DB’s have limited “express-ability” Unlikely there is a common set of operators Common Data Processing Model Write once read many But a lot of meta-data updates Amenable to data parallelism Approximate results are acceptable to 1 st order
23-October-07 Scientific Panel III Wish List Approximate queries Full spatial queries Multiple availability levels Mixture of real-time, interactive, background uses The rest is yes Scaling, performance, maintainability, etc.
23-October-07 Industry Panel I Primarily traditional DB use Standard scaling techniques Disallow certain types of queries Availability is a must Money and survivability is the issue 90% non-transactional query Wide range of size several TB to several PB 1 Billion rows/hour ingest peak Trillions of rows 25TB/Day is not unusual Millions of queries a day
23-October-07 Industry Panel II Some homegrown solutions Depending on how it is used Problem is I/O throughput Minimize use of indexes Some specialized systems used to increase performance Dirty reads common Transactional latency is a problem
23-October-07 Industry Panel III Varied use patterns (business model driven) Non-indexed data for mining purposes Parallel Load and Query Real time queries (currency is a must) Designing for the unknown query Customization motivation varies Join inefficiency Limited SQL expressiveness Lack of sufficient parallelism
23-October-07 Common Industry/Science Issues Performance issues I/O throughput, transactional latency, etc Lack of effective parallelism Usability SQL expressiveness Licensing Industry more constrained but cost is an issue Human power Labor is the dominant cost DBA costs are high and must be reduced
23-October-07 Final Perceptions Science/Industry operate roughly on same scale Size and throughput Science & Industry “business models” differ Drive each community into different direction Science is a long-term affair Industry must be reactive
23-October-07 Discussion Points What drives feature sets? General feeling that scaling features are missing Is it the architecture (e.g., Relational vs other)? Is it the business model? Something else? What feature sets do you think are important? Performance, Scalability, Usability, Reliability? Do you see it as a tradeoff? Open Software Presence A question of customization possibilities or simply cost? Is it considered a threat to your business model? Is it time to rethink the nature and placement of databases?