Panel Summary Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University XLDB 23-October-07.

Slides:

Advertisements

Similar presentations

Chapter 10: Designing Databases

Advertisements

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert

The State of the Art in Distributed Query Processing by Donald Kossmann Presented by Chris Gianfrancesco.

Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.

INTEGRATING BIG DATA TECHNOLOGY INTO LEGACY SYSTEMS Robert Cooley, Ph.D.CodeFreeze 1/16/2014.

CLOUD COMPUTING AN OVERVIEW & QUALITY OF SERVICE Hamzeh Khazaei University of Manitoba Department of Computer Science Jan 28, 2010.

Performance Engineering Methodology Chapter 4. Performance Engineering Performance engineering analyzes the expected performance characteristics of a.

Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank

Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.

© 2011 Citrusleaf. All rights reserved.1 A Real-Time NoSQL DB That Preserves ACID Citrusleaf Srini V. Srinivasan Brian Bulkowski VLDB, 09/01/11.

Building a Framework for Data Preservation of Large-Scale Astronomical Data ADASS London, UK September 23-26, 2007 Jeffrey Kantor (LSST Corporation), Ray.

Chapter 14 The Second Component: The Database.

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Matei Ripeanu.

Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.

Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,

Designing a Data Warehouse

Data Warehousing: Defined and Its Applications Pete Johnson April 2002.

Russ Houberg Senior Technical Architect, MCM KnowledgeLake, Inc.

How WebMD Maintains Operational Flexibility with NoSQL Rajeev Borborah, Sr. Director, Engineering Matt Wilson – Director, Production Engineering – Consumer.

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.

Designing a Data Warehouse Issues in DW design. Three Fundamental Processes Data Acquisition Data Storage Data a Access.

IT – DBMS Concepts Relational Database Theory.

Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.

IT The Relational DBMS Section 06. Relational Database Theory Physical Database Design.

Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.

Wrangling Customer Usage Data with Hadoop Clearwire – Thursday, June 27 th Carmen Hall – IT Director Mathew Johnson – Sr. IT Manager.

Data Warehousing at Acxiom Paul Montrose Data Warehousing at Acxiom Paul Montrose.

Relationships July 9, Producers and Consumers SERI - Relationships Session 1.

DATABASE MANAGEMENT SYSTEMS IN DATA INTENSIVE ENVIRONMENNTS Leon Guzenda Chief Technology Officer.

LSST: Preparing for the Data Avalanche through Partitioning, Parallelization, and Provenance Kirk Borne (Perot Systems Corporation / NASA GSFC and George.

Lessons Learned from Managing a Petabyte Jacek Becla Stanford Linear Accelerator Center (SLAC) Daniel Wang now University of CA in Irvine, formerly SLAC.

5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.

MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.

INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.

Ch 14 QQ T F 1.A database table consists of fields and records. T F 2.Good data validation techniques can help improve data integrity. T F 3.An index is.

1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.

Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.

Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.

Srik Raghavan Principal Lead Program Manager Kevin Cox Principal Program Manager SESSION CODE: DAT206.

1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.

Astronomy, Petabytes, and MySQL MySQL Conference Santa Clara, CA April 16, 2008 Kian-Tat Lim Stanford Linear Accelerator Center.

GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.

A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.

What is Big Query?.

Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.

Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.

Web Technologies Lecture 13 Introduction to cloud computing.

Millions of Jobs or a few good solutions …. David Abramson Monash University MeSsAGE Lab X.

1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.

ETICS An Environment for Distributed Software Development in Aerospace Applications SpaceTransfer09 Hannover Messe, April 2009.

Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)

Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.

uses of DB systems DB environment DB structure Codd’s rules current common RDBMs implementations.

If you have a transaction processing system, John Meisenbacher

Database Processing Chapter "No, Drew, You Don’t Know Anything About Creating Queries.” Copyright © 2015 Pearson Education, Inc. Operational database.

IT-DSS Alberto Pace2 ? Detecting particles (experiments) Accelerating particle beams Large-scale computing (Analysis) Discovery We are here The mission.

BIG DATA/ Hadoop Interview Questions.

Ignite in Sberbank: In-Memory Data Fabric for Financial Services

Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.

1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.

Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.

Future Data Architecture Cloud Hosting at USGS

Comparison June 2017.

CSE 451: Operating Systems Spring 2006 Module 18 Redundant Arrays of Inexpensive Disks (RAID) John Zahorjan Allen Center.

Data Warehousing and Data Mining

CSE 451: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.

Dep. of Information Technology By: Raz Dara Mohammad Amin

Data Warehousing Concepts

IT 344: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Chia-Chi Teng CTB

Future Directions in DOLAP Research - DOLAP 04 Panel -

Presentation transcript:

Panel Summary Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University XLDB 23-October-07

State in High Energy Physics A lot of data 15 PB/Year for LHC Typically, write once data Applications are CPU bound A lot of institutes must be involved Increase total resources Necessity forces a Hybrid Model (RDBMS + Files) Performance impact of consistency is high Not required for LHC Wide range of applications, DB expertise, environments

23-October-07 LHC Issues Power and Cooling Cheap hardware for scaling Reliability problems Patching issues Distributed Deployment Issues Needed to develop in-house tools Multi-dimensional search requirements Usually reason for using “files” for data

23-October-07 LHC Questions Database as a Transactional system, efficient query engine, highly available storage? Can one product do all of this? Multi-Mode Storage How do you measure scaling? Size? Transactions/Second? Etc. Shared everything or shared nothing architectures?

23-October-07 State in Astronomy (LSST A lot of data Trillions or more of rows 14PB by 2024 Only data about the image Actual images (write once) much larger! Data is distributed Telescope and archive physically separate Time for databases technology to catch up (12 years) Some proprietary systems handle even more data today Reliability and Security issues loose Can absorb some data may be lost, up time 98%, public data However must be able to ingest the data Telescope keeps going

23-October-07 Issues in LSST Easy Scaling Add resources on the fly Dependable software sources This is a long term project Data has some unique needs Distributed mining capabilities Varied database data types Not available today except in OO databases Relaxed consistency requirements Fault tolerant software not hardware Human scaling must be low

23-October-07 Scientific Panel I 40% Pure Database Otherwise 20-30% in DB rest in files Majority in the peta-byte range Everyone in the TB range Majority use commercial products Though open source DB’s rampant Few (in XL scale today) use homegrown systems Sometimes driven by need sometimes by legacy

23-October-07 Scientific Panel II Wide range of user analytic needs DB’s have limited “express-ability” Unlikely there is a common set of operators Common Data Processing Model Write once read many But a lot of meta-data updates Amenable to data parallelism Approximate results are acceptable to 1 st order

23-October-07 Scientific Panel III Wish List Approximate queries Full spatial queries Multiple availability levels Mixture of real-time, interactive, background uses The rest is yes Scaling, performance, maintainability, etc.

23-October-07 Industry Panel I Primarily traditional DB use Standard scaling techniques Disallow certain types of queries Availability is a must Money and survivability is the issue 90% non-transactional query Wide range of size several TB to several PB 1 Billion rows/hour ingest peak Trillions of rows 25TB/Day is not unusual Millions of queries a day

23-October-07 Industry Panel II Some homegrown solutions Depending on how it is used Problem is I/O throughput Minimize use of indexes Some specialized systems used to increase performance Dirty reads common Transactional latency is a problem

23-October-07 Industry Panel III Varied use patterns (business model driven) Non-indexed data for mining purposes Parallel Load and Query Real time queries (currency is a must) Designing for the unknown query Customization motivation varies Join inefficiency Limited SQL expressiveness Lack of sufficient parallelism

23-October-07 Common Industry/Science Issues Performance issues I/O throughput, transactional latency, etc Lack of effective parallelism Usability SQL expressiveness Licensing Industry more constrained but cost is an issue Human power Labor is the dominant cost DBA costs are high and must be reduced

23-October-07 Final Perceptions Science/Industry operate roughly on same scale Size and throughput Science & Industry “business models” differ Drive each community into different direction Science is a long-term affair Industry must be reactive

23-October-07 Discussion Points What drives feature sets? General feeling that scaling features are missing Is it the architecture (e.g., Relational vs other)? Is it the business model? Something else? What feature sets do you think are important? Performance, Scalability, Usability, Reliability? Do you see it as a tradeoff? Open Software Presence A question of customization possibilities or simply cost? Is it considered a threat to your business model? Is it time to rethink the nature and placement of databases?