Espresso - a Feasibility Study of a Scalable, Performant ODBMS Dirk Duellmann CERN IT/DB and RD45 n Aim of this Study n Architectural Overview n Espresso.

Slides:



Advertisements
Similar presentations
Object Persistency & Data Handling Session C - Summary Object Persistency & Data Handling Session C - Summary Dirk Duellmann.
Advertisements

System Integration and Performance
Chapter 10: Designing Databases
Database System Concepts and Architecture
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
MSc IT UFIE8K-10-M Data Management Prakash Chatterjee Room 3P16
File System Implementation
INTRODUCTION OS/2 was initially designed to extend the capabilities of DOS by IBM and Microsoft Corporations. To create a single industry-standard operating.
Memory Management 2010.
1 File Management in Representative Operating Systems.
Physical design. Stage 6 - Physical Design Retrieve the target physical environment Create physical data design Create function component implementation.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
Computer Organization and Architecture
Object Based Operating Systems1 Learning Objectives Object Orientation and its benefits Controversy over object based operating systems Object based operating.
PRASHANTHI NARAYAN NETTEM.
Introduction to Databases Transparencies 1. ©Pearson Education 2009 Objectives Common uses of database systems. Meaning of the term database. Meaning.
FALL 2005CSI 4118 – UNIVERSITY OF OTTAWA1 Part 4 Web technologies: HTTP, CGI, PHP,Java applets)
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
Computer System Architectures Computer System Software
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Lecture On Database Analysis and Design By- Jesmin Akhter Lecturer, IIT, Jahangirnagar University.
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
Distributed Database Systems Overview
2Object-Oriented Analysis and Design with the Unified Process Objectives  Describe the differences and similarities between relational and object-oriented.
4/5/2007Data handling and transfer in the LHCb experiment1 Data handling and transfer in the LHCb experiment RT NPSS Real Time 2007 FNAL - 4 th May 2007.
File Storage Organization The majority of space on a device is reserved for the storage of files. When files are created and modified physical blocks are.
Serverless Network File Systems Overview by Joseph Thompson.
Introduction to Database Systems1. 2 Basic Definitions Mini-world Some part of the real world about which data is stored in a database. Data Known facts.
A summary by Nick Rayner for PSU CS533, Spring 2006
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
1 Choices “Our object-oriented system architecture embodies the notion of customizing operating systems to tailor them to support particular hardware configuration.
INTRODUCTION TO DBS Database: a collection of data describing the activities of one or more related organizations DBMS: software designed to assist in.
CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.
Chapter 11: File System Implementation Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 11: File System Implementation Chapter.
Apr. 8, 2002Calibration Database Browser Workshop1 Database Access Using D0OM H. Greenlee Calibration Database Browser Workshop Apr. 8, 2002.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition File System Implementation.
CASE (Computer-Aided Software Engineering) Tools Software that is used to support software process activities. Provides software process support by:- –
M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young MACH: A New Kernel Foundation for UNIX Development Presenter: Wei-Lwun.
The Mach System Silberschatz et al Presented By Anjana Venkat.
Full and Para Virtualization
Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,
D. Duellmann - IT/DB LCG - POOL Project1 The LCG Pool Project and ROOT I/O Dirk Duellmann What is Pool? Component Breakdown Status and Plans.
Some Ideas for a Revised Requirement List Dirk Duellmann.
E.Bertino, L.Matino Object-Oriented Database Systems 1 Chapter 9. Systems Seoul National University Department of Computer Engineering OOPSLA Lab.
Andrea Valassi (CERN IT-DB)CHEP 2004 Poster Session (Thursday, 30 September 2004) 1 HARP DATA AND SOFTWARE MIGRATION FROM TO ORACLE Authors: A.Valassi,
Chapter 1 Database Access from Client Applications.
D. Duellmann - IT/DB LCG - POOL Project1 The LCG Dictionary and POOL Dirk Duellmann.
Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.
G.Govi CERN/IT-DB 1GridPP7 June30 - July 2, 2003 Data Storage with the POOL persistency framework Motivation Strategy Storage model Storage operation Summary.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Evaluation of the C++ binding to the Oracle Database System Dirk Geppert and Krzysztof Nienartowicz, IT/DB CERN IT Fellow Seminar November 20, 2002.
Lecture 1 Page 1 CS 111 Summer 2013 Important OS Properties For real operating systems built and used by real people Differs depending on who you are talking.
DISTRIBUTED FILE SYSTEM- ENHANCEMENT AND FURTHER DEVELOPMENT BY:- PALLAWI(10BIT0033)
Introduction to Operating Systems Concepts
SDN controllers App Network elements has two components: OpenFlow client, forwarding hardware with flow tables. The SDN controller must implement the network.
Databases and DBMSs Todd S. Bacastow January 2005.
Jean-Philippe Baud, IT-GD, CERN November 2007
Chapter 2 Memory and process management
Introduction to Distributed Platforms
Open Source distributed document DB for an enterprise
Chapter 2: System Structures
File System Implementation
Data, Databases, and DBMSs
Chapter 4: Threads.
Outline Midterm results summary Distributed file systems – continued
Lecture 15 Reading: Bacon 7.6, 7.7
Prof. Leonardo Mostarda University of Camerino
Chapter 15: File System Internals
Presentation transcript:

Espresso - a Feasibility Study of a Scalable, Performant ODBMS Dirk Duellmann CERN IT/DB and RD45 n Aim of this Study n Architectural Overview n Espresso Components n Prototype Status & Plans

Overview Why Espresso? n RD45 Risk Analysis Milestone –Understand the effort needed to develop a ODBMS suitable as fallback solution for LHC data stores n Testbed that allows us to test novel solutions for remaining problems –e.g. VLDB issues, asynchronous I/O, user schema & data, modern C++ binding,... n NO plans to stop Objectivity production service!

Overview Could a home grown ODBMS be feasible? n Most Database kernels have been developed in “C” the late 80s and before –Today all main design choices are extensively studied in the computer science literature –C++ Language and Library provide am much better development platform than C n Our specific requirements are better understood –We know much better what we need (and not need). –We could reuse HEP developments in many areas like mass storage interface, security n Building an ODBMS for HEP is an engineering and not a research task –We don’t need to spend O(150) person years which went into the first ODBMS!

Overview System Requirements n Scalability –in data volume and number of client connections n Navigational Access –with performance close network and disk limits n Heterogeneous Access –from multiple platforms and languages n Transactional Safety & Crash Recovery –automatic consistency after soft/hardware failures

Overview A Clean Sheet Approach - What should/could be done differently? n No need for big architectural changes –Objectivity/DB largely fulfils our functional requirements –Migration would be easier if access model is similar (e.g ODMG- like) n Focus on remaining problems –Improved Scalability & Concurrency of the Storage Hierarchy n Larger address space (VLDB) n Segmented and more scalable schema & catalogue –Improved Support for HEP environment n parallel development - concept of user/developer sandbox within the store needed –Simplify Partial Distribution of the Data Store n import export consistent subsets of the store

Overview Flexible Storage Hierarchy n File - Group of physically clustered objects –Smallest possible Espresso store –Contains data and optionally schema –Fast navigation within the file using physical OIDs n Domain - Group of files with tightly coupled objects –Contains domain catalogue, data and additional schema –Navigation between all objects within the domain using physical OIDs n Federation - Group of weakly coupled domains –Domain catalogue (very few updates!) –Shared schema (very few updates!)

User 1 DomainCatalogue SchemamyTrack Tags Histos MyTracks User 1 “sandbox” Period 1 DomainCatalogue P 1 RAW AOD REC Period N DomainCatalogue P n RAW AOD REC Production Server Read-Only Domain (no locking required) CalibDomainCatalogue CalibTPCCalibHCALCalibECAL Calib Server FD FDCatalogue Production ProductionSchema

Overview Espresso OID Layout n Federation –set of weakly coupled domains n Domain#32bit –set of tightly coupled objects –e.g. a run or run period, a end-user workspace n File# 16bit –a single file within a domain n Page# 32bit –a single logical page in the file n Object# 16bit –a single data record on a page e.g. a object or varray Federation Domain File File Page Object

Overview Prototype Implementation n Espresso is implemented in standard C++ –no other dependencies –(for now we use portable network I/O from ObjectSpace) n Expect a full C++ compiler –STL containers n in fact all containers in the current implementation are STL containers –Exceptions n C++ binding uses exceptions to signal error conditions (conforming to ODMG standard) –Namespaces n All of the implementation is contained in namespace “espresso” C++ binding is in namespace "odmg” C++ binding is in namespace "odmg” n Development Platform: RedHat Linux & g++

Overview Component Approach n Espresso is split into a small set of replaceable components with well defined –task –interface –dependency on other components n Common Services n Storage Manager n Schema Manager n Catalogue Manager n Data Server n Lock Server n C++ & Python Binding, (JAVA)

Overview Toplevel Components User API Tool Interface Storage Level Interface OS & Network Abstraction Distribution Net I/OFile I/O StorageMgr Page I/O TransMgrCatalogMgrSchemaMgrC++ BindingJAVA Binding PageServerLocktableLockServer depends on Python Binding

Overview Components: Physical Model n Each top-level component corresponds to one shared library and namespace –shared lib dependencies follow category diagram –components are isolated in their namespace n from other components n from user classes n Each shared lib provides IComponent interface –Factory for main provided interfaces –Version and configuration control on component level n implementation version, date and compiler version n boolean flags for optimised, debug, profiling

Overview Client Side Components n Storage Manager –store and retrieve variable length opaque data objects n maintains OIDs for data objects n implements transactional safety n language and platform independent –current implementation uses “shadow-paging” to implement transactions n Schema Manager –describe the layout of data types n data member position, size and type, byte ordering for primitive types –used for: n Platform Conversion, Generic Browsing, Schema Consistency –current implementation extracts schema from the debug information provided directly by the compiler –no schema pre-processor required

Overview Server Side Components n Data Server –transfer data pages from persistent storage (disk/tape) to memory n file system like interface –trivial implementation for local I/O –multi-threaded server daemon for remote I/O n Lock Server –keep a central table of resource locks n getLock (oid) –implements lock waiting and upgrading –very similar approach to most DBMS n Hash Table of resource locks (resource specified as OID) n Queue of waiters per locked resource –moderate complexity: storage manager implements “real” transaction logic

Overview C++ Language Binding n Support all main language features –Including polymorphic access and templates –No language extensions, No generated code n ODMG 2.0 compliant C++ Binding –Ref templates can be sub-classed to extend their behavior e.g. d_Ref could be extended to monitor object access counts e.g. d_Ref could be extended to monitor object access counts –large fraction of the binding has already been implemented n smart pointers can point to transient objects n persistent capable classes may be embedded into other persistent classes n d_activate and d_deactivate are implemented –design supports multiple DB contexts per process n e.g. for multi-threaded applications and mutiple federations n Work in progress: –B-Tree indices, bi-directional links, installable adapters for persistent objects

Overview First Scalability & Performance Tests n Page Server –up to 70 concurrent clients n Lock Server –up to 150 concurrent clients, up to 3000 locks n Storage Manager –Files up to 2 GB (ext2 file system limit under LINUX) –100 million objects per file n stress tested with “random” bit-patterns –Objects up to 10 MB size –Write Performance: > 40MB/s at 30% CPU n 450MHz dual PIII with 4 stripe RAID 0 on RedHat 6.1 n C++ Binding and Schema Handling –successfully ported several non-trivial applications –HTL histogram examples, simple object browser using python –tagDb and naming examples from HepODBMS

Overview Next Steps n Start detailed requirement discussion with experiments and other interested institutes n Continue Scalability & Performance Test –Storage Manager: larger files (>100GB) –Page Server: connections > 500 –Lock Server: number of locks > 20k –C++ Binding & Schema Manager: port Geant4 persistency examples and Conditions-DB n By summer this year –Written Architectural Overview of the Prototype –Development Plan with detailed estimate of required manpower –Single user toy-system

Overview Summary & Conclusions n We identified solutions for most critical components of a scalable and performant ODBMS –Prototype implementation shows promising performance and scalability –Using a strict component approach allows to split the effort into independently developed, replaceable modules. n The development of an Open Source ODBMS seems possible within the HEP or general science community n A collaborative effort of the order of 15 person years seems sufficient to produce such a system with production quality

The End

Overview Exploit Read-Only Data n Most of our data volume follows the pattern –(private) write-once, –share read-only –e.g. raw data is never updated, reconstructed data is not updated but replaced n Current ODBMS implementations do not really take advantage of this fact –read-only files n no need to obtain any locks for this data n no need to ever update cache content n simple backup strategy n Using the concept of read-only files –e.g. in the catalogue –should significantly reduce the locking overhead and improve the scalability of the system with many concurrent clients

Overview Transactions and Recovery n Shadow Paging –Physical pages on disk are accessed indirectly through a translation table (page map). –Copy-on-Write : page modifications are always written to a new, free physical page –Changed physical pages are made visible to other transactions by updating the page map at commit time Master PageMap 1 Data 2 Data 3 Data 4 Data 5 PageMap

Overview Advantages of this Approach n Single files or complete domains can be used stand-alone without modification –e.g. set of user files containing tags and histograms n Local OIDs could be stored in a more compact form –transparent expansion into a full OID as they are read into memory n “Attaching” or direct sharing of files or complete domains does not need any special treatment –no OID translation needed –read-only files/domains can directly be shared by multiple federations n Domains allow to segment the store into “coherent regions” of associated objects –Efficient distribution, backup and replication of subsets of the data (e.g. a run period, a set of user tracks) –Consistency checks can be constrained to a single domain

Overview Common Services n Services and Interfaces of global visibility –OID, IStorageMgr,IPageServer,ILockServer, ISchemaMgr –Platform & OS abstraction n fixed range types, I/O primitives, process control –component interface n version & configuration control n component factory –extendible diagnostics n named counters, timers to instrument the code n each component may have a sub-tree diagnostic items –error & debug message handler n syslog like: component, level, message –exception base class

Overview Espresso Schema Extraction n Currently implemented oextraction based on the “stabs” standard format for debugging information (used by egcs and Sun CC) obased on GNU “BDF” library and “objdump” source code n Prototype provides full runtime reflection for C++ data odescribes classes and structs with their fields and inheritance osupports namespaces, typedefs and enums and templates olocation and value of virtual function and virtual base class pointers osufficient to allow runtime field by field consistency check against persistent schema o Starting of a modified egcs front-end as schema extractor would be an alternative