GORDA Kickoff meeting INRIA – Sardes project Emmanuel Cecchet Sara Bouchenak.

Slides:



Advertisements
Similar presentations
Distributed Processing, Client/Server and Clusters
Advertisements

ObjectWeb Architecture Meeting, 25 sept 03 Our experience of open source.
Database Architectures and the Web
The google file system Cs 595 Lecture 9.
Introduction to DBA.
High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
GGF Toronto Spitfire A Relational DB Service for the Grid Peter Z. Kunszt European DataGrid Data Management CERN Database Group.
Distributed Database Management Systems
Overview Distributed vs. decentralized Why distributed databases
Figure 1.1 Interaction between applications and the operating system.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 17 Client-Server Processing, Parallel Database Processing,
Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.
Module 14: Scalability and High Availability. Overview Key high availability features available in Oracle and SQL Server Key scalability features available.
Understanding and Managing WebSphere V5
Word Wide Cache Distributed Caching for the Distributed Enterprise.
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
1 The Google File System Reporter: You-Wei Zhang.
Highly available web sites with Tomcat and Clustered JDBC Emmanuel Cecchet.
JOnAS developer workshop – /02/2004 status Emmanuel Cecchet
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.
Oracle10g RAC Service Architecture Overview of Real Application Cluster Ready Services, Nodeapps, and User Defined Services.
Sofia, Bulgaria | 9-10 October SQL Server 2005 High Availability for developers Vladimir Tchalkov Crossroad Ltd. Vladimir Tchalkov Crossroad Ltd.
/11/2003 C-JDBC: a High Performance Database Clustering Middleware Nicolas Modrzyk
DISTRIBUTED COMPUTING
© Emic 2005 Emmanuel Cecchet | ApacheCon EU Building Highly Available Database Applications for Apache Derby Emmanuel Cecchet.
Cloud Computing & Amazon Web Services – EC2 Arpita Patel Software Engineer.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Chokchai Junchey Microsoft Product Specialist Certified Technical Training Center.
Copyright 2006 MySQL AB The World’s Most Popular Open Source Database MySQL Cluster: An introduction Geert Vanderkelen MySQL AB.
An open source middleware to build Redundant Array of Inexpensive Databases Emmanuel Cecchet.
Week 5 Lecture Distributed Database Management Systems Samuel ConnSamuel Conn, Asst Professor Suggestions for using the Lecture Slides.
Oracle 10g Database Administrator: Implementation and Administration Chapter 2 Tools and Architecture.
Module 13 Implementing Business Continuity. Module Overview Protecting and Recovering Content Working with Backup and Restore for Disaster Recovery Implementing.
Heavy and lightweight dynamic network services: challenges and experiments for designing intelligent solutions in evolvable next generation networks Laurent.
Applications Web et bases de données en grappe Séminaire InTech 3 Février 2005 – Grenoble.
Middleware for FIs Apeego House 4B, Tardeo Rd. Mumbai Tel: Fax:
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Usenix Annual Conference, Freenix track – June 2004 – 1 : Flexible Database Clustering Middleware Emmanuel Cecchet – INRIA Julie Marguerite.
Achieving Scalability, Performance and Availability on Linux with Oracle 9iR2-RAC Grant McAlister Senior Database Engineer Amazon.com Paper
Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes.
Kjell Orsborn UU - DIS - UDBL DATABASE SYSTEMS - 10p Course No. 2AD235 Spring 2002 A second course on development of database systems Kjell.
Databases Illuminated
1 Admission Control and Request Scheduling in E-Commerce Web Sites Sameh Elnikety, EPFL Erich Nahum, IBM Watson John Tracey, IBM Watson Willy Zwaenepoel,
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Highly available database clusters with JDBC
VMware vSphere Configuration and Management v6
© Chinese University, CSE Dept. Distributed Systems / Distributed Systems Topic 1: Characterization of Distributed & Mobile Systems Dr. Michael R.
Nicolas Modrzyk /02/2004 C-JDBC: a High Performance Database Clustering.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Chapter 1 Database Access from Client Applications.
Module 6: Administering Reporting Services. Overview Server Administration Performance and Reliability Monitoring Database Administration Security Administration.
By Nitin Bahadur Gokul Nadathur Department of Computer Sciences University of Wisconsin-Madison Spring 2000.
Replicazione e QoS nella gestione di database grid-oriented Barbara Martelli INFN - CNAF.
Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)
INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.
Distributed Databases
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
The Holmes Platform and Applications
Introduction to Distributed Platforms
Open Source distributed document DB for an enterprise
Maximum Availability Architecture Enterprise Technology Centre.
A Technical Overview of Microsoft® SQL Server™ 2005 High Availability Beta 2 Matthew Stephen IT Pro Evangelist (SQL Server)
Storage Virtualization
Introduction of Week 6 Assignment Discussion
Database System Architectures
Presentation transcript:

GORDA Kickoff meeting INRIA – Sardes project Emmanuel Cecchet Sara Bouchenak

/10/2004 Outline è INRIA, ObjectWeb & Sardes è GORDA

/10/2004 Budget: 120 M€ (tax not incl.) 900 permanent staff 400 researchers 500 engineers, technical and administrative 450 researchers from other organizations 700 Ph.D students 200 external collaborators 750 trainees, post-doctoral students, visiting researchers from abroad (universities or industry)) 6 Research Units A scientific force of 3,000 INRIA key figures Jan A public scientific and technological research institute in computer science and control under the dual authority of the Ministry of Research and the Ministry of Industry INRIA Rhône-Alpes

/10/2004 Itanium-2 processors 104 nodes (Dual 64 bits 900 MHz processors, 3 GB memory, 72 GB local disk) connected through a Myrinet network 208 processors, 312 GB memory, 7.5 TB disk Connected to the GRID Linux OS (RedHat Advanced Server) First Linpack experiments at INRIA (Aug. 03) have reached 560 GFlop/s Applications : Grid computing, classical scientific computing, high performance Internet servers, … iCluster 2

/10/2004 ObjecWeb key figures è Open source middleware development è Based on open standard  J2EE, CORBA, OSGi… è International consortium  Founded by INRIA, Bull and FT R&D in 2001 è Academic partners  European universities and research centers è Industrial partners  RedHat, Suse, MySQL, …  NEC, Bull, France Telecom, Dassault, Cap Gemini, …

/10/2004 Common Software & Architecture for Component Based Development OSGi Hardware & OS Applications J2EECCM... Dedicated Middleware Platform Networked Resources JVM TransactionsPersistence Deployment... Composition... Application Servers Database Connectivity Interoperability Tools FrameworksComponents... JOnAS OpenCCM OSCAR ProActive JORM/MEDOR JOTMOSCAR Fractal Think JORAM DotNetJ C-JDBC RmiJdbc Octopus Enhydra Zeus Kilim CAROL Jonathan Speedo RUBiS Bonita JAWE XMLC JMOB

/10/2004 Sardes project è Distributed Systems group è Main research themes  Reflective component technology  Autonomous systems management è Applications areas  high-availability J2EE servers  dynamic monitoring, configuration and resource management in large scale distributed systems  (embedded system networks, ubiquitous computing) è Result dissemination by ObjectWeb

/10/2004 Outline è INRIA, ObjectWeb & Sardes è GORDA

/10/2004 Sardes experiences è Component-based open source middleware  ObjectWeb ( è J2EE application servers  JOnAS clustering ( è Database replication middleware  C-JDBC ( è Benchmarking  RUBiS (  TPC-W (  CLIF ( è Monitoring  LeWYS (

/10/2004 Common scalability practice Internet Web frontend App. server Well-known database vendor here Database Well-known hardware + database vendors here è Cons  Cost  Scalability limit

/10/2004 Replication with shared disks Internet Web frontend App. server Database Disks Another well-known database vendor è Cons  still expensive hardware  availability

/10/2004 Master/Slave replication Internet Web frontend App. server è Cons  consistency  failover time on master failure  scalability Master

/10/2004 Internet è Database tier should be  scalable  highly available  without modifying the client application  database vendor independent  on commodity hardware Atomic broadcast-based replication Atomic broadcast

/10/2004 Internet è JDBC compliant (no client application modification) è database vendor independent  JDBC driver required  heterogeneity support è no 2PC, no group communication between databases è group communication for controller replication only C-JDBC JDBC

/10/2004 RAIDb - Definition è Redundant Array of Inexpensive Databases è better performance and fault tolerance than a single database, at a low cost, by combining multiple database instances into an array of databases è RAIDb levels offers various tradeoff of performance and fault tolerance

/10/2004 RAIDb è Redundant Array of Inexpensive Databases  better performance and fault tolerance than a single database, at a low cost, by combining multiple database instances into an array of databases è RAIDb controller  gives the view of a single database to the client  balance the load on the database backends è RAIDb levels  RAIDb-0: full partitioning  RAIDb-1: full mirroring  RAIDb-2: partial replication  composition possible

/10/2004 C-JDBC – Key ideas è Middleware implementing RAIDb è Two components  generic JDBC 2.0 driver (C-JDBC driver)  C-JDBC Controller è C-JDBC Controller provides  performance scalability  high availability  failover  caching, logging, monitoring, … è Supports heterogeneous databases

/10/2004 C-JDBC – Overview

/10/2004 Heterogeneity support è unload a single Oracle DB with several MySQL è RAIDb-2 for partial replication

/10/2004 Inside the C-JDBC Controller Sockets JMX

/10/2004 C-JDBC features è unified authentication management è tunable concurrency control è automatic schema detection è tunable replication: full partitioning, partial replication, full replication è caching: metadata, parsing, result with various invalidation granularities è various load balancing strategies è on-the-fly query rewriting for macros and heterogeneity support è recovery log for dynamic backend adding and failure recovery è database backup/restore using Octopus è JMX based monitoring and administration è graphical administration console

/10/2004 Functional overview connect myDB connect login, password execute SELECT * FROM t

/10/2004 Functional overview execute INSERT INTO t …

/10/2004 Failures è No 2 phase- commit  parallel transactions  failed nodes are automatically disabled execute INSERT INTO t …

/10/2004 Controller replication è Prevent the controller from being a single point of failure è Group communication for controller synchronization è C-JDBC driver supports multiple controllers with automatic failover jdbc:c-jdbc://node1:25322,node2:12345/myDB

/10/2004 Controller replication jdbc:cjdbc://node1,node2/myDB Uniform total order reliable multicast

/10/2004 Mixing horizontal & vertical scalability

/10/2004 Lessons learned è SQL parsing cannot be generic è many discrepancies in JDBC implementations è minimize the use of group communications  IP multicast does not scale è notification infrastructure needed è users want  no single point of failure  control (monitoring, plug-able recovery policies, …)  no database vendor locking  no database modification è need for an exhaustive test suite è benchmarking accurately is very difficult  load injection requires resources  monitoring and exploiting results is tricky

/10/2004 Sardes role in GORDA è provide input  GORDA APIs  group communication requirements  monitoring and management requirements è middleware implementation based on C-JDBC è dissemination effort  ObjectWeb  possible participation to JCP for JDBC extensions è hardware resources for experiments è eCommerce benchmarks

/10/2004 Other interests è LeWYS (  monitoring infrastructure  generic hardware/kernel probes for Linux/Windows  software probes: JMX, SNMP, …  monitoring repository è autonomic behavior  building supervision loops  self-healing clusters  self-sizing (expand or shrink)  SLAs

/10/2004 Q without A è do we consider distributed query execution? è XA support? è cluster size targeted? è do we target grids or cluster of clusters?  reconciliation  consistency/caching  network architecture considered? è are relaxed or loose consistency options? è what will really cover the GRI?  do we impose a specific way of doing replication?  access to read-set/write-set difficult to implement with legacy databases è which workloads are considered? è which WP deals with backup/recovery? è licensing issues?

Q&A _________ Thanks to all users and contributors...

Bonus slides

INTERNALS

/10/2004 Virtual Database è gives the view of a single database è establishes the mapping between the database name used by the application and the backend specific settings è backends can be added and removed dynamically è configured using an XML configuration file

/10/2004 Authentication Manager è Matches real login/password used by the application with backend specific login/ password è Administrator login to manage the virtual database

/10/2004 Scheduler è Manages concurrency control è Specific implementations for Single DB, RAIDb 0, 1 and 2 è Query-level è Optimistic and pessimistic transaction level  uses the database schema that is automatically fetched from backends

/10/2004 Request cache è caches results from SQL requests è improved SQL statement analysis to limit cache invalidations  table based invalidations  column based invalidations  single-row SELECT optimization è request parsing possible in the C-JDBC driver  offload the controller  parsing caching in the driver

/10/2004 Load balancer 1/2 è RAIDb-0  query directed to the backend having the needed tables è RAIDb-1  read executed by current thread  write executed in parallel by a dedicated thread per backend  result returned if one, majority or all commit  if one node fails but others succeed, failing node is disabled è RAIDb-2  same as RAIDb-1 except that writes are sent only to nodes owning the written table

/10/2004 Load balancer 2/2 è Static load balancing policies  Round-Robin (RR)  Weighted Round-Robin (WRR) è Least Pending Requests First (LPRF)  request sent to the node that has the shortest pending request queue  efficient if backends are homogeneous in terms of performance

/10/2004 Connection Manager è Connection pooling for a backend  Simple : no pooling  RandomWait : blocking pool  FailFast : non-blocking pool  VariablePool : dynamic pool è Connection pools defined on a per login basis  resource management per login  dedicated connections for admin

/10/2004 Recovery Log è Checkpoints are associated with database dumps è Record all updates and transaction markers since a checkpoint è Used to resynchronize a database from a checkpoint è JDBCRecoveryLog  store information in a database  can be re-injected in a C-JDBC cluster for fault tolerance

SCALABILITY

/10/2004 C-JDBC scalability è Horizontal scalability  prevents the controller to be a Single Point Of Failure (SPOF)  distributes the load among several controllers  uses group communications for synchronization è C-JDBC Driver  multiple controllers automatic failover jdbc:c-jdbc://node1:25322,node2:12345/myDB  connection caching  URL parsing/controller lookup caching

/10/2004 C-JDBC scalability è Vertical scalability  allows nested RAIDb levels  allows tree architecture for scalable write broadcast  necessary with large number of backends  C-JDBC driver re-injected in C-JDBC controller

/10/2004 C-JDBC vertical scalability è RAIDb-1-1 with C-JDBC è no limit to composition deepness

/10/2004 C-JDBC vertical scalability è RAIDb-0-1 with C-JDBC

CHECKPOINTING

/10/2004 Fault tolerant recovery log UPDATE statement

/10/2004 Checkpointing è Octopus is an ETL tool è Use Octopus to store a dump of the initial database state è Currently done by the user using the database specific dump tool

/10/2004 Checkpointing è Backend is enabled è All database updates are logged (SQL statement, user, transaction, …)

/10/2004 Checkpointing è Add new backends while system online è Restore dump corresponding to initial checkpoint with Octopus

/10/2004 Checkpointing è Replay updates from the log

/10/2004 Checkpointing è Enable backends when done

/10/2004 Making new checkpoints è Disable one backend to have a coherent snapshot è Mark the new checkpoint entry in the log è Use Octopus to store the dump

/10/2004 Making new checkpoints è Replay missing updates from log

/10/2004 Making new checkpoints è Re-enable backend when done

/10/2004 Recovery è A node fails! è Automatically disabled but should be fixed or changed by administrator

/10/2004 Recovery è Restore latest dump with Octopus

/10/2004 Recovery è Replay missing updates from log

/10/2004 Recovery è Re-enable backend when done

HORIZONTAL SCALABILITY

/10/2004 Horizontal scalability è JGroups for controller synchronization è Groups messages for writes only

/10/2004 Horizontal scalability è Centralized write approach issues è Issues with transactions assigned to connections

/10/2004 Horizontal scalability è General case for a write query  3 multicast + 2n unicast

/10/2004 Horizontal scalability è Solution: No backend sharing  1 multicast + n unicast [+ 1 multicast]

/10/2004 Horizontal scalability è Issues with JGroups  resources needed by a channel  instability of throughput with UDP  performance scalability è TCP better than UDP but  unable to disable reliability on top of TCP  unable to disable garbage collection  ordering implementation is sub-optimal è Need for a new group communication layer optimized for cluster

/10/2004 Horizontal scalability è JGroups performance on UDP/FastEthernet

USE CASES

/10/2004 Budget High Availability è High availability infrastructure “on a budget” è Typical eCommerce setup è

/10/2004 OpenUSS: University Support System è eLearning è High availability è Portability  Linux, HP-UX, Windows  InterBase, Firebird, PostgreSQL, HypersonicSQL è

/10/2004 Flood alert system è Disaster recovery è Independent nodes synchronized with C-JDBC è VPN for security issues è

/10/2004 J2EE benchmarking è Large scale J2EE clusters è Interne t emulated users

PERFORMANCE

/10/2004 TPC-W è Browsing mix performance

/10/2004 TPC-W è Shopping mix performance

/10/2004 TPC-W è Ordering mix performance

/10/2004 Result cache è Cache contains a list of SQL->ResultSet è Policy defined by queryPattern->Policy è 3 policies  EagerCaching: variable granularities for invalidations  RelaxedCaching: invalidations based on timeout  NoCaching: never cached RUBiS bidding mix with 450 clients No cacheCoherent cache Relaxed cache Throughput (rq/min) Avg response time801 ms284 ms134 ms Database CPU load100%85%20% C-JDBC CPU load-15%7%

/10/2004 Outline è Motivations è RAIDb è C-JDBC è Performance è Lessons learned è Conclusion

/10/2004 Open problems è Partition of clusters è Users want control on failure policy è Reconciliation must also be user controlled

/10/2004 LeWYS overview Monitoring Pump CPU probe disk probe pump thread DREAM Monitoring Pump CPU probe memory probe pump thread DREAM Monitoring Pump pump thread CPU probe network probe Observer Monitoring repository

LeWYS

/10/2004 LeWYS components è Library of probes  hardware resources: cpu, memory, disk, network  generic sensors: SNMP, JMX, JVMPI, … è Monitoring pump  dynamic deployment of sensors  manages monitoring leases è Event channels  propagate monitored events to interested observers  allows for filtering, aggregation, content-based processing, … è Optional monitoring repository

/10/2004 LeWYS design choices è Component-based framework  probes, monitoring pump, event channels  provides (re)configurability capabilities è Minimize intrusiveness on monitored nodes è No global clock  timestamp generated locally by pump è Information processing in DREAM channels

/10/2004 Centralized monitoring using a monitoring repository (1) Monitoring Pump CPU probe memory probe pump thread DREAM Monitoring Pump CPU probe memory probe pump thread DREAM Monitoring Repository Monitoring DB storage thread query threads Monitoring Console event subscribe service

/10/2004 Centralized monitoring using a monitoring repository (2) è Monitoring repository  stores monitoring information  service to retrieve monitoring information è Pros  DB allows for storing large amount of data  powerful queries correlate data from various probes at different locations resynchronize clocks  browsing history to diagnose failures  use history for system provisioning è Cons  requires a DB (heavy weight solution)

/10/2004 Outline è J2EE Cluster è Group communications è Monitoring  motivations  LeWYS  implementation è Status & Perspectives

/10/2004 Monitoring pump implemention Component ProbeFactory Monitoring Pump Thread Probe MonitoringPump Manager ChannelOut ProbeManager Binding Controller Probe Repository ProbeManager OutputManager Component PullPush Multiplexer MonitoringMumpManager OutputManager TimeStamp Probe Cache CachedProbe RMI

/10/2004 Hardware Probes è Pure Java probes  using /proc  cost: 0.01ms/call (Linux) LinuxSolaris /proc Hardware resources Windows.DLLCC Linux C JNI cpu, mem, disk, net, kernel, …probes cpu, … probes ……

/10/2004 Software Probes è Application level monitoring  JMX  ad-hoc è JVM LinuxSolaris JVM Hardware resources WindowsLinux JMXJVMPI SNMP, ad-hoc, … probes JVM probes JMX based probes