Distributed Data Management for Compute Grid Presented by Michael Di Stefano Founder of Author of Meeting: Tuesday, September 13 th, 2005
Slide Agenda Data Management - The Next Grid Problem Evolution in Compute Topology Objectives of Data Management New Topology – New Data Management Techniques New Techniques, New Research, Emergence of Standards
Slide Two Components of The Grid Compute GRID The Grid Operating System - provides the core services for grid computing –Physical Resource Accounting –Process Task Queues –Management of Task/Resource Execution Data GRID Data Management System of Grid - Manages all aspects –Enterprise Data –Data Scheduling –Replication –Availability –Legacy Access Compute Grid Data Grid
Slide Compute Grids Roll your own Compute Grid Free Versions of Compute Grids Product and Supported Compute Grids
Slide Data Grids Data Grid Engine - Movement of Bits and Bytes FTP Sockets Middleware (messaging) Caches Applications Perspective Multiple Data Characteristics Quality of Service Data Management not Bit/Byte Movement
Slide Evolution in Computing MainframeMiniClient/Server
Slide Years of Distributed Computing Evolution Sockets CORBA Messaging Internet Application Servers Tight Bindings Loose Coupling Publish / Subscribe Grid Topology Emerging from the “Evolutionary Mist” Client/Server © Integrasoft, L.L.C. 2005
Slide Evolution Distributed Data Management for Grid Computing Copyright John Wiley and Sons 2005
Slide The Grid Topology Client / Server Compute Grid Physical Operational Operating System Physical CPU Peripherals Execution Threads Operating System Physical Nodes Resource/Node Management Inventory of Work/Tasks Resource Inventory Matching of Task to Recourse Close Proximity (Mother Board) Diverse CPU Families Diverse Geography Diverse Network Bandwidth
Slide Application on the Grid Multiple Data Sources and Destinations Client Information Portfolio Information Market Data Quality of Service Levels Application in its entirety Application components Speed of Access Query Updates (Transactional, Optimistic)
Slide How QoS is Delivered Today Relational Databases SQL Query Transactional Updates Stored Procedures Middleware Queuing Various delivery modes Publish and Subscribe Easy Programmatic API Other Object Databases Object Relational Data flow and movement is optimized. Designed to meet Application QoS For Client/Server Topology
Slide Application Today in Client/Server Threads RAM Connection Pools Tailored Middleware Business Applicatio n Server Machine
Slide What Happens in a Grid Business Applicatio n Server Machine Compute Grid
Slide The Data Access Funnel Distributed Data Management for Grid Computing Copyright John Wiley and Sons 2005
Slide Data Grid Eliminates the Funnel Distributed Data Management for Grid Computing Copyright John Wiley and Sons 2005
Slide Goals of a Data Management in Grid The Big 3 Goals of Data Management in Grid Optimize Data Affinity –Minimize Data Movement –Optimize the recourse of the Network Maintain Business Application QoS for Data Management Integrate Legacy Systems into the Grid
Slide How do Achieve Goals of the Data Grid What the Architect/Developer must Address How many copies or “Replicas” of data are needed in the Data Grid? How fine is the granularity of my “Data Atoms” to be replicated? How do best to “Distribute” Data Atoms across the Data Grid? What level of “Synchronization” is required? How do “logically group” data along business lines? How to “Integrate” and “Operate” legacy data sources? How to manage “Events” in the Data Grid? Synchronization of data sources external to the Data Grid?
Slide Data Management in Grid Granularity of Data Atoms Replication Distribution Logical Data Groupings (Data Regions) Synchronization InterRegion IntraRegion External Data Sources Events Integration with Legacy Systems Nothing to do with mechanics of the bits and bytes These are Data Management Issues
Slide Data Management is NOT Caching Distributed Data Management for Grid Computing Copyright John Wiley and Sons 2005 Moves the bits and bytes -Cache -Grid FTP -Others Data Management to deliver Business Application’s QoS given the “compute topology”
Slide Engines of a Data Grid Cache Java based engines such as JCache, Java Spaces, … Various C++ Caches Recycled Object Data Base Technology FTP Grid FTP Meta Data Services File Systems NFS Distributed File Systems
Slide Right Tool for the Job Business Applications have specific QoS levels from the Data Grid Complex Analysis of Large Data Sets Dependency of small fast moving data sets Large Static Data Sets …….
Slide Business Drivers Fueling Grid
Slide Business Drivers Fueling Grid Distributed Data Management for Grid Computing Copyright John Wiley and Sons 2005
Slide Limited Patience of Business
Slide No Data Management Tools Difficult Custom Code Long Time to Delivery No Reuse Business Prospective Increased Complexity Improved Performance Financial ROI Grid fails Wide Spread Acceptance
Slide Business Prospective Financial ROI With Data Management for Grid Easy to use/understand Reuse Effort on business Increased Complexity Improved Performance Fast Time to Market Ease of Migration to Grid Changes Data Centers
Slide Data Management in Grid Granularity of Data Atoms Replication Distribution Data Regions Synchronization Integration with Legacy Systems If Distributed Data Management is not addressed, wide acceptance of Grid will fail.
Slide Measuring QoS to Determine Data Grid Distributed Data Management for Grid Computing Copyright John Wiley and Sons 2005
Slide Measuring QoS to Determine Data Grid Distributed Data Management for Grid Computing Copyright John Wiley and Sons 2005 Application QoS( Work(), Data(), Time(), Geography() Query() ) Where: Work( batch/atomic, sync/async ) Data( overall size, atomic size, transient, query ) Time( RealTime, Non-RealTime, Near-RealTime ) Geography( Topology, Bandwidth ) Query( Basic, Complex )
Slide Objective of Data Grid - Data Affinity Low cost of CPU Data size is determined by application Network bandwidth is limited Data and Work need to be co-located Virtual Centrally Managed Data Base Physically Distributed
Slide How to Achieve Data Affinity Locate data and work close together to minimize data movement across the network Reactive : Data Grid distributes data in anticipation of where work will be assigned. Distributed Data Management policies of Regionalization Replication Distribution Synchronization Proactive : Routing of Task to Data. Compute Grid Task Scheduler queries Data Locality Information from Data Grid
Slide Distributed Data Management Data Regions Replication Distribution Synchronization Load and Store Event
Slide Distributed Data Management Policies Distributed Data Management for Grid Computing Copyright John Wiley and Sons 2005
Slide Advanced Topics in Distributed Data Management Natural Attraction Forces of Data Bodies Within a Data Grid To Describe Efficient Data Distribution Patterns White Paper Michael Di Stefano September 2004 Distributed Data Management for Grid Computing Copyright John Wiley and Sons 2005
Slide Advanced Topics in Distributed Data Management Natural Attraction Forces of Data Bodies Within a Data Grid To Describe Efficient Data Distribution Patterns White Paper Michael Di Stefano September 2004 Distributed Data Management for Grid Computing Copyright John Wiley and Sons 2005
Slide Purchasing Information Please Visit To Purchase your copy of “Distributed Data Management for Grid Computing” To receive a 15% discount.