Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.

Slides:



Advertisements
Similar presentations
Copyright © 2007, GemStone Systems Inc. All Rights Reserved. Optimize computations with Grid data caching OGF21 Jags Ramnarayan Chief Architect, GemFire.
Advertisements

MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
Chapter 13 (Web): Distributed Databases
Click to add text Introduction to the new mainframe: Large-Scale Commercial Computing © Copyright IBM Corp., All rights reserved. Chapter 3: Scalability.
Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:
Chapter 14 The Second Component: The Database.
Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.
Module 14: Scalability and High Availability. Overview Key high availability features available in Oracle and SQL Server Key scalability features available.
Scalability Module 6.
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 1 Preview of Oracle Database 12 c In-Memory Option Thomas Kyte
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | OFSAAAI: Modeling Platform Enterprise R Modeling Platform Gagan Deep Singh Director.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
File Systems and N/W attached storage (NAS) | VTU NOTES | QUESTION PAPERS | NEWS | VTU RESULTS | FORUM | BOOKSPAR ANDROID APP.
Grid Computing Meets the Database Chris Smith Platform Computing Session #
Performance and Scalability. Performance and Scalability Challenges Optimizing PerformanceScaling UpScaling Out.
Ch 4. The Evolution of Analytic Scalability
Word Wide Cache Distributed Caching for the Distributed Enterprise.
Training Workshop Windows Azure Platform. Presentation Outline (hidden slide): Technical Level: 200 Intended Audience: Developers Objectives (what do.
Systems analysis and design, 6th edition Dennis, wixom, and roth
Module 12: Designing High Availability in Windows Server ® 2008.
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Scalability Terminology: Farms, Clones, Partitions, and Packs: RACS and RAPS Bill Devlin, Jim Cray, Bill Laing, George Spix Microsoft Research Dec
 Anil Nori Distinguished Engineer Microsoft Corporation.
Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.
GigaSpaces Global HTTP Session Sharing October 2013 Massive Web Application Scaling.
Introduction to Hadoop and HDFS
IMDGs An essential part of your architecture. About me
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Computing Infrastructure for Large Ecommerce Systems -- based on material written by Jacob Lindeman.
Module 10 Administering and Configuring SharePoint Search.
G063 - Distributed Databases. Learning Objectives: By the end of this topic you should be able to: explain how databases may be stored in more than one.
Distributed Computing Systems CSCI 4780/6780. Distributed System A distributed system is: A collection of independent computers that appears to its users.
Distributed Computing Systems CSCI 4780/6780. Geographical Scalability Challenges Synchronous communication –Waiting for a reply does not scale well!!
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Kjell Orsborn UU - DIS - UDBL DATABASE SYSTEMS - 10p Course No. 2AD235 Spring 2002 A second course on development of database systems Kjell.
1 Distributed Databases BUAD/American University Distributed Databases.
Srik Raghavan Principal Lead Program Manager Kevin Cox Principal Program Manager SESSION CODE: DAT206.
Ceph: A Scalable, High-Performance Distributed File System
Intuitions for Scaling Data-Centric Architectures
Chapter 20 Parallel Sysplex
Distributed Computing Systems CSCI 6900/4900. Review Distributed system –A collection of independent computers that appears to its users as a single coherent.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
Distributed Computing Systems CSCI 4780/6780. Scalability ConceptExample Centralized servicesA single server for all users Centralized dataA single on-line.
Data Communications and Networks Chapter 9 – Distributed Systems ICT-BVF8.1- Data Communications and Network Trainer: Dr. Abbes Sebihi.
Your Data Any Place, Any Time Performance and Scalability.
1 Copyright © 2005, Oracle. All rights reserved. Following a Tuning Methodology.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Copyright ©2003 Dell Inc. All rights reserved. Scaling-Out with Oracle® Grid Computing on Dell™ Hardware J. Craig Lowery, Ph.D. Software Architect and.
Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.
St. Petersburg, 2016 Openstack Disk Storage vs Amazon Disk Storage Computing Clusters, Grids and Cloud Erasmus Mundus Master Program in PERCCOM Author:
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Curator: Self-Managing Storage for Enterprise Clusters
Open Source distributed document DB for an enterprise
Andy Wang COP 5611 Advanced Operating Systems
Course Introduction Dr. Eggen COP 6611 Advanced Operating Systems
Andy Wang COP 5611 Advanced Operating Systems
Maximum Availability Architecture Enterprise Technology Centre.
Grid Computing.
Software Architecture in Practice
Capitalize on modern technology
What is the Azure SQL Datawarehouse?
Chapter 17: Database System Architectures
Akshay Tomar Prateek Singh Lohchubh
Ch 4. The Evolution of Analytic Scalability
Practical Database Design and Tuning
Andy Wang COP 5611 Advanced Operating Systems
Andy Wang COP 5611 Advanced Operating Systems
Database System Architectures
Andy Wang COP 5611 Advanced Operating Systems
Presentation transcript:

Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone Systems

Copyright © 2008, GemStone Systems Inc. All Rights Reserved. Background on GemStone Systems Known for its Object Database technology since 1982 Now specializes in memory-oriented distributed data management –12 pending patents Over 200 installed customers in global 2000 Grid focus driven by: –Very high performance with predictable throughput, latency and availability Capital markets – risk analytics, pricing, etc Large e-commerce portals – real time fraud Federal intelligence

Copyright © 2008, GemStone Systems Inc. All Rights Reserved. Batch to real-time - long jobs to short tasks

Copyright © 2008, GemStone Systems Inc. All Rights Reserved. Increasing focus on DATA management Workloads where –task duration is getting shorter –latency of data access is important –consistency in data is crucial –high availability is not enough; it has to be continuously available –common data across thousands of parallel activities

Copyright © 2008, GemStone Systems Inc. All Rights Reserved. Accessing data in Grid today Direct access to enterprise database or Federated data access layer Exposed to the weakest link problem –only as fast as the slowest data source –only as available as the weakest link –can only scale as well as the weakest link Distributed/parallel file systems What if too many tasks go after the same data? Disk access speed is still 1000X slower than memory Data consistency challenges Might be controversial here

Copyright © 2008, GemStone Systems Inc. All Rights Reserved. Impact to Grid SLA

Copyright © 2008, GemStone Systems Inc. All Rights Reserved. Introducing memory oriented data fabric  Pool memory (and disk) across cluster/Grid  Managed as a single unit  Replicate data for high concurrent load, HA  Distribute (partition) data for high data volume, scale  Gracefully expand capacity to meet scalability/Perf goals Distributed Data Space Data warehouses Rational databases Distributed Applications

Copyright © 2008, GemStone Systems Inc. All Rights Reserved. How does it work? When data is stored, it is transparently replicated and/or partitioned; Redundant storage can be in memory and/or on disk— ensures continuous availability Machine nodes can be added dynamically to expand storage capacity or to handle increased client load Shared Nothing disk persistence - Each cache instance can optionally persist to disk Synchronous read through, write through or Asynchronous write-behind to other data sources and sinks

Copyright © 2008, GemStone Systems Inc. All Rights Reserved. Predictably scale with partitioning Distributed Apps By keeping data spread across many nodes in memory, we can exploit the CPU and network capacity on each node simultaneously to provide linear scalability A1 B1 C1 D1 E1 F1 G1 H1 I1 Local Cache Partitioning Meta Data Single Hop - Parallel loading by many Grid nodes - only limited by CPU and network backbone - With partitioning meta data on each compute node, access to any single piece of data is a single hop - As changes are redundantly and synchronously managed, availability and consistency is preserved - Dynamically detect load changes and add or remove nodes for data - Automatic data re-partitioning will condition the load

Copyright © 2008, GemStone Systems Inc. All Rights Reserved. Collocate data for near infinite scale Distributed Apps A1 B1 C1 D1 E1 F1 G1 H1 I1 Local Cache Partitioning Meta Data Single Hop Different Partitioning policies Hash partitioning –Suitable for key based access –Uniform random hashing Dramatically scale by keeping all related data together Application managed - associations –Orders hash partitioned but associated line items are collocated Application managed –Grouped on data object field(s) –Customize what is collocated –Example: ‘Manage all Sept trades in one data partition’

Copyright © 2008, GemStone Systems Inc. All Rights Reserved. Move business logic to data f 1, f 2, … f n FIFO Queue Data fabric Resources Exec functions Sept Trades Submit (f1) -> AggregateHighValueTrades(, “where trades.month=‘Sept’) Function (f1) Function (f2)  Principle: Move task to computational resource with most of the relevant data before considering other nodes where data transfer becomes necessary  Fabric function execution service  Data dependency hints Routing key, collection of keys, “where clause(s)”  Serial or parallel execution “Map Reduce”

Copyright © 2008, GemStone Systems Inc. All Rights Reserved. Parallel queries  Query execution for Hash policy  Parallelize query to each relevant node  Each node executes query in parallel using local indexes on data subset  Query result is streamed to coordinating node  Individual results are unioned for final result set  This “scatter-gather” algorithm can waste CPU cycles  Partition the data on the common filter  For instance, most queries are filtered on a Trade symbol  Query predicate can be analyzed to prune partitions 1. select * from Trades where trade.month = August 2. Parallel query execution 3. Parallel streaming of results 4. Results returned

Copyright © 2008, GemStone Systems Inc. All Rights Reserved. Key lessons Apps should think about capitalizing memory across Grid (it is abundant) Keep IO cycles to minimum through main memory caching of operational data sets –Scavange Grid memory and avoid data source access Achieve near infinite scale for your Grid apps by horizontally partitioning your data and behavior –Read “Pat helland’s – Life beyond Distributed transactions” ( Get more info on the GemFire data fabric –