Download presentation
Presentation is loading. Please wait.
Published byMalcolm Sanders Modified over 8 years ago
1
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone Systems
2
Copyright © 2008, GemStone Systems Inc. All Rights Reserved. Background on GemStone Systems Known for its Object Database technology since 1982 Now specializes in memory-oriented distributed data management –12 pending patents Over 200 installed customers in global 2000 Grid focus driven by: –Very high performance with predictable throughput, latency and availability Capital markets – risk analytics, pricing, etc Large e-commerce portals – real time fraud Federal intelligence
3
Copyright © 2008, GemStone Systems Inc. All Rights Reserved. Batch to real-time - long jobs to short tasks
4
Copyright © 2008, GemStone Systems Inc. All Rights Reserved. Increasing focus on DATA management Workloads where –task duration is getting shorter –latency of data access is important –consistency in data is crucial –high availability is not enough; it has to be continuously available –common data across thousands of parallel activities
5
Copyright © 2008, GemStone Systems Inc. All Rights Reserved. Accessing data in Grid today Direct access to enterprise database or Federated data access layer Exposed to the weakest link problem –only as fast as the slowest data source –only as available as the weakest link –can only scale as well as the weakest link Distributed/parallel file systems What if too many tasks go after the same data? Disk access speed is still 1000X slower than memory Data consistency challenges Might be controversial here
6
Copyright © 2008, GemStone Systems Inc. All Rights Reserved. Impact to Grid SLA
7
Copyright © 2008, GemStone Systems Inc. All Rights Reserved. Introducing memory oriented data fabric Pool memory (and disk) across cluster/Grid Managed as a single unit Replicate data for high concurrent load, HA Distribute (partition) data for high data volume, scale Gracefully expand capacity to meet scalability/Perf goals Distributed Data Space Data warehouses Rational databases Distributed Applications
8
Copyright © 2008, GemStone Systems Inc. All Rights Reserved. How does it work? When data is stored, it is transparently replicated and/or partitioned; Redundant storage can be in memory and/or on disk— ensures continuous availability Machine nodes can be added dynamically to expand storage capacity or to handle increased client load Shared Nothing disk persistence - Each cache instance can optionally persist to disk Synchronous read through, write through or Asynchronous write-behind to other data sources and sinks
9
Copyright © 2008, GemStone Systems Inc. All Rights Reserved. Predictably scale with partitioning Distributed Apps By keeping data spread across many nodes in memory, we can exploit the CPU and network capacity on each node simultaneously to provide linear scalability A1 B1 C1 D1 E1 F1 G1 H1 I1 Local Cache Partitioning Meta Data Single Hop - Parallel loading by many Grid nodes - only limited by CPU and network backbone - With partitioning meta data on each compute node, access to any single piece of data is a single hop - As changes are redundantly and synchronously managed, availability and consistency is preserved - Dynamically detect load changes and add or remove nodes for data - Automatic data re-partitioning will condition the load
10
Copyright © 2008, GemStone Systems Inc. All Rights Reserved. Collocate data for near infinite scale Distributed Apps A1 B1 C1 D1 E1 F1 G1 H1 I1 Local Cache Partitioning Meta Data Single Hop Different Partitioning policies Hash partitioning –Suitable for key based access –Uniform random hashing Dramatically scale by keeping all related data together Application managed - associations –Orders hash partitioned but associated line items are collocated Application managed –Grouped on data object field(s) –Customize what is collocated –Example: ‘Manage all Sept trades in one data partition’
11
Copyright © 2008, GemStone Systems Inc. All Rights Reserved. Move business logic to data f 1, f 2, … f n FIFO Queue Data fabric Resources Exec functions Sept Trades Submit (f1) -> AggregateHighValueTrades(, “where trades.month=‘Sept’) Function (f1) Function (f2) Principle: Move task to computational resource with most of the relevant data before considering other nodes where data transfer becomes necessary Fabric function execution service Data dependency hints Routing key, collection of keys, “where clause(s)” Serial or parallel execution “Map Reduce”
12
Copyright © 2008, GemStone Systems Inc. All Rights Reserved. Parallel queries Query execution for Hash policy Parallelize query to each relevant node Each node executes query in parallel using local indexes on data subset Query result is streamed to coordinating node Individual results are unioned for final result set This “scatter-gather” algorithm can waste CPU cycles Partition the data on the common filter For instance, most queries are filtered on a Trade symbol Query predicate can be analyzed to prune partitions 1. select * from Trades where trade.month = August 2. Parallel query execution 3. Parallel streaming of results 4. Results returned
13
Copyright © 2008, GemStone Systems Inc. All Rights Reserved. Key lessons Apps should think about capitalizing memory across Grid (it is abundant) Keep IO cycles to minimum through main memory caching of operational data sets –Scavange Grid memory and avoid data source access Achieve near infinite scale for your Grid apps by horizontally partitioning your data and behavior –Read “Pat helland’s – Life beyond Distributed transactions” (http://www-db.cs.wisc.edu/cidr/cidr2007/papers/cidr07p15.pdf)http://www-db.cs.wisc.edu/cidr/cidr2007/papers/cidr07p15.pdf Get more info on the GemFire data fabric –http://www.gemstone.com/gemfirehttp://www.gemstone.com/gemfire
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.