Large dataset processing in the Cloud Kevin Glenny and GridwiseTech team
Simplified data oriented system Internal or external data sources applications working on data
IT systems are constantly growing Increased number of users Increased number of applications Increased amount of data
IT systems are constantly growing Infrastructure bottleneck
Example Electronics manufacturer 24/7 production Report computation too long for decision making 2.5 million transactions daily 4TB data to manage
What is Cloud computing? „Transparant access to capabilities using a pay-per-use business model” Benefits: – Dynamic scaling – Pay-for-use – Off-shored administration
What are the delivery models? SaaS (Software as a Service) – SalesForce.com, 63,00 clients PaaS (Platform as a Service) – Google App Engine (2008), Microsoft Azure (2008) IaaS (Infrastructure as a Service) – Amazon Elastic Compute Cloud, 8.2 million instances launched since 2006
Application data processing Database sharding (MySQL, postgreSQL etc.) NoSQL (Google's BigTable, Amazon's Dynamo etc.) Data-grid (GigaSpaces XAP, Oracle Coherance, InfiniSpan etc.)
Data-grid and sharding in the Cloud All data processing and persistence in the Cloud Achievements: Near real-time Dynamic scaling (application and resources) Pay-per-use Reduced administration HA
Remaining issues Getting large datasets in and out of the Cloud – Bandwidth limited client side – Resort to mailing hard drives! Performance - 2 to 50% slow down Data security/privacy - trust SLAs – plan for the worst
Conclusions Data oriented systems datasets grow causing bottlenecks Datasets in the Cloud can be processed using scalable technologies Challenges remain Main – how to get the data to the Cloud?