An Introduction to Super-Scalability But first…

Slides:



Advertisements
Similar presentations
Chen Zhang Hans De Sterck University of Waterloo
Advertisements

Megastore: Providing Scalable, Highly Available Storage for Interactive Services. Presented by: Hanan Hamdan Supervised by: Dr. Amer Badarneh 1.
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Cassandra Structured Storage System over a P2P Network Avinash Lakshman, Prashant Malik.
NoSQL, No SQL!!, No, SQL? Raj Nair, Penton. Variety is the spice of life Key-Value stores Document stores ColumnFam ily Graph Hybrid Spice can lead to.
NoSQL Databases: MongoDB vs Cassandra
Parallel and distributed databases R & G Chapter 22.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo How to Scale a Database System.
NoSQL Database.
Distributed Databases
Chapter 9 Overview  Reasons to monitor SQL Server  Performance Monitoring and Tuning  Tools for Monitoring SQL Server  Common Monitoring and Tuning.
Distributed storage for structured data
Capacity Planning in SharePoint Capacity Planning Process of evaluating a technology … Deciding … Hardware … Variety of Ways Different Services.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Databases with Scalable capabilities Presented by Mike Trischetta.
Software Engineer, #MongoDBDays.
Introduction To Windows Azure Cloud
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Getting Biologists off ACID Ryan Verdon 3/13/12. Outline Thesis Idea Specific database Effects of losing ACID What is a NoSQL database Types of NoSQL.
Meet with the AppEngine Márk Gergely eu.edge. What is AppEngine? It’s a tool, that lets you run your web applications on Google's infrastructure. –Google's.
SCALING ON AWS – FROM MVP TO 100 MILLION USERS by: Muhammad Umair Cloudifie: Cloud for
Windows Azure Conference 2014 Deploy your Java workloads on Windows Azure.
Molecular Transactions G. Ramalingam Kapil Vaswani Rigorous Software Engineering, MSRI.
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
Database Systems: Design, Implementation, and Management Tenth Edition Chapter 12 Distributed Database Management Systems.
AZR308. Building distributed systems on an abstraction against commodity hardware at Internet scale, composed of multiple services. Distributed System.
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
DISTRIBUTED COMPUTING Introduction Dr. Yingwu Zhu.
CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes 13: BigTable, HBASE, Cassandra Hector Garcia-Molina.
MongoDB is a database management system designed for web applications and internet infrastructure. The data model and persistence strategies are built.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Databases Illuminated
Homework 4 Code for word count com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/
CAP Theorem Justin DeBrabant CIS Advanced Systems - Fall 2013.
Lecture 8: Databases and Data Infrastructure CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier.
Intuitions for Scaling Data-Centric Architectures
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
Dynamo: Amazon’s Highly Available Key-value Store DAAS – Database as a service.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
Senior Solutions Architect, MongoDB Inc. Massimo Brignoli #MongoDB Introduction to Sharding.
Technology Drill Down: Windows Azure Platform Eric Nelson | ISV Application Architect | Microsoft UK |
Big Data Yuan Xue CS 292 Special topics on.
SINGLE PLATFORM. COMPLETE SCALABILITY. The NoSQL and NewSQL.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Group members: Phạm Hoàng Long Nguyễn Huy Hùng Lê Minh Hiếu Phan Thị Thanh Thảo Nguyễn Đức Trí 1 BIG DATA & NoSQL Topic 1:
BIG DATA/ Hadoop Interview Questions.
Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.
CSCI5570 Large Scale Data Processing Systems
and Big Data Storage Systems
Cloud Computing and Architecuture
CSE-291 Cloud Computing, Fall 2016 Kesden
Learning MongoDB ZhangGang
NOSQL databases and Big Data Storage Systems
Database Performance Tuning and Query Optimization
Chapter 15 QUERY EXECUTION.
NoSQL Databases An Overview
آزمايشگاه سيستمهای هوشمند علی کمالی زمستان 95
Chapter 11 Database Performance Tuning and Query Optimization
The Database World of Azure
Presentation transcript:

An Introduction to Super-Scalability

But first…

1 ENIAC1 Teletype

1 MainframeN Terminals

N ServersN Terminals

N ServersN PCs

N Web ServersN Browsers

N Web ServersN AJAX Apps

N ClustersN AJAX Apps

N ClustersN*M Phones

N CloudletsN*M Phones

CPUDiskMemory Network

Time / Throughput Space / Capacity

Time / Throughput Space / Capacity Complexity Locking

(but how to scale?)

Just make it bigger (vertical scaling)

(super-scalability)

Not Super One big data store One big memory store Make it bigger Make it redundant E.g. Full activity logging Partitioning Sharding / Hashing Growth = Add Partition Tradeoff: Splitting Partitions Tradeoff: Redundancy becomes a distribution problem ……CCBBAA

Not Super Number of objects increase As relations increase, add time or space requirements Common with graph problems E.g. PageRank Distribution Chop up problem / workload Map/Reduce Tradeoff: coordination Tradeoff: network

Not Super Tune your code Tune your database Tune your network Better hardware Optimization As fast as possible Can’t scale as fast as growth Specialization – ONE thing Caching - Reduces work in trade for space Tradeoff: space Tradeoff: coordination

Not Super One at a time Serialized access Parallelizing / Estimating Separate reads & writes Non-locking estimation Reduce contention Tradeoff: space Tradeoff: coordination

Partitions: Data & Processing Sharding Worker Processes Coordination: Distribution & Ordering Queues & Managers Separate Read/Write Access What does this make the system look like?

And now…

Atomicity – all or nothing Consistency – always correct Isolation – changesets executed independently Durability – once committed, stays so Really hard to scale in one big block (although SSDs + RAM helps!)

(it depends)

Basically Available Soft State Eventual Consistency A node will either eventually get a change or retire Well…still need conflict resolution BASE is NOT ACID (get it?)

Choose TWO: Consistency Availability Partition tolerance ManagerManager Replica 1 Replica 2 Double Outage! Client 1 Client 2

Log Profile Tune Test Divide Compare Partition No, really, log a lot

1.The network is reliable.network 2.Latency is zero.Latency 3.Bandwidth is infinite.Bandwidth 4.The network is secure.secure 5.Topology doesn't change.Topology 6.There is one administrator.administrator 7.Transport cost is zero. 8.The network is homogeneous.

Separate operations for: Command – perform an action Query – returns data about state Promotes simpler programs Allows Command Queues Reduces locking

Applications SaaS Storage Identity Runtime Queue / Bus PaaS Compute Block Data Network IaaS

ComponentExample ComputeAmazon EC2 Azure Web/Worker Roles StorageAmazon S3 Azure TableStore NetworkAny CDN

ComponentExample DatabaseSQL Azure Postgres MySQL NoSQLCassandra Redis BigTable MongoDB CacheMemcache QueueAzure Service Bus ProcessingHadoop Storm

Salesforce? (Also sort of a platform) Whateva!

Cassandra

A “scalable” key-value store Automatic partitioning Automatic replicas

Worse than SQL Tuning?

Get user by user id Get item by item id Get all the items that a particular user likes Get all the users who like a particular item

Can’t get all the items that a particular user likes (without a giant scan)

N-M relationship is modeled with two tables. But Properties require secondary lookups.

Can put some data in the indexes if your queries need it. (Or serialize data.)

SuperColumns let you store other dimensions of data. (eek?)

Composite (sorted) column keys let you do neat things like time-order the mapping.

Roll your own model – see for great data model articleswww.datastax.com

Each Tuple has a Timestamp Last change wins Requires clock synchronization (Working on other strategies)

But wait, there’s more….

N*M*Q CloudletsN*M*Q Devices

It’s coming. Can your servers handle it?

Arduino Netduino Raspberry Pi ($25)

Cross-thing sharing Data storage Analysis

Communication Network Effect Analytics

Self-sufficient unit of scale All components required to operate a portion of workload Known performance characteristics Known cost to interact with other cells

How big is your project?

50,000 doctors 100 editors 500GB of data Does it matter?