© 2013 A. Haeberlen, Z. Ives Cloud Storage & Case Studies NETS 212: Scalable & Cloud Computing Fall 2014 Z. Ives University of Pennsylvania 1.

Slides:



Advertisements
Similar presentations
Running Your Startup on Amazon Web Services Alex Iskold Founder/CEO AdaptiveBlue Feature Writer ReadWriteWeb.
Advertisements

© 2013 A. Haeberlen, Z. Ives NETS 212: Scalable and Cloud Computing 1 University of Pennsylvania Cloud storage September 19, 2013.
© 2010 VMware Inc. All rights reserved Amazon Web Services.
AMAZON CLOUD SERVICES – A WALKTHROUGH FOR COMPARISON TO GAE
Distributed Databases John Ortiz. Lecture 24Distributed Databases2  Distributed Database (DDB) is a collection of interrelated databases interconnected.
Chapter 9 Chapter 9: Managing Groups, Folders, Files, and Object Security.
COMP106 Assignment 2 – A new interface design Proposal 6.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 5: Managing File Access.
Introduction to Structured Query Language (SQL)
PowerPoint Presentation for Dennis, Wixom & Tegarden Systems Analysis and Design Copyright 2001 © John Wiley & Sons, Inc. All rights reserved. Slide 1.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 5: Managing File Access.
Introduction to Databases CIS 5.2. Where would you find info about yourself stored in a computer? College Physician’s office Library Grocery Store Dentist’s.
Administering Active Directory
Introduction to Structured Query Language (SQL)
Nikolay Tomitov Technical Trainer SoftAcad.bg.  What are Amazon Web services (AWS) ?  What’s cool when developing with AWS ?  Architecture of AWS 
Chapter 7 Managing Data Sources. ASP.NET 2.0, Third Edition2.
1 Securing Network Resources Understanding NTFS Permissions Assigning NTFS Permissions Assigning Special Permissions Copying and Moving Files and Folders.
Definitions Collaboration – working together on team projects and sharing information, often through ad-hoc processes, to accomplish project goals. Document.
NovaBACKUP 10 xSP Technical Training By: Nathan Fouarge
Databases with Scalable capabilities Presented by Mike Trischetta.
LINQ Boot Camp ADO.Net Entity Framework Presenter : Date : Mahesh Moily Nov 26, 2009.
DAY 15: ACCESS CHAPTER 2 Larry Reaves October 7,
MongoDB An introduction. What is MongoDB? The name Mongo is derived from Humongous To say that MongoDB can handle a humongous amount of data Document.
XP New Perspectives on Microsoft Office Access 2003 Tutorial 12 1 Microsoft Office Access 2003 Tutorial 12 – Managing and Securing a Database.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 5: Managing File Access.
The Blue Book pages 19 onwards
A Metadata Catalog Service for Data Intensive Applications Presented by Chin-Yi Tsai.
Entity Framework Overview. Entity Framework A set of technologies in ADO.NET that support the development of data-oriented software applications A component.
Managing Groups, Folders, Files and Security Local Domain local Global Universal Objects Folders Permissions Inheritance Access Control List NTFS Permissions.
Object Persistence (Data Base) Design Chapter 13.
Object Persistence Design Chapter 13. Key Definitions Object persistence involves the selection of a storage format and optimization for performance.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Exam and Lecture Overview.
Views In some cases, it is not desirable for all users to see the entire logical model (that is, all the actual relations stored in the database.) In some.
Databases Shortfalls of file management systems Structure of a database Database administration Database Management system Hierarchical Databases Network.
Database Management Systems.  Database management system (DBMS)  Store large collections of data  Organize the data  Becomes a data storage system.
INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.
1 © Prentice Hall, 2002 Chapter 5: Logical Database Design and the Relational Model Modern Database Management 6 th Edition Jeffrey A. Hoffer, Mary B.
Data Manipulation Jonathan Rosenberg dynamicsoft.
Netprog: Corba Object Services1 CORBA 2.0 Object Services Ref: The Essential Distributed Objects Survival Guide: Orfali, Harky & Edwards.
Chapter 9 Database Systems © 2007 Pearson Addison-Wesley. All rights reserved.
1 Chapter 4: Creating Simple Queries 4.1 Introduction to the Query Task 4.2 Selecting Columns and Filtering Rows 4.3 Creating New Columns with an Expression.
Dynamo: Amazon’s Highly Available Key-value Store DAAS – Database as a service.
1 Introduction to NTFS Permissions Assign NTFS permissions to specify Which users and groups can gain access to folders and files What they can do with.
Web Technologies Lecture 10 Web services. From W3C – A software system designed to support interoperable machine-to-machine interaction over a network.
CPSC 203 Introduction to Computers T97 By Jie (Jeff) Gao.
Module 6: Administering Reporting Services. Overview Server Administration Performance and Reliability Monitoring Database Administration Security Administration.
Launch Amazon Instance. Amazon EC2 Amazon Elastic Compute Cloud (Amazon EC2) provides resizable computing capacity in the Amazon Web Services (AWS) cloud.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
Course: Cluster, grid and cloud computing systems Course author: Prof
Jonathan Rosenberg dynamicsoft
and Big Data Storage Systems
Introduction to NTFS Permissions
Module 11: File Structure
Section 6 Object Storage Gateway (RADOS-GW)
Amazon Storage- S3 and Glacier
NOSQL.
Introduction to NewSQL
NOSQL databases and Big Data Storage Systems
Building a Database on S3
Indexing and Hashing Basic Concepts Ordered Indices
Systems Analysis and Design
Microsoft Office Access 2003
Presented By: Aarushi Chawla ( ) Shiv Kandikuppa ( )
Fundamentals of Databases
Data Model.
The Blue Book pages 19 onwards
Greta Mameniskyte IV course 3rd group
Building Serverless Enterprise Applications
Creating and Managing Folders
Presentation transcript:

© 2013 A. Haeberlen, Z. Ives Cloud Storage & Case Studies NETS 212: Scalable & Cloud Computing Fall 2014 Z. Ives University of Pennsylvania 1

© 2013 A. Haeberlen, Z. Ives Specialized KVS Cloud KVS are often specialized for a particular tradeoff or usage scenario Example: Amazon’s solutions Simple Storage Service (S3): large objects – files, virtual machines, etc. assumes objects change infrequently objects are opaque to the storage system SimpleDB (old), DynamoDB (newer replacement): small objects – Java objects, records, etc. generally updated more frequently; greater need for consistency generally multiple attributes or properties, which are exposed to the storage system 2

© 2013 A. Haeberlen, Z. Ives Big Objects: Amazon S3 S3 = Simple Storage System Stores large objects (=values) that may have access permissions objects – named items stored in S3 buckets of objects – think of these as volumes in a filesystem the console includes a notion of folders, but these are not intrinsic to S3 Accessed via REST/SOAP But we’ll be using Java(script) libraries to interact with S3 You’ll just call them as normal functions, but they will open and close sockets as necessary 3

© 2013 A. Haeberlen, Z. Ives S3: Access permissions Permissions are assigned through Access Control Lists (ACLs) Essentially, a list of users/groups  permissions Bucket permissions are inherited by objects unless overridden at the object level What can you control? Can be at the level of buckets or individual objects Available rights: Read, write, read ACL, write ACL Possible grantees: Everyone, authenticated users, specific users (by AWS account address) 4

© 2013 A. Haeberlen, Z. Ives S3: Uploading an object Step 1: Hit 'upload' in management console 5University of Pennsylvania

© 2013 A. Haeberlen, Z. Ives S3: Uploading an object Step 2: Select files Step 3: Set metadata (or accept default) Step 4: Set permissions (or make public) 6University of Pennsylvania

© 2013 A. Haeberlen, Z. Ives S3: Pricing and usage, over a year… 7University of Pennsylvania (9/19/2013) (9/18/2014)

© 2013 A. Haeberlen, Z. Ives S3: Bucket operations Create bucket (optionally versioned; see later) Delete bucket List all keys in bucket (may not be 100% up to date) Modify bucket permissions 8 Source: Amazon S3 User’s Guide

© 2013 A. Haeberlen, Z. Ives S3: Object operations PUT object in bucket GET object from bucket DELETE object from bucket Modify object permissions The key issue: How do we manage concurrent updates? Will I see objects you delete? the latest version? etc. 9

© 2013 A. Haeberlen, Z. Ives S3: Consistency models Consistency model depends on the region US West, EU, Asia Pacific, S. America: read-after-write consistency for PUTs of new objects and eventual consistency for overwrite PUTs and DELETEs S3 buckets in the US Standard Region: eventual consistency Read-after-write consistency: Each read or write operation becomes effective at some point between its start time and its completion time Reads return the value of the last effective write 10 Time Client 1: Client 2: W1: Cat W2: Dog R1 R2

© 2013 A. Haeberlen, Z. Ives S3: Versioning S3 handles consistency through versioning rather than locking The idea: every bucket + key maps to a list of versions [bucket+key]  [object v1] [object v2] [object v3] … Each time we PUT an object, it gets a new version The last-received PUT overwrites any previous ones! When we GET: An unversioned request likely receives the last version – but this is not guaranteed depending on propagation delays A request for bucket + key + version uniquely maps to a single object! Versioning can be enabled for each bucket Why would you (not) want versioning? 11

© 2013 A. Haeberlen, Z. Ives Recap: Amazon S3 A key-value store for large objects Buckets, keys, objects, folders Various ways to access objects, e.g., HTTP and BitTorrent Provides eventual consistency +/- a few details that depend on the region Supports versioning and access control Access control is based on ACLs 12University of Pennsylvania

© 2013 A. Haeberlen, Z. Ives DynamoDB: Record-Like Key-Value Storage University of Pennsylvania13

© 2013 A. Haeberlen, Z. Ives What is Amazon DynamoDB? A highly scalable, non-relational data store Despite its name, not really a database Stronger consistency guarantees than S3 Highly scalable; built-in replication; automatic indexing No 'real' transactions, just a conditional put/delete No 'real' relations and joins, just a fairly basic select 14University of Pennsylvania S3 DynamoDB RDS SimpleDB

© 2013 A. Haeberlen, Z. Ives DynamoDB: Data model Somewhat analogous to a spreadsheet: Domains: Entire 'tables'; like buckets Items: Names with attribute-multivalue sets For example, an item could have more than one street address It is possible to add attributes later No pre-defined schema 15University of Pennsylvania CustomerIDDateFirst name Last name Street addressCityStateZip 1231/2/3BobSmith123 Main StSpringfieldMO /3/4BobSmith123 Main StKansas CityMO JamesJohnson456 Front Items Name (hash key) Attributes (key-multivalue) Range Key

© 2013 A. Haeberlen, Z. Ives DynamoDB: Basic operations List Tables, Get Table Description Create, Delete Table GetItem, PutItem, UpdateItem, DeleteItem Can do Conditional Writes based on a value Can assign an Atomic Counter with each write, to test versions Select (like an SQL query) 16

© 2013 A. Haeberlen, Z. Ives DynamoDB: PutItem, UpdateItem, and GetItem PutItem/UpdateItem has a very simple model: Specify the Table, a set of key attributes, and a set of other attributes UpdateItem can specify a condition based on the Atomic Counter GetItem Specify the Table, set of key attributes Can choose whether the read should be strongly consistent or not What are the advantages of each choice? Can also assign a Condition, e.g., that a value matches some equality condition 17

© 2013 A. Haeberlen, Z. Ives DynamoDB: Select A very simple “query” interface based on SQL syntax SELECT output_list FROM domain_name WHERE expression [sort expression] [limit spec] Example: "select * from books where author like 'Tan%' and price <= and year is not null order by title desc limit 50" Can choose whether or not read should be consistent Supports a cursor 18

© 2013 A. Haeberlen, Z. Ives Alternatives to SimpleDB There is a similar service to SimpleDB underneath most major “cloud” companies’ infrastructure Google calls theirs BigTable Yahoo’s is called PNUTS See reading list at the end All consist of items with a variable set of attribute-value pairs More flexible than a relational DBMS table But don’t support full-fledged transactions 19

© 2013 A. Haeberlen, Z. Ives Alternatives to DynamoDB There is a similar service to DynamoDB underneath most major “cloud” companies’ infrastructure In open source there are platforms like HBase, Cassandra, MongoDB, Accumulo that do similar things Google calls theirs BigTable Yahoo’s is called PNUTS See reading list at the end All consist of items with a variable set of attribute-value pairs More flexible than a relational DBMS table But don’t support full-fledged transactions 20

© 2013 A. Haeberlen, Z. Ives Recap: Amazon DynamoDB A scalable, non-relational data store Domains, items, keys, values Stronger consistency than S3 No pre-defined schema 21University of Pennsylvania

© 2013 A. Haeberlen, Z. Ives Where could we go beyond this? KVSs present one of the simplest data representations: key + one or more objects/properties Some alternatives: Relational databases represent data as interlinked tables (in essence, a limited form of a graph) Hierarchical storage systems represent data as nested entities JSON / Document stores (e.g., MongoDB) support JSON or HTML More general graph storage might represent entire graph structures with links All are implementable over a KVS But all allow higher level requests (e.g., paths), and might optimize for this Example: I know that the customer always asks for images related to patients’ records, so maybe we should put the two in the same place 22

© 2013 A. Haeberlen, Z. Ives Summary: Cloud Key/Value Stores Attempt to provide very high durability, availability in a persistent, geographically distributed storage system Need to choose compromises due to limitations of communications, hardware, software Large, seldom-changing objects – eventual consistency and versioned model in S3 Small, more frequently changing objects – lower-latency response, conditional updates in DynamoDB Both are useful in different situations We’ll be using DynamoDB in our assignments, incl HW1M2 23

© 2013 A. Haeberlen, Z. Ives Beyond Storage: Other Cloud Services University of Pennsylvania24

© 2013 A. Haeberlen, Z. Ives Beyond Storage, What if… I want to host a Web site? Or a Web service? Or an instance of a DBMS that I closely manage? Amazon (and Azure and Google) give several options, including services they manage (e.g., Amazon RDS) and a bare-bones service you manage “Infrastructure as a Service”, IaaS Amazon Elastic Compute Cloud (EC2), Azure Virtual Machines, Google Compute Engine University of Pennsylvania 25

© 2013 A. Haeberlen, Z. Ives Amazon EC2 Logging into AWS Management Console Launching an instance Contacting the instance via ssh Terminating an instance Have a look at the AWS Getting Started guide: University of Pennsylvania 26

© 2013 A. Haeberlen, Z. Ives Oh no - where has my data gone? EC2 instances do not have persistent storage Data survives stops & reboots, but not termination So where should I put persistent data? Elastic Block Store (EBS) - in a few slides Ideally, use an AMI with an EBS root (Amzon's default AMI has this property) University of Pennsylvania 27 If you store data on the virtual hard disk of your instance and the instance fails or you terminate it, your data WILL be lost!

© 2013 A. Haeberlen, Z. Ives Amazon Machine Images When I launch an instance, what software will be installed on it? Software is taken from an Amazon Machine Image (AMI) Selected when you launch an instance Essentially a file system that contains the operating system, applications, and potentially other data Lives in S3 How do I get an AMI? Amazon provides several generic ones, e.g., Amazon Linux, Fedora Core, Windows Server,... You can make your own You can even run your own custom kernel (with some restrictions) University of Pennsylvania 28

© 2013 A. Haeberlen, Z. Ives Security Groups Basically, a set of firewall rules Can be applied to groups of EC2 instances Each rule specifies a protocol, port numbers, etc... Only traffic matching one of the rules is allowed through Sometimes need to explicitly open ports University of Pennsylvania 29 Instance Evil attacker Legitimate user (you or your customers)

© 2013 A. Haeberlen, Z. Ives Regions and Availability Zones Where exactly does my instance run? No easy way to find out - Amazon does not say Instances can be assigned to regions Currently 9 availble: US East (Northern Virginia), US West (Northern California), US West (Oregon), EU (Ireland), Asia/Pacific (Singapore), Asia/Pacific (Sydney), Asia/Pacific (Tokyo), South America (Sao Paulo), AWS GovCloud Important, e.g., for reducing latency to customers Instances can be assigned to availability zones Purpose: Avoid correlated fault Several availability zones within each region University of Pennsylvania 30

© 2013 A. Haeberlen, Z. Ives Network pricing AWS does charge for network traffic Price depends on source and destination of traffic Free within EC2 and other AWS svcs in same region (e.g., S3) Remember: ISPs are typically charged for upstream traffic University of Pennsylvania 31 (9/18/2014)

© 2013 A. Haeberlen, Z. Ives Instance types So far: On-demand instances Also available: Reserved instances One-time reservation fee to purchase for 1 or 3 years Usage still billed by the hour, but at a considerable discount Also available: Spot instances Spot market: Can bid for available capacity Instance continues until terminated or price rises above bid University of Pennsylvania 32 Source: ec2/reserved-instances/

© 2013 A. Haeberlen, Z. Ives Service Level Agreement University of Pennsylvania 33 (9/11/2013; excerpt) 4.38h downtime per year allowed

© 2013 A. Haeberlen, Z. Ives Recap: EC2 What EC2 is: IaaS service - you can rent virtual machines Various types: Very small to very powerful How to use EC2: Ephemeral state - local data is lost when instance terminates AMIs - used to initialize an instance (OS, applications,...) Security groups - "firewalls" for your instances Regions and availability zones On-demand/reserved/spot instances Service level agreement (SLA) University of Pennsylvania 34

© 2013 A. Haeberlen, Z. Ives Virtual Disks for EC2 University of Pennsylvania35

© 2013 A. Haeberlen, Z. Ives Elastic Block Store (EBS) Persistent storage Unlike the local instance store, data stored in EBS is not lost when an instance fails or is terminated Should I use the instance store or EBS? Typically, instance store is used for temporary data University of Pennsylvania 36 Instance EBS storage

© 2013 A. Haeberlen, Z. Ives Volumes EBS storage is allocated in volumes A volume is a 'virtual disk' (size: 1GB - 1TB) Basically, a raw block device Can be attached to an instance (but only one at a time) A single instance can access multiple volumes Placed in specific availability zones Why is this useful? Be sure to place it near instances (otherwise can't attach) Replicated across multiple servers Data is not lost if a single server fails Amazon: Annual failure rate is % for a 20GB volume University of Pennsylvania 37

© 2013 A. Haeberlen, Z. Ives EC2 instances with EBS roots EC2 instances can have an EBS volume as their root device ("EBS boot") Result: Instance data persists independently from the lifetime of the instance You can stop and restart the instance, similar to suspending and resuming a laptop You won't be charged for the instance while it is stopped (only for EBS) You can enable termination protection for the instance Blocks attempts to terminate the instance (e.g., by accident) until termination protection is disabled again Alternative: Use instance store as the root You can still store temporary data on it, but it will disappear when you terminate the instance You can still create and mount EBS volumes explicitly University of Pennsylvania 38

© 2013 A. Haeberlen, Z. Ives Time Snapshots You can create a snapshot of a volume Copy of data in the volume at the time snapshot was made Only the first snapshot makes a full copy; subsequent snapshots are incremental What are snapshots good for? Sharing data with others DBpedia snapshot ID is "snap-882a8ae3" Access control list (specific account numbers) or public access Instantiate new volumes Point-in-time backups University of Pennsylvania 39

© 2013 A. Haeberlen, Z. Ives Pricing You pay for... Storage space: $0.10 per allocated GB per month I/O requests: $0.10 per million I/O requests S3 operations (GET/PUT) Charge is only for actual storage used Empty space does not count University of Pennsylvania 40

© 2013 A. Haeberlen, Z. Ives Creating an EBS volume University of Pennsylvania 41 Needs to be in same availability zone as your instance! DBpedia snapshot ID Create volume

© 2013 A. Haeberlen, Z. Ives Mounting an EBS volume Step 1: Attach the volume Step 2: Mount the volume in the instance University of Pennsylvania 42 ec2-attach-volume -d /dev/sda2 -i i-9bd6eef1 vol-cca68ea5 ATTACHMENT vol-cca68ea5 i-9bd6eef1 /dev/sda2 attaching ssh __| __|_ ) Amazon Linux AMI _| ( / Beta ___|\___|___| See /usr/share/doc/system-release for latest release notes. :-) ~]$ sudo mount /dev/sda2 /mnt/ ~]$ ls /mnt/ dbpedia_3.5.1.owl dbpedia_3.5.1.owl.bz2 en other_languages ~]$

© 2013 A. Haeberlen, Z. Ives Detaching an EBS volume Step 1: Unmount the volume in the instance Step 2: Detach the volume University of Pennsylvania 43 ec2-detach-volume vol-cca68ea5 ATTACHMENT vol-cca68ea5 i-9bd6eef1 /dev/sda2 detaching ~]$ sudo umount /mnt/ ~]$ exit

© 2013 A. Haeberlen, Z. Ives Recap: Elastic Block Store (EBS) What EBS is: Basically a virtual hard disk; can be attached to EC2 instances Persistent - state survives termination of EC2 instance How to use EBS: Allocate volume - empty or initialized with a snapshot Attach it to EC2 instance and mount it there Can create snapshots for data sharing, backup University of Pennsylvania 44

© 2013 A. Haeberlen, Z. Ives Cloud Case Studies University of Pennsylvania45

© 2013 A. Haeberlen, Z. Ives Recall Some Cloud Definitions As discussed previously, “cloud” is a broad term but comprises: Very large data centers with thousands of commodity machines Multiple, geographically distributed sites Common management infrastructure Common programming infrastructure that automatically allocates requests and/or jobs toavailable machines Difference between public and private clouds? Public clouds sub-contract out to multiple clients; private clouds are controlled by one organization 46 University of Pennsylvania

© 2013 A. Haeberlen, Z. Ives Recap: Types of clouds Software as a Service (SaaS): cloud-hosted apps think Hotmail, GMail, Google Docs, Office Web, … where Microsoft, etc. want to go – subscriptions & ads Platform as a Service (PaaS): programming layer and services over the cloud think Hadoop, MS Azure, extensible apps, Google Maps Infrastructure as a Service (IaaS): virtual machines, virtualized networks and disks think Amazon EC2 typically includes Storage as a Service: EBS, etc. also some variants like content delivery networks University of Pennsylvania47

© 2013 A. Haeberlen, Z. Ives The major public Cloud providers Amazon is the big player Multiple services: infrastructure as a service, platform as a service (incl. Hadoop), storage as a service But there are many others: Microsoft Azure – has a similar stack to Amazon Google App + Compute EngineEngine – again, similar Also software as a service: GMail, Docs, etc. IBM, HP, Yahoo – seem to focus mostly on enterprise (often private) cloud apps (not small business- level) Rackspace, Terremark/Verizon – mostly infrastructure as a service University of Pennsylvania48

© 2013 A. Haeberlen, Z. Ives Case Studies We’ll look at successful examples of: SaaS: Salesforce.com PaaS: Facebook IaaS: Netflix University of Pennsylvania49

© 2013 A. Haeberlen, Z. Ives A SaaS Example: Salesforce.com University of Pennsylvania50

© 2013 A. Haeberlen, Z. Ives Perhaps the first truly successful “software as a service” platform Predated the term “cloud” (founded in 1999) – and was initially met with skepticism Now the IBMs, MSs of the world want to be like them: a constant revenue stream, unlike shrink-wrapped software What is the software being provided? “Customer Relationship Management” – tools for sales people to find customers, keep in contact with them Gives a bird’s-eye view of customers’ status, in-flight orders, order history, leads, approvals, etc. Salesforce.com University of Pennsylvania51

© 2013 A. Haeberlen, Z. Ives Salesforce.com: A Timeline Founded in 1999: first proponents of the term ‘cloud’, with support from Larry Ellison (Oracle) First CRM offered as a SAAS (Software as a service) 2005: offered Force.com as a platform for apps 2010: Chatter Launched, Heroku acquired 2011: Radian 6 acquired, more than 90,000 customers 52University of Pennsylvania © 2012 A. Subramanian

© 2013 A. Haeberlen, Z. Ives What does it look like? 53University of Pennsylvania © 2012 A. Subramanian

© 2013 A. Haeberlen, Z. Ives Example Salesforce “Dashboard” University of Pennsylvania54

© 2013 A. Haeberlen, Z. Ives How Salesforce.com works Basic architecture as of Mar 2009: 'Only' about 1000 mirrored machines for 55K enterprise customers, 1.5M subscribers 10 Oracle databases across 50 servers About 20 predefined tables / schemas, shared across all customers, 100s of TB Sophisticated, proprietary query optimization and indexing AJAX Web interface with various communication services Tracking for Twitter, collaborative tools, etc. Easy “tunnels” for sharing across customers Plug-ins for extensions via Platform-as-a-Service “force.com” – 30M lines of 3 rd party code University of Pennsylvania55

© 2013 A. Haeberlen, Z. Ives Salesforce.com Architecture Multi-tenant: Each datacenter contains servers shared across customers Performance maintained by limits App logic separation Scales vertically (adding more cores, improving index strategies) 56University of Pennsylvania © 2012 A. Subramanian

© 2013 A. Haeberlen, Z. Ives Salesforce.com Technology Stack Consist of Oracle RAC (Real Application Clusters) nodes Allow transparent access of single database instance by multiple clients Largest standing Oracle installation in the world 57University of Pennsylvania © 2012 A. Subramanian

© 2013 A. Haeberlen, Z. Ives Why Salesforce is so effective Their value proposition: outsource your main corporate IT to them They bill per month – force.com $15/user/month They can offer it cheaper than corporate IT: Leverage the same infrastructure, design, and support across many companies at the same time – “multi-tenancy” Some customers: Dell, AMD, SunTrust, Spring, Computer Associates, Kaiser Permanente University of Pennsylvania58

© 2013 A. Haeberlen, Z. Ives PaaS Case Study: Facebook University of Pennsylvania59

© 2013 A. Haeberlen, Z. Ives Users of Platform as a Service Facebook provides some PaaS capabilities to application developers Web services – remote APIs – that allow access to social network properties, data, “Like” button, etc. Many third-parties run their apps off Amazon EC2, and interface to Facebook via its APIs – PaaS + IaaS Facebook itself makes heavy use of PaaS services for their own private cloud Key problems: how to analyze logs, make suggestions, determine which ads to place See also Chapter 16 of the Tom White book University of Pennsylvania60

© 2013 A. Haeberlen, Z. Ives Facebook API: Overview What you can do: Read data from profiles and pages Navigate the graph (e.g., via friends lists) Issue queries (for posts, people, pages,...) Add or modify data (e.g., create new posts) Get real-time updates, issue batch requests,... How you can access it: Graph API FQL Legacy REST API 61University of Pennsylvania

© 2013 A. Haeberlen, Z. Ives Facebook API: The Graph API (1/2) Requests are mapped directly to HTTP: Response is in JSON 62University of Pennsylvania { "id": " ", "age_range": { "min": 21 }, "locale": "en_US", "location": { "id": " ", "name": "Philadelphia, Pennsylvania" } }

© 2013 A. Haeberlen, Z. Ives Facebook API: The Graph API (2/2) Uses several HTTP methods: GET for reading POST for adding or modifying DELETE for removing IDs can be numeric or names / or /andreas.haeberlen Pages also have IDs Authorization is via 'access tokens' Opaque string; encodes specific permissions (access user location, but not interests, etc.) Has an expiration date, so may need to be refreshed 63University of Pennsylvania

© 2013 A. Haeberlen, Z. Ives Facebook Data Management / Warehousing Tasks Main tasks for “cloud” infrastructure: Summarization (daily, hourly) to help guide development on different components to report on ad performance recommendations Ad hoc analysis: Answer questions on historical data – to help with managerial decisions Archival of logs Spam detection Ad optimization... Initially used Oracle DBMS for this But eventually hit scalability, cost, performance bottlenecks... just like Salesforce does now University of Pennsylvania64

© 2013 A. Haeberlen, Z. Ives Data Warehousing at Facebook University of Pennsylvania65 >2PB of data 10TB added every day Mostly HDFS (+ some mySQL) 2,400 cores 9TB of memory

© 2013 A. Haeberlen, Z. Ives PaaS at Facebook (partial list of components; these have evolved) Scribe – open source logging, actually records the data that will be analyzed by Hadoop Hadoop (MapReduce – discussed next time) as batch processing engine for data analysis As of 2009: 2 nd largest Hadoop cluster in the world, 2400 cores, > 2PB data with > 10TB added every day Hive – SQL over Hadoop, used to write the data analysis queries Federated MySQL, Oracle – multi-machine DBMSs to store query results University of Pennsylvania66

© 2013 A. Haeberlen, Z. Ives Example Use Case 1: Ad Details Advertisers need to see how their ads are performing Cost-per-click (CPC), cost-per-1000-impressions (CPM) Social ads – include info from friends Engagement ads – interactive with video Performance numbers given: Number unique users, clicks, video views, … Main axes: Account, campaign, ad Time period Type of interaction Users Summaries are computed using Hadoop via Hive University of Pennsylvania67

© 2013 A. Haeberlen, Z. Ives Use Case 2: Ad Hoc analysis, feedback Engineers, product managers may need to understand what is going on e.g., impact of a new change on some sub-population Again, Hive-based, i.e., queries are in SQL with database joins Combine data from several tables, e.g., click-through rate = views combined with clicks Sometimes requires custom analysis code with sampling University of Pennsylvania68

© 2013 A. Haeberlen, Z. Ives IaaS Case Study: Netflix University of Pennsylvania69

© 2013 A. Haeberlen, Z. Ives IaaS example: Netflix Perhaps Amazon’s highest-profile customer In 12/2010, most of their traffic was served from AWS A year earlier, none of it was Why did Netflix take this step? Needed to re-architect after a phase of growth  Ability to question everything Focus on their core competence (content); leave the 'heavy lifting' (datacenter operation) to Amazon Customer growth & device engagement hard to predict  With the cloud, they don't have to Belief that cloud computing is the future  Gain experience with an increasingly important technology University of Pennsylvania70

© 2013 A. Haeberlen, Z. Ives How Netflix uses AWS Streaming movie retrieval and playback Media files stored in S3 “Transcoding” to target devices (Wii, iPad, etc.) using EC2 Web site modules Movie lists and search – app hosted by Amazon Web Services Recommendations Analysis of streaming sessions, business metrics – using Elastic MapReduce University of Pennsylvania71

© 2013 A. Haeberlen, Z. Ives Netflix: 5 Lessons learned using AWS Dorothy, you're not in Kansas anymore Be prepared to unlearn a lot of what you know Example: Assumptions about network capacity, hw reliability Co-tenancy is hard Throughput variance can occur at any level in the stack Best way to avoid failure: Fail constantly Design for failure independence; use the 'Chaos Monkey' Learn with real scale, not toy models Only full-scale traffic shows where the real bottlenecks are Commit yourself 72University of Pennsylvania

© 2013 A. Haeberlen, Z. Ives Discussion University of Pennsylvania73

© 2013 A. Haeberlen, Z. Ives Other users, and the future Startups, especially, are making great use of EC2, Rackspace, etc. for their hosting needs compare to 10 years ago – dot-com boom – where you started by buying a cluster of SPARC machines Government, health care, science, many enterprises have great interest in cost savings of the cloud But concerns remain – esp. with respect to security, privacy, availability … And moreover: the last word has not been written on how to program the cloud University of Pennsylvania74

© 2013 A. Haeberlen, Z. Ives Given this discussion… Our goal for the remainder of the semester: learn how to build applications much like the ones we discussed We’ll use many of the same programming platforms, tools, etc. And there will be an AJAX, Web-based emphasis on the projects University of Pennsylvania75

© 2013 A. Haeberlen, Z. Ives Next time The first “programming model for the cloud”: MapReduce Not really a language – but a set of interfaces and a runtime system Please read Dean & Ghemawat paper – the Google work that spawned it all Later in the semester we’ll see more sophisticated models, including some research ones University of Pennsylvania76

© 2013 A. Haeberlen, Z. Ives Stay tuned Next time you will learn about: A programming model for the Cloud 77University of Pennsylvania