Tamir Dresher Senior Software Architect July 2, 2014 Where is my Data? (In the Cloud)
About Me Software architect, consultant and instructor Software Engineering Ruppin Academic Center Technology addict 10 years of experience.NET and Native Windows
Agenda Storage Blob Relational DB NoSql DB MapReduce 3
Storage 4 Where is my dataStorage
Numbers – 1 Second is 1,132 Instagram photos uploaded 5 Where is my dataStorage
Numbers – 1 Second is 1,132 Instagram photos uploaded 1,365 Tumblr posts 6 Where is my dataStorage
Numbers – 1 Second is 1,132 Instagram photos uploaded 1,365 Tumblr posts 7,241 Tweets sent Tweets sent 7 Where is my dataStorage
Numbers – 1 Second is 1,132 Instagram photos uploaded 1,365 Tumblr posts 7,241 Tweets sent Tweets sent 44,512 Google searchesGoogle searches 8 Where is my dataStorage
Numbers – 1 Second is 1,132 Instagram photos uploaded 1,365 Tumblr posts 7,241 Tweets sent Tweets sent 44,512 Google searchesGoogle searches 84,921 YouTube videos viewed Where is my dataStorage
Storage Prices 10
Types of information Product catalogs Employee data User profiles Images Session state Shopping cart Game scores and state 11 Social feeds Query output results Airline seating charts Inventory management system Game leaderboards Performance counters Weather Stock quotes Where is my dataStorage
Gartner Magic Quadrant 12 IaaS PaaS
North America Europe Asia Pacific Data centers Windows Azure Growing Global Presence Storage SLA – 99.99% minutes per year
AZURE BLOBS 14
What is a BLOB BLOB – Binary Large OBject Storage for any type of entity such as binary files and text documents Distributed File Service (DFS) – Scalability and High availability BLOB file is distributed between multiple server and replicated at least 3 times 15 Where is my dataBLOB
Azure Blob Storage Concepts BlobContainerAccount / Pages/ Blocks contoso PIC01.JPG Block/Page PIC02.JPG images VID1.AVIvideos 16 Where is my dataBLOB
Amazon Simple Storage Service(S3) Concepts ObjectBucketAccount s3.amazonaws.com/ contoso PIC01.JPG PIC02.JPG images VID1.AVIvideos 17 Where is my dataBLOB
Blob Operations 18 REST Where is my dataBLOB
DEMO Creating a Blob 19
BLOBS - Azure Block blob - up to 200 GB in size Page blobs – up to 1 TB in size Total Account Capacity TB 20 Where is my dataBLOB
BLOBS - AWS Object size – up to 5 TB AWS account can own up to 100 buckets at a time, unlimited objects % durability, 99.99% availability Reduced Redundancy Storage (RRS) % durability and 99.99% Amazon Glaciar - low-cost storage service as a storage option for data archival. 21 Where is my dataBLOB
Pricing - AWS pay for what you use Components: – Storage capacity used (per GB per month) – Data transfer out (per GB per month) – Requests (per n thousand requests per month) 22 Where is my dataBLOBPricing
Pricing - Azure pay for what you use or 6,12 months plan Components – Storage capacity used (per GB per month) – Replication option (LRS, GRS, RA-GRS) – Number of requests (per n thousand requests per month) – Data egress (per GB per month) 23 Where is my dataBLOBPricing
RELATIONAL DB 24
Relational Database Service (RDS) MySQL, Oracle, or Microsoft SQL Server in the cloud No administrative overheads Dedicated Hardware High Availability pay-as-you-grow pricing Familiar Development Model* * Despite missing features and some limitations Where is my dataRelational DB
SQL Azure SQL Server in the cloud No administrative overheads Shared or Reserved (Dedicated) Hardware High Availability pay-as-you-grow pricing Familiar Development Model* * Despite missing features and some limitations Where is my dataRelational DB
DEMO Creating and Using SQL Azure 27
PricingSQL - Azure 28 Where is my dataRelational DB
Pricing - RDS 29 Where is my dataRelational DB pay for what you use Components: – Storage capacity used (per GB-month and per million I/O requests) – Deployment type - Single-AZ/Multi-AZ (AZ-Availabiity Zone) – DB instance hours (per hour) – Additional backup storage (per GB-month( – Data transfer in / out (per GB per month)
Case Study Where is my dataSQL Azure
Case Study records-on.html records-on.html How do I make querying 154 million addresses as fast as possible? if I want 100GB of SQL Server and I want to hit it 10 million times, it’ll cost me $176 a month (now its ~20$) 31 Where is my dataSQL Azure
NoSql - Azure Tables, DynamoDB 32
NoSql Relational technology has long been the dominant approach for data. Large amount of data – Scaling across many servers is challenging. Different kind of data on Relational DB – JSON documents – Graphs ACID – Atomicity, Consistency, Isolation, Durability. CAP - Consistency, Availability, Partition tolerance. BASE - Basic Availability, Soft-state, Eventual consistency. 33 Where is my dataNoSql
34 Where is my dataNoSql
Table Storage Concepts EntityTableAccount contoso Name =… = … Name =… Add= customers Photo ID =… Date =… photos Photo ID =… Date =… 35 Where is my dataNoSqlAzure Tables
Table Storage Not RDBMS – No relationships between entities – NoSql Entity can have up to 255 properties - Up to 1MB per entity Mandatory Properties for every entity – PartitionKey & RowKey (only indexed properties) Uniquely identifies an entity Same RowKey can be used in different PartitionKey Defines the sort order – Timestamp - Optimistic Concurrency Strongly consistent 36 Where is my dataNoSqlAzure Tables
No Fixed Schema FIRSTLASTBIRTHDATE WadeWegner2/2/1981 NathanTotten3/15/1965 NickHarrisMay 1, 1976 FAV SPORT Canoeing 37 Where is my dataNoSqlAzure Tables
Table Object Model ITableEntity interface –PartitionKey, RowKey, Timestamp, and Etag properties – Implemented by TableEntity and DynamicTableEntity 38 // This class defines one additional property of integer type, // since it derives from TableEntity it will be automatically // serialized and deserialized. public class SampleEntity : TableEntity { public int SampleProperty { get; set; } } Where is my dataNoSqlAzure Tables
Sample – Inserting an Entity into a Table 39 // You will need the following using statements using Microsoft.WindowsAzure.Storage; using Microsoft.WindowsAzure.Storage.Table; // Create the table client. CloudTableClient tableClient = storageAccount.CreateCloudTableClient(); CloudTable peopleTable = tableClient.GetTableReference("people"); peopleTable.CreateIfNotExists(); // Create a new customer entity. CustomerEntity customer1 = new CustomerEntity("Harp", "Walter"); customer1. = customer1.PhoneNumber = " "; // Create an operation to add the new customer to the people table. TableOperation insertCustomer1 = TableOperation.Insert(customer1); // Submit the operation to the table service. peopleTable.Execute(insertCustomer1); Where is my dataNoSqlAzure Tables
Retrieve 40 // Create the table client. CloudTableClient tableClient = storageAccount.CreateCloudTableClient(); CloudTable peopleTable = tableClient.GetTableReference("people"); // Retrieve the entity with partition key of "Smith" and row key of "Jeff" TableOperation retrieveJeffSmith = TableOperation.Retrieve ("Smith", "Jeff"); // Retrieve entity CustomerEntity specificEntity = (CustomerEntity)peopleTable.Execute(retrieveJeffSmith).Result; Where is my dataNoSqlAzure Tables
Table Storage – Important Points Azure Tables can store TBs of data Tables Operations are fast Tables are distributed –PartitionKey defines the partition – A table might be stored in different partitions on different storage devices. 41 Where is my dataNoSqlAzure Tables
Pricing 42 Where is my dataNoSqlAzure Tables
Case Study Where is my dataNoSqlAzure Tables
Case Study - How do I make querying 154 million addresses as fast as possible? – the domain is the partition key and the alias is the row key if I want 100GB of storage and I want to hit it 10 million times, it’ll cost me $8 a month SQL Server will cost $176 a month - 22 times more expensive 44 Where is my dataNoSqlAzure Tables
DynamoDB Item can have up to 64KB per entity Item stored on SSDs and are replicated across multiple Availability Zones in a Region Item has a primary key can either be a single-attribute hash key or a composite hash-range key Supports secondary indexes 45 Where is my dataNoSqlAWS DynamoDB
DynamoDB Eventually-consistent reads (by default), and strongly-consistent reads (optional) Provisioned Throughput - the request throughput you want your table to be able to achieve – 10 units of Write Capacity (enough capacity to do up to 36,000 writes per hour)* – 50 units of Read Capacity (enough capacity to do up to 180,000 strongly consistent reads, or 360,000 eventually consistent reads, per hour) 46 Where is my dataNoSqlAWS DynamoDB
Pricing Pay for what you use Components: – Provisioned throughput capacity (per hour) – Indexed data storage (per GB per month) – Data transfer out (per GB per month) 47 Where is my dataNoSqlAWS DynamoDB
DynamoDB Item can have up to 64KB per entity Item stored on SSDs and are replicated across multiple Availability Zones in a Region Item has a primary key can either be a single-attribute hash key or a composite hash-range key Supports secondary indexes 48 Where is my dataNoSqlAWS DynamoDB
MapReduce on the Cloud 49
Hadoop in the cloud Hadoop on Azure Cloud Some Facts: – 2013 Global mobile data traffic reached 1.5 exabytes per month – Cisco predicts 1.1 zettabytes (1000 exabyte) of internet traffic in 2016 Cisco 50 Where is my dataMapReduce
MapReduce – The BigData Power Map – takes input and output key;value pairs 51 (Key1,Value1) (Key2,Value2) : (Key n,Value n ) Where is my dataMapReduce
MapReduce – The BigData Power Reduce – take group of values per key and produce new group of values 52 Key1: [value1-1,Value1-2…] Key2: [value2-1,Value2-2…] Key n : [valueN-1,ValueN-2…] [new_value1-1,new_value1-2…] [new_value2-1,new_value2-2…] [new_valueN-1,new_valueN-2…] :: Where is my dataMapReduce
Server MapReduce - How Does It Work? Files Server Where is my dataMapReduce
So How Does It Work? Server RUNTIME Code Where is my dataMapReduce
Elastic Map Reduce (EMR) 55 Where is my dataMapReduceEMR Amazon Hadoop on the Cloud Hortonworks and Microsoft Hadoop to Windows Cluster of EC2 Pricing: – hourly rate for every instance hour (by instance type) – Additional EMR price per EC2 instance –
HDInsight 56 Where is my dataMapReduceHDInsight MS Hadoop on (not only) Azure Cloud Hortonworks and Microsoft Hadoop to Windows Native integration with.NET
Finding common friends Facebook shows you how many common friends you have with someone There were 1,310,000,000 active users in facebook with130 friends on average ( ) Calculating the mutual friends 57 Where is my dataHDInsight
Finding common friends We can represent Friend Relationship as: Note that a Friend relationship is Symmetrical – if A is a friend of B then B is a friend of A 58 Where is my dataHDInsight Someone [List of his\her friends] Common Friends
Example of Friends file U1 -> U2 U3 U4 U2 -> U1 U3 U4 U5 U3 -> U1 U2 U4 U5 U4 -> U1 U2 U3 U5 U5 -> U2 U3 U4 59 Where is my dataHDInsight Common Friends
Designing our MapReduce job Each line from the file will input line to the Mapper The Mapper will output key-value pairs Key: (user, friend) – Sorted, friend might be before user value: list of friends 60 Where is my dataHDInsight Common Friends
Designing our MapReduce job - Mapper Each line from the file will input line to the Mapper The Mapper will output key-value pairs Key: (user, friend) – Sorted, friend might be before user value: list of friends Having the key sorted will help us with the reducer, same pairs will be provided together 61 Where is my dataHDInsight Common Friends
Mapper Example 62 Where is my dataHDInsight Common Friends Mapper Output:Given the Line: (U1 U2) U2 U3 U4 (U1 U3) U2 U3 U4 (U1 U4) U2 U3 U4 U1 U2 U3 U4
Mapper Example 63 Where is my dataHDInsight Common Friends Mapper Output:Given the Line: (U1 U2) U2 U3 U4 (U1 U3) U2 U3 U4 (U1 U4) U2 U3 U4 U1 U2 U3 U4 (U1 U2) -> U1 U3 U4 U5 (U2 U3) -> U1 U3 U4 U5 (U2 U4) -> U1 U3 U4 U5 (U2 U5) -> U1 U3 U4 U5 U2 U1 U3 U4 U5
Mapper Example – final result 64 Where is my dataHDInsight Common Friends Mapper Output:Given the Line: (U1 U2) U2 U3 U4 (U1 U3) U2 U3 U4 (U1 U4) U2 U3 U4 U1 U2 U3 U4 (U1 U2) -> U1 U3 U4 U5 (U2 U3) -> U1 U3 U4 U5 (U2 U4) -> U1 U3 U4 U5 (U2 U5) -> U1 U3 U4 U5 U2 U1 U3 U4 U5 (U1 U3) -> U1 U2 U4 U5 (U2 U3) -> U1 U2 U4 U5 (U3 U4) -> U1 U2 U4 U5 (U3 U5) -> U1 U2 U4 U5 U3 -> U1 U2 U4 U5 Mapper Output:Given the Line: (U1 U4) -> U1 U2 U3 U5 (U2 U4) -> U1 U2 U3 U5 (U3 U4) -> U1 U2 U3 U5 (U4 U5) -> U1 U2 U3 U5 U4 -> U1 U2 U3 U5 (U2 U5) -> U2 U3 U4 (U3 U5) -> U2 U3 U4 (U4 U5) -> U2 U3 U4 U5 -> U2 U3 U4
Designing our MapReduce job - Reducer The input for the reducer will be structured as: (friend1, friend2) (friend1 friends) (friend2 friends) The reducer will find the intersection between the lists Output: (friend1, friend2) (intersection of friend1 and friend2 friends) 65 Where is my dataHDInsight Common Friends
Reducer Example 66 Where is my dataHDInsight Common Friends Reducer Output:Given the Line: (U1 U2) -> (U3 U4)(U1 U2) -> (U1 U3 U4 U5) (U2 U3 U4) (U1 U3) -> (U2 U4)(U1 U3) -> (U1 U2 U4 U5) (U2 U3 U4) (U1 U4) -> (U2 U3)(U1 U4) -> (U1 U2 U3 U5) (U2 U3 U4) (U2 U3) -> (U1 U4 U5)(U2 U3) -> (U1 U2 U4 U5) (U1 U3 U4 U5) (U2 U4) -> (U1 U3 U5)(U2 U4) -> (U1 U2 U3 U5) (U1 U3 U4 U5) (U2 U5) -> (U3 U4)(U2 U5) -> (U1 U3 U4 U5) (U2 U3 U4) (U3 U4) -> (U1 U2 U5)(U3 U4) -> (U1 U2 U3 U5) (U1 U2 U4 U5) (U3 U5) -> (U2 U4)(U3 U5) -> (U1 U2 U4 U5) (U2 U3 U4) (U4 U5) -> (U2 U3)(U4 U5) -> (U1 U2 U3 U5) (U2 U3 U4)
Creating c# MapReduce 67 Where is my dataHDInsight Common Friends
Creating c# MapReduce - Mapper 68 Where is my dataHDInsight Common Friends public class CommonFriendsMapper:MapperBase { public override void Map(string inputLine, MapperContext context) { var strings = inputLine.Split(new []{' '}, StringSplitOptions.RemoveEmptyEntries); if (strings.Any()) { var currentUser = strings[0]; var friends = strings.Skip(1); foreach (var friend in friends) { var keyArr = new[] {currentUser, friend}; Array.Sort(keyArr); var key = String.Join(" ", keyArr); context.EmitKeyValue(key, string.Join(" ",friends)); }
Creating c# MapReduce - Reduce 69 Where is my dataHDInsight Common Friends public class CommonFriendsReducer:ReducerCombinerBase { public override void Reduce(string key, IEnumerable strings, ReducerCombinerContext context) { var friendsLists = strings.Select(friendList => friendList.Split(' ')).ToList(); var intersection = friendsLists[0].Intersect(friendsLists[1]); context.EmitKeyValue(key, string.Join(" ", intersection)); }
Creating c# MapReduce – Hadoop Job 70 Where is my dataHDInsight Common Friends HadoopJobConfiguration myConfig = new HadoopJobConfiguration(); myConfig.InputPath = "wasb:///example/data/friends/friends"; myConfig.OutputFolder = "wasb:////example/data/friends/output"; var hadoop = Hadoop.Connect(clusterUri, clusterUserName, hadoopUserName, clusterPassword, azureStorageAccount, azureStorageKey, azureStorageContainer, createContinerIfNotExist); var jobResult = hadoop.MapReduceJob.Execute (myConfig); int exitCode = jobResult.Info.ExitCode; // (0 – success, otherwise – failure)
Pricing 71 Where is my dataHDInsight 10 node cluster that will exist for 24 hours: Secure Gateway Node - free. head node USD per 24-hour day 1 data node USD per 24-hour day 10 data nodes USD per 24-hour day Total: $92.16 USD
WRAP UP 72
Comparing the alternatives 73 Storage TypeWhen Should you UseImplications BLOBUnstructured data Files -Application Logic Responsibility -Consider using HDInsight(Hadoop) Relational DBStructured Relational Data ACID transactions -SQL DML+DDL -Could affect scalability -BI Abilities -Reporting Azure Tables, DynamoDB Structured Data Loose Schema Geo Replication (High DR) Auto Sharding -OData, REST -Application Logic -Responsibility(Multiple Schemas) Where is my dataWrap Up
What have we seen Blobs Relational DB NoSql MapReduce in the Cloud 74 Where is my dataWrap Up
What’s Next NoSql – MongoDB, Cassandra, CouchDB, RavenDB Hadoop ecosystem – Hive, Pig, SQOOP, Mahout Cache Options - Amazon ElastiCache, Azure Cache, InRole Cache, Redis Where is my dataWrap Up
Presenter contact details c: t: e: b: TamirDresher.comTamirDresher.com w: