HDInsight on Azure and Map-Reduce Richard Conway Windows Azure MVP Elastacloud Limited
Introduction
Big Data vs Big Compute
Compute Bound IO Bound
All distributed compute works on the basis of taking a large JOB and breaking it to many smaller TASKS which are then parallelised
Hadoop HPC
Understanding Big Data
$100 gets you 3million times more storage in 30 years) MIPS/$ M MIPS/$ >5.5 billion (70+% of global population) >2 Billion users Web traffic Exabyte (10 E18) ZettaByte (10 E21) >10 Billion
Internet of things Audio / Video Log Files Text/Image Social Sentiment Data Market Feeds eGov Feeds Weather Wikis / Blogs Click Stream Sensors / RFID / Devices Spatial & GPS Coordinates WEB 2.0 Mobile Advertisin g CollaborationeCommerce Digital Marketing Search Marketing Web Logs Recommendation s ERP / CRM Sales Pipeline Payables Payroll Inventory Contacts Deal Tracking Terabytes (10E12) Gigabytes (10E9) Exabytes (10E18) Petabytes (10E15) Velocity - Variety - variability Volume ,000$ $ ,000$ $ Storage/GB ERP / CRM WEB 2.0 Internet of things
Big Data, BIG OPPORTUNITY 49% CEOs and CIOs are planning big data projects Software Growth Services Growth 1. McKinsey&Company, McKinsey Global Survey Results, Minding Your Digital Business, IDC Market Analysis, Worldwide Big Data Technology and Services 2012–2015 Forecast, 2012
Invisible devices Trillions of networked nodes Low bandwidth last- mile connection Mostly addressed by local schemes Machine-centricSensing-focus Global addressingUser-centric Communication- focus Laptops / tablets / smartphones Billions of networked devices High-bandwidth access
Big Data Scenarios
Hadoop Distributed Architecture
Server Files Server
RUNTIME Code
TRADITIONAL RDBMSHADOOP Data Size Access Updates Structure Integrity Scaling DBA Ratio
Windows Azure HDInsight Service
Demo
Distributed Storage (HDFS) Query (Hive) Distributed Processing (MapReduce) HDINSIGHT / HADOOP Eco-System Legend Red = Core Hadoop Blue = Data processing Purple = Microsoft integration points and value adds Orange = Data Movement Green = Packages
Storing Data with HDInsight
Front end Stream Layer Partition Layer Name Node de Data Node Front end HDFS API DFS (1 Data Node per Worker Role) and Compute Cluster Azure Storage (ASV) … Azure Blob Storage
Map Reduce Examples in C#
public class FrenchSessionsJob : HadoopJob { public override HadoopJobConfiguration Configure(ExecutorContext context) { var config = new HadoopJobConfiguration() { InputPath = "\"/AllSessions/*.gz\"", OutputFolder = "/FrenchSessions/" }; return config; }
public class FrenchSessionsMapper : MapperBase { public override void Map(string inputLine, MapperContext context) { if (inputLine.Contains("Country=France") { context.IncrementCounter("FrenchSession"); context.EmitKeyValue("FR", "1"); }
public class SessionsReducer : ReducerCombinerBase { public override void Reduce(string key, IEnumerable values, ReducerContext context) { context.EmitKeyValue(key, values.Count()); }
Demo
t/Map-Reduce HDInsight Lab.pdf
Questions?