New England SQL Server Big Data 101 Paresh Motiwala SPONSORED BY
Upcoming Events Jan 11 - Don't Let History Be a Mystery! Temporal Data in SQL Server 2016 Adam Machanic Feb 8 - Architecting Availability Groups Derik Hammer Upcoming Events
Upcoming Events Tuning Your Biggest Queries Adam Machanic January 27th @ Microsoft Burlington Tuning Your Biggest Queries Adam Machanic 20% OFF – Use code: NESQL http://tinyurl.com/NESQLJan27 Upcoming Events
Announcements
Big data 101 Seriously, this is just 101 By Paresh Motiwala Prepared for NESQL
Paresh Motiwala, PMP ® pareshmotiwala@gmail.com http://www.linkedin.com/in/pareshmotiwala @pareshmotiwala www.circlesofgrowth.com
BIG Data 101 Who should attend DBAs CIO Marketing peeps Developers Big Data Enthusiasts Who should not attend
Big data 101 Agenda for the day: Sources Privacy concerns Storing- Hadoop Processing – MapReduce Presentation Summary
BIG Data 101
So why should I care about this? Data is the new Electricity (Satya Nadella, Spring 2016) https://www.microsoft.com/en-us/sql-server/data-driven Companies Generate data, Distribute, Meter, and Use it Where is data stored? Current: SQL Server, Oracle, Teradata, DB2, Netezza, Open Source Databases; Casandra, MySQL, MongoDB Unstructured: Hadoop, Spark, Data Lakes What type of data is stored? Traditional: Rows and Columns Big Data Explosion: Images, streaming data, internet-connected devices (IoT), Machine data
Big Data is driving transformative changes Traditional Big Data Relational data with highly modeled schema All data with schema agility Data characteristics Costs Specialized HW Commodity HW Culture Operational reporting Focus on rear-view analysis Experimentation leading to intelligent action With machine learning, graph, a/b testing
Big data 101 Sources Cell Phones Social Media Credit Cards GPSs Bread Crumbs
Big data 101 5 Vs of Big Data Volume Variety Velocity Veracity Value
Big data 101 Desired Properties: Robustness- Fault Tolerance Low Latency Scalability Generalization Extensibility Ad hoc Queries Minimal Maintenance Debuggability
WAS CREATED IN PAST 2 YEARS Big data 101 Flow Collection Pre-processing Hygiene Intervention Visualization Analysis OVER 90% OF TODAY’S DATA WAS CREATED IN PAST 2 YEARS
Big data 101 5 Rs of Data Quality Ephemeral Vs. Durability Relevancy Recency Range Robustness Reliability Ephemeral Vs. Durability Refresh of Data
Big data 101 Privacy of Data FIPP- Fair Information Privacy Principles If I collect the data, is it mine? Ownership Vs Rights Share Answers not Data OpAl (http://www.trust.mit.edu/projects/) Enigma Let them know Why you are collecting What you are collecting FIPP- Fair Information Privacy Principles Individual Control Transparency Respect for Context Security Access and Accuracy Focused Collection FERPA- Family Education Rights and Privacy Act
Big data 101 What is a data lake? ---Courtesy : James serra the Parallel Data Warehouse Appliance 6/11/2018 Big data 101 What is a data lake? ---Courtesy : James serra A storage repository, usually Hadoop, that holds a vast amount of raw data in its native format until it is needed. A place to store unlimited amounts of data in any format inexpensively, especially for archive purposes Allows collection of data that you may or may not use later: “just in case” A way to describe any large data pool in which the schema and data requirements are not defined until the data is queried: “just in time” or “schema on read” Complements EDW and can be seen as a data source for the EDW – capturing all data but only passing relevant data to the EDW Frees up expensive EDW resources (storage and processing), especially for data refinement Allows for data exploration to be performed without waiting for the EDW team to model and load the data (quick user access) Some processing in better done with Hadoop tools than ETL tools like SSIS Easily scalable Also called bit bucket, staging area, landing zone or enterprise data hub (Cloudera) http://www.jamesserra.com/archive/2014/05/hadoop-and-data-warehouses/ http://www.jamesserra.com/archive/2014/12/the-modern-data-warehouse/ http://adtmag.com/articles/2014/07/28/gartner-warns-on-data-lakes.aspx http://intellyx.com/2015/01/30/make-sure-your-data-lake-is-both-just-in-case-and-just-in-time/ http://www.blue-granite.com/blog/bid/402596/Top-Five-Differences-between-Data-Lakes-and-Data-Warehouses http://www.martinsights.com/?p=1088 http://data-informed.com/hadoop-vs-data-warehouse-comparing-apples-oranges/ http://www.martinsights.com/?p=1082 http://www.martinsights.com/?p=1094 http://www.martinsights.com/?p=1102 © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Big data 101 The “data lake” Uses A Bottoms-Up Approach Ingest all data regardless of requirements Store all data in native format without schema definition Do analysis Using analytic engines like Hadoop Devices Social Batch queries Devices LOB applications Video Interactive queries Social LOB applications Real-time analytics Sensors Web Sensors Video Relational Machine Learning Web Clickstream Data warehouse Relational Clickstream Courtesy : James Serra
Big data 101
Big data 101
Big data 101
Big data 101 MapReduce Map –Sends Queries Reduce – Collects Results Job Tracker Task Tracker YARN
Base Architecture : Big Data Advanced Analytics Pipeline 6/11/2018 2:38 PM Data Sources Ingest Prepare (normalize, clean, etc.) Analyze (stat analysis, ML, etc.) Publish (for programmatic consumption, BI/visualization) Consume (Alerts, Operational Stats, Insights) OnPrem Data Azure Services Near Realtime Data Analytics Pipeline using Azure Steam Analytics Machine Learning (Anomaly Detection) Data Stream Telemetry Event Hub Stream Analytics (real-time analytics) Live / real-time data stats, Anomalies and aggregates PowerBI dashboard Data in Motion Data at Rest Interactive Analytics and Predictive Pipeline using Azure Data Factory Realtime Readings and Operational Data HDI Custom ETL Aggregate /Partition Machine Learning Local DB Sensor Readings Local DB Logs Customer MIS dashboard of predictions / alerts (Replaced by Azure SQL) Legacy Azure Storage Blob Azure SQL (Predictions) Historic Laser Data (1 time drop) Fault and Maintenance Data (1 time drop) Scheduled hourly transfer using Azure Data Factory Big Data Analytics Pipeline using Azure Data Lake Sensor Readings Device Health dashboard of operational stats Azure Data Lake Storage Azure Data Lake Analytics (Big Data Processing) Azure SQL Operational Logs © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Vision for Big Data and Data Warehousing Bing SMB Advertisers – Search Ads 6/11/2018 Vision for Big Data and Data Warehousing Data Warehouse “Big Data” Microsoft Azure Microsoft Azure Cloud VMs HDInsight Data Lake Devices Relational Sensors Video LOB applications Web Social Clickstream VMs SQL DW Comprehensive Connected Choice Azure Data Factory + Federated Query Microsoft SQL Server Your data Your workload Your business Your way On-Premises APS SQL Server HDP APS
Big Data 101 Presentation R Python Power BI Power BI Desktop
Someday Big Data will just become data
Big data 101 Summary: Sources Privacy concerns Storing- Hadoop Processing – MapReduce Presentation
Big data 101 - Conclusion SQL Server is the best Relational Database The world is much bigger than any one relational database What is your company’s data strategy? What is your company’s cloud strategy? Learn adjacent technologies that will make you valuable. Power BI? Hadoop? NoSQL?
Big data 101 http://www.datasciencecentral.com/ BIBLIOGRAPHY – http://www.datasciencecentral.com/ https://www.youtube.com/playlist?list=PLt- 0mOCwxJ6B_OxTlpevxJNAa7GfCLd3l https://www.dezyre.com/article/hadoop-components-and- architecture-big-data-and-hadoop-training/114 MIT Big Data Analytics Course
Bibliography- Big Data 101 Ignite (IT Pros) - https://myignite.microsoft.com/videos Channel9 (Developers) - https://channel9.msdn.com/ Microsoft Virtual Academy (Both) – http://mva.microsoft.com Technet Virtual Labs (Hands-on!) - https://technet.microsoft.com/en-us/virtuallabs/default Free Azure for 1 month - https://azure.microsoft.com/en-us/free/ Free HDInsight (Hadoop as a service) for a week - https://azure.microsoft.com/en-us/services/hdinsight/information- request/ MSDN? Link that to Azure for monthly Azure money. Github - https://github.com/