Download presentation
Presentation is loading. Please wait.
Published byHenry Bradford Modified over 6 years ago
1
Big Data-BI Fusion: Microsoft HDInsight & MS BI
Visual Studio Live! Chicago 2013 Big Data-BI Fusion: Microsoft HDInsight & MS BI Andrew Brust CEO and Founder Blue Badge Insights Level: Intermediate © Visual Studio Live! All rights reserved.
2
Meet Andrew brustblog.com, Twitter: @andrewbrust
CEO and Founder, Blue Badge Insights Big Data blogger for ZDNet Microsoft Regional Director, MVP Co-chair VSLive! and 17 years as a speaker Founder, Microsoft BI User Group of NYC Co-moderator, NYC .NET Developers Group “Redmond Review” columnist for Visual Studio Magazine and Redmond Developer News brustblog.com,
3
Andrew’s New Blog (bit.ly/bigondata)
4
Read all about it!
5
SQL Server Live! Orlando 2012
Visual Studio Live! Las Vegas 2011 SQL Server Live! Orlando 2012 What is Big Data? 100s of TB into PB and higher Involving data from: financial data, sensors, web logs, social media, etc. Parallel processing often involved Hadoop is emblematic, but other technologies are Big Data too Processing of data sets too large for transactional databases Analyzing interactions, rather than transactions The three V’s: Volume, Velocity, Variety Big Data tech sometimes imposed on small data problems SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © SQL Server Live! All rights reserved. 2012 Visual Studio Live! All rights reserved.
6
The Hadoop Stack Log file integration Machine Learning/Data Mining
RDBMS Import/Export Query: HiveQL and Pig Latin Database MapReduce, HDFS
7
What’s MapReduce? “Big” data input accepted in file form
Data is partitioned and sent to mappers (nodes in cluster) Mappers pre-process data into KV pairs, then all output for (a) given key(s) goes to a reducer Reducers aggregate; one line of output per unique key, with one value Map and Reduce code natively written as Java functions
8
MapReduce, in a Diagram mapper mapper reducer Input Output mapper
K1 mapper reducer Input Output Input Output Output mapper K2 reducer Input Output Input Output mapper K3 reducer Input Output Input Output mapper Input Output mapper Input Output
9
HDFS File system whose data gets distributed over commodity drives on commodity servers Data is replicated If one box goes down, no data lost “Shared Nothing” Except the name node BUT: Immutable Files can only be written to once So updates require drop + re-write (slow) You can append though Like a DVD/CD-ROM
10
HBase A Wide-Column Store, NoSQL database
Modeled after Google BigTable HBase tables are HDFS files Therefore, Hadoop-compatible Hadoop often used with HBase But you can use either without the other Microsoft’s Hadoop distribution does not (yet) include HBase
11
Microsoft HDInsight Developed with Hortonworks and incorporates Hortonworks Data Platform (HDP) for Windows Windows Azure HDInsight and Microsoft HDInsight Server Single node preview runs on Windows client Also Hortonworks HDP for Windows Includes ODBC Drivers for Hive JavaScript MapReduce framework Contribute it all back to open source Apache Project
12
Azure HDInsight Provisioning
Visual Studio Live! Las Vegas 2013 Azure HDInsight Provisioning New! HDInsight preview now public, so… Go to Windows Azure portal Sign up for the public preview Select HDInsight from left navbar Click “+ NEW” lower-left Specify cluster name, number of nodes, admin password, storage account Credentials used for browser login, RDP and ODBC During preview, you will be billed 50% of Azure compute rates for nodes in cluster. Will be 100% at GA. Click “CREATE HDINSIGHT CLUSTER” Wait for provisioning to complete Navigate to © Visual Studio Live! All rights reserved.
13
Azure HDInsight Provisioning
New! Azure HDInsight Provisioning
14
Submitting, Running and Monitoring Jobs
Upload a JAR Use Streaming Use other languages (i.e. other than Java) to write MapReduce code Python is popular option Any executable works, even C# console apps On HDInsight, JavaScript works too Still uses a JAR file: streaming.jar Run at command line (passing JAR name and params) or use GUI
15
Amenities for Visual Studio/.NET
Hortonworks Data Platform for Windows MRLib (NuGet Package) LINQ to Hive OdbcClient + Hive ODBC Driver Deployment Debugging MR code in C#, HadoopJob, MapperBase, ReducerBase
16
SQL Server Live! Orlando 2012
Running MapReduce Jobs SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © SQL Server Live! All rights reserved.
17
The “Data-Refinery” Idea
Use Hadoop to “on-board” unstructured data, then extract manageable subsets Load the subsets into conventional DW/BI servers and use familiar analytics tool to examine This is the current rationalization of Hadoop + BI tools’ coexistence Will it stay this way?
18
Visual Studio Live! Redmond 2012
Hive Used by most BI products which connect to Hadoop Provides a SQL-like abstraction over Hadoop Officially HiveQL, or HQL Works on own tables, but also on HBase Query generates MapReduce job, output of which becomes result set Microsoft has Hive ODBC driver Connects Excel, Reporting Services, PowerPivot, Analysis Services Tabular Mode (only) 2012 Visual Studio Live! All rights reserved.
19
Hive
20
HDInsight Data Sources
Files in HDFS Azure Blob Storage (Azure HDInsight only) Hive tables HBase?
21
Just-in-time Schema When looking at unstructured data, schema is imposed at query time Schema is context specific If scanning a book, are the values words, lines, or pages? Are notes a single field, or is each word a value? Are date and time two fields or one? Are street, city, state, zip separate or one value? Pig and Hive let you determine this at query time So does the Map function in MapReduce code
22
How Does MS BI Fit In? Excel, PowerPivot: can query via Hive ODBC driver Analysis Services (SSAS) Tabular Mode Also compatible with Hive ODBC Driver Multidimensional mode is not Power View Works against PowerPivot and SSAS Tabular RDBMS + Parallel Data Warehouse (PDW) Sqoop connectors Columnstore Indexes Enterprise Edition and PDW only PDW: PolyBase
23
Excel, PowerPivot Excel and PowerPivot use the BI Semantic Model (BISM), which can query Hadoop via Hive and its ODBC driver Excel also features Power Query (fka “Data Explorer”), currently in Beta, which can query HDFS directly and insert the results into a BISM repository Excel BISM accommodates millions of rows through compression. Not petabyte scale, but sufficient to store and analyze output of Hadoop queries.
24
PowerPivot, SSAS Tabular
SQL Server Analysis Services Tabular mode is the enterprise server implementation of BISM Features partitioning and role-based security Can store billions of rows. So even better for Hadoop output analysis. Excel-based BISM repositories can be upsized to SSAS Tabular
25
SQL Server Live! Orlando 2012
Querying Hadoop from Microsoft BI SQTH8 - Microsoft's Big Play for Big Data - Andrew Brust © SQL Server Live! All rights reserved.
26
Sqoop Acronym for “SQL to Hadoop”
Essentially a technology for moving data between data warehouses and Hadoop Command line utility; allows specification of source/target HDFS file and relational server, database and table Sqoop connectors available for SQL Server and PDW Sqoop generates MapReduce job to extract data from, or insert data into, HDFS
27
PDW, PolyBase SQL Server Parallel Data Warehouse (PDW) is a Massively Parallel Proicessing (MPP) data warehouse appliance version of SQL Server MPP manages a grid of relational database servers for divide-and-conquer processing of large data sets. PDW v2 includes “PolyBase,” a component which allows PDW to query data in Hadoop directly. Bypasses MapReduce; addresses data nodes directly and orchestrates parallelism itself
28
PolyBase Versus Hive, Sqoop
Hive and Sqoop generate MapReduce jobs, and work in batch mode PolyBase addresses HDFS data itself This is true SQL over Hadoop. Competitors: Cloudera Impala Teradata Aster SQL-H Pivotal HD/HAWQ Hadapt
29
Usability Impact PowerPivot makes analysis much easier, self-service
Power View is great for discovery and visualization; also self-service Combine with the Hive ODBC driver and suddenly Hadoop is accessible to business users Caveat: can query Big Data, but must have smaller result
30
Resources Big On Data blog Apache Hadoop home page
Apache Hadoop home page Hive & Pig home pages Hadoop on Azure home page SQL Server 2012 Big Data
31
Thank You! Email Blog: Twitter andrew.brust@bluebadgeinsights.com
Twitter @andrewbrust on twitter
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.