Team 2 – Mike, Rich, Sam and Steven DPS – PACE University

Slides:



Advertisements
Similar presentations
The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica.
Advertisements

Nokia Technology Institute Natural Partner for Innovation.
Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.
Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
Big Data Workflows N AME : A SHOK P ADMARAJU C OURSE : T OPICS ON S OFTWARE E NGINEERING I NSTRUCTOR : D R. S ERGIU D ASCALU.
Business Intelligence Michael Gross Tina Larsell Chad Anderson.
Architecting for the Internet of Things
Tyson Condie.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Alert Logic Provides a Fully Managed Security and Compliance Solution Based in the Cloud, Powered by the Robust Microsoft Azure Platform MICROSOFT AZURE.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
Datalayer Notebook Allows Data Scientists to Play with Big Data, Build Innovative Models, and Share Results Easily on Microsoft Azure MICROSOFT AZURE ISV.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Axis AI Solves Challenges of Complex Data Extraction and Document Classification through Advanced Natural Language Processing and Machine Learning MICROSOFT.
Big Data Yuan Xue CS 292 Special topics on.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
BIG DATA. The information and the ability to store, analyze, and predict based on that information that is delivering a competitive advantage.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
Microsoft Ignite /28/2017 6:07 PM
András Benczúr Head, “Big Data – Momentum” Research Group Big Data Analytics Institute for Computer.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
From RDBMS to Hadoop A case study Mihaly Berekmeri School of Computer Science University of Manchester Data Science Club, 14th July 2016 Hayden Clark,
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.
Hadoop Big Data Usability Tools and Methods. On the subject of massive data analytics, usability is simply as crucial as performance. Right here are three.
Data Analytics (CS40003) Introduction to Data Lecture #1
OMOP CDM on Hadoop Reference Architecture
CNIT131 Internet Basics & Beginning HTML
Data Analytics 1 - THE HISTORY AND CONCEPTS OF DATA ANALYTICS
Connected Infrastructure
Big Data is a Big Deal!.
SNS COLLEGE OF TECHNOLOGY
Big Data Enterprise Patterns
PROTECT | OPTIMIZE | TRANSFORM
DocFusion 365 Intelligent Template Designer and Document Generation Engine on Azure Enables Your Team to Increase Productivity MICROSOFT AZURE APP BUILDER.
ANOMALY DETECTION FRAMEWORK FOR BIG DATA
Big Data A Quick Review on Analytical Tools
TDWI EXECUTIVE SUMMIT From Traditional to Modern: How Rakuten Marketing Realized the Promise of a New Generation of BI September 21, 2015 Donald Krapohl.
Chapter 14 Big Data Analytics and NoSQL
Spark Presentation.
Couchbase Server is a NoSQL Database with a SQL-Based Query Language
Connected Infrastructure
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Data Platform and Analytics Foundational Training
Get Real Value and Insights from Your Data: Biin Solutions Provides Predictive Analytics, IoT, and Business Intelligence with Microsoft Azure Power MICROSOFT.
Add intelligence to Dynamics AX with Cortana Intelligence suite
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Ministry of Higher Education
Designed for Big Data Visual Analytics, Zoomdata Allows Business Users to Quickly Connect, Stream, and Visualize Data in the Microsoft Azure Platform MICROSOFT.
Yellowfin: An Azure-Compatible Business Intelligence Platform That Connects People with Their Data for Better Decision Making MICROSOFT AZURE APP BUILDER.
Big Data - in Performance Engineering
Utilizing the Capabilities of Microsoft Azure, Skipper Offers a Results-Based Platform That Helps Digital Advertisers with the Marketing of Their Mobile.
DeFacto Planning on the Powerful Microsoft Azure Platform Puts the Power of Intelligent and Timely Planning at Any Business Manager’s Fingertips Partner.
Carl Data Solutions Collects Utility Sensor and Meter Data to Provide Advanced Reporting, Alarming, and Analytics with Microsoft Azure MICROSOFT AZURE.
XtremeData on the Microsoft Azure Cloud Platform:
Overview of big data tools
Big Data Young Lee BUS 550.
Technical Capabilities
Business Intelligence
Charles Tappert Seidenberg School of CSIS, Pace University
Big DATA.
Big-Data Analytics with Azure HDInsight
CS 239 – Big Data Systems Fall 2018
UNIT 6 RECENT TRENDS.
SQL Server 2019 Bringing Apache Spark to SQL Server
Architecture of modern data warehouse
Big Data.
Presentation transcript:

Team 2 – Mike, Rich, Sam and Steven DPS – PACE University Big Data Team 2 – Mike, Rich, Sam and Steven DPS – PACE University Team 2 – Mike, Rich, Sam and Steven 9/20/2018

Team 2 – Mike, Rich, Sam and Steven 9/20/2018

More Data Than Expected Team 2 – Mike, Rich, Sam and Steven 9/20/2018

What to Do with the Data? Analytics at the Edge Devices: In traditional analysis, data is stored and then analyzed. But with streaming data, analytics must occur in real time, as the data passes through. This allows you to identify and examine patterns of interest as the data is being created. The result is instant insight and immediate action. Real-Time Analytics: Instant access to information derived from data processing that leads to action. BIG DATA = BIG ASSUMPTIONS? Team 2 – Mike, Rich, Sam and Steven 9/20/2018

Why Big Data Now? The Hype: Understanding the cultural phenomenon of data science and how others were experiencing it. Study how companies, and universities are “doing data science”. Why Now: Technology makes this possible: infrastructure for large-scale data processing, increased memory, and bandwidth, as well as a cultural acceptance of technology in the fabric of our lives. This wasn't true a decade ago. Consideration should be to the ethical and technical responsibilities for the people responsible for the process. Team 2 – Mike, Rich, Sam and Steven 9/20/2018

How can data science help? Team 2 – Mike, Rich, Sam and Steven 9/20/2018

What is data science? In Academia: an academic data scientist is a scientist, trained in anything from social science to biology, who works with large amounts of data, and must grapple with computational problems posed by the structure, size, messiness, and the complexity and nature of the data, while simultaneously solving a real-world problem. In Industry: Someone who knows how to extract meaning from and interpret data, which requires both tools and methods from statistics and machine learning, as well as being human. She spends a lot of time in the process of collecting, cleaning, and “munging” data, because data is never clean. This process requires persistence, statistics, and software engineering skills- skills that are also necessary for understanding biases in the data, and for debugging logging output from code. Team 2 – Mike, Rich, Sam and Steven 9/20/2018

Who is a data scientist? Job descriptions: experts in computer science, statistics, communication, data visualization, and to have extensive domain expertise. Observation: Nobody is an expert in everything, which is why it makes more sense to create teams of people who have different profiles and different expertise-together, as a team, they can specialize in all those things. Team 2 – Mike, Rich, Sam and Steven 9/20/2018

Statistical Methodologies Algorithms Linear regression k-Nearest Neighbors k-means Logistic Regression Naive Bayes Team 2 – Mike, Rich, Sam and Steven 9/20/2018

The stats you were told to forget… In data science, analysis is very exploratory in nature. “Exploratory” aspect means that your understanding of the problem you are solving, or might solve, is changing as you go. Team 2 – Mike, Rich, Sam and Steven 9/20/2018

How does a data scientist fit into this? Team 2 – Mike, Rich, Sam and Steven 9/20/2018

Characteristics of Big Data and the Big Data Challenge There are three (3) Big Data Challenges: Volume: Amount of data being generated Velocity: Rate of which data is being generated Variety: Different types, unstructured data Team 2 – Mike, Rich, Sam and Steven 9/20/2018

Characteristics of Big Data and the Big Data Challenge You could also include other characteristics: Variability: Inconsistency of data at times A problem for data analysts Complexity: Data is from different sources and needs to be linked, connected, and correlated. So the big problem here is that with these challenges, batch big data processing is not efficient or effective enough. Team 2 – Mike, Rich, Sam and Steven 9/20/2018

Let’s put it in Perspective “Real-time is for robots,” says Joe Hellerstein, chancellor’s professor of computer science at UC Berkeley. “If you have people in the loop, it’s not real time. Most people take a second or two to react, and that’s plenty of time for a traditional transactional system to handle input and output.”. For most data analysts, real time means “pretty fast” at the data layer and “very fast” at the decision layer. Team 2 – Mike, Rich, Sam and Steven 9/20/2018

So the Elephant was Created What is Apache Hadoop. Java Frame Work ( a lot of java packages). Promises a dramatic reduction in Cost of Operation Separation of Concerns between data collection for BI and Analytics It's a cost-effective alternative to a conventional (ETL) process Suitable for unstructured data (the variety issue), distributed computing and scale. In a nutshell, Hadoop YARN is an attempt to take Apache Hadoop beyond MapReduce for data-processing. As folks are aware, Hadoop HDFS is the data storage layer for Hadoop and MapReduce was the data-processing layer. Yet Another Resource Negotiator. Team 2 – Mike, Rich, Sam and Steven 9/20/2018

Hadoop Architecture Problems Batch Analytics! Still High Data Volumes per day – TB(s) – (See Moore’s Law) Hadoop stack having to deal with “data wrangling” Report Latency very High - Takes hours aggregate data Raw Data processing through MapReduce still slow Team 2 – Mike, Rich, Sam and Steven 9/20/2018

Generalized Real-Time Analytics Stack Decisions The topmost layer is the decision layer. This is where the rubber meets the road, and it can include end-user applications such as desktop, mobile, and interactive web apps, as well as business intelligence software. This is the layer that most people “see.” It’s the layer at which business analysts, c-suite executives, and customers interact with the real-time big data analytics systems Integration On top of the analytics layer is the integration layer. It is the “glue” that holds the end-user applications and analytics engines together, and it usually includes a rules engine or CEP engine, and an API for dynamic analytics that “brokers” communication between app developers and data scientists. Analytics The analytics layer sits above the data layer. The analytics layer includes a production environment for deploying real-time scoring and dynamic analytics; a development environment for building models; and a local data mart that is updated periodically from the data layer, situated near the analytics engine to improve performance. DATA At the foundation is the data layer. At this level you have structured data in an RDBMS, NoSQL, Hbase, or Impala; unstructured data in Hadoop MapReduce; streaming data from the web, social media, sensors and operational systems; and limited capabilities for performing descriptive analytics. Tools such as Hive, HBase, Storm and Spark also sit at this layer. (Matei Zaharia suggests dividing the data layer into two layers, one for storage and the other for query. At the foundation is the data layer. At this level you have structured data in an RDBMS, NoSQL, Hbase, or Impala; unstructured data in Hadoop MapReduce; streaming data from the web, social media, sensors and operational systems; and limited capabilities for performing descriptive analytics. Tools such as Hive, HBase, Storm and Spark also sit at this layer. (Matei Zaharia suggests dividing the data layer into two layers, one for storage and the other for queryBarlow, Mike (2013-06-24). Real-Time Big Data Analytics: Emerging Architecture (Kindle Locations 164-167). . Kindle Edition. The analytics layer sits above the data layer. The analytics layer includes a production environment for deploying real-time scoring and dynamic analytics; a development environment for building models; and a local data mart that is updated periodically from the data layer, situated near the analytics engine to improve performance. Barlow, Mike (2013-06-24). Real-Time Big Data Analytics: Emerging Architecture (Kindle Locations 167-170). . Kindle Edition. On top of the analytics layer is the integration layer. It is the “glue” that holds the end-user applications and analytics engines together, and it usually includes a rules engine or CEP engine, and an API for dynamic analytics that “brokers” communication between app developers and data scientists.Barlow, Mike (2013-06-24). Real-Time Big Data Analytics: Emerging Architecture (Kindle Locations 170-172). . Kindle Edition. The topmost layer is the decision layer. This is where the rubber meets the road, and it can include end-user applications such as desktop, mobile, and interactive web apps, as well as business intelligence software. This is the layer that most people “see.” It’s the layer at which business analysts, c-suite executives, and customers interact with the real-time big data analytics system.Barlow, Mike (2013-06-24). Real-Time Big Data Analytics: Emerging Architecture (Kindle Locations 173-176). . Kindle Edition. Team 2 – Mike, Rich, Sam and Steven 9/20/2018

Moving From Batch to “Near” Real-Time Analytics Apache Spark Apache Spark is an open source, parallel data processing framework that complements Apache Hadoop to make it easy to develop fast, unified Big Data applications, combining batch, streaming and interactive analytics on all your data. Spark was initially developed as a UC Berkeley research project, and much of the design is documented in papers. http://spark.apache.org/documentation.html Team 2 – Mike, Rich, Sam and Steven 9/20/2018

Next Generation Analytics Platforms will include Spark Team 2 – Mike, Rich, Sam and Steven 9/20/2018

UC Berkeley - Shark Team 2 – Mike, Rich, Sam and Steven 9/20/2018

Data Stack Evolution Team 2 – Mike, Rich, Sam and Steven 9/20/2018

Data Stack Evolution Team 2 – Mike, Rich, Sam and Steven 9/20/2018

Data Stack Evolution Team 2 – Mike, Rich, Sam and Steven 9/20/2018

Cisco SIO Hadoop Stack Hadoop Mahout, MLLib Shark, Impala New Threat Footprint within 2-5 min Batch Processing Closed-Loop Operations SENSOR DATA Mahout, MLLib Spark Streaming for known threats & aggregation INTRUSION PROTECTION SYSTEM LOGS SQL Queries and Reporting Shark, Impala GraphX & TitanDB FIREWALL LOGS 20 TB per day Hadoop 1 million events/sec. Over 100 channels Graph Processing SECURITY APPLIANCE LOGS Benefits: Unified platform for Analytics Low Operational Costs Faster Response Times Better Algorithms Globally Dispersed Datacenters Team 2 – Mike, Rich, Sam and Steven 9/20/2018

Cloudera Impala Stack Team 2 – Mike, Rich, Sam and Steven 9/20/2018

Splunk > Operational Intelligence with Real-Time Analytics What is Splunk > Turns Machine Data into Operational Intelligence for IT Provides Monitoring, Search, Analyze and Visualize Team 2 – Mike, Rich, Sam and Steven 9/20/2018

Splunk > Architecture What is Splunk > Handles large streams of machine data generated by the websites, applications, servers, networks, mobile apps. Team 2 – Mike, Rich, Sam and Steven 9/20/2018

References http://www.ebizq.net/blogs/enterprise/images/mapreduce_hadoop.png http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop- hdfs/HdfsDesign.html Real-Time Big Data Analytics: Emerging Architecture, Mike Barlow – O’Reilly http://conferences.infotoday.com/documents/214/W1_Casaletto.pdf, jcasaletto@mapr.com http://www.cloudera.com/content/cloudera/en/products-and- services/cloudera-enterprise.html http://cordis.europa.eu/fp7/ict/future-networks/documents/smart-cities- projects/citypulse.pdf http://www.informationweek.in/informationweek/news- analysis/176208/evolution-analytics Team 2 – Mike, Rich, Sam and Steven 9/20/2018