Download presentation
Presentation is loading. Please wait.
PublishMelissa Floyd Modified over 6 years ago
1
Team 2 – Mike, Rich, Sam and Steven DPS – PACE University
Big Data Team 2 – Mike, Rich, Sam and Steven DPS – PACE University Team 2 – Mike, Rich, Sam and Steven 9/20/2018
2
Team 2 – Mike, Rich, Sam and Steven
9/20/2018
3
More Data Than Expected
Team 2 – Mike, Rich, Sam and Steven 9/20/2018
4
What to Do with the Data? Analytics at the Edge Devices: In traditional analysis, data is stored and then analyzed. But with streaming data, analytics must occur in real time, as the data passes through. This allows you to identify and examine patterns of interest as the data is being created. The result is instant insight and immediate action. Real-Time Analytics: Instant access to information derived from data processing that leads to action. BIG DATA = BIG ASSUMPTIONS? Team 2 – Mike, Rich, Sam and Steven 9/20/2018
5
Why Big Data Now? The Hype: Understanding the cultural phenomenon of data science and how others were experiencing it. Study how companies, and universities are “doing data science”. Why Now: Technology makes this possible: infrastructure for large-scale data processing, increased memory, and bandwidth, as well as a cultural acceptance of technology in the fabric of our lives. This wasn't true a decade ago. Consideration should be to the ethical and technical responsibilities for the people responsible for the process. Team 2 – Mike, Rich, Sam and Steven 9/20/2018
6
How can data science help?
Team 2 – Mike, Rich, Sam and Steven 9/20/2018
7
What is data science? In Academia: an academic data scientist is a scientist, trained in anything from social science to biology, who works with large amounts of data, and must grapple with computational problems posed by the structure, size, messiness, and the complexity and nature of the data, while simultaneously solving a real-world problem. In Industry: Someone who knows how to extract meaning from and interpret data, which requires both tools and methods from statistics and machine learning, as well as being human. She spends a lot of time in the process of collecting, cleaning, and “munging” data, because data is never clean. This process requires persistence, statistics, and software engineering skills- skills that are also necessary for understanding biases in the data, and for debugging logging output from code. Team 2 – Mike, Rich, Sam and Steven 9/20/2018
8
Who is a data scientist? Job descriptions:
experts in computer science, statistics, communication, data visualization, and to have extensive domain expertise. Observation: Nobody is an expert in everything, which is why it makes more sense to create teams of people who have different profiles and different expertise-together, as a team, they can specialize in all those things. Team 2 – Mike, Rich, Sam and Steven 9/20/2018
9
Statistical Methodologies
Algorithms Linear regression k-Nearest Neighbors k-means Logistic Regression Naive Bayes Team 2 – Mike, Rich, Sam and Steven 9/20/2018
10
The stats you were told to forget…
In data science, analysis is very exploratory in nature. “Exploratory” aspect means that your understanding of the problem you are solving, or might solve, is changing as you go. Team 2 – Mike, Rich, Sam and Steven 9/20/2018
11
How does a data scientist fit into this?
Team 2 – Mike, Rich, Sam and Steven 9/20/2018
12
Characteristics of Big Data and the Big Data Challenge
There are three (3) Big Data Challenges: Volume: Amount of data being generated Velocity: Rate of which data is being generated Variety: Different types, unstructured data Team 2 – Mike, Rich, Sam and Steven 9/20/2018
13
Characteristics of Big Data and the Big Data Challenge
You could also include other characteristics: Variability: Inconsistency of data at times A problem for data analysts Complexity: Data is from different sources and needs to be linked, connected, and correlated. So the big problem here is that with these challenges, batch big data processing is not efficient or effective enough. Team 2 – Mike, Rich, Sam and Steven 9/20/2018
14
Let’s put it in Perspective
“Real-time is for robots,” says Joe Hellerstein, chancellor’s professor of computer science at UC Berkeley. “If you have people in the loop, it’s not real time. Most people take a second or two to react, and that’s plenty of time for a traditional transactional system to handle input and output.”. For most data analysts, real time means “pretty fast” at the data layer and “very fast” at the decision layer. Team 2 – Mike, Rich, Sam and Steven 9/20/2018
15
So the Elephant was Created
What is Apache Hadoop. Java Frame Work ( a lot of java packages). Promises a dramatic reduction in Cost of Operation Separation of Concerns between data collection for BI and Analytics It's a cost-effective alternative to a conventional (ETL) process Suitable for unstructured data (the variety issue), distributed computing and scale. In a nutshell, Hadoop YARN is an attempt to take Apache Hadoop beyond MapReduce for data-processing. As folks are aware, Hadoop HDFS is the data storage layer for Hadoop and MapReduce was the data-processing layer. Yet Another Resource Negotiator. Team 2 – Mike, Rich, Sam and Steven 9/20/2018
16
Hadoop Architecture Problems
Batch Analytics! Still High Data Volumes per day – TB(s) – (See Moore’s Law) Hadoop stack having to deal with “data wrangling” Report Latency very High - Takes hours aggregate data Raw Data processing through MapReduce still slow Team 2 – Mike, Rich, Sam and Steven 9/20/2018
17
Generalized Real-Time Analytics Stack
Decisions The topmost layer is the decision layer. This is where the rubber meets the road, and it can include end-user applications such as desktop, mobile, and interactive web apps, as well as business intelligence software. This is the layer that most people “see.” It’s the layer at which business analysts, c-suite executives, and customers interact with the real-time big data analytics systems Integration On top of the analytics layer is the integration layer. It is the “glue” that holds the end-user applications and analytics engines together, and it usually includes a rules engine or CEP engine, and an API for dynamic analytics that “brokers” communication between app developers and data scientists. Analytics The analytics layer sits above the data layer. The analytics layer includes a production environment for deploying real-time scoring and dynamic analytics; a development environment for building models; and a local data mart that is updated periodically from the data layer, situated near the analytics engine to improve performance. DATA At the foundation is the data layer. At this level you have structured data in an RDBMS, NoSQL, Hbase, or Impala; unstructured data in Hadoop MapReduce; streaming data from the web, social media, sensors and operational systems; and limited capabilities for performing descriptive analytics. Tools such as Hive, HBase, Storm and Spark also sit at this layer. (Matei Zaharia suggests dividing the data layer into two layers, one for storage and the other for query. At the foundation is the data layer. At this level you have structured data in an RDBMS, NoSQL, Hbase, or Impala; unstructured data in Hadoop MapReduce; streaming data from the web, social media, sensors and operational systems; and limited capabilities for performing descriptive analytics. Tools such as Hive, HBase, Storm and Spark also sit at this layer. (Matei Zaharia suggests dividing the data layer into two layers, one for storage and the other for queryBarlow, Mike ( ). Real-Time Big Data Analytics: Emerging Architecture (Kindle Locations ). . Kindle Edition. The analytics layer sits above the data layer. The analytics layer includes a production environment for deploying real-time scoring and dynamic analytics; a development environment for building models; and a local data mart that is updated periodically from the data layer, situated near the analytics engine to improve performance. Barlow, Mike ( ). Real-Time Big Data Analytics: Emerging Architecture (Kindle Locations ). . Kindle Edition. On top of the analytics layer is the integration layer. It is the “glue” that holds the end-user applications and analytics engines together, and it usually includes a rules engine or CEP engine, and an API for dynamic analytics that “brokers” communication between app developers and data scientists.Barlow, Mike ( ). Real-Time Big Data Analytics: Emerging Architecture (Kindle Locations ). . Kindle Edition. The topmost layer is the decision layer. This is where the rubber meets the road, and it can include end-user applications such as desktop, mobile, and interactive web apps, as well as business intelligence software. This is the layer that most people “see.” It’s the layer at which business analysts, c-suite executives, and customers interact with the real-time big data analytics system.Barlow, Mike ( ). Real-Time Big Data Analytics: Emerging Architecture (Kindle Locations ). . Kindle Edition. Team 2 – Mike, Rich, Sam and Steven 9/20/2018
18
Moving From Batch to “Near” Real-Time Analytics
Apache Spark Apache Spark is an open source, parallel data processing framework that complements Apache Hadoop to make it easy to develop fast, unified Big Data applications, combining batch, streaming and interactive analytics on all your data. Spark was initially developed as a UC Berkeley research project, and much of the design is documented in papers. Team 2 – Mike, Rich, Sam and Steven 9/20/2018
19
Next Generation Analytics Platforms will include Spark
Team 2 – Mike, Rich, Sam and Steven 9/20/2018
20
UC Berkeley - Shark Team 2 – Mike, Rich, Sam and Steven 9/20/2018
21
Data Stack Evolution Team 2 – Mike, Rich, Sam and Steven 9/20/2018
22
Data Stack Evolution Team 2 – Mike, Rich, Sam and Steven 9/20/2018
23
Data Stack Evolution Team 2 – Mike, Rich, Sam and Steven 9/20/2018
24
Cisco SIO Hadoop Stack Hadoop Mahout, MLLib Shark, Impala
New Threat Footprint within 2-5 min Batch Processing Closed-Loop Operations SENSOR DATA Mahout, MLLib Spark Streaming for known threats & aggregation INTRUSION PROTECTION SYSTEM LOGS SQL Queries and Reporting Shark, Impala GraphX & TitanDB FIREWALL LOGS 20 TB per day Hadoop 1 million events/sec. Over 100 channels Graph Processing SECURITY APPLIANCE LOGS Benefits: Unified platform for Analytics Low Operational Costs Faster Response Times Better Algorithms Globally Dispersed Datacenters Team 2 – Mike, Rich, Sam and Steven 9/20/2018
25
Cloudera Impala Stack Team 2 – Mike, Rich, Sam and Steven 9/20/2018
26
Splunk > Operational Intelligence with Real-Time Analytics
What is Splunk > Turns Machine Data into Operational Intelligence for IT Provides Monitoring, Search, Analyze and Visualize Team 2 – Mike, Rich, Sam and Steven 9/20/2018
27
Splunk > Architecture
What is Splunk > Handles large streams of machine data generated by the websites, applications, servers, networks, mobile apps. Team 2 – Mike, Rich, Sam and Steven 9/20/2018
28
References hdfs/HdfsDesign.html Real-Time Big Data Analytics: Emerging Architecture, Mike Barlow – O’Reilly services/cloudera-enterprise.html projects/citypulse.pdf analysis/176208/evolution-analytics Team 2 – Mike, Rich, Sam and Steven 9/20/2018
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.