Presentation is loading. Please wait.

Presentation is loading. Please wait.

Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.

Similar presentations


Presentation on theme: "Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery."— Presentation transcript:

1 Big Data Analytics Platforms

2 Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery

3 What is Big Data ? Big Data - Is a very successful buzzword. Under this market-shaped named hides the idea of massive, heterogeneous, autonomous data sources. The three V's of Big Data: ◎ Volume: The quantity of the data to be analyzed is in constant growth. ◎ Variety: Categorization of the data. Data about speed of airplanes, sonar data for rendering maps. ◎ Velocity: Speed of generation of new data. Facebook, Tweeter, sensors on the jet-engine. ◎ Veracity: Quality of data. Measurement errors, not full data. ◎ Complexity: Data from the multiple sources. Background

4 The roots for the need for Big Data - BI applications. The requirements from the data for analysis had grown. Volumes A lot of data gathered up till now Velocity The speed of data generation is constantly growing Complexity The more we learn about a process the more data we want to know Why Did It Emerge ?

5 ●Analytics - The discovery and communication of meaningful patterns in data. ●Business Intelligence - (BI) is the set of techniques and tools for the transformation of raw data into meaningful and useful information for business analysis purposes ●Data mining - Computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning and statistics.data setsartificial intelligencemachine learningstatistics ●MapReduce - A technique to work with large volumes of data that focuses on distribution of a data selection and processing between several processing nodes. ●Computer Cluster - A computer cluster consists of a set of connected computers that work together when each node is set to perform the same task, controlled and scheduled.computers ●Concurrency - The ability to execute multiple processes at the same time. Popular Terms - Big Data

6 ●Big data analytics enables organizations to analyze many kinds of data in search of valuable business information and insights. ●The analytical findings can lead to more effective marketing, new revenue opportunities and many more advantages over rival organizations and other business benefits. The primary goal of Big data analytics is to help companies make more informed business decisions by enabling data scientist, predictive modelers and other analytics professionals to analyze large volumes of transaction data. Big Data Goals

7 ●Explore the Web. ●Study Big Data & Big Data analytics. ●Choose related applications. ●Build a Feature Diagram. ●Compare between the shared use cases of the applications. ●Removing unrelated features. ●To build a collective diagrams. ●DIscussions. ●Create the final domain and its boundaries. Domain Creation Workflow

8 In industry Big Data Analytics tools are mainly referred for: ●Managing data storage, manipulation and retrieval tasks. ●Frameworks and tools for development and execution of analytics processes with business value. ●Various techniques, libraries, software products that facilitate data mining processes. ●Reporting and visualization tools and applications. Our focus - Tools for developing analytics application in the Big Data realm. An abstraction layer between developers and underlying robust infrastructure that handles issues related to data volume, speed, complexity. Domain Boundaries

9 +IN ◎ Development Environments ◎ Distributed Software Systems ◎ Closed solutions with APIs Domain Boundaries -OUT ◎ Visualization Tools * ◎ Data Organizers/Analytics not designed for Big Data ◎ Machine Learning libraries * There exist tools that have visualization component as the main component in the system. They are excluded. However tools with built-in visualization component which is not them main one are included.

10 ◎ Apache Spark ◎ Apache Storm ◎ HPCC ◎ Google BigQuery Applications

11 Apache Spark Apache Spark is a fast and general engine for large-scale data processing. ●Supports writing applications in Java, Scala, Python and so on. ●Combine SQL, streaming, and complex analytics. ●Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. ●Spark is used at a wide range of organizations to process large datasets.

12 Apache Spark Within The Domain Spark is in the Bi Data Analytics domain ●Designed to support developers in Big Data realm ●Provides abstraction layer for running tasks in parallel and fast access to huge data sources ●Support integration of existing tools and methods for Big Data challenges

13 Apache Spark Relation to the Others Similarities ●Integration with existing tools ●Job parallelization ●System management Differences ●No visualization/reporting ●No user management ●Open source approach

14 Apache Storm ●A distributed real-time computation system for processing large volumes of high velocity data. ●Powerful for scenarios requiring real-time analytics, machine learning and continuous monitoring of operations. ●Companies and projects powered by Storm: Groupon, Twitter, Yahoo!, Alibaba and many more. ●Integrates with any queueing system, Simple API, scalable, fault tolerant and may use with any language.

15 Apache Storm Within The Domain ●Storm was developed mainly for online streaming process ●Storm supports in job parallelization processing ●Easily integrates to many other systems

16 Apache Storm Relation to the Others Similarities ●System management ●Stream management ●External tools easy Integration ●Job parallelization Differences ●Topology creation ●Real time analytics

17 HPCC ●an open source, data-intensive computing system platform developed by LexisNexis Risk Solutions ●It stores and processes large quantities of data using massive parallel processing technology ●data across disparate data sources can be accessed, analyzed and manipulated in fractions of seconds ●HPCC functions as both a processing and a distributed data storage environment ●support both parallel batch data processing (Thor) and high-performance online query applications using indexed data files (Roxie).

18 HPCC Within The Domain ●HPCC designed to handle Big Data using high-performance query applications ●HPCC supports job and data parallelization processing ●HPCC Uses its own query language, also external known query languages

19 HPCC Relation to the Others Similarities ●System management ●External tools easy Integration ●Job parallelization Differences ●No stream management ●User Management

20 BigQuery BigQuery is a RESTful web service that enables interactive analysis of massively large datasets working in conjunction with Google Storage. It is an Infrastructure as a Service that may be used complementarily with MapReduce. Managing data - create and delete tables based on a JSON-encoded schema, import data encoded as CSV or JSON from Google Storage. Query - the queries are expressed in a SQL dialect and the results are returned in JSON with a maximum reply length of approximately 64 MB. [2] There are some limitations to the usual SQL queries. For example, BigQuery supports joins, but one of the two JOINed tables must be small enough or use the JOIN EACH keyword instead. [2] Integration - BigQuery can be used from Google Apps Script, Google Spreadsheets, or any language that can work with its REST API Access Control - is possible to share datasets with arbitrary individuals, groups, or the world.

21 BigQuery Within The Domain ●BigQuery massives alot of data and is designed to handle Big Data using high-performance query engine ●BigQuery supports data parallelization ●BigQuery support integration of existing tools and methods for Big Data

22 BigQuery Relation to the Others Similarities ●System management ●External tools easy Integration ●Job parallelization Differences ●No stream management ●User Management

23 Big Data Analytics Domain ●Analytics Design - Design the transformation process. ●System Management - To manage the topology of the process. ●Process Management - The ensemble of activities the performance of the process. ●Reporting - Visualization of the data. ●Data Access - The technique to access the data.

24 Domain Use Case (Data Access)

25 Domain Use Case (Task Design)

26 Domain - OVM & Class Diagram

27 Domain - OVM & Sequence

28 Big Data Analytics Domain

29 Applications Comparison Criteria Analytics Development User Management Integration capabilities Job parallelization Stream processing

30 ●New techniques and methods were learned during the course. ● Analysis of software fields/domains starts from individual products. ●SPLE gives a new approach to software engineering. Conclusions

31 “ ??????????????


Download ppt "Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery."

Similar presentations


Ads by Google