Make Sense of Big Data Researched by JIANG Wen-rui Led by Pro. ZOU

Slides:



Advertisements
Similar presentations
Shark:SQL and Rich Analytics at Scale
Advertisements

Overview of this week Debugging tips for ML algorithms
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
epiC: an Extensible and Scalable System for Processing Big Data
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.
Distributed Graph Processing Abhishek Verma CS425.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Distributed Computations
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
GraphLab A New Parallel Framework for Machine Learning Carnegie Mellon Based on Slides by Joseph Gonzalez Mosharaf Chowdhury.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System.
Storage in Big Data Systems
Introduction to Hadoop and HDFS
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
Joseph Gonzalez Yucheng Low Danny Bickson Distributed Graph-Parallel Computation on Natural Graphs Haijie Gu Joint work with: Carlos Guestrin.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Data Structures and Algorithms in Parallel Computing
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
REX: RECURSIVE, DELTA-BASED DATA-CENTRIC COMPUTATION Yavuz MESTER Svilen R. Mihaylov, Zachary G. Ives, Sudipto Guha University of Pennsylvania.
Next Generation of Apache Hadoop MapReduce Owen
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Image taken from: slideshare
TensorFlow– A system for large-scale machine learning
Big Data is a Big Deal!.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
An Open Source Project Commonly Used for Processing Big Data Sets
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Tutorial: Big Data Algorithms and Applications Under Hadoop
Spark Presentation.
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
Central Florida Business Intelligence User Group
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Introduction to Spark.
Distributed Systems CS
CS110: Discussion about Spark
Replication-based Fault-tolerance for Large-scale Graph Processing
Ch 4. The Evolution of Analytic Scalability
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Introduction to Apache
Pregelix: Think Like a Vertex, Scale Like Spandex
Overview of big data tools
Spark and Scala.
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Distributed Systems CS
Big Data Analysis in Digital Marketing
Fast, Interactive, Language-Integrated Cluster Computing
Lecture 29: Distributed Systems
Presentation transcript:

Make Sense of Big Data Researched by JIANG Wen-rui Led by Pro. ZOU

Three levels of Big Data Data Analysis Software Infrastructure Hardware Infrastructure SaaS PaaS IaaS

Contradiction—First and Second Level Data Analysis Meachine Learning Data Warehouse Statistics SoftWare Infrastruct MapReduce Pregel GraphLab GraphBuilder Spark

Evolution of Big Data Tech Intelligence Level MLBase Mahout BC-PDM Graph app BDAS Cloudera GraphLab Shark Spark Hive Hive Pig Pregel GraphBuilder Software Architecture Level MapReduce MapReduce MapR MapReduce HBase HDFS

4V in Big Data Volume Variety Velocity Value V Why? Big Data is just that – data sets that are so massive that typical software systems are incapable of economically storing, let alone managing and computing, the information. A Big Data platform must capture and readily provide such quantities in a comprehensive and uniform storage framework to enable straightforward management and development Variety One of the tenets of Big Data is the exponential growth of unstructured data. The vast majority of data now originates from sources with either limited or variable structure, such as social media and telemetry. A Big Data platform must accommodate the full spectrum of data types and forms. Velocity As organizations continue to seek new questions, patterns, and metrics within their data sets, they demand rapid and agile modeling and query capabilities. A Big Data platform should maintain the original format and precision of all ingested data to ensure full latitude of future analysis and processing cycles. Value Driving relevant value, whether as revenue or cost savings, from data is the primary motivator for many organizations. The popularity of long tail business models has forced companies to examine their data in detail to find the patterns, affiliations, and connections to drive these new opportunities

Model VS MapReduce Pregel GraphLab Spark Frame Performance Google MapReduce Good at data-independence tasks, not machine learning and graph processing(data-dependent and iterative tasks). based on acyclic data flow Think like a key. Pregel Good at iterative and data-dependent computations, include graph processing. Using BSP(Bulk Synchronous Parallel) Model. A Message Passing abstraction. CMU GraphLab Good at iterative and data-dependent computations , especially nature graph problem. Using asynchronous distributed shared memory model. A Shared-State abstraction. Think like a vertex. UC Berkeley BDAS Spark Good at Iterative algorithms, Interactive data mining, OLAP reports. Using RDDs(resilient distributed datasets) abstraction, which using In-Memory Cluster Computing and distributed-memory model.

MapReduce

Map@MapReuduce

Reduce@MapReuduce

RPC@MapReuduce

RPC@MapReuduce

MapReduce+BSP

BSP Model Processors Local Computation Communication Barrier Synchronization

Mapreduce + BSP

GraphLab

GraphLab-Think like a vertex

Graphlab Working Pattern Functions MR Map-Reduce Map_reduce_vertices Map_reduce_edges Transform_vertices Transform_edges GAS Gather-Apply-Scatter Gather_edges Gather Apply Scatter_edges Scatter

Distributed Execution of a PowerGraph Vertex-Program Machine 1 Machine 2 Master Gather Y’ Y’ Y’ Y’ Y Σ Σ1 Σ2 + + + Y Mirror Apply Y Y Machine 3 Machine 4 Σ3 Σ4 Scatter Mirror Mirror

Graphlab vs Pregel--Example Depends on the popularity their followers Depends on popularity of her followers What’s the popularity of this user? Popular?

Graphlab vs Pregel-- PageRank Rank of user i Weighted sum of neighbors’ ranks Update ranks in parallel Iterate until convergence

The Pregel Abstraction Vertex-Programs interact by sending messages. Pregel_PageRank(i, messages) : // Receive all the messages total = 0 foreach( msg in messages) : total = total + msg // Update the rank of this vertex R[i] = 0.15 + total // Send new messages to neighbors foreach(j in out_neighbors[i]) : Send msg(R[i] * wij) to vertex j i Malewicz et al. [PODC’09, SIGMOD’10]

The Pregel Abstraction Compute Communicate Barrier Put equation on slide

The GraphLab Abstraction Vertex-Programs directly read the neighbors state GraphLab_PageRank(i) // Compute sum over neighbors total = 0 foreach( j in in_neighbors(i)): total = total + R[j] * wji // Update the PageRank R[i] = 0.15 + total // Trigger neighbors to run again if R[i] not converged then foreach( j in out_neighbors(i)): signal vertex-program on j i j Low et al. [UAI’10, VLDB’12]

GraphLab Execution Scheduler The scheduler determines the order that vertices are executed CPU 1 e f g k j i h d c b a b c Scheduler e f b a i h i j CPU 2 The process repeats until the scheduler is empty

GraphLab vs. Pregel (BSP) 51% updated only once Multicore PageRank (25M Vertices, 355M Edges)

Graph-parallel Abstractions Better for ML Pregel Messaging Shared State i i Synchronous Asynchronous

Challenges of High-Degree Vertices Sends many messages (Pregel) Touches a large fraction of graph (GraphLab) Edge meta-data too large for single machine Sequentially process edges Asynchronous Execution requires heavy locking (GraphLab) Synchronous Execution prone to stragglers (Pregel)

Berkeley Data Analytics Stack

Berkeley Data Analytics Stack MapReduce MPI GraphLab etc MLBase Value BlinkDB(approximate queries) Velocity Shark(Spark+Hive)-SQL Spark Shared RDDs(distributed memory) Variety Mesos(Cluster resource manager) HDFS Volume

Spark-Motivation Most current cluster programming models are based on acyclic data flow from stable storage to stable storage Map Reduce Input Output Acyclic

Spark Iterative algorithms, including many machine learning algorithms and graph algorithms like PageRank. Interactive data mining, where a user would like to load data into RAM across a cluster and query it repeatedly. OLAP reports that run multiple aggregation queries on the same data.

Spark Spark allows iterative computation on the same data, which would form a cycle if jobs were visualized Spark offers an abstraction called resilient distributed datasets (RDDs) to support these applications efficiently

RDDs Resilient Distributed Dataset (RDD) serves as an abstraction to raw data, and some data is kept in memory and cached for later use. Spark allows data to be committed in RAM for an approximate20x speedup over MapReduce based on disks. RDDs allow Spark to outperform existing models by up to 100x in multi-pass analytics RDDs are immutable and created through parallel transformations such as map, filter, groupBy and reduce

Function-Mapreduce VS Spark

Logistic Regression Performance 127 s / iteration first iteration 174 s further iterations 6 s This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)

MLBase Motivation-2 Gaps In spite of the modern primacy of data, the complexity of existing ML algorithms is often overwhelming—— many users do not understand the trade-offs and challenges of parameterizing and choosing between different learning techniques. They need to tune and compare several suitable algorithms Further more, existing scalable systems that support machine learning are typically not accessible to ML researchers without a strong background in distributed systems and low-level primitives So we design a systems which is extensibility to novel ML algorithms. Acyclic

MLBase—4 pieces MQL A simple declarative way to specify ML tasks Capability MQL A simple declarative way to specify ML tasks ML-Library A library of distributed algorithms Set of high-level operators to enable ML researchers to scalably implement a wide range of ML methods without deep systems knowledge ML-Optimizer A novel optimizer to select and dynamically adapt the choice of learning algorithm ML-Runtime A new run-time optimized for the data-access patterns of these high-level operators

MLBase Architecture This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)

MLBase This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)

Error guide Just Hadoop Frame ? In a sense, the distributed platforms just a language, we can not miss them, also not only depend on them. The things more important is as follows: Machine Learning! Reading: Machine Learning A Probabilistic Perspective. Deep Learning.

FUJITSU Parallel time series regression Led by Dr. Yang Group LI Zhong-hua WANG Yun-zhi JIANG Wen-rui

Parallel time series regression Property Performance Platform Hadoop from Apache. MapReduce from Google(Open Source) GraphLab from Carnegie Mellon University(Open Source) Both are Good at distributed parallel processing MapReduce – good at acyclic data flow GraphLab - Good at iterative and data-dependent computations Volume Support for big data. The algorithm has good scalability. When a large amount of data comes, the algorithm can handle it without any modification, just by increasing the number of clusters Velocity Rapid and agile modeling and handling capabilities for big data. Interface Using XML file for input parameters setting, allowing customers set parameters intuitively

Parallel time series regression Decompose MapReduce CycLenCalcu MapReduce Indicative Frag MapReduce TBSCPro MapReduce Clustering GraphLab Choose Cluster MapReduce

Design for Parallel Indicative fragment Indicative fragment - identification the best length of indicative fragment. Assume - days:90 Max indicative fragment Length:96 Compare - Serial and parallel time complexity 1 1 C90 2 Serial 1 3 3 1 1 2 2 2 C90 2 3 C90 2 Generate all the 96* (90*89/2) operation pairs before the parallel computation 96 2 2 3 3 96 96 96 C90 2 96 96 Time Complexity: 96* C90 2 Parallel Time Complexity: 1

TBSCPro Heap with capacity 3 All Days a1 a2 a3 a4 a5 a6 a7 a8 ……….. b1 b2 b3 b4 b5 b6 b7 b8 …….. 2 2 All Days 3 c1 c2 c3 c4 c5 c6 c7 c8 …….. 3 4 5 4 d1 d2 d3 d4 d5 d6 d7 d8 ……….. 5 e1 e2 e3 e4 e5 e6 e7 e8 ……….. 1 2 3 4 5

Parallel time series regression model

Thank you !