CSCI5570 Large Scale Data Processing Systems

CSCI5570 Large Scale Data Processing Systems
Distributed Data Analytics Systems James Cheng CSE, CUHK

Fan Yang, Jinfeng Li, James Cheng The Chinese University of Hong Kong
Husky: Towards a More Efficient and Expressive Distributed Computing Framework Fan Yang, Jinfeng Li, James Cheng The Chinese University of Hong Kong

The Era of Big Data Whether you like the term “big data” or not, BIG data is real The challenges of handling 3 ‘V’s, i.e., volume, velocity, and variety of big data are real The problem related to the veracity (so called the 4th ‘V’) of big data is also real The great value (so called the 5th ‘V’) hidden in big data is also real

Big Data Applications Big data applications in industry
Big data applications in science Big data applications for social good

Source: Big data use cases by Dell Big data applications in industry Sales conversion optimization Consumer behavior analysis Customer segmentation Security threat prediction Predictive support Market basket analysis Pricing optimization Other industry-specific applications

Source: Big data use cases by Dell Big data applications in industry 8 categories of big data use cases A wide range of industries: communication, finance, food & beverage, retail, hotel, travel, banking, e-commerce, transportation, Cloud storage, car manufacturing, insurance, healthcare, HR/recruiting, farming Proven working and generated great profits in numerous companies with thousands to millions of employees

Big Data Applications Big data applications in science Genomic studies
Astronomical data analysis Complex physics simulations Biology and environmental research …

Big Data Applications Big data applications for social good
Physical education Health care monitoring Healthy ageing Air pollution control …

Big Data Solutions The volume, velocity, and variety of big data (in order to get value) require new techniques and systems to handle them Many concepts of large scale data processing and distributed computing are still valid, but making them work in industry is a totally different story --- require both research and non-trivial engineering efforts

Big Data Solutions Deep learning The universal big data solution?
The best big data solution? Source:

Big Data Solutions A big data application often requires a combination of multiple types of systems to develop a good solution What types of systems are generally available today for big data solutions?

Systems for Big Data Solutions
General-purpose big data platforms: Hadoop, Spark, Flink, Dato, Naiad, Husky … NoSQL: MongoDB, Cassandra, CouchDB … Key-value stores: Redis, Memcached … Search engines: ElasticSearch, Solr … Machine learning systems: Petuum, GraphLab, TensorFlow, mxnet, Angel, DMTK … Graph computing systems: Pregel, Giraph, GraphLab, …

General-purpose platforms NoSQL Machine learning systems Great! So many big data tools available! Key-value stores Graph systems Search engines

General-purpose platforms NoSQL Machine learning systems Difficult to integrate them for big data solutions Key-value stores Graph systems Search engines

Problems of Integrated Solutions
General problems of integrated solutions Poor performance Steep learning curve Low reusability High maintenance cost Incompatibility

General problems of integrated solutions Poor performance High context switch cost when moving from one system to another Vs. general-purpose platforms: no context switch but inefficient for some types of workloads Steep learning curve Low reusability High maintenance cost Incompatibility

Poor Performance Big data solution with domain-specific systems: high context switch cost Big data solution with general-purpose platforms: no context switch cost => lower overall cost

General problems of integrated solutions Poor performance Steep learning curve Users need to learn how to use different types of systems Not easy even for skillful programmers Low reusability High maintenance cost Incompatibility

General problems of integrated solutions Poor performance Steep learning curve Low reusability Solutions, or components in a solution, are hard to be re-used to build other solutions More and more ad-hoc solutions, which gradually become very hard to understand and maintain High maintenance cost Incompatibility

General problems of integrated solutions Poor performance Steep learning curve Low reusability High maintenance cost Different systems used to build a solution may be updated regularly by their developers A single update may trigger a cascade of updates to the whole solution package Incompatibility

General problems of integrated solutions Poor performance Steep learning curve Low reusability High maintenance cost Incompatibility Different types of systems used to build the solution may not be fully compatible with each other Some may have poor performance running on a certain platform/environment

Husky Design Goals Can we build a big data platform that provides a unified framework with the following characteristics? High performance Flat learning curve Good reusability Low maintenance cost High compatibility

One unified platform, multiple purposes
Husky - Overview Search Engine Messaging System Key-value Stores NoSQL Graph Analytics Machine Learning Map Reduce Stream Processing SQL OLAP Husky Kernel APIs Hadoop Ecosystem One unified platform, multiple purposes

Husky: bred for your big data
A husky bred for big data Fast  Velocity Strong  Volume Versatile  Variety

Husky: bred for your big data
A general-purpose big data platform General and expressive High-performance User-friendly

Husky: Generality and Expressiveness
A new programming model that captures coarse-grained transformations (e.g., map, reduce, join) fine-grained operations over mutable data structures (e.g., machine learning, graph analytics) Support both synchronous and asynchronous execution Support real-time streaming model or Spark's mini-batch streaming model Bridge different programming paradigms Different programming paradigms can co-exist in Husky and cooperate

Husky Computational Model
The Basics: Husky represents data as Husky objects Objects are stored in object lists Both objects and objects lists are mutable Objects can have structures E.g., a “vector” object, a “product” object, and so on Computation happens by object interaction E.g., a “graph vertex” object pushes some messages to its neighboring vertices, and pulls information from some other vertices

Visibility: Local objects Visible to the local thread only Allow more efficient local computation Global objects Visible across the cluster Facilitate communication among workers Scoped communication An object can only talk to other objects that are visible to it Global object facilitates communication across workers. Local object helps optimize local computation when it can proceed independently or asynchronously without any global synchronization. - Global/local objects can push messages to global objects. - Global/local objects can pull messages from global objects. - Global/local objects can push messages to or pull messages from local objects in the same worker. - Global/local objects can broadcast messages to one or more workers (regarded as global objects), and the messages are then accessible by all objects in the respective worker.

Object Interaction Push/Pull: an object can push information to, or pull information from, another visible object Migrate: an object can migrate to another thread (worker) Dynamic Object Creation An existing object can push a message to a not-yet-existing object The receiving thread dynamically creates the object

Example: Parameter Server Two types of objects: Client, Server Clients are local objects Servers are global objects Clients pull parameters from servers, and push gradients to servers

Can easily create object interaction pattern to express different existing computational models Pregel Parameter Server MapReduce Pipeline (graph analytics) (machine learning) (map reduce jobs)

Consistency Level Synchronous mode: Bulk Synchronous Parallel (BSP) Compute – Shuffle – Compute – Shuffle … Asynchronous mode: Objects keep talking, no order guaranteed Synchronization barrier removed => increased CPU and network utilization A diagram?

An example of asynchronous machine learning NOMAD [*] (Matrix Factorization algorithm) Native MPI program: over 2000 lines of code Need to migrate parameters asynchronously Not possible in other popular ML frameworks Implementing NOMAD on Husky Husky supports asynchronous object migration 100 lines of code in Husky API by a customized pattern using just the pull and migrate primitives Comparable performance as native MPI program! Emphasize this This is one kind of custom pattern [*] H. Yun, H. Yu, C. Hsieh, S. V. N. Vishwanathan, and I. S. Dhillon. NOMAD: nonlocking, stochastic multi-machine algorithm for asynchronous and decentralized matrix completion. PVLDB, 7(11):975–986, 2014.

Husky: Generality and Expressiveness
Summary: a new computational model that makes Husky general and expressive Do the generality and expressiveness result in poor performance?

Husky: High Performance
10 – 100 times faster than Spark for iterative fine-grained workloads (e.g., machine learning, graph analytics) 2 – 10 times faster than Spark for non-iterative coarse-grained workloads (e.g., map reduce, ETL), using many times less memory Better scalability than Spark as Husky uses much less resource/time for the same jobs and has more efficient fault tolerance

Much greater performance with same hardware! Same performance with much less $budget$

No big deal, as many other systems are times faster than Spark! Same type of platform? General-purpose platform? Support user-friendly API, e.g., Python? E.g.: not fair to compare Spark with Petuum, TensorFlow, Angel, or DMTK on machine learning! Hardware-specific optimization? Husky hasn’t used hardware-specific optimization to boost its performance yet Husky is also many times faster than Petuum, GraphLab, and other popular systems for domain-specific workloads

System Implementation
Master-Workers Architecture Master Keeps worker information and data partitioning scheme Does not sit on the data path, and does not compute Coordinates work among workers, and monitors the progress of workers Workers Read/write data, communicate with other workers, compute in parallel Send heartbeats to master periodically

In-memory working objects Channel-based messaging subsystem
Event-driven communication router External data source connector Columnar storage subsystem Execution Engine Associated with I/O Network Coordinate Compute

Object Lists store Objects Channels define how Object Lists interact with each other Executor (of each worker) applies operators (in parallel) such as load, globalize, list_execute on Object Lists list_execute: perform a user-defined execute function on each object in an Object List in a worker

Global Object Layout Consistent hashing (on object id) Features Elastically scalable: support dynamic machine addition/removal Fault tolerant: support dynamic machine failover Load balancing: support static/dynamic balancing Efficiency Hashing happens in all messaging operations Use Google Jump Consistent Hash to ensure performance Efficiency Hashing happens in all messaging operations. We use Google Jump Consistent Hash to ensure performance

Object List Implementation The system sorts the list on object id, and stores the sorted list in a dense array (to explore data locality for fast search) Dynamically added objects will be buffered in a hash map (for fast updates) Rebuild the dense array when the number of objects in the hash map reaches certain threshold Try to make it more clear

Attribute List Implementation A Husky object is the smallest unit of data abstraction in Husky, and each object may have many attributes Attributes of a “person” object (id, gender, age, phone_number, address, occupation) Attributes of a “graph vertex” object (vertex_id, neighbors, degree, pr_value, cc_value… )

Store attribute lists as in a row-store Poor locality when searching specific attributes Row-oriented Column-oriented Store attribute lists as in a column-store Better locality -> faster access speed More opportunity to optimize, e.g., vectorization Adding attributes without re-compiling: useful for interactive data analysis

PushChannel Through PushChannel, you can push messages from one ObjList to another PushCombinedChannel PushCombinedChannel helps you combine your messages given combine functions PullChannel Through PullChannel, you can pull messages from another ObjList MigrateChannel Object can migrate from one ObjList to another BroadcastChannel You can broadcast your messages to all workers (workers are global objects in Husky)

More channels Channel that supports asynchronous list_execute Channel that supports hash combine …. Channel concept makes streaming computation possible Event-driven computation Asynchronous stream processing Can channel support more advanced programming paradigms?

An Example of a Workflow Graph
ch1 ObjList1 InputFormat1 ch3 ch4 ch2 InputFormat2 ObjList2 ch5 ch6 ObjList3 ch7

Extensible and easy to customize Want a new type of Channel (e.g., a channel to support RDMA)? Inherit from BaseChannel and implement yours~ Built-in channels such as PushChannel and BroadcastChannel take only lines of code Want a new type of Executor (e.g. sample your Object List with repetition)? Write your own executor~

Channel Hierarchy BaseChannel Source2ObjListChannel ObjList2ObjListChannel Source2AllChannel PushChannel MigrateChannel BroadcastChannel PushCombinedChannel PullChannel

Collection Hierarchy ChannelSource ChannelDestination InputFormatBase ObjListBase InputFormat ObjList<ObjT>

Shuffle Combiner A combiner can aggregate multiple messages into a single message based on some user-specified condition, so as to reduce network traffic Shuffle combiner is at process level Messages are prepared in each worker (thread) Hash messages among local workers Combine and send Explain the difference compared with the combiner for mapper. Shuffle combiner considers the multi-core and shared-memory architecture.

Messaging Optimization Combiner specialization, e.g.: Use Radix sort for fixed-size key (e.g., int) Use hash-based combiner when receivers are not many CPU cache-aware look-up Look up the receiver object with binary search Enjoy temporal+spatial locality, faster than hashing Further improved performance with prefetching Binary search Radix sort Hash combine

Messaging Optimization Pull communication reduction Pull “requests” are compressed with Bloom Filter -> almost same performance as push Identical pull “responses” to threads in the same process are combined and sent as a single “response”, and shared by all requesting threads in that process Binary search Radix sort Hash combine

Performance Improved TF-IDF (a typical coarse-grained workload) by creating a customized pattern (This is exact TF-IDF, not vectorized TF-IDF) Use local objects and pull primitive so most objects stay intact and do not needed to be shuffled around.

Performance Single-source-shortest-path (fine-grained iterative task)

Performance Matrix Factorization (ALS, iterative ML task)
(Netflix dataset) (YahooMusic dataset)

Performance Matrix Factorization (NOMAD, asynchronous SGD)
(Netflix dataset) (YahooMusic dataset)

Husky: User-friendliness
C/C++ API (high performance) For programmers/software developers, allow fine control for better utilization of system sources to boost performance Simple and even more intuitive to understand and reason than Python/Scala Python API (low development cost) For data scientists and users who do not have good programming skills Backed-up by highly efficient C++ libraries for most core operations Scala API (good performance & easy development) Offer something in-between C/C++ API (high performance) and Python API (low development cost)

Husky: User-friendliness
Supports interactive data analytics Runs well on a laptop or in a large distributed cluster Offers smooth connection with many existing systems, e.g.: Hadoop ecosystems: Hive, HBase, Impala NoSQL: MongoDB Key-value stores: Redis Column stores: Parquet Messaging systems: Kafka Log processing systems: Flume

A Typical Big Data Solution
APIs Husky: Enabling end-to-end big data business solutions! Smart city User-Friendly Application Interface Finance Data Processing Graph Analytics Machine Learning Map Reduce Stream Processing SQL OLAP Husky Kernel Marketing Map Reduce Machine Learning Graph Analytics Stream Processing SQL OLAP Scientific research Search Engine Messaging System Key-value Stores NoSQL Hadoop Ecosystem Data Storage Anything about big data Data Collection

Q&A Find us at support@husky-project.com :-) www.husky-project.com
github.com/husky-team/husky Find us at :-)

CSCI5570 Large Scale Data Processing Systems

Similar presentations

Presentation on theme: "CSCI5570 Large Scale Data Processing Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSCI5570 Large Scale Data Processing Systems

Similar presentations

Presentation on theme: "CSCI5570 Large Scale Data Processing Systems"— Presentation transcript:

Similar presentations

About project

Feedback