Machine Learning Library for Apache Ignite

Slides:



Advertisements
Similar presentations
Spark: Cluster Computing with Working Sets
Advertisements

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
GridGain In-Memory Data Fabric:
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Matthew Winter and Ned Shawa
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
BIG DATA/ Hadoop Interview Questions.
Ignite in Sberbank: In-Memory Data Fabric for Financial Services
Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.
Apache Ignite Compute Grid Research Corey Pentasuglia.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Image taken from: slideshare
Presented by: Omar Alqahtani Fall 2016
TensorFlow– A system for large-scale machine learning
Big Data is a Big Deal!.
Apache Ignite Data Grid Research Corey Pentasuglia.
PROTECT | OPTIMIZE | TRANSFORM
MapReduce Compiler RHadoop
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Hadoop Aakash Kag What Why How 1.
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Introduction to Distributed Platforms
ITCS-3190.
Spark.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Curator: Self-Managing Storage for Enterprise Clusters
Microsoft Machine Learning & Data Science Summit
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Self Healing and Dynamic Construction Framework:
Open Source distributed document DB for an enterprise
Spark Presentation.
Neelesh Srinivas Salian
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Data Platform and Analytics Foundational Training
Iterative Computing on Massive Data Sets
Apache Hadoop YARN: Yet Another Resource Manager
Projects on Extended Apache Spark
In-Memory Performance
Real-time Software Design
Central Florida Business Intelligence User Group
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Meng Cao, Xiangqing Sun, Ziyue Chen May 28th, 2014
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Spark.
Distributed Systems CS
CS110: Discussion about Spark
Introduction to Apache
Overview of big data tools
Spark and Scala.
Department of Intelligent Systems Engineering
Distributed Systems CS
Charles Tappert Seidenberg School of CSIS, Pace University
Introduction to Spark.
CS639: Data Management for Data Science
Apache Hadoop and Spark
IBM C IBM Big Data Engineer. You want to train yourself to do better in exam or you want to test your preparation in either situation Dumpspedia’s.
Fast, Interactive, Language-Integrated Cluster Computing
Big-Data Analytics with Azure HDInsight
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Machine Learning Library for Apache Ignite Examiners Dr. Scott Spetka Dr. Bruno Andriamanalimanana Dr. Roger Cavallo Ignite-ML Machine Learning Library for Apache Ignite Corey Pentasuglia Masters Project 5/11/2016

Masters Project Objectives Research DML (Distributed Machine Learning) Preparation for Doctoral studies Compare current DML frameworks Develop a library built over Apache Ignite for Machine Learning Currently, nothing available to perform Machine Learning in Ignite framework Spark (sister project of Ignite provides MLlib) Compare Ignite & Spark Apache Ignite might be a better framework for DML (especially in a practical sense) Develop Ignite-ML as a practical library for attempting ML on transactional data as well as analytical processing (many others focus only on analytics). Develop idea of TDML (Transactional DML) Comparison Paper Summarize findings in a short paper

What is Apache Ignite? In Memory Data Fabric An open source Apache Incubator project Started and still mostly maintained by a company named GridGain Ignite contains several key components for high performance computing within a distributed architecture

Compute Grid Designed for high performance, low latency, and scalability Availability is definitely considered. Jobs will execute as long as there is at least one node Failover Included a load balancer to orchestrate jobs that have failed

Compute Grid (Key Benefits) Fault Tolerance If a node fails, jobs will automatically be transferred over to another node (if available) Load Balancing Automatic load balancing will occur to allow an efficient distribution of work among the available nodes Job Scheduling Priority can be set for tasks that run on the grid, however by default tasks will be worked off randomly Direct MapReduce API

Ignite vs. Spark

OLTP vs. OLAP

Ignite vs. Spark (Cont.) Apache Ignite (Hybrid OLTP & OLAP) Spark (OLAP) In-Memory Treats memory as primary storage Better indexing Avoids (de)serialization Reduced latency RDD (Resilient Distributed Datasets) Real streaming (No delays) Utilizes Off-heap memory Avoid garbage collection pauses In-Memory SQL indexes Avoids full scans of datasets Map Reduce Fully compatible with Hadoop MR APIs Support for legacy MR code In-Memory Used only for processing RDD (Resilient Distributed Datasets) Created beforehand Is immutable Language Support Scala, Java, Python, and R (Ignite – Scala, and Java)

Grid Configuration The lab machines selected can be seen below: More machine could easily be added, however I have been utilizing these four lab machines

Project Background Initially just developed some code to perform KNN in Apache Ignite JavaML & Apache Ignite Utilized JavaML to perform KNN on the compute grid Determined that an extensible library would be more useful to others Switched to Weka and Apache Ignite Weka is a more well adopted ML library Developed the start of an extensible architecture Allows others to plugin in additional ML algorithms Attempts to auto-scale based upon cluster size

Ignite-ML Ignite-ML is my own project built on top of the Apache Ignite In- Memory data fabric (https://github.com/pentasc/ignite-compute- grid/tree/master/ignite-ml) The library consists of API and Executor sub modules The idea is that the library provides an extensible entry point for plugging in Machine Learning algorithms to Ignite The API contains custom defined exceptions, request objects, response objects, and handlers Many other distributed ML frameworks focus on data analysis after data has been stored With Ignite being a hybrid OLTP/OLAP, I’d like to focus on performing ML algorithms with transactional data

Ignite-ML Use Case Ignite-ML Ignite-ML Ignite-ML In OLTP Feedback OLAP Possible feedback from unsupervised learning Normalize and classify incoming data using supervised learning and give feedback OLAP Other Storage Other Storage HDFS Other Storage Normalize and classify incoming signals Suppose we currently know of two classes. Say class 1 and 2, but we receive data we cannot classify with confidence… We can start to perform unsupervised learning, which may eventually lead to a new class

MLLib K-means Clustering

Ignite-ML Knn Clustering NOTE: It should be obvious that the K-Means and Knn algorithms are very different. However, these slides are meant to portray the different in syntax and semantics This is an example of my Knn APP that utilizes my Ignite-ML library (available in Github)

Ignite-ML Extensions App Ignite Framework Ignite-ML Default Register custom or new machine learning algorithms by adding requests, response, and handler classes Ignite-ML Extensions App Ignite Framework Ignite-ML Default Ignite-ML-API Ignite-ML-Executor Custom Requests Exceptions Handlers Executors Requests Responses Responses Executor Interfaces Handlers

Further Work Continue to develop Ignite-ML Hopefully get more integrated with the Apache Ignite project (become contributor) Lay foundation for Doctoral studies Explore the idea of TDML (Transacted Distributed Machine Learning) Needs some way of providing additional configuration Integrate more of the caching features of Ignite Better plan for the integration of supervised and unsupervised learning

Community

Citation https://ignite.apache.org/ http://gotocon.com/dl/goto-chicago- 2015/slides/DmitriySetrakyan_BeyondTheDataGridFastDataProcessingWithApacheIgniteincubating.pdf http://datawarehouse4u.info/OLTP-vs-OLAP.html http://www.infoq.com/articles/gridgain-apache-ignite http://www.odbms.org/blog/2015/02/interview-nikita-ivanov/ https://spark.apache.org/docs/1.1.1/mllib-clustering.html