Big Data Machine Learning using Apache Spark MLlib

Slides:

Advertisements

Similar presentations

Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter.

Advertisements

Decision Tree Induction in Hierarchic Distributed Systems With: Amir Bar-Or, Ran Wolff, Daniel Keren.

Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.

Introduction to WEKA Aaron 2/13/2009. Contents Introduction to weka Download and install weka Basic use of weka Weka API Survey.

Oracle Data Mining Ying Zhang. Agenda Data Mining Data Mining Algorithms Oracle DM Demo.

A REVIEW OF FEATURE SELECTION METHODS WITH APPLICATIONS Alan Jović, Karla Brkić, Nikola Bogunović {alan.jovic, karla.brkic,

Last Words COSC Big Data (frameworks and environments to analyze big datasets) has become a hot topic; it is a mixture of data analysis, data mining,

RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.

CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.

© 2015 IBM Corporation UNIT 2: BigData Analytics with Spark and Spark Platforms 1 Shelly Garion IBM Research -- Haifa.

Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.

DELIVERING THE ENTERPRISE FABRIC FOR BIG DATA Aiaz Kazi SVP, Platform Strategy and Adoption

Introducing MapReduce to High End Computing Grant Mackey, Julio Lopez, Saba Sehrish, John Bent, Salman Habib, Jun Wang University of Central Florida, Carnegie.

Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.

How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.

Matthew Winter and Ned Shawa

Apache Mahout Qiaodi Zhuang Xijing Zhang.

Redpoll A machine learning library based on hadoop Jeremy CS Dept. Jinan University, Guangzhou.

Large Scale Distributed Distance Metric Learning by Pengtao Xie and Eric Xing PRESENTED BY: PRIYANKA.

Unsupervised Streaming Feature Selection in Social Media

Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.

ODL based AI/ML for Networks Prem Sankar Gopannan, Ericsson

CSE 5810 Biomedical Informatics and Cloud Computing Zhitong Fei Computer Science & Engineering Department The University of Connecticut CSE5810: Introduction.

Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:

Databricks What is Databricks ? Cloud services used Functionality Languages Spark Usage 3 rd Party Apps Architecture Books

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Knowledge Discovery in a DBMS Data Mining Computing models and finding patterns in large databases current major challenge in database systems & large.

The KDD Process for Extracting Useful Knowledge from Volumes of Data Fayyad, Piatetsky-Shapiro, and Smyth Ian Kim SWHIG Seminar.

Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.

1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.

LARGE-SCALE DATA ANALYSIS WITH APACHE SPARK ALEXEY SVYATKOVSKIY.

Image taken from: slideshare

Big Data Analytics and HPC Platforms

Presentation by: ABHISHEK KAMAT ABHISHEK MADHUSUDHAN SUYAMEENDRA WADKI

Big Data is a Big Deal!.

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

Topo Sort on Spark GraphX Lecturer: 苟毓川

ANOMALY DETECTION FRAMEWORK FOR BIG DATA

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning

Oral Presentation Applied Machine Learning Course YOUR NAME

Status and Challenges: January 2017

Distributed Network Traffic Feature Extraction for a Real-time IDS

Tomáš Jurníček, Jakub Jůza, Lenka Kmeťová

NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.

Data Platform and Analytics Foundational Training

Distributed Computing with Spark

Projects on Extended Apache Spark

Big Data Analytics in Parallel Systems

Interactive Website (

DATA ANALYTICS AND TEXT MINING

NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.

Introduction to Spark.

Spark Software Stack Inf-2202 Concurrent and Data-Intensive Programming Fall 2016 Lars Ailo Bongo

Intro to Machine Learning

CMPT 733, SPRING 2016 Jiannan Wang

Methodology & Current Results

Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC

Parallel Analytic Systems

Overview of big data tools

Spark and Scala.

Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC

Spark and Scala.

orange.biolab.si A general-purpose open source component-based

The Student’s Guide to Apache Spark

Big-Data Analytics with Azure HDInsight

Presentation transcript:

Big Data Machine Learning using Apache Spark MLlib Mehdi Assefi , Ehsun Behravesh , Guangchi Liu , and Ahmad P. Tafti

Motivation Big Data World! Applications Challenges healthcare informatics genomic data analysis text mining stochastic modeling Challenges Cost Time

Major Libraries

Major Libraries Apache Spark StreamingEnhanced situational awareness, Apache Spark SQL, Spark GraphX, Apache Spark MLlib ,

Apache Spark MLlib platform independent open-source libraries distributed architecture and automatic data parallelization

Functions Regression dimension reduction Classification Clustering rule extraction

Pathway

Experimental Evaluation Datasets VMWARE Cluster environment Machine Learning Algorithms

Results

Conclusion

Questions?