Big Data Machine Learning using Apache Spark MLlib

Slides:



Advertisements
Similar presentations
Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter.
Advertisements

Decision Tree Induction in Hierarchic Distributed Systems With: Amir Bar-Or, Ran Wolff, Daniel Keren.
Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
Introduction to WEKA Aaron 2/13/2009. Contents Introduction to weka Download and install weka Basic use of weka Weka API Survey.
Oracle Data Mining Ying Zhang. Agenda Data Mining Data Mining Algorithms Oracle DM Demo.
A REVIEW OF FEATURE SELECTION METHODS WITH APPLICATIONS Alan Jović, Karla Brkić, Nikola Bogunović {alan.jovic, karla.brkic,
Last Words COSC Big Data (frameworks and environments to analyze big datasets) has become a hot topic; it is a mixture of data analysis, data mining,
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.
© 2015 IBM Corporation UNIT 2: BigData Analytics with Spark and Spark Platforms 1 Shelly Garion IBM Research -- Haifa.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
DELIVERING THE ENTERPRISE FABRIC FOR BIG DATA Aiaz Kazi SVP, Platform Strategy and Adoption
Introducing MapReduce to High End Computing Grant Mackey, Julio Lopez, Saba Sehrish, John Bent, Salman Habib, Jun Wang University of Central Florida, Carnegie.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Matthew Winter and Ned Shawa
Apache Mahout Qiaodi Zhuang Xijing Zhang.
Redpoll A machine learning library based on hadoop Jeremy CS Dept. Jinan University, Guangzhou.
Large Scale Distributed Distance Metric Learning by Pengtao Xie and Eric Xing PRESENTED BY: PRIYANKA.
Unsupervised Streaming Feature Selection in Social Media
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
ODL based AI/ML for Networks Prem Sankar Gopannan, Ericsson
CSE 5810 Biomedical Informatics and Cloud Computing Zhitong Fei Computer Science & Engineering Department The University of Connecticut CSE5810: Introduction.
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
Databricks What is Databricks ? Cloud services used Functionality Languages Spark Usage 3 rd Party Apps Architecture Books
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Knowledge Discovery in a DBMS Data Mining Computing models and finding patterns in large databases current major challenge in database systems & large.
The KDD Process for Extracting Useful Knowledge from Volumes of Data Fayyad, Piatetsky-Shapiro, and Smyth Ian Kim SWHIG Seminar.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.
LARGE-SCALE DATA ANALYSIS WITH APACHE SPARK ALEXEY SVYATKOVSKIY.
Image taken from: slideshare
Big Data Analytics and HPC Platforms
Presentation by: ABHISHEK KAMAT ABHISHEK MADHUSUDHAN SUYAMEENDRA WADKI
Big Data is a Big Deal!.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Topo Sort on Spark GraphX Lecturer: 苟毓川
ANOMALY DETECTION FRAMEWORK FOR BIG DATA
SLAQ: Quality-Driven Scheduling for Distributed Machine Learning
Oral Presentation Applied Machine Learning Course YOUR NAME
Status and Challenges: January 2017
Distributed Network Traffic Feature Extraction for a Real-time IDS
Tomáš Jurníček, Jakub Jůza, Lenka Kmeťová
NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.
Data Platform and Analytics Foundational Training
Distributed Computing with Spark
Projects on Extended Apache Spark
Big Data Analytics in Parallel Systems
Interactive Website (
DATA ANALYTICS AND TEXT MINING
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Introduction to Spark.
Spark Software Stack Inf-2202 Concurrent and Data-Intensive Programming Fall 2016 Lars Ailo Bongo
Intro to Machine Learning
CMPT 733, SPRING 2016 Jiannan Wang
Methodology & Current Results
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Parallel Analytic Systems
Overview of big data tools
Spark and Scala.
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Spark and Scala.
orange.biolab.si A general-purpose open source component-based
The Student’s Guide to Apache Spark
Big-Data Analytics with Azure HDInsight
Presentation transcript:

Big Data Machine Learning using Apache Spark MLlib Mehdi Assefi , Ehsun Behravesh , Guangchi Liu , and Ahmad P. Tafti

Motivation Big Data World! Applications Challenges healthcare informatics genomic data analysis text mining stochastic modeling Challenges Cost Time

Major Libraries

Major Libraries Apache Spark StreamingEnhanced situational awareness, Apache Spark SQL, Spark GraphX, Apache Spark MLlib ,

Apache Spark MLlib platform independent open-source libraries distributed architecture and automatic data parallelization

Functions Regression dimension reduction Classification Clustering rule extraction

Pathway

Experimental Evaluation Datasets VMWARE Cluster environment Machine Learning Algorithms

Results

Conclusion

Questions?