Igor Stančin, Alan Jović to: {igor.stancin,

Slides:



Advertisements
Similar presentations
Sarah Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON.
Advertisements

Introduction to WEKA Aaron 2/13/2009. Contents Introduction to weka Download and install weka Basic use of weka Weka API Survey.
Introduction to Data Mining Engineering Group in ACL.
Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 김지연.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Data Mining Chun-Hung Chou
Dr. Chris Musselle – Consultant R Meets Julia Dr Chris Musselle.
HRVFrame: Java-Based Framework for Feature Extraction from Cardiac Rhythm Alan Jovic and Nikola Bogunovic Faculty of Electrical Engineering and Computing,
Python for: Data Science. Python  Python is an open source scripting language.  Developed by Guido Van Rossum in late 1980s  Named after Monty Python.
Matthew Winter and Ned Shawa
DATA MINING Pandas. Python Data Analysis Library A library for data analysis of (mostly) tabular data Gives capabilities similar to Excel and SQL but.
Data analysis tools Subrata Mitra and Jason Rahman.
COMP 4332 Tutorial 1 Feb 16 WANG YUE Tutorial Overview & Learning Python.
PREDICTING SONG HOTNESS
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Hadoop Big Data Usability Tools and Methods. On the subject of massive data analytics, usability is simply as crucial as performance. Right here are three.
TensorFlow The Deep Learning Library You Should Be Using.
VisIt Project Overview
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Python for data analysis Prakhar Amlathe Utah State University
IBM Predictive Analytics Virtual Users’ Group Meeting March 30, 2016
TensorFlow CS 5665 F16 practicum Karun Joseph, A Reference:
Deep Learning Software: TensorFlow
Big Data is a Big Deal!.
Building Machine Learning System with Python
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
CSC391/691 Intro to OpenCV Dr. Rongzhong Li Fall 2016
Machine Learning Library for Apache Ignite
ITCS-3190.
BigDL Deep Learning Library on HDInsight
Big Data A Quick Review on Analytical Tools
Hadoop Tutorials Spark
Open Source distributed document DB for an enterprise
Spark Presentation.
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Data Mining Tools some examples.
Big Data Analytics: HW#3
Data Mining 101 with Scikit-Learn
Deep Learning Libraries
Hadoop Clusters Tess Fulkerson.
Basic machine learning background with Python scikit-learn
Prepared by Kimberly Sayre and Jinbo Bi
Python Classes in Pune |
Introduction to Spark.
Torch 02/27/2018 Hyeri Kim Good afternoon, everyone. I’m Hyeri. Today, I’m gonna talk about Torch.
Introduction to Deep Learning for neuronal data analyses
CMPT 733, SPRING 2016 Jiannan Wang
Deep Learning Packages
Data Warehousing and Data Mining
Brief Intro to Python for Statistics
Parallel Analytic Systems
Overview of big data tools
Alan Jovic1, Kresimir Jozic2, Davor Kukolja1,
Analytics: Its More than Just Modeling
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Python Users Group University of South Carolina
Python for Data Analysis
CMPT 733, SPRING 2017 Jiannan Wang
Collecting, Analyzing, and Visualizing Data with Python Part I
The Student’s Guide to Apache Spark
Big-Data Analytics with Azure HDInsight
Convergence of Big Data and Extreme Computing
Machine Learning for Cyber
Machine Learning and Its Applications in Molecular Biophysics Jacob Andrzejczyk and Harish Vashisth Department of Chemical Engineering, University of New.
An Introduction to Data Science using Python
An Introduction to Data Science using Python
Spark with R Martijn Tennekes
Presentation transcript:

An overview and comparison of free Python libraries for data mining and big data analysis Igor Stančin, Alan Jović E-mail to: {igor.stancin, alan.jovic}@fer.hr University of Zagreb Faculty of Electrical Engineering and Computing, Zagreb, Croatia

CONTENT Motivation & goal Core libraries Data preparation Data visualization Machine learning Deep learning Big data Conclusion

Motivation & goal Python’s massive growth in usage  why? Many open-source libraries and tools  20+ are examined Many options/algorithms for machine learning / deep learning  Compare and use the most appropriate

Motivation & goal KDnuggets 2013 poll: KDnuggets 2018 poll:

Libraries popularity Library Stars Forked Contributors Activity NumPy 9621 3318 726 28 (103) SciPy 5418 2690 685 21 (101) Cython 3833 799 275 10 (85) pandas 18134 7233 1407 65 (217) PyTables 801 164 60 0 (0) h5py 1042 288 98 3 (6) Tabel 11 1 1 (1) Matplotlib 8688 3966 787 20 (218) seaborn 5722 905 87 Plotly 4569 1068 68 5 (38) Bokeh 8969 2398 346 11 (52) ggplot 3429 539 13 scikit-learn 33337 16358 1253 38 (94) mlpy 5 2 Shogun 2312 891 153 8 (57) mlxtend 2033 475 46 3 (17) TensorFlow 120547 72008 1834 194 (1888) Keras 38196 14584 773 20 (53) PyTorch 24781 5878 934 152 (913) Caffe 27016 16335 267 Caffe2 8407 2130 196 mrjob 2367 570 82 3 (143) Dumbo 1037 161 6 Hadoopy 245 62 3 Pydoop 168 53 1 (18) Spark (PySpark) 20576 18057 1330 78 (246) Hadoop (Streaming) 8567 5360 155 58 (456) Libraries popularity

Core libraries NumPy – highly efficient vectorized computing SciPy – implementations of algorithms for scientific purposes – relying on Netlib repository Cython – calling C functions from Python, C-types of variables – accelerates calculations

Data preparation Data preprocessing & data manipulation (wrangling) pandas dominates the field Wide range of data I/O handling Data transformations and cleaning (DataFrame) Statistical calculations (EDA) Basic visualizations (EDA) Competition: PyTables and h5py – support only HDF5 data type, suitable for large and heterogeneous datasets

Data visualization High competition in this field Based on the number of easily accessible functionalities, the rank would be: Plotly – the most powerful library in data visualization field, main flaw is a relatively unintuitive syntax; integrateable into web pages via Dash seaborn – built on top of Matplotlib, many graphs, easy to learn for beginners MatplotLib – Python implementation of Matlab-like plots, low level, lots of options for customization Other: Bokeh (for interactive plots in webpages), ggplot

Machine learning scikit-learn dominates the field Pros: Cons: Implementation of many machine learning algorithms (classifiers, regressors, clustering methods) Supports feature selection & dimensionality reduction Variety of evaluation metrics for all types of analyses Cons: Lacks many standard decision tree and inductive rules implementations Lacks association rules mining implementations Lacks some other interesting algorithms (e.g. rotation forest, full Bayesian network, stacking classifiers, fuzzy c-means clustering) Competition: Shogun (not as many algorithms as scikit-learn, but has different tree learners) and mlxtend (the least algorithms, but has association rules)

Deep learning Very popular in Python – high competition TensorFlow, Keras and PyTorch are currently the most popular libraries (Caffe/2, Theano and others not as much) TensorFlow (Google) – low level, detailed, supports most options Keras – built on top of TensorFlow and other libraries (high level ANN API), easy to learn, runs seamlessly on CPU and GPU, a bit fewer functionalities than TensorFlow PyTorch (Facebook) - runs code in a more procedural fashion, unlike TensorFlow, where one first needs to design the whole model and then run it within a Session, easy to learn and debug, number of functionalities comparable to TensorFlow

Big data Not specifically designed to Python, but most big data tools support Python (R, Java and Scala are equally popular here) Two most popular: PySpark (Python specific) for Spark, may use Spark-internal Mllib for machine learning Hadoop Streaming (any language) for Hadoop MapReduce Several Python libraries for running Hadoop: mrjob – multi-step MapReduce jobs in pure Python, good documentation, does not support complex tasks, a bit slow Dumbo – has advanced functionalities, not rich documentation, wrapper around Hadoop Streaming, not maintained Hadoopy – similar to Dumbo, better documentation, not maintained Pydoop - wrapper around Hadoop pipes (C++ API for Hadoop)

Conclusion Recommended Python stack for data mining / data science: Core: NumPy, SciPy, Cython Data preparation: pandas Visualization: Plotly, seaborn or MatplotLib Machine learning: scikit-learn Deep learning: TensorFlow, Keras, PyTorch Big data: Spark, Hadoop Streaming Community support is vital for survival of Python open- source libraries, especially in a fast-evolving area such as data science

Thank you! Questions?