1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo.

Slides:



Advertisements
Similar presentations
.NET Technology. Introduction Overview of.NET What.NET means for Developers, Users and Businesses Two.NET Research Projects:.NET Generics AsmL.
Advertisements

Data Mining with R/ORE Minming Duan. 2 iTech Solution Profile Agenda R/ORE Overview 1 XML output generation using SQL 4 Integration with IBP and BIEE.
Weka & Rapid Miner Tutorial By Chibuike Muoh. WEKA:: Introduction A collection of open source ML algorithms – pre-processing – classifiers – clustering.
Copyright © 2010 SAS Institute Inc. All rights reserved. A Quick Introduction to JMP Dara Hammond JMP Account Rep.
Introduction to BioConductor Friday 23th nov 2007 Ståle Nygård Statistical methods and bioinformatics for the analysis of microarray.
IT Project Management, Third Edition Appendix A1 Appendix A: Guide to Using Microsoft Project 2002.
Copyright © 2014 Pearson Education, Inc. 1 Managers from across organizations are involved in developing and acquiring information systems Chapter 5 -
R Mohammed Wahaj. What is R R is a programming language which is geared towards using a statistical approach and graphics Statisticians and data miners.
Data Analytics and Dynamic Languages Lee E. Edlefsen, Ph.D. VP of Engineering 1.
Microsoft Visual Studio and C# Programming
Input Validation For Free Text Fields ADD Project Members: Hagar Offer & Ran Mor Academic Advisor: Dr Gera Weiss Technical Advisors: Raffi Lipkin & Nadav.
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
Clementine Server Clementine Server A data mining software for business solution.
Chapter 14 The Second Component: The Database.
Justin Sun Boston DataCon September 14, Overview Why Use Orange? Classification Tree Example Project History Architecture Widgets Demo Resources.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Business Intelligence components Introduction. Microsoft® SQL Server™ 2005 is a complete business intelligence (BI) platform that provides the features,
Mgt 240 Lecture Website Construction: Software and Language Alternatives March 29, 2005.
What is R Muhammad Omer. What is R  R is the programing language software for statistical computing and data analysis  The R language is extensively.
Volume Licensing Service Center Overview Presentation V1.0 August 2007.
An Exercise in Machine Learning
CSCI 347 / CS 4206: Data Mining Module 05: WEKA Topic 01: WEKA Navigation.
 The Weka The Weka is an well known bird of New Zealand..  W(aikato) E(nvironment) for K(nowlegde) A(nalysis)  Developed by the University of Waikato.
What is R By: Wase Siddiqui. Introduction R is a programming language which is used for statistical computing and graphics. “R is a language and environment.
Data Visualization using R
Almost 4 decades of Advanced Analytics & DM expertise.
Appendix: The WEKA Data Mining Software
1 Research Groups : KEEL: A Software Tool to Assess Evolutionary Algorithms for Data Mining Problems SCI 2 SMetrology and Models Intelligent.
APPLICATION Provisioning & Management made EASY EASY to ManageEASY to Manage EASY to MarketEASY to Market.
An Introduction to SAS® ENTERPRISE GUIDE. Corporate Strength & Stability Reliability in a High-Risk Economy Largest Privately held software company in.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
© 2008 IBM Corporation ® Atlas for Lotus Connections Unlock the power of your social network! Customer Overview Presentation An IBM Software Services for.
The CMT in an Electronic Environment The CMT in an Electronic Environment Guy Gordon A/ADM Service Manitoba CMT Day Alberta Federal Council Feb
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
COMBIT Replace with your logo.. Visual Studio Industry Partner COMBIT NEXT STEPS Contact us at: combit develops and distributes the award.
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
 What are CASE Tools ?  Rational ROSE  Microsoft Project  Rational ROSE VS MS Project  Virtual Communication  The appropriate choice for ALL Projects.
Carolina Environmental Program 1 UNC Chapel Hill A New Control Strategy Tool within the Emissions Modeling Framework Alison M. Eyth Carolina Environmental.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
Data Mining Tools some examples.
W E K A Waikato Environment for Knowledge Aquisition.
Application of Data Mining Techniques on Survey Data using R and Weka
PowerBuilder is an integrated development environment (IDE) used to create applications. PowerBuilder 12.5 has good integration with the Microsoft.
WEKA: A Practical Machine Learning Tool WEKA : A Practical Machine Learning Tool.
Zohreh Raghebi.  A software platform provides an integrated environment  Machine learning  Data mining  Text mining  Predictive analytics  Business.
Tao Su Xin Xiao Computer and Network Security Group Data Base And Data Mining Group Open Source Data Mining Software.
Systems Analysis and Design in a Changing World, Fifth Edition
Data Platform and Analytics Foundational Training
SAS users meeting in Halifax
Big Data A Quick Review on Analytical Tools
Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240A notes by C. Zaniolo.
R For The SQL Developer Kevin Feasel Manager, Predictive Analytics
Appendix A: Guide to Using Microsoft Project 2002
Introduction to R Programming with AzureML
Systems Analysis – ITEC 3155 Evaluating Alternatives for Requirements, Environment, and Implementation.
R Programming.
Waikato Environment for Knowledge Analysis
Prepared by Kimberly Sayre and Jinbo Bi
WEKA.
Machine Learning with Weka
What's New in eCognition 9
Touchstone Testing Platform
Appendix A: Guide to Using Microsoft Project 2002
Welcome! Knowledge Discovery and Data Mining
What's New in eCognition 9
Microsoft Virtual Academy
Mark Quirk Head of Technology Developer & Platform Group
What's New in eCognition 9
Data Mining CSCI 307, Spring 2019 Lecture 7
Top PHP Development Tools For PHP Developers By: iblinfotech.com iblinfotech.com.
Presentation transcript:

1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo

Comparing KDD/DM Toolsets Many packages and very few in-depth comparisons  An Evaluation by USDA Forest Service comparing R, WEKA, Orange, and SAS® Several User-satisfaction/popularity surveys  KDD-nuggets  Rexer Analytics Survey (annual) 2

An Evaluation of CART Programs by USDA Forest Service (USFS) By USDA Forest Service (USFS) USFS uses classification and regression-tree (CART) technology to map USFS Forest Inventory and Analysis (FIA) biomass, forest type, forest type groups, and National Forest vegetation. The results of the study were reported by: B. Ruefenacht, G. Liknes, A. J. Lister, H. Fisk and Dan Wendt “Evaluation of Open Source Data Mining Software Packages”, Symposium on Forest Inventory and Analysis (FIA), October 2008; Park City,UT. Proc.Evaluation of Open Source Data Mining Software Packages 3

R: ( By the University of Auckland, NZ, in 1993 GNU Public License (GPL) in An extension of the S language (Bell Labs) Twelve packages are supplied with the basic R distribution each including many functions offers 1,364 additional packages extending the basic R functionality. 4

WEKA: Waikato Environment for Knowledge Analysis by the University of Waikato, New Zealand, which supports the software with funds by the NZ government. Starded in 1993 and released in A GPL package WEKA is a collection of machine-learning algorithms implemented in Java plus data preprocessing tools, and visualization tools, interface tools (R, SQL) 5

Orange: By the University of Ljubljana, Slovenia, in 2004, under GPL. Still evolving: frequent new releases Main routines & libraries in C++ but Python is used to call the routines and access libraries Users can add their machine-learning algorithms using both scripting and GUI environments Orange also has a GUI version called Orange Canvas, which allows for interactive machine- learning “visual programming”. 6

SAS® (Statistical Analysis Software) By Jim Goodnight and North Carolina State University associates in early 1970s. In 1976 the SAS-Institute was founded to distribute and further develop the increasingly popular software. SAS® currently has 10,658 employees, and is the largest privately held software company with annual revenue of $2.15 billion (in 2007) SAS® is used in 109 countries, different industries, with 44,000 customer sites worldwide. SAS® is purchased by contacting a distributor directly: it can cost several thousand dollars depending on the options. The purchase includes the software, technical support, and licenses, which are renewed regularly, incurring more costs. 7

Evaluation Criteria Cost Usability:  How easy is the interface to use and understand?  Are there a variety of models and options available?  How easy to use is the software’s programming language?  Does the software integrate easily with other programs? Performance w.r.t.  speed,  stability, and  accuracy. Critical Mass: how widespread is the software? Uniqueness of useful features & algorithms Defensibility w.r.t.citations and academic repute 8

Usability SAS®: The Enterprise Guide for SAS® has a user-friendly GUI system that allows for the building of graphical models.  GUIs also exist for other SAS® modules, but unlike WEKA and Orange there is no universal GUI for SAS  SAS® is primarily driven by its own programming language, a new user will require some training R, like SAS®, is used by numerous industries and thus has a wide variety of models and options.  R is driven by its own scripting language, which does require some training and/or experience  GUIs for specific functions only. 9

Usability (Cont.) WEKA does have a comprehensive GUI with many models and options available. WEKA’s GUI is easy for users need a good understanding of modeling techniques. to integrate WEKA with other software programs Familiarity with Java is needed to extend WEKA and link with other software programs WEKA can be expanded and used within R,  Orange: Open source data visualization and analysis for novice and experts. Data mining through visual programming or Python scripting. Orange website ( Orange has a good website on how to integrate Orange with Python. The number of models and options available in Orange lags behind not only SAS® and R but WEKA as well. 10

Performance notes R significantly faster than WEKA and Orange on classification trees. Orange is the least stable although new versions are released monthly WEKA is a stable program, but also does not work well with large datasets.  The weka recently recently introduced MOA to process massive data sets in a stream-like mode. 11

Evaluation Results 12

Most Popular Data Mining Software Rexer Analytics Survey (Early 2007) asked about the tools used often and occasionally. Clearly more popular than the rest were: SPSS or SPSS ClementineSPSS Clementine "Own Code" SAS or SAS Enterprise MinerSAS Enterprise Miner Followed by R Weka C4.5 / C5.0 13

Critical Mass and Popularity Top ten most used packages by KDD Nuggets Survey (May 2007): SPSS/ SPSS ClementineSPSS Clementine Salford Systems CART/MARS/TreeNet/RF Salford Systems Yale (now Rapid Miner)Rapid Miner SAS / SAS Enterprise MinerSAS Enterprise Miner Angoss Knowledge Studio / Knowledge SeekerKnowledge StudioKnowledge Seeker KXEN Weka R Microsoft SQL Server?? MATLAB?? Note: Microsoft Excel omitted as it's not really "data mining" software, and I've merged the tools offered by a single vendor (SPSS and SAS) You can see the full survey results see the full survey results 14

15 Comments Gregory Piatetsky-Shapiro, KDnuggets Editor: Votes from tool vendors were removed.. Comparing with 2008 KDnuggets Poll on data mining tools/software used, the big changes are growth in SPSS, RapidMiner, and R.

Popular Data Mining Software (cont.) Rexer Analytics Survey Rexer Analytics Survey is taken every year and the summary report can be obtained free SURVEY HIGHLIGHTS:  Open-source tools Weka and R made substantial movement up data miner’s tool rankings this year, and are now used by large numbers of both academic and for-profit data miners.  SAS Enterprise Miner dropped in data miner’s tool rankings 2010 SURVEY HIGHLIGHTS:  R: After a steady rise across the past few years, R overtook other tools to become the tool used by more data miners (43%)  STATISTICA has also been climbing in the rankings. STATISTICA, IBM SPSS Modeler, and R received the strongest satisfaction ratings in both 2010 and

17

18 Selected References Witten, I.H.; Frank, E. Data Mining: Practical machine learning tools and techniques. 2nd Edition, Morgan Kaufmann, R. R. Bouckaert et al., WEKA Manual for Version 3.6.0, Demsar J.; Zupan, B.; Leban, G.. “Orange: From experimental machine learning to interactive data mining”, ( R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, 2008.A language and environment for statistical computing

19 About Weka Comparison to R, WEKA is weaker in classical statistics but stronger in machine learning (data mining) algorithms. WEKA has developed a set of extensions covering diverse areas, such as text mining, visualization and bioinformatics. WEKA 3.6 includes support for importing PMML models (Predictive Modeling Markup Language). PMML is a XML-based standard fro expressing statistical and data mining models. WEKA can interface with many systems and formats: SQL, LibSVM and SVM-Light,…. WEKA has 2 limitations:  Java implementation is somewhat slower than an equivalent in C/C++  Most of the algorithms require all the data stored in main memory. So it restricts application to small or medium-sized datasets.

MOA: Massive Online Analysis MOA supports bi-directional interaction with WEKA  to deal with the scaling up the implementation of state of the art algorithms to real world dataset sizes using a streaming settings MOA: a software environment for testing algorithms and running experiments for online learning from evolving data streams A DSMS will then be required to deploy these algorithms on actual data streams—MOA is not a DSMS 20

21 Downloads available under GNU GPL license Several Data Sets used:  SEA Concepts Generator: artificial dataset with abrupt concept drift  STAGGER Concepts Generator by Schlimmer and Grange  Rotating Hyperplane: used as testbed for CVFDT versus VFDT  Random RBF Generator  Waveform Generator  Function Generator It was introduced by Agrawal et al. MOA Currently supports: Classification and clustering methods System is easily extensible and has nice GUI Good Documentation:  Albert Bifet, G. Holmes, R. Kirkby & B. Pfahringer: DATA STREAM MINING: A Practical Approach. May DATA STREAM MINING: A Practical Approach  Albert Bifet et al.: MOA: Massive Online Analysis, a Framework for Stream Classication and Clustering (2010)Massive Online Analysis, a Framework for Stream Classication and Clustering