Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo.

Similar presentations


Presentation on theme: "1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo."— Presentation transcript:

1 1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo

2 Comparing KDD/DM Toolsets Many packages and very few in-depth comparisons  An Evaluation by USDA Forest Service comparing R, WEKA, Orange, and SAS® Several User-satisfaction/popularity surveys  KDD-nuggets  Rexer Analytics Survey (annual) 2

3 An Evaluation of CART Programs by USDA Forest Service (USFS) By USDA Forest Service (USFS) USFS uses classification and regression-tree (CART) technology to map USFS Forest Inventory and Analysis (FIA) biomass, forest type, forest type groups, and National Forest vegetation. The results of the study were reported by: B. Ruefenacht, G. Liknes, A. J. Lister, H. Fisk and Dan Wendt “Evaluation of Open Source Data Mining Software Packages”, Symposium on Forest Inventory and Analysis (FIA), October 2008; Park City,UT. Proc.Evaluation of Open Source Data Mining Software Packages 3

4 R: (http://www.r-project.org) By the University of Auckland, NZ, in 1993 GNU Public License (GPL) in 1995. An extension of the S language (Bell Labs) Twelve packages are supplied with the basic R distribution each including many functions http://cran.r-project.org offers 1,364 additional packages extending the basic R functionality. 4

5 WEKA: www.cs.waikato.ac.nz/ml/weka/ Waikato Environment for Knowledge Analysis by the University of Waikato, New Zealand, which supports the software with funds by the NZ government. Starded in 1993 and released in 1996. A GPL package WEKA is a collection of machine-learning algorithms implemented in Java plus data preprocessing tools, and visualization tools, interface tools (R, SQL) 5

6 Orange: www.ailab.si/orange/ By the University of Ljubljana, Slovenia, in 2004, under GPL. Still evolving: frequent new releases Main routines & libraries in C++ but Python is used to call the routines and access libraries www.ailab.si/orange/doc/ofb/. Users can add their machine-learning algorithms using both scripting and GUI environments Orange also has a GUI version called Orange Canvas, which allows for interactive machine- learning “visual programming”. 6

7 SAS® (Statistical Analysis Software) By Jim Goodnight and North Carolina State University associates in early 1970s. In 1976 the SAS-Institute was founded to distribute and further develop the increasingly popular software. SAS® currently has 10,658 employees, and is the largest privately held software company with annual revenue of $2.15 billion (in 2007) SAS® is used in 109 countries, different industries, with 44,000 customer sites worldwide. SAS® is purchased by contacting a distributor directly: it can cost several thousand dollars depending on the options. The purchase includes the software, technical support, and licenses, which are renewed regularly, incurring more costs. 7

8 Evaluation Criteria Cost Usability:  How easy is the interface to use and understand?  Are there a variety of models and options available?  How easy to use is the software’s programming language?  Does the software integrate easily with other programs? Performance w.r.t.  speed,  stability, and  accuracy. Critical Mass: how widespread is the software? Uniqueness of useful features & algorithms Defensibility w.r.t.citations and academic repute 8

9 Usability SAS®: The Enterprise Guide for SAS® has a user-friendly GUI system that allows for the building of graphical models.  GUIs also exist for other SAS® modules, but unlike WEKA and Orange there is no universal GUI for SAS  SAS® is primarily driven by its own programming language, a new user will require some training R, like SAS®, is used by numerous industries and thus has a wide variety of models and options.  R is driven by its own scripting language, which does require some training and/or experience  GUIs for specific functions only. 9

10 Usability (Cont.) WEKA does have a comprehensive GUI with many models and options available. WEKA’s GUI is easy for users need a good understanding of modeling techniques. to integrate WEKA with other software programs Familiarity with Java is needed to extend WEKA and link with other software programs WEKA can be expanded and used within R,  Orange: Open source data visualization and analysis for novice and experts. Data mining through visual programming or Python scripting. Orange website (http://www.ailab.si/orange/) Orange has a good website on how to integrate Orange with Python.http://www.ailab.si/orange/ The number of models and options available in Orange lags behind not only SAS® and R but WEKA as well. 10

11 Performance notes R significantly faster than WEKA and Orange on classification trees. Orange is the least stable although new versions are released monthly WEKA is a stable program, but also does not work well with large datasets.  The weka recently recently introduced MOA to process massive data sets in a stream-like mode. 11

12 Evaluation Results 12

13 Most Popular Data Mining Software Rexer Analytics Survey (Early 2007) asked about the tools used often and occasionally. Clearly more popular than the rest were: SPSS or SPSS ClementineSPSS Clementine "Own Code" SAS or SAS Enterprise MinerSAS Enterprise Miner Followed by R Weka C4.5 / C5.0 13

14 Critical Mass and Popularity Top ten most used packages by KDD Nuggets Survey (May 2007): SPSS/ SPSS ClementineSPSS Clementine Salford Systems CART/MARS/TreeNet/RF Salford Systems Yale (now Rapid Miner)Rapid Miner SAS / SAS Enterprise MinerSAS Enterprise Miner Angoss Knowledge Studio / Knowledge SeekerKnowledge StudioKnowledge Seeker KXEN Weka R Microsoft SQL Server?? MATLAB?? Note: Microsoft Excel omitted as it's not really "data mining" software, and I've merged the tools offered by a single vendor (SPSS and SAS) You can see the full survey results see the full survey results 14

15 15 Comments Gregory Piatetsky-Shapiro, KDnuggets Editor: Votes from tool vendors were removed.. Comparing with 2008 KDnuggets Poll on data mining tools/software used, the big changes are growth in SPSS, RapidMiner, and R.

16 Popular Data Mining Software (cont.) Rexer Analytics Survey Rexer Analytics Survey is taken every year and the summary report can be obtained free. 2009 SURVEY HIGHLIGHTS:  Open-source tools Weka and R made substantial movement up data miner’s tool rankings this year, and are now used by large numbers of both academic and for-profit data miners.  SAS Enterprise Miner dropped in data miner’s tool rankings 2010 SURVEY HIGHLIGHTS:  R: After a steady rise across the past few years, R overtook other tools to become the tool used by more data miners (43%)  STATISTICA has also been climbing in the rankings. STATISTICA, IBM SPSS Modeler, and R received the strongest satisfaction ratings in both 2010 and 2009. 16

17 17

18 18 Selected References Witten, I.H.; Frank, E. Data Mining: Practical machine learning tools and techniques. 2nd Edition, Morgan Kaufmann, 2005. R. R. Bouckaert et al., WEKA Manual for Version 3.6.0, 2008. Demsar J.; Zupan, B.; Leban, G.. “Orange: From experimental machine learning to interactive data mining”, 2004. (http://www.ailab.si/orange).http://www.ailab.si/orange R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, 2008.A language and environment for statistical computing

19 19 About Weka Comparison to R, WEKA is weaker in classical statistics but stronger in machine learning (data mining) algorithms. WEKA has developed a set of extensions covering diverse areas, such as text mining, visualization and bioinformatics. WEKA 3.6 includes support for importing PMML models (Predictive Modeling Markup Language). PMML is a XML-based standard fro expressing statistical and data mining models. WEKA can interface with many systems and formats: SQL, LibSVM and SVM-Light,…. WEKA has 2 limitations:  Java implementation is somewhat slower than an equivalent in C/C++  Most of the algorithms require all the data stored in main memory. So it restricts application to small or medium-sized datasets.

20 MOA: Massive Online Analysis MOA supports bi-directional interaction with WEKA  to deal with the scaling up the implementation of state of the art algorithms to real world dataset sizes using a streaming settings MOA: a software environment for testing algorithms and running experiments for online learning from evolving data streams A DSMS will then be required to deploy these algorithms on actual data streams—MOA is not a DSMS 20

21 21 Downloads available under GNU GPL license Several Data Sets used:  SEA Concepts Generator: artificial dataset with abrupt concept drift  STAGGER Concepts Generator by Schlimmer and Grange  Rotating Hyperplane: used as testbed for CVFDT versus VFDT  Random RBF Generator  Waveform Generator  Function Generator It was introduced by Agrawal et al. MOA Currently supports: Classification and clustering methods System is easily extensible and has nice GUI Good Documentation:  Albert Bifet, G. Holmes, R. Kirkby & B. Pfahringer: DATA STREAM MINING: A Practical Approach. May 2011. DATA STREAM MINING: A Practical Approach  Albert Bifet et al.: MOA: Massive Online Analysis, a Framework for Stream Classication and Clustering (2010)Massive Online Analysis, a Framework for Stream Classication and Clustering


Download ppt "1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo."

Similar presentations


Ads by Google