W E K A Waikato Environment for Knowledge Aquisition
Goals of the workshop Aquisition of functional knowledge about the WEKA platform Ability of processing (own) data in WEKA Write seminar work identifying a problem transform into data choose appropriate DM technique apply to data evaluate & interpret the results
Some basic facts about WEKA: WEKA(1)WEKA(1) = a flightless bird with an inquisitive nature (found only on the islands of New Zealand) WEKA(2)WEKA(2) = a software ‘workbench’ incorporating several standard ML/DM techniques AuthorsAuthors = Ian H. Witten, Eibe Frank (et. al.) Programming languageProgramming language = JAVA OriginOrigin = The University of Waikato, New Zealand LiteratureLiterature = Ian H. Witten, Eibe Frank: Practical Machine Learning Tools with JAVA Implementations, Morgan Kaufmann, 1999 HomepageHomepage = What is WEKA ?
make ML/DM techniques generally available apply them to practical problems (in agriculture) develop new ML/DM algorithms contribute to the theoretical framework of the field (ML/DM) Objectives of WEKA
Versions of WEKA There are several versions of WEKA: –WEKA 3.0: “book version” compatible with description in data mining book –WEKA 3.2: “GUI version” adds graphical user interfaces (book version is command- line only) –WEKA 3.4: “development version” with lots of improvements This workshop is based on WEKA 3.4(.3)
ARFF format (“flat” files) : example: Play-tennis domain The input to WEKA %this is an example of a knowledge %domain in ARFF outlook {sunny, overcast, temperature humidity windy {TRUE, play {yes, sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes rainy,65,70,TRUE,no overcast,64,65,TRUE,yes sunny,72,95,FALSE,no sunny,69,70,FALSE,yes rainy,75,80,FALSE,yes sunny,75,70,TRUE,yes overcast,72,90,TRUE,yes overcast,81,75,FALSE,yes... Conversion to the ARFF format ? Example: converting from MS-EXCEL to ARFF
Starting WEKA – the GUI
Preprocess panel A quick tour of the “explorer” Domain info. panel Attributes panel Status bar Filters panel Attribute info. panel Log file Attribute visualization panel
Classify panel Classifier panel Class attribute Output panel Test options panel Result panel A quick tour of the “explorer”
Visualize panel A quick tour of the “explorer”
example: The command line C:\Temp>java weka.classifiers.trees.J48 Weka exception: No training file and no object input file given. General options: -t Sets training file. -T Sets test file. If missing, a cross-validation will be performed on the training data. -c Sets index of class attribute (default: last). -x Sets number of folds for cross-validation (default: 10). -s Sets random number seed for cross-validation (default: 1). -m Sets file with cost matrix. -l Sets model input file. -d Sets model output file. -v Outputs no statistics for training data. -o Outputs statistics only, not the classifier. -i Outputs detailed information-retrieval statistics for each class. -k Outputs information-theoretic statistics. -p Only outputs predictions for test instances. -r Only outputs cumulative margin distribution. -z Only outputs the source representation of the classifier, giving it the supplied name. -g Only outputs the graph representation of the classifier. Options specific to weka.classifiers.j48.J48: -U Use unpruned tree. -C Set confidence threshold for pruning. (default 0.25) -M Set minimum number of instances per leaf. (default 2) -R Use reduced error pruning. -N Set number of folds for reduced error pruning. One fold is used as pruning set. (default 3) -B Use binary splits only. -S Don't perform subtree raising. -L Do not clean up after the tree has been built.
GUI (+): visualisation of data and (some) models GUI (-): not all the parameters can be set (reduced functionality) GUI vs. command line Command line (-): only textual visualisation of models awkward to use Command line (+): full functionality (‘saving the model’) batch processing
PROs: open source (GNU licence) platform-independent (JAVA) easy to use (relatively) easy to modify PROs & CONs of WEKA CONs: relatively slow (JAVA) ‘incomplete’ documentation (some GUI features could be explained better) some features available only from command line
Let’s go to work