Machine Learning for Language Technology Introduction to Weka: Arff format and Preprocessing Practical Machine Learning for Language Technology Marina Santini Department of Linguistics and Philology Uppsala University, Uppsala, Sweden Autumn 2015 ML4LT Lecture 2: LAB SESSION1
Acknowledgements ML4LT Lecture 2: LAB SESSION2 Many thanks to Weka slides…..Martin D. Sykora,
Outline Aim of lab sessions Requirement of the lab sessions Structure of the lab assignments The Weka Package Arff format Preprocessing – Feature Selection ML4LT Lecture 2: LAB SESSION3
Aim of the lab sessions The aim of the lab sessions is manyfold: – to practise with a number of machine learning methods – to apply machine-learning methods to real-world problems in LT – to learn how to use a state-of-the-art machine- learning workbench. ML4LT Lecture 2: LAB SESSION4
Requirements of the lab sessions Each lab session includes a number of lab assignments to be completed. The completion of the lab assignments is required to pass the course. The physical attendance to the lab sessions is required to pass the course Out of 12 lectures and corresponding lab sessions, 9? lab assignments must be correctely completed to pass the course. ML4LT Lecture 2: LAB SESSION5
Structure of the lab assignments Lab assignments should be completed in class. A lab assignment includes a number of tasks. Tasks are divided into G tasks and VG tasks. In order to pass a lab assignment, the G tasks must be completed correctly and a short report must be sent to the teacher by the the due date. ML4LT Lecture 2: LAB SESSION6
Weka 1 Weka stands for Waikato Environment for Knowledge Analysis. It is a state of the art machine learning workbench normally used to derive useful knowledge from datasets that are far too large to be anlalysed by hand. ML4LT Lecture 2: LAB SESSION7
Weka 2 Weka is a general purpose workbench that is used in many different, domains (bioinformatics, medicine, text analytics, etc. ) for data and text mining. It contains many machine learning methods (both supervised and unsupervised), preprocessig tools and statistical tests to evaluate the performance of the different models. ML4LT Lecture 2: LAB SESSION8
??? When you want to apply ML to our classification problem: – Either you write your own implementation of a model using a programming language – Or you use an off-the-shelf software package that free you from the programming task. ML4LT Lecture 2: LAB SESSION9
??? Some learning models are easy to program: students in the previous year have provided their own implementation of the Perceptron using Java. You could this by using Python this year… You can also use Weka open source code and modify it (if you are not happy with it) to achieve your specific purposes. ML4LT Lecture 2: LAB SESSION10
Weka includes Regression Classification Clustering Association Rules Attribute Selection Visualization ML4LT Lecture 2: LAB SESSION11
The ARFF format The standard format of the datasets to be processed by Weka is the ARFF format. See section 2.4 Example: <> ML4LT Lecture 2: LAB SESSION12
The Weather Table ML4LT Lecture 2: LAB SESSION13
Feature representation You must decide about the best way of representing the problem you wan to address! Different features give different results There is no a priori correct/incorrect answer to ”which are the best features?”. Feature selection is based on your theoretical knowledge about the problems, your theoretical assumption and empirical tries with different models/algorithms. ML4LT Lecture 2: LAB SESSION14
How to get the ARFF format? P. 407 Either you use an already prepared arff, that somebody else has made available Or you create yourself (feature manipulation and extraction) – Decide the best way to represent your problem thru the feature – Extract features from a corpus – Organize the feature in a spreadsheed (eg. csv, exec) – Convert it into arff – Or… ML4LT Lecture 2: LAB SESSION15
Get the Lab Assignment ML4LT Lecture 2: LAB SESSION16
Summary and Conclusions ML4LT Lecture 2: LAB SESSION17