Download presentation
Presentation is loading. Please wait.
Published byTeresa Gilbert Modified over 9 years ago
1
Machine Learning for Language Technology 2015 http://stp.lingfil.uu.se/~santinim/ml/2015/ml4lt_2015.htm Introduction to Weka: Arff format and Preprocessing Practical Machine Learning for Language Technology Marina Santini santinim@stp.lingfil.uu.se Department of Linguistics and Philology Uppsala University, Uppsala, Sweden Autumn 2015 ML4LT 2015 - Lecture 2: LAB SESSION1
2
Acknowledgements ML4LT 2015 - Lecture 2: LAB SESSION2 Many thanks to Weka slides…..Martin D. Sykora,
3
Outline Aim of lab sessions Requirement of the lab sessions Structure of the lab assignments The Weka Package Arff format Preprocessing – Feature Selection ML4LT 2015 - Lecture 2: LAB SESSION3
4
Aim of the lab sessions The aim of the lab sessions is manyfold: – to practise with a number of machine learning methods – to apply machine-learning methods to real-world problems in LT – to learn how to use a state-of-the-art machine- learning workbench. ML4LT 2015 - Lecture 2: LAB SESSION4
5
Requirements of the lab sessions Each lab session includes a number of lab assignments to be completed. The completion of the lab assignments is required to pass the course. The physical attendance to the lab sessions is required to pass the course Out of 12 lectures and corresponding lab sessions, 9? lab assignments must be correctely completed to pass the course. ML4LT 2015 - Lecture 2: LAB SESSION5
6
Structure of the lab assignments Lab assignments should be completed in class. A lab assignment includes a number of tasks. Tasks are divided into G tasks and VG tasks. In order to pass a lab assignment, the G tasks must be completed correctly and a short report must be sent to the teacher by the the due date. ML4LT 2015 - Lecture 2: LAB SESSION6
7
Weka 1 Weka stands for Waikato Environment for Knowledge Analysis. It is a state of the art machine learning workbench normally used to derive useful knowledge from datasets that are far too large to be anlalysed by hand. ML4LT 2015 - Lecture 2: LAB SESSION7
8
Weka 2 Weka is a general purpose workbench that is used in many different, domains (bioinformatics, medicine, text analytics, etc. ) for data and text mining. It contains many machine learning methods (both supervised and unsupervised), preprocessig tools and statistical tests to evaluate the performance of the different models. ML4LT 2015 - Lecture 2: LAB SESSION8
9
??? When you want to apply ML to our classification problem: – Either you write your own implementation of a model using a programming language – Or you use an off-the-shelf software package that free you from the programming task. ML4LT 2015 - Lecture 2: LAB SESSION9
10
??? Some learning models are easy to program: students in the previous year have provided their own implementation of the Perceptron using Java. You could this by using Python this year… You can also use Weka open source code and modify it (if you are not happy with it) to achieve your specific purposes. ML4LT 2015 - Lecture 2: LAB SESSION10
11
Weka includes Regression Classification Clustering Association Rules Attribute Selection Visualization ML4LT 2015 - Lecture 2: LAB SESSION11
12
The ARFF format The standard format of the datasets to be processed by Weka is the ARFF format. See section 2.4 Example: <> ML4LT 2015 - Lecture 2: LAB SESSION12
13
The Weather Table ML4LT 2015 - Lecture 2: LAB SESSION13
14
Feature representation You must decide about the best way of representing the problem you wan to address! Different features give different results There is no a priori correct/incorrect answer to ”which are the best features?”. Feature selection is based on your theoretical knowledge about the problems, your theoretical assumption and empirical tries with different models/algorithms. ML4LT 2015 - Lecture 2: LAB SESSION14
15
How to get the ARFF format? P. 407 Either you use an already prepared arff, that somebody else has made available Or you create yourself (feature manipulation and extraction) – Decide the best way to represent your problem thru the feature – Extract features from a corpus – Organize the feature in a spreadsheed (eg. csv, exec) – Convert it into arff – Or… ML4LT 2015 - Lecture 2: LAB SESSION15
16
Get the Lab Assignment ML4LT 2015 - Lecture 2: LAB SESSION16
17
Summary and Conclusions ML4LT 2015 - Lecture 2: LAB SESSION17
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.