Introduction to Mallet

Slides:



Advertisements
Similar presentations
Introduction to Java 2 Programming Lecture 4 Writing Java Applications, Java Development Tools.
Advertisements

Ling 570: Day 8 Classification, Mallet 1. Roadmap  Open questions?  Quick review of classification  Feature templates 2.
Classification & Mallet Shallow Processing Techniques for NLP Ling570 November 14, 2011.
Mallet & MaxEnt POS Tagging Shallow Processing Techniques for NLP Ling570 November 16, 2011.
LibSVM LING572 Fei Xia Week 9: 3/4/08 1. Documentation The libSVM directory on Patas: /NLP_TOOLS/svm/libsvm/latest/
Final review LING572 Fei Xia Week 10: 03/13/08 1.
K nearest neighbor and Rocchio algorithm
1 HTML Markup language – coded text is converted into formatted text by a web browser. Big chart on pg. 16—39. Tags usually come in pairs like – data Some.
Linux+ Guide to Linux Certification, Second Edition
Finite state automaton (FSA)
Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?
1 Introduction LING 572 Fei Xia, Dan Jinguji Week 1: 1/08/08.
1 Introduction LING 575 Week 1: 1/08/08. Plan for today General information Course plan HMM and n-gram tagger (recap) EM and forward-backward algorithm.
The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.
Introduction to information theory
Shell Programming, or Scripting Shirley Moore CPS 5401 Fall August 29,
MAchine Learning for LanguagE Toolkit
1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009.
SVMLight SVMLight is an implementation of Support Vector Machine (SVM) in C. Download source from :
Final review LING572 Fei Xia Week 10: 03/11/
8 Shell Programming Mauro Jaskelioff. Introduction Environment variables –How to use and assign them –Your PATH variable Introduction to shell programming.
Week 7 Working with the BASH Shell. Objectives  Redirect the input and output of a command  Identify and manipulate common shell environment variables.
An Introduction to Unix Shell Scripting
General Programming Introduction to Computing Science and Programming I.
Week 1 - Friday.  What did we talk about last time?  Our first Java program.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Shell Script Programming. 2 Using UNIX Shell Scripts Unlike high-level language programs, shell scripts do not have to be converted into machine language.
Linux+ Guide to Linux Certification, Third Edition
Writing Shell Scripts ─ part 3 CSE 2031 Fall October 2015.
Linux Operations and Administration
Chapter Five Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command.
Linux+ Guide to Linux Certification Chapter Eight Working with the BASH Shell.
BIF713 Additional Utilities. Linux Utilities  You have learned many Linux commands. Here are some more that you can use:  Data Manipulation (Reg Exps)
 Pearson Education, Inc. All rights reserved Introduction to Java Applications.
Chapter 10: BASH Shell Scripting Fun with fi. In this chapter … Control structures File descriptors Variables.
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
5 1 Data Files CGI/Perl Programming By Diane Zak.
11/25/2015Slide 1 Scripts are short programs that repeat sequences of SPSS commands. SPSS includes a computer language called Sax Basic for the creation.
Chapter Five Advanced File Processing. 2 Lesson A Selecting, Manipulating, and Formatting Information.
1 Lecture 9 Shell Programming – Command substitution Regular expressions and grep Use of exit, for loop and expr commands COP 3353 Introduction to UNIX.
1 © 2000 John Urrutia. All rights reserved. Session 5 The Bourne Shell.
IR Homework #3 By J. H. Wang May 10, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Executable scripts. So far We have made scripts echo hello #for example And called it hello.sh Run it as sh hello.sh This only works from current directory?
Lesson 3-Touring Utilities and System Features. Overview Employing fundamental utilities. Linux terminal sessions. Managing input and output. Using special.
©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek (610)
Files Tutor: You will need ….
Agenda Positional Parameters / Continued... Command Substitution Bourne Shell / Bash Shell / Korn Shell Mathematical Expressions Bourne Shell / Bash Shell.
Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06.
Linux+ Guide to Linux Certification, Second Edition
Compunet Corporation Introduction to Unix (CA263) Round and Round By Tariq Ibn Aziz Dammam Community College.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
Essential components of the implementation are:  Formation of the network and weight initialization routine  Pixel analysis of images for symbol detection.
Lesson 6-Using Utilities to Accomplish Complex Tasks.
CS 403: Programming Languages Lecture 20 Fall 2003 Department of Computer Science University of Alabama Joel Jones.
Based on Learning SAS by Example: A Programmer’s Guide Chapters 1 & 2
1 UNIX Operating Systems II Part 2: Shell Scripting Instructor: Stan Isaacs.
Linux Administration Working with the BASH Shell.
FILES AND EXCEPTIONS Topics Introduction to File Input and Output Using Loops to Process Files Processing Records Exceptions.
IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:
Introduction to Computing Science and Programming I
Implementation of a simple shell, xssh (Section 1 version)
Classification with Perceptrons Reading:
Writing Shell Scripts ─ part 3
Writing Shell Scripts ─ part 3
LING 408/508: Computational Techniques for Linguists
Week 6 Discussion Word Cloud.
CSE 491/891 Lecture 21 (Pig).
Good Testing Practices
Linux Shell Script Programming
CST8177 Scripting 2: What?.
Presentation transcript:

Introduction to Mallet LING 572 Fei Xia Week 1: 1/08/2009

Mallet basics A package developed by McCallum’s group at UMass. It is written in Java. It includes most ML algorithms that we will cover in LING572. The package has been used by researchers from all over the world. It is still under development: Some functions are missing Some code has bugs

On Patas Mallet package: /NLP_TOOLS/tool_sets/mallet/latest Fei’s classes: ~/dropbox/07-08/572/fei_mallet In each directory, there are several subdirectories: bin/: shell script. class/: the Java classes. src/: the Java source code lib/: (only for the Mallet dir), the jar files

Set the env !! You need to add the following to ~/.bash_profile, then start a new terminal PATH=$PATH:$HOME/dropbox/07-08/572/fei_mallet/bin export PATH CLASSPATH=$CLASSPATH:$HOME/dropbox/07-08/572/fei_mallet/class:/NLP_TOOLS/tool_sets/mallet/latest/lib/mallet.jar:/NLP_TOOLS/tool_sets/mallet/latest/lib/mallet-deps.jar export CLASSPATH

To run Mallet Type “which classify”: /opt/dropbox/07-08/572/fei_mallet/bin/classify Type “which vectors2info” /NLP_TOOLS/tool_sets/mallet/latest/bin/vectors2info If they do not work, echo $PATH /opt/dropbox/07-08/572/fei_mallet/bin should be there. echo $CLASSPATH /opt/dropbox/07-08/572/fei_mallet/class, /NLP_TOOLS/tool_sets/mallet/latest/lib/mallet.jar, /NLP_TOOLS/tool_sets/mallet/latest/lib/mallet-deps.jar should be there

Mallet commands Types: Data preparation Format conversation: text <-> binary Training Decoding All the commands are actually shell scripts that will call java.

An example: classifier2info #!/bin/sh malletdir=`dirname $0` malletdir=`dirname $malletdir` cp=$malletdir/lib/mallet.jar:$malletdir/class:$malletdir/lib/mallet-deps.jar:$CLASSPATH mem=200m arg=`echo "$1" | sed -e 's/-Xmx//'` if test $1 != $arg ; then mem=$arg shift fi java -Xmx$mem-classpath $cp edu.umass.cs.mallet.base.classify.tui.Classifier2Info"$@"

Data preparation

The format of the feature vectors Text format: instanceName targetLabel f1 v1 f2 v2 …. Binary format: It stores the mapping from featName to featIdx, from targetLabel to targetIdx, etc. The learner/decoder uses only the binary format.  We need to convert the text format to the binary format before training/decoding with the info2vectors command.

Data preparation info2vectors: convert vectors from the text format to the binary format vectors2info: convert vectors from the binary format to the text format vectors2vectors: split the vectors into training vectors and test vectors (all in the binary format)

info2vectors --input news.vectors.txt --output news.vectors vectors2info --input news.vectors --print-matrix siw | remove_blank_line.exec > news.vectors.new_txt diff news.vectors news.vectors.new_txt  they are the same except that the (feat, val) pairs might be in different order. vectors2vectors --input news.vectors --training-portion 0.9 --training-file train.vectors--testing-file test.vectors The split uses a random function inside.

When training data and test data are prepared separately info2vectors --input train.vectors.txt --output train.vectors => create train.vectors info2vectors --input test.vectors.txt --output test.vectors --use-pipe-from train.vectors => create test.vectors, which contains the same mapping

Training

Training vectors2train --training-file train.vectors --trainer MaxEnt --output-classifier foo_model --report train:accuracy train:confusion > foo.stdout 2>foo.stderr It will create foo_model (the model): features and their weights foo_stdout: the report, including training acc, confusion matrix foo_stderr (the training info): iteration values, etc. The name of trainer: MaxEnt, C45, DecisionTree, NaiveBayes, …

Viewing the model classifier2info --classifer me_model > me_model.txt There is a typo in the Java code, so the option is misspelled. An example model: FEATURES FOR CLASS guns <default> 0.1298 fire 0.3934 firearms 0.4221 government 0.3721 arabic -0.0204

Accuracy and confusion matrix Confusion Matrix, row=true, column=predicted accuracy=0.9711111111111111 label 0 1 2 | total 0 misc 846 27 23 | 896 1 mideast 12 899 2 | 913 2 guns 12 2 877 | 891 Train accuracy mean = 0.9711

Testing

Testing and evaluation classify --testing-file test.vectors --classifier foo_model –report test:accuracy test:confusion test:raw >foo_res.stdout 2> foo_res.stderr In foo_res.stdout: instName tgtLable c1: score1 c2:score2 … talk.politics.guns/54600 guns guns:0.999 misc:9.24E-4 mideast:1.42E-5 Test data accuracy= 0.87

Training, testing and eval vectors2classify --training-file train.vectors --testing-file test.vectors --trainer MaxEnt > foo.stdout 2>foo.stderr It is the same as vectors2train followed by classify. The training and test accuracies are at the end of foo.stdout.

The error message in stderr Logging configuration class "edu.umass.cs.mallet.base.util.Logger.DefaultConfigurator" failed java.lang.ClassNotFoundException: edu.umass.cs.mallet.base.util.Logger.DefaultConfigurator  Please ignore this message.

Summary

Main commands Data preparation: info2vectors => create vectors in the binary format => use –use-pipe-from option when the training and test data are created separately. Training: vectors2train => create a model from the training data Testing and evaluation: classify => create classification results All the three are Fei’s classes. Both vectors2train and classify have the --report option.

Other commands Split vectors into training and testing: vectors2vectors => It uses a random function. Viewing the vectors: vectors2info => use remove_blank_line.exec to remove the final blank line. Viewing the model: classifier2info => the –classifer option is misspelled. vectors2classify: training, test and eval It is the same as vectors2train + classify All of these are Mallet’s classes.

Other commands (cont) csv2vectors: InstName classLabel f1 f2 f3 … Convert a text file into the vectors in the binary format. The text file has the format: InstName classLabel f1 f2 f3 … Similar to info2vectors, but it does not allow feature values

Naming convention *.vectors: feature vectors in binary format *.vectors.txt: feature vectors in text format *_model: models in binary format *_model.txt: models in text format

File format Vectors in the text format: Classification results: InstName classLabel fn1 fv1 fn2 fv2 …. The order of the (featName, val) pairs does not matter. Classification results: InstName classLabel c1 score1 c2 score2 …. (class, score) pairs are ordered by the score.

More information Mallet url (optional): for version 2.0 http://mallet.cs.umass.edu/index.php A tutorial that I wrote for Mallet two years ago (optional): http://courses.washington.edu/ling572/winter07/homework/mallet_guide.pdf It discusses the main classes in Mallet.

Mallet version Latest version and tutorials from UMass’ site: version 2.0 Version on Patas: version 0.4  There could be a mismatch between the two versions.

Hw1

Hw1 Purpose: Text classification task Learn to use Mallet package Learn to create feature vectors Text classification task Three categories: guns, mideast, and misc Each category has 1000 files under ~/dropbox/08-09/572/20_newsgroups/talk.politics.*/

Part I: use text2vectors to create feature vectors, and make sure that Mallet works for you. text2vectors –input 20_newsgroups/talk.politics.* --skip-header –output news.vectors => create news.vectors Part II: the same task, but you need to prepare the vectors yourself.

Features in Hw1 Given a document Skip the header: use the text after the first blank lines. Tokenize the text by finding all the [a-zA-Z]+ sequences and lowercasingthe sequences.  That is, replacing non-alphabet with whitespace This is different from the typical tokenization The tokens in the sequences are features. Feature values are the frequencies of the sequences in the document.

An example: talk.politics.guns/53293 Xref: cantaloupe.srv.cs.cmu.edu misc.headlines:41568 talk.politics.guns:53293 … Lines: 38 hambidge@bms.com wrote: : In article <C4psoG.C6@magpie.linknet.com>, manes@magpie.linknet.com (Steve Manes) writes:

After “tokenization” hambidge@bms.comwrote: :In article<C4psoG.C6@magpie.linknet.com>, manes@magpie.linknet.com(SteveManes) writes: Hambidge bms comwrote In articleC psoG C magpie linknet commanes magpie linknet com SteveManes writes

After lowercasing, counting and sorting talk.politics.guns/53293 guns a 11 about 2 absurd 1 again 1 an 1 and 5 any 2 approaching 1 are 5 argument 1 art icle 1 as 5 associates 1 at 1 average 2 bait 1 be 6 being 1 betraying 1 better 1 bms 1 by 5 c 2 calculator 1 capita 1 choice 1 chrissakes 1 citizen 1 com 4 crow 1 dangerous 1 deaths 2 die 1 easier 1 eigth 1 enuff 1 …

Coming up Please try the Mallet commands ASAP to ensure it runs for you. Do not wait until Wed. Bill will cover kNN, DT, etc. next week. Post your questions to GoPost. For any emails to me, please cc Bill at billmcn@u. My response could be very slow between 1/9 and 1/15.