MAchine Learning for LanguagE Toolkit Mallet MAchine Learning for LanguagE Toolkit
Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion
Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion
About MALLET "MALLET: A Machine Learning for Language Toolkit.“ written by Andrew McCallum http://mallet.cs.umass.edu. 2002. Implemented in Java, currently version 2.0.6 Motivation: Text classification and information extraction Commercial machine learning Analysis and indexing of academic publications
About MALLET Main idea How to Text focus: data is discrete rather than continuous, even when values could be continuous How to Command line scripts: bin/mallet [command] --[option] [value] … Text User Interface (“tui”) classes Direct Java API http://mallet.cs.umass.edu/api
Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion
Representations Transform text documents to vectors x1 , x2 … Elements of vector are called feature values Example: “Feature at row 345 is number of times “dog” appears in document” Retain meaning of vector indices
Documents to Vectors
Documents to Vectors
Documents to Vectors
Documents to Vectors
Documents to Vectors
Instances
Instances
Instances
Outline About MALLET Representing Data Command Line Processing Developing with MALLET Conclusion
Command Line Importing Data Classification Sequence Tagging Topic Modeling
Importing Data One Instance per file One file, one instance per line files in the folder: sample-data/web/en or sample-data/web/de command line: bin/mallet import-dir --input sample-data/web/* --output web.mallet One file, one instance per line file format: [URL] [language] [text of the page...] bin/mallet import-file --input /data/web/data.txt --output web.mallet
Classification Training a classifier Choosing an algorithm Evaluation bin/mallet train-classifier --input training.mallet --output-classifier my.classifier Choosing an algorithm MaxEnt, NaiveBayes, C45, DecisionTree and many others. bin/mallet train-classifier --input training.mallet --output-classifier my.classifier --trainer MaxEnt Evaluation Random split the data into 90% training instances, which will be used to train the classifier, and 10% testing instances. bin/mallet train-classifier --input labeled.mallet --training-portion 0.9
Sequence Tagging Sequence algorithms SimpleTagger hidden Markov models (HMMs) linear chain conditional random fields (CRFs). SimpleTagger a command line interface to the MALLET Conditional Random Field (CRF) class
SimpleTagger Input file: [feature1 feature2 ... featuren label] Bill CAPITALIZED noun slept non-noun here LOWERCASE STOPWORD non-noun Train a CRF An input file “sample” A trained CRF in the file "nouncrf" java -cp “~/mallet/class:~/mallet/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --train true --model-file nouncrf sample
SimpleTagger A file “stest” needed to be labeled Label the input CAPITAL Al slept here Label the input java -cp “~/mallet/class:~/mallet/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --model-file nouncrf stest Output Number of predicates: 5 noun CAPITAL Al non-noun slept non-noun here
Topic Modeling Building Topic Models bin/mallet train-topics --input topic-input.mallet --num-topics 100 --output-state topic-state.gz --input [FILE] --num-topics [NUMBER] The number of topics to use. The best number depends on what you are looking for in the model. --num-iterations [NUMBER] The number of sampling iterations should be a trade off between the time taken to complete sampling and the quality of the topic model. --output-state [FILENAME] This option outputs a compressed text file containing the words in the corpus with their topic assignments.
Demo
Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion
Methodology Focus on sequence tagging module in MALLET CRF-based implementation Some scripts written for importing data and evaluating results Small corpora collected from web Divided into two parts, 80% for training, 20% for test Evaluate both POS Tagging and Named Entity Recognition The performance of training Accuracy (POS Tagging) and Precision, Recall and FB1 (NER) All scripts, corpora and results can be found here http://mallet-eval.googlecode.com
A Survey of Named Entity Corpora Well known named entity corpora Language-Independent Named Entity Recognition at CoNLL-2003 A manual annotation of a subset of RCV1 (Reuters Corpus Volume 1) free and public, but need RCV1 raw texts as the input Message Understanding Conference (MUC) 6 / 7 not for free Affective Computational Entities (ACE) Training Corpus Other special purpose corpora Enron Email Dataset email messages in this corpus are tagged with person names, dates and times. A variety of biomedical corpora some corpora in this collection are tagged with entities in the biomedical domain, such as gene name
Small Corpora Two small corpora collected from web Penn Treebank Sample English POS tagging corpora, ~5% fragment of Penn Treebank, (C) LDC 1995. raw, tagged, parsed and combined data from Wall Street Journal 148120 tokens, 36 Standard treebank POS tagger http://web.mit.edu/course/6/6.863/OldFiles/share/data/corpora/treebank/ HIT CIR LTP Corpora Sample Chinese NER corpora integrated 10% of the whole corpora (open to public) 23751 tokens, 7 kinds of named entities http://ir.hit.edu.cn/demo/ltp/Sharing_Plan.htm
Environment Hardware Software CPU: Q8300 Quad Core 2.50 GHz Memory: 3GB Software Fedora 13 x86_64 Java 1.6.0_18 MALLET 2.0.6
Data Format and Labels Data Format Labels Each token one row, each feature one column Bill noun slept non-noun Here non-noun Labels Standard treebank POS Tagger CC Coordinating conjunction | CD Cardinal number | DT Determiner | EX Existential there | FW Foreign word | IN Preposition or subordinating conjunction | JJ Adjective | JJR Adjective, comparative | JJS Adjective, superlative | LS List item marker | MD Modal | NN Noun, singular or mass | NNS Noun, plural … … (36 taggers in all) HIT Named Entity O 不是 NE | S- 单独构成 NE | B- 一个 NE 的开始 | I- 一个 NE 的中间 | E- 一个 NE 的结尾 Nm 数词 | Ni 机构名 | Ns 地名 | Nh 人名 | Nt 时间 | Nr 日期 | Nz 专有名词 Example: 美国 B-Ni 洛杉矶 I-Ni 警察局 E-Ni
Evaluation Tasks Stages pos chunking ner Training Instance # 3982 8936 1286 Tokens # 95767 211727 20913 Time 308m 23s 190m 50s 17m 13s Test 46452 47377 2829 Accuracy 85.67% 93.97% 98.55% Precision - 90.54% 86.89% Recall 89.89% FB1 90.21 86.89 15.80s 4.43s 0.8s Tasks Stages
DEMO
Q&A