Presentation is loading. Please wait.

Presentation is loading. Please wait.

MAchine Learning for LanguagE Toolkit

Similar presentations


Presentation on theme: "MAchine Learning for LanguagE Toolkit"— Presentation transcript:

1 MAchine Learning for LanguagE Toolkit
Mallet MAchine Learning for LanguagE Toolkit

2 Outline About MALLET Representing Data Command Line Processing
Simple Evaluation Conclusion

3 Outline About MALLET Representing Data Command Line Processing
Simple Evaluation Conclusion

4 About MALLET "MALLET: A Machine Learning for Language Toolkit.“
written by Andrew McCallum Implemented in Java, currently version 2.0.6 Motivation: Text classification and information extraction Commercial machine learning Analysis and indexing of academic publications

5 About MALLET Main idea How to
Text focus: data is discrete rather than continuous, even when values could be continuous How to Command line scripts: bin/mallet [command] --[option] [value] … Text User Interface (“tui”) classes Direct Java API

6 Outline About MALLET Representing Data Command Line Processing
Simple Evaluation Conclusion

7 Representations Transform text documents to vectors x1 , x2 …
Elements of vector are called feature values Example: “Feature at row 345 is number of times “dog” appears in document” Retain meaning of vector indices

8 Documents to Vectors

9 Documents to Vectors

10 Documents to Vectors

11 Documents to Vectors

12 Documents to Vectors

13 Instances

14 Instances

15 Instances

16 Outline About MALLET Representing Data Command Line Processing
Developing with MALLET Conclusion

17 Command Line Importing Data Classification Sequence Tagging
Topic Modeling

18 Importing Data One Instance per file One file, one instance per line
files in the folder: sample-data/web/en or sample-data/web/de command line: bin/mallet import-dir --input sample-data/web/* --output web.mallet One file, one instance per line file format: [URL] [language] [text of the page...] bin/mallet import-file --input /data/web/data.txt --output web.mallet

19 Classification Training a classifier Choosing an algorithm Evaluation
bin/mallet train-classifier --input training.mallet --output-classifier my.classifier Choosing an algorithm MaxEnt, NaiveBayes, C45, DecisionTree and many others. bin/mallet train-classifier --input training.mallet --output-classifier my.classifier --trainer MaxEnt Evaluation Random split the data into 90% training instances, which will be used to train the classifier, and 10% testing instances.  bin/mallet train-classifier --input labeled.mallet --training-portion 0.9

20 Sequence Tagging Sequence algorithms SimpleTagger
hidden Markov models (HMMs) linear chain conditional random fields (CRFs). SimpleTagger a command line interface to the MALLET Conditional Random Field (CRF) class

21 SimpleTagger Input file: [feature1 feature2 ... featuren label]
Bill CAPITALIZED noun slept non-noun here LOWERCASE STOPWORD non-noun Train a CRF An input file “sample” A trained CRF in the file "nouncrf" java -cp “~/mallet/class:~/mallet/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --train true --model-file nouncrf sample

22 SimpleTagger A file “stest” needed to be labeled Label the input
CAPITAL Al slept here Label the input java -cp “~/mallet/class:~/mallet/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --model-file nouncrf stest Output Number of predicates: 5 noun CAPITAL Al non-noun slept non-noun here

23 Topic Modeling Building Topic Models
bin/mallet train-topics --input topic-input.mallet --num-topics output-state topic-state.gz --input [FILE]  --num-topics [NUMBER] The number of topics to use. The best number depends on what you are looking for in the model. --num-iterations [NUMBER] The number of sampling iterations should be a trade off between the time taken to complete sampling and the quality of the topic model. --output-state [FILENAME] This option outputs a compressed text file containing the words in the corpus with their topic assignments. 

24 Demo

25 Outline About MALLET Representing Data Command Line Processing
Simple Evaluation Conclusion

26 Methodology Focus on sequence tagging module in MALLET
CRF-based implementation Some scripts written for importing data and evaluating results Small corpora collected from web Divided into two parts, 80% for training, 20% for test Evaluate both POS Tagging and Named Entity Recognition The performance of training Accuracy (POS Tagging) and Precision, Recall and FB1 (NER) All scripts, corpora and results can be found here

27 A Survey of Named Entity Corpora
Well known named entity corpora Language-Independent Named Entity Recognition at CoNLL-2003 A manual annotation of a subset of RCV1 (Reuters Corpus Volume 1) free and public, but need RCV1 raw texts as the input Message Understanding Conference (MUC) 6 / 7 not for free Affective Computational Entities (ACE) Training Corpus Other special purpose corpora Enron Dataset messages in this corpus are tagged with person names, dates and times. A variety of biomedical corpora some corpora in this collection are tagged with entities in the biomedical domain, such as gene name

28 Small Corpora Two small corpora collected from web
Penn Treebank Sample English POS tagging corpora, ~5% fragment of Penn Treebank, (C) LDC 1995. raw, tagged, parsed and combined data from Wall Street Journal tokens, 36 Standard treebank POS tagger HIT CIR LTP Corpora Sample Chinese NER corpora integrated 10% of the whole corpora (open to public) 23751 tokens, 7 kinds of named entities

29 Environment Hardware Software CPU: Q8300 Quad Core 2.50 GHz
Memory: 3GB Software Fedora 13 x86_64 Java 1.6.0_18 MALLET 2.0.6

30 Data Format and Labels Data Format Labels
Each token one row, each feature one column Bill noun slept non-noun Here non-noun Labels Standard treebank POS Tagger CC Coordinating conjunction | CD Cardinal number | DT Determiner | EX Existential there | FW Foreign word | IN Preposition or subordinating conjunction | JJ Adjective | JJR Adjective, comparative | JJS Adjective, superlative | LS List item marker | MD Modal | NN Noun, singular or mass | NNS Noun, plural … … (36 taggers in all) HIT Named Entity O 不是 NE | S- 单独构成 NE | B- 一个 NE 的开始 | I- 一个 NE 的中间 | E- 一个 NE 的结尾 Nm 数词 | Ni 机构名 | Ns 地名 | Nh 人名 | Nt 时间 | Nr 日期 | Nz 专有名词 Example: 美国 B-Ni 洛杉矶 I-Ni 警察局 E-Ni

31 Evaluation Tasks Stages pos chunking ner Training Instance # 3982 8936
1286 Tokens # 95767 211727 20913 Time 308m 23s 190m 50s 17m 13s Test 46452 47377 2829 Accuracy 85.67% 93.97% 98.55% Precision - 90.54% 86.89% Recall 89.89% FB1 90.21 86.89 15.80s 4.43s 0.8s Tasks Stages

32 DEMO

33 Q&A


Download ppt "MAchine Learning for LanguagE Toolkit"

Similar presentations


Ads by Google