MAchine Learning for LanguagE Toolkit

Name: MAchine Learning for LanguagE Toolkit
Uploaded: 2017-08-22T23:38:07+00:00
Duration: PTM12S37
Channel: Kerry Fletcher
Description: MAchine Learning for LanguagE Toolkit

MAchine Learning for LanguagE Toolkit
Mallet MAchine Learning for LanguagE Toolkit

Outline About MALLET Representing Data Command Line Processing
Simple Evaluation Conclusion

About MALLET "MALLET: A Machine Learning for Language Toolkit.“
written by Andrew McCallum Implemented in Java, currently version 2.0.6 Motivation: Text classification and information extraction Commercial machine learning Analysis and indexing of academic publications

About MALLET Main idea How to
Text focus: data is discrete rather than continuous, even when values could be continuous How to Command line scripts: bin/mallet [command] --[option] [value] … Text User Interface (“tui”) classes Direct Java API

Representations Transform text documents to vectors x1 , x2 …
Elements of vector are called feature values Example: “Feature at row 345 is number of times “dog” appears in document” Retain meaning of vector indices

Documents to Vectors

Instances

Developing with MALLET Conclusion

Command Line Importing Data Classification Sequence Tagging
Topic Modeling

Importing Data One Instance per file One file, one instance per line
files in the folder: sample-data/web/en or sample-data/web/de command line: bin/mallet import-dir --input sample-data/web/* --output web.mallet One file, one instance per line file format: [URL] [language] [text of the page...] bin/mallet import-file --input /data/web/data.txt --output web.mallet

Classification Training a classifier Choosing an algorithm Evaluation
bin/mallet train-classifier --input training.mallet --output-classifier my.classifier Choosing an algorithm MaxEnt, NaiveBayes, C45, DecisionTree and many others. bin/mallet train-classifier --input training.mallet --output-classifier my.classifier --trainer MaxEnt Evaluation Random split the data into 90% training instances, which will be used to train the classifier, and 10% testing instances. bin/mallet train-classifier --input labeled.mallet --training-portion 0.9

Sequence Tagging Sequence algorithms SimpleTagger
hidden Markov models (HMMs) linear chain conditional random fields (CRFs). SimpleTagger a command line interface to the MALLET Conditional Random Field (CRF) class

SimpleTagger Input file: [feature1 feature2 ... featuren label]
Bill CAPITALIZED noun slept non-noun here LOWERCASE STOPWORD non-noun Train a CRF An input file “sample” A trained CRF in the file "nouncrf" java -cp “~/mallet/class:~/mallet/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --train true --model-file nouncrf sample

SimpleTagger A file “stest” needed to be labeled Label the input
CAPITAL Al slept here Label the input java -cp “~/mallet/class:~/mallet/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --model-file nouncrf stest Output Number of predicates: 5 noun CAPITAL Al non-noun slept non-noun here

Topic Modeling Building Topic Models
bin/mallet train-topics --input topic-input.mallet --num-topics output-state topic-state.gz --input [FILE] --num-topics [NUMBER] The number of topics to use. The best number depends on what you are looking for in the model. --num-iterations [NUMBER] The number of sampling iterations should be a trade off between the time taken to complete sampling and the quality of the topic model. --output-state [FILENAME] This option outputs a compressed text file containing the words in the corpus with their topic assignments.

Methodology Focus on sequence tagging module in MALLET
CRF-based implementation Some scripts written for importing data and evaluating results Small corpora collected from web Divided into two parts, 80% for training, 20% for test Evaluate both POS Tagging and Named Entity Recognition The performance of training Accuracy (POS Tagging) and Precision, Recall and FB1 (NER) All scripts, corpora and results can be found here

A Survey of Named Entity Corpora
Well known named entity corpora Language-Independent Named Entity Recognition at CoNLL-2003 A manual annotation of a subset of RCV1 (Reuters Corpus Volume 1) free and public, but need RCV1 raw texts as the input Message Understanding Conference (MUC) 6 / 7 not for free Affective Computational Entities (ACE) Training Corpus Other special purpose corpora Enron Dataset messages in this corpus are tagged with person names, dates and times. A variety of biomedical corpora some corpora in this collection are tagged with entities in the biomedical domain, such as gene name

Small Corpora Two small corpora collected from web
Penn Treebank Sample English POS tagging corpora, ~5% fragment of Penn Treebank, (C) LDC 1995. raw, tagged, parsed and combined data from Wall Street Journal tokens, 36 Standard treebank POS tagger HIT CIR LTP Corpora Sample Chinese NER corpora integrated 10% of the whole corpora (open to public) 23751 tokens, 7 kinds of named entities

Environment Hardware Software CPU: Q8300 Quad Core 2.50 GHz
Memory: 3GB Software Fedora 13 x86_64 Java 1.6.0_18 MALLET 2.0.6

Evaluation Tasks Stages pos chunking ner Training Instance # 3982 8936
1286 Tokens # 95767 211727 20913 Time 308m 23s 190m 50s 17m 13s Test 46452 47377 2829 Accuracy 85.67% 93.97% 98.55% Precision - 90.54% 86.89% Recall 89.89% FB1 90.21 86.89 15.80s 4.43s 0.8s Tasks Stages

MAchine Learning for LanguagE Toolkit

Similar presentations

Presentation on theme: "MAchine Learning for LanguagE Toolkit"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MAchine Learning for LanguagE Toolkit

Similar presentations

Presentation on theme: "MAchine Learning for LanguagE Toolkit"— Presentation transcript:

Similar presentations

About project

Feedback