MAchine Learning for LanguagE Toolkit

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Chapter 5: Introduction to Information Retrieval
ClearTK: A Framework for Statistical Biomedical Natural Language Processing Philip Ogren Philipp Wetzler Department of Computer Science University of Colorado.
Modeling the Evolution of Product Entities Priya Radhakrishnan 1, Manish Gupta 1,2, Vasudeva Varma 1 1 Search and Information Extraction Lab, IIIT-Hyderabad,
Report : 鄭志欣 Advisor: Hsing-Kuo Pao 1 Learning to Detect Phishing s I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing s. In Proceedings.
Learning with Probabilistic Features for Improved Pipeline Models Razvan C. Bunescu Electrical Engineering and Computer Science Ohio University Athens,
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Named Entity Recognition.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Scalable Text Mining with Sparse Generative Models
BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen.
CS224N Interactive Session Competitive Grammar Writing Chris Manning Sida, Rush, Ankur, Frank, Kai Sheng.
Named Entity Recognition and the Stanford NER Software Jenny Rose Finkel Stanford University March 9, 2007.
SVMLight SVMLight is an implementation of Support Vector Machine (SVM) in C. Download source from :
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition.
Concept demo System dashboard. Overview Dashboard use case General implementation ideas Use of MULE integration platform Collection Aggregation/Factorization.
The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.
Survey of Semantic Annotation Platforms
Overview of Machine Learning for NLP Tasks: part II Named Entity Tagging: A Phrase-Level NLP Task.
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva JRC Workshop September 27, 2005.
Ling 570 Day 17: Named Entity Recognition Chunking.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.
CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.
SPAM DETECTION AND FILTERING By Prasanna Kunchavaram.
Natural language processing tools Lê Đức Trọng 1.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
Tokenization & POS-Tagging
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
IR Homework #3 By J. H. Wang May 10, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
CS621: Artificial Intelligence
Multi-core Structural SVM Training Kai-Wei Chang Department of Computer Science University of Illinois at Urbana-Champaign Joint Work With Vivek Srikumar.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
PoS tagging and Chunking with HMM and CRF
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:
Language Identification and Part-of-Speech Tagging
Introduction to Machine Learning and Text Mining
张昊.
Introduction to Information Extraction
Clustering Algorithms for Noun Phrase Coreference Resolution
CSE 635 Multimedia Information Retrieval
CS224N Section 3: Corpora, etc.
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
CS224N Section 3: Project,Corpora
Presentation transcript:

MAchine Learning for LanguagE Toolkit Mallet MAchine Learning for LanguagE Toolkit

Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion

Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion

About MALLET "MALLET: A Machine Learning for Language Toolkit.“ written by Andrew McCallum http://mallet.cs.umass.edu. 2002. Implemented in Java, currently version 2.0.6 Motivation: Text classification and information extraction Commercial machine learning Analysis and indexing of academic publications

About MALLET Main idea How to Text focus: data is discrete rather than continuous, even when values could be continuous How to Command line scripts: bin/mallet [command] --[option] [value] … Text User Interface (“tui”) classes Direct Java API http://mallet.cs.umass.edu/api

Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion

Representations Transform text documents to vectors x1 , x2 … Elements of vector are called feature values Example: “Feature at row 345 is number of times “dog” appears in document” Retain meaning of vector indices

Documents to Vectors

Documents to Vectors

Documents to Vectors

Documents to Vectors

Documents to Vectors

Instances

Instances

Instances

Outline About MALLET Representing Data Command Line Processing Developing with MALLET Conclusion

Command Line Importing Data Classification Sequence Tagging Topic Modeling

Importing Data One Instance per file One file, one instance per line files in the folder: sample-data/web/en or sample-data/web/de command line: bin/mallet import-dir --input sample-data/web/* --output web.mallet One file, one instance per line file format: [URL] [language] [text of the page...] bin/mallet import-file --input /data/web/data.txt --output web.mallet

Classification Training a classifier Choosing an algorithm Evaluation bin/mallet train-classifier --input training.mallet --output-classifier my.classifier Choosing an algorithm MaxEnt, NaiveBayes, C45, DecisionTree and many others. bin/mallet train-classifier --input training.mallet --output-classifier my.classifier --trainer MaxEnt Evaluation Random split the data into 90% training instances, which will be used to train the classifier, and 10% testing instances.  bin/mallet train-classifier --input labeled.mallet --training-portion 0.9

Sequence Tagging Sequence algorithms SimpleTagger hidden Markov models (HMMs) linear chain conditional random fields (CRFs). SimpleTagger a command line interface to the MALLET Conditional Random Field (CRF) class

SimpleTagger Input file: [feature1 feature2 ... featuren label] Bill CAPITALIZED noun slept non-noun here LOWERCASE STOPWORD non-noun Train a CRF An input file “sample” A trained CRF in the file "nouncrf" java -cp “~/mallet/class:~/mallet/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --train true --model-file nouncrf sample

SimpleTagger A file “stest” needed to be labeled Label the input CAPITAL Al slept here Label the input java -cp “~/mallet/class:~/mallet/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --model-file nouncrf stest Output Number of predicates: 5 noun CAPITAL Al non-noun slept non-noun here

Topic Modeling Building Topic Models bin/mallet train-topics --input topic-input.mallet --num-topics 100 --output-state topic-state.gz --input [FILE]  --num-topics [NUMBER] The number of topics to use. The best number depends on what you are looking for in the model. --num-iterations [NUMBER] The number of sampling iterations should be a trade off between the time taken to complete sampling and the quality of the topic model. --output-state [FILENAME] This option outputs a compressed text file containing the words in the corpus with their topic assignments. 

Demo

Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion

Methodology Focus on sequence tagging module in MALLET CRF-based implementation Some scripts written for importing data and evaluating results Small corpora collected from web Divided into two parts, 80% for training, 20% for test Evaluate both POS Tagging and Named Entity Recognition The performance of training Accuracy (POS Tagging) and Precision, Recall and FB1 (NER) All scripts, corpora and results can be found here http://mallet-eval.googlecode.com

A Survey of Named Entity Corpora Well known named entity corpora Language-Independent Named Entity Recognition at CoNLL-2003 A manual annotation of a subset of RCV1 (Reuters Corpus Volume 1) free and public, but need RCV1 raw texts as the input Message Understanding Conference (MUC) 6 / 7 not for free Affective Computational Entities (ACE) Training Corpus Other special purpose corpora Enron Email Dataset email messages in this corpus are tagged with person names, dates and times. A variety of biomedical corpora some corpora in this collection are tagged with entities in the biomedical domain, such as gene name

Small Corpora Two small corpora collected from web Penn Treebank Sample English POS tagging corpora, ~5% fragment of Penn Treebank, (C) LDC 1995. raw, tagged, parsed and combined data from Wall Street Journal 148120 tokens, 36 Standard treebank POS tagger http://web.mit.edu/course/6/6.863/OldFiles/share/data/corpora/treebank/ HIT CIR LTP Corpora Sample Chinese NER corpora integrated 10% of the whole corpora (open to public) 23751 tokens, 7 kinds of named entities http://ir.hit.edu.cn/demo/ltp/Sharing_Plan.htm

Environment Hardware Software CPU: Q8300 Quad Core 2.50 GHz Memory: 3GB Software Fedora 13 x86_64 Java 1.6.0_18 MALLET 2.0.6

Data Format and Labels Data Format Labels Each token one row, each feature one column Bill noun slept non-noun Here non-noun Labels Standard treebank POS Tagger CC Coordinating conjunction | CD Cardinal number | DT Determiner | EX Existential there | FW Foreign word | IN Preposition or subordinating conjunction | JJ Adjective | JJR Adjective, comparative | JJS Adjective, superlative | LS List item marker | MD Modal | NN Noun, singular or mass | NNS Noun, plural … … (36 taggers in all) HIT Named Entity O 不是 NE | S- 单独构成 NE | B- 一个 NE 的开始 | I- 一个 NE 的中间 | E- 一个 NE 的结尾 Nm 数词 | Ni 机构名 | Ns 地名 | Nh 人名 | Nt 时间 | Nr 日期 | Nz 专有名词 Example: 美国 B-Ni 洛杉矶 I-Ni 警察局 E-Ni

Evaluation Tasks Stages pos chunking ner Training Instance # 3982 8936 1286 Tokens # 95767 211727 20913 Time 308m 23s 190m 50s 17m 13s Test 46452 47377 2829 Accuracy 85.67% 93.97% 98.55% Precision - 90.54% 86.89% Recall 89.89% FB1 90.21 86.89 15.80s 4.43s 0.8s Tasks Stages

DEMO

Q&A