A Parallel Implemenation of Conditional Random Fields This was an AUSS/NIP project for the grant Developing an Entity Extractor for the Scalable Constructing.

Slides:



Advertisements
Similar presentations
CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
Advertisements

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
1 Gholamreza Haffari Anoop Sarkar Presenter: Milan Tofiloski Natural Language Lab Simon Fraser university Homotopy-based Semi- Supervised Hidden Markov.
Lecture 5: Learning models using EM
Conditional Random Fields
Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld.
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Constrained Optimization for Validation-Guided Conditional Random Field Learning Minmin Chen , Yixin Chen , Michael Brent , Aaron Tenney Washington University.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Graphical models for part of speech tagging
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
Benk Erika Kelemen Zsolt
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.
Tell Me What You See and I will Show You Where It Is Jia Xu 1 Alexander G. Schwing 2 Raquel Urtasun 2,3 1 University of Wisconsin-Madison 2 University.
Evolutionary Programming
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
ECE-7000: Nonlinear Dynamical Systems Overfitting and model costs Overfitting  The more free parameters a model has, the better it can be adapted.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.
John Lafferty Andrew McCallum Fernando Pereira
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Conditional Markov Models: MaxEnt Tagging and MEMMs
Anomaly Detection in GPS Data Based on Visual Analytics Kyung Min Su - Zicheng Liao, Yizhou Yu, and Baoquan Chen, Anomaly Detection in GPS Data Based on.
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Concurrency and Performance Based on slides by Henri Casanova.
1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Evolutionary Programming A.E. Eiben and J.E. Smith, Introduction to Evolutionary Computing Chapter 5.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Chapter 11 – Neural Nets © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Hidden Markov Models BMI/CS 576
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
RNNs: An example applied to the prediction task
Learning Deep Generative Models by Ruslan Salakhutdinov
Efficient Inference on Sequence Segmentation Models
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Chilimbi, et al. (2014) Microsoft Research
Article Review Todd Hricik.
Conditional Random Fields for ASR
Learning Sequence Motif Models Using Expectation Maximization (EM)
RNNs: Going Beyond the SRN in Language Prediction
Probabilistic Models with Latent Variables
Jeremy Morris & Eric Fosler-Lussier 04/19/2007
Support Vector Machines
Word embeddings (continued)
August 8, 2006 Danny Budik, Itamar Elhanany Machine Intelligence Lab
Sequential Learning with Dependency Nets
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Presentation transcript:

A Parallel Implemenation of Conditional Random Fields This was an AUSS/NIP project for the grant Developing an Entity Extractor for the Scalable Constructing of Semantically Rich Socio-Technical Network Data by Jana Diesner of UIUC. Mostly I worked with Brent Fegley, her research assistant This is machine learning, and thus NIP.

Motivations For them: some of the problems they want to run take weeks with the serial code. For us: machine learning is an obvious candidate for HPC.

What is CRF? The original paper: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data by Lafferty, McCallum, and Pereira (2001) Goal: (for example) add part-of-speech labels to text Idea: use a hidden Markov process based on previous previous labels. Optimize the transition weights in the model to get the predicted parts of speech as close the the correct values as possible, for example by maximizing the entropy of the solution.

Diesner specifics The goal for Diesner's group is to identify entities within unlabeled text 95 training examples; roughly 2500 labeled paragraphs from WSJ Two schemes: 'BOUNDARY' (5 labels) and 'CATEGORY' (95 labels) (much longer running) Intel Corp. reported a 50 % drop in third-quarter net income, partly because of a one -time charge for discontinued operations. The big semiconductor and computer maker, said it had net of $ 72 million, or 38 cents, down 50 % from $ million, or 78 cents a share. The lower net included a charge of $ 35 million, equal to 12 cents a share on an after-tax basis, for the cost of abandoning a computer-systems joint venture with Siemens AG of West Germany. Earning also fell from the year-ago period because of slowing microchip demand.

Sarawagi Implementation of CRF Diesner et al. use a Java implementation developed by Sunita Sarawagi of IITB around 2006 DataIter iterates over DataSequence instances – so a DataSequence is some labeled text FeatureGenerator takes a DataSequence and provides a set of features – e.g text tags or parts of speech CRF.Trainer optimizes the weights

Illinois application RoundRobinEstimator takes each set of 4 from the 5 training sets, trains the CRF, and tests the prediction against the remaining set. Two cases: BOUNDARY (5 labels) and CATEGORY (95 labels) Typical test case: 10 optimization steps for BOUNDARY, 3 for CATEGORY I worked directly from their SVN repo using Eclipse Maven for source version control

Serial Code Architecture Basically the problem is gradient decent in many dimensions. Uses LBFGS – A Java port of the good old Fortran code – Has internal state, which frustrated Fegley's threading efforts Calls alternate between LBFGS (setting next test point) and computeFunctionGradient (evaluating the sum)

(Some of) Java's Support for Parallel Threads A java.util.concurrent.ExecutorService maintains a pool of threads that take tasks from a parallel queue and return values via futures. We can create one using newFixedThreadPool() The threads and tasks have to be customized. Task Parallel Queue Promise Queue Result

Parallel Code Architecture Introduce a new derived class: ParallelTrainer extends Trainer In ParallelTrainer.computeFunctionGradient(), training example terms are evaluated across threads. Scalar values get returned as Futures. The gradient vector must be returned via the calling parameter! 1.Make a new Thread class that accumulates gradient terms over its lifetime 2.Merge those values at the end of the iteration

Results Breakdown of time vs. task for the BOUNDARY and CATEGORY problems

Results (2) Parallel speed-up vs. thread count for the two tasks. The BOUNDARIES task rapidly exhausts scalability, but the CATEGORIES task is still improving at 64 threads. Training time vs. thread count for CATEGORIES. The blue line is for 16 cores/32 hyperthreads.

Drawbacks The big issue: Sarawagi's API does not provide the FeatureGenerator until it is time to start training. (Very flexible). Internal state from Trainer is shared across the package. This makes it difficult to create the threads early- they would have to have their internals replaced every iteration anyway. Thus we create threads late, starting fresh every iteration. Very inefficient, but the overhead is tiny for realistic cases.

Where are we now? I'm working with a new set of grad students to understand some variability. Rounding error causes drift in the optimization trajectory The need to return the gradient vector makes a fully deterministic version very expensive We've also shared ParallelTrainer with a group at CMU which uses Sarawagi's CRF implementation.