Text Classification Eric Doi Harvey Mudd College November 20th, 2008.

Slides:

Advertisements

Similar presentations

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

Albert Gatt Corpora and Statistical Methods – Lecture 7.

SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.

NATURAL LANGUAGE PROCESSING. Applications  Classification ( spam )  Clustering ( news stories, twitter )  Input correction ( spell checking )  Sentiment.

1 I256: Applied Natural Language Processing Marti Hearst Sept 13, 2006.

A Neural Probabilistic Language Model Keren Ye.

Probabilistic inference

Ngram models and the Sparsity problem John Goldsmith November 2002.

Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.

Part 5 Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.

Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.

Scalable Text Mining with Sparse Generative Models

Language Identification in Web Pages Bruno Martins, Mário J. Silva Faculdade de Ciências da Universidade Lisboa ACM SAC 2005 DOCUMENT ENGENEERING TRACK.

SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.

1 Advanced Smoothing, Evaluation of Language Models.

Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.

1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.

Deep Learning for Speech and Language Yoshua Bengio, U. Montreal NIPS’2009 Workshop on Deep Learning for Speech Recognition and Related Applications December.

6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

Improving Upon Semantic Classification of Spoken Diary Entries Using Pragmatic Context Information Daniel J. Rayburn Reeves Curry I. Guinn University of.

Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College.

NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

Language acquisition

Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.

1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab

Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.

N-gram Models CMSC Artificial Intelligence February 24, 2005.

1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.

Statistical NLP Winter 2009

Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

Neural Net Language Models

A COMPARISON OF HAND-CRAFTED SEMANTIC GRAMMARS VERSUS STATISTICAL NATURAL LANGUAGE PARSING IN DOMAIN-SPECIFIC VOICE TRANSCRIPTION Curry Guinn Dave Crist.

Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,

Accurate Cross-lingual Projection between Count-based Word Vectors by Exploiting Translatable Context Pairs SHONOSUKE ISHIWATARI NOBUHIRO KAJI NAOKI YOSHINAGA.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Natural Language Processing Statistical Inference: n-grams

St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Statistical NLP Spring 2011 Lecture 3: Language Models II Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.

Language Model for Machine Translation Jang, HaYoung.

A Straightforward Author Profiling Approach in MapReduce

How Do We Translate? Methods of Translation The Process of Translation.

Language Modelling By Chauhan Rohan, Dubois Antoine & Falcon Perez Ricardo Supervised by Gangireddy Siva 1.

Neural Language Model CS246 Junghoo “John” Cho.

Transformer result, convolutional encoder-decoder

N-Gram Model Formulas Word sequences Chain rule of probability

Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Statistical n-gram David ling.

Ngram frequency smooting

Speech Recognition: Acoustic Waves

Word embeddings Text processing with current NNs requires encoding into vectors. One-hot encoding: N words encoded by length N vectors. A word gets a.

Word embeddings (continued)

Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi

CS249: Neural Language Model

Presentation transcript:

Text Classification Eric Doi Harvey Mudd College November 20th, 2008

Kinds of Classification  Language Hello. My name is Eric. Hola. Mi nombre es Eric. こんにちは。私の名前はエリックである. 你好。我叫 Eric 。

Kinds of Classification  Type “approaches based on n-grams obtain generalization by concatenating”* To: Subject: McCain and Obama use it too You have received this message because you opted in to receives Sund Design pecial offers via . Login to your member account to edit your subscription. Click here to unsubscribe. ACAAGATGCCATTGTCCCCCGGCCTCCTG *(Bengio)

Difficulties  Dictionary? Generalization?  Over 500,000 words in English language (and over one million if counting scientific words)  Typos/OCR errors  Loan words We practice ballet at the café. Nous pratiquons le ballet au café.

Approaches  Unique letter combinations LanguageString English “ery” French “eux” Gaelic “mh” Italian “cchi“ Dunning, Statistical Identification of Language

Approaches  “Unique” letter combinations LanguageString English “ery” French “milieux” Gaelic “farmhand” Italian “zucchini“ Dunning, Statistical Identification of Language

Approaches  “Unique” letter combinations LanguageString English “ery” French “milieux” Gaelic “farmhand” Italian “zucchini“  Requires hand-coding; what about other languages (6000+)? Dunning, Statistical Identification of Language

Approaches  Try to minimize: Hand-coded knowledge Training data Input data (isolating phrases?)  Dunning, “Statistical Identification of Language.”  Bengio, “A Neural Probabilistic Language Model.” 2003.

Statistical Approach: N-Grams  N-grams are sequences of n elements Professor Keller is not a goth. Word-level bigrams: Char-level trigrams:

Statistical Approach: N-Grams  N-grams are sequences of n elements Professor Keller is not a goth. Word-level bigrams: (Professor, Keller) Char-level trigrams:

Statistical Approach: N-Grams  N-grams are sequences of n elements Professor Keller is not a goth. Word-level bigrams: (Keller, is) Char-level trigrams:

Statistical Approach: N-Grams  N-grams are sequences of n elements Professor Keller is not a goth. Word-level bigrams: (is, not) Char-level trigrams:

Statistical Approach: N-Grams  N-grams are sequences of n elements Professor Keller is not a goth. Word-level bigrams: (not, a) Char-level trigrams:

Statistical Approach: N-Grams  N-grams are sequences of n elements Professor Keller is not a goth. Word-level bigrams: (a, goth) Char-level trigrams:

Statistical Approach: N-Grams  N-grams are sequences of n elements Professor Keller is not a goth. Word-level bigrams: Char-level trigrams:

Statistical Approach: N-Grams  N-grams are sequences of n elements Professor Keller is not a goth. Word-level bigrams: Char-level trigrams: (P, r, o)

Statistical Approach: N-Grams  N-grams are sequences of n elements Professor Keller is not a goth. Word-level bigrams: Char-level trigrams: (r, o, f)

Statistical Approach: N-Grams  N-grams are sequences of n elements Professor Keller is not a goth. Word-level bigrams: Char-level trigrams: (o, f, e)

Statistical Approach: N-Grams  N-grams are sequences of n elements Professor Keller is not a goth. Word-level bigrams: Char-level trigrams: (f, e, s)

Statistical Approach: N-Grams  N-grams are sequences of n elements Professor Keller is not a goth. Word-level bigrams: Char-level trigrams:

Statistical Approach: N-Grams  Mined from 1,024,908,267,229 words  Sample 4-grams serve as the infrastructure 500 serve as the initial 5331 serve as the injector 56

Statistical Approach: N-Grams  Informs some notion of probability Normalize frequencies P(serve as the initial) > P(serve as the injector) Classification P(English | serve as the initial) > P(Spanish | serve as the initial) P(Spam | serve as the injector) < P(!Spam | serve as the injector)

Statistical Approach: N-Grams  But what about P(serve as the ink)? = 0? P(serve as the ink) = P(vxvw aooa *%^$) = 0? How about P(sevre as the initial)?

Statistical Approach: N-Grams  How do we smooth out sparse data?  Additive smoothing  Interpolation  Good-Turing estimate  Backoff  Witten-Bell smoothing  Absolute discounting  Kneser-Key smoothing MacCartney

Statistical Approach: N-Grams  Additive smoothing  Interpolation- consider smaller n-grams as well, e.g. (serve as the), (serve)  Backoff- use interpolation only if necessary MacCartney

Statistical Approach: Results  Dunning: Compared parallel translated texts in English and Spanish 20 char input, 50K training: 92% accurate 500 char input, 50K training: 99.9%  Modified for comparing DNA sequences of Humans, E-Coli, and Yeast

Neural Network Approach Bengio et. al, “A Neural Probabilistic Language Model.” 2003:  N-gram does handle sparse data well  However, there are problems: Narrow consideration of context (~1–2 words) Does not consider semantic/grammatical similarity: “A cat is walking in the bedroom” “A dog was running in a room”

Neural Network Approach  The general idea: 1. Associate with each word in the vocabulary (e.g. size 17,000) a feature vector (30–100 features) 2. Express the joint probability function of word sequences in terms of feature vectors 3. Learn simultaneously the word feature vectors and the parameters of the probability function

References  Dunning, “Statistical Identification of Language.”  Bengio, “A Neural Probabilistic Language Model.”  MacCartney, “NLP Lunch Tutorial: Smoothing.” 2005.