Download presentation
Presentation is loading. Please wait.
Published byMarilynn Molly Bruce Modified over 9 years ago
1
Toward Automatic Speech Act Discovery
2
email newsgroups forums blogs
3
Data Set 20 usenet newsgroups The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
4
Preprocessing >> I just wonder if this will also cause a divergence between commercial >> and non-commercial software (ie. you will only get free software using >> Athena or OpenLook widget sets, and only get commercial software using >> the Motif widget sets). > > I can't see why. If just about every workstation will come with Motif > by default and you can buy it for under $100 for the "free" UNIX > platforms, I can't see this causing major problems. Let me add another of my concerns: Yes, I can buy a port of Motif for "cheap", but I cannot get the source for "cheap", hence I am limited to using whatever X libraries the Motif port was compiled against (at least with older versions of Motif. I have been told that Motif 1.2 can be used with any X, but I have not seen it myself).
5
Preprocessing >> I just wonder if this will also cause a divergence between commercial >> and non-commercial software (ie. you will only get free software using >> Athena or OpenLook widget sets, and only get commercial software using >> the Motif widget sets). > > I can't see why. If just about every workstation will come with Motif > by default and you can buy it for under $100 for the "free" UNIX > platforms, I can't see this causing major problems. Let me add another of my concerns: Yes, I can buy a port of Motif for "cheap", but I cannot get the source for "cheap", hence I am limited to using whatever X libraries the Motif port was compiled against (at least with older versions of Motif. I have been told that Motif 1.2 can be used with any X, but I have not seen it myself). Section into “levels” Level < previous level = reply to previous message Level > previous level = new message
6
Also: Remove headers Xref: cantaloupe.srv.cs.cmu.edu comp.windows.x:66928 comp.windows.x.apps:2487 Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!bb3.andrew.cmu.edu!news.sei.cmu.edu! cis.ohio-state.edu!zaphod.mps.ohio- state.edu!howland.reston.ans.net!gatech!asuvax!chnews!tmcconne From: tmcconne@sedona.intel.com (Tom McConnell~) Newsgroups: comp.windows.x,comp.windows.x.apps Subject: Re: Motif vs. [Athena, etc.] Date: 16 Apr 1993 20:14:04 GMT Organization: Intel Corporation Lines: 44 Sender: tmcconne@sedona (Tom McConnell~) Distribution: world Message-ID: References: NNTP-Posting-Host: thunder.intel.com Originator: tmcconne@sedona
7
Also: Remove signatures Cheers, Tom McConnell -- Tom McConnell | Internet: tmcconne@sedona.intel.com Intel, Corp. C3-91 | Phone: (602)-554-8229 5000 W. Chandler Blvd. | The opinions expressed are my own. No one in Chandler, AZ 85226 | their right mind would claim them.
8
Also: Remove signatures Cheers, Tom McConnell -- Tom McConnell | Internet: tmcconne@sedona.intel.com Intel, Corp. C3-91 | Phone: (602)-554-8229 5000 W. Chandler Blvd. | The opinions expressed are my own. No one in Chandler, AZ 85226 | their right mind would claim them. Look for ---* Doesn't always find it
9
Also: Remove signatures Cheers, Tom McConnell -- Tom McConnell | Internet: tmcconne@sedona.intel.com Intel, Corp. C3-91 | Phone: (602)-554-8229 5000 W. Chandler Blvd. | The opinions expressed are my own. No one in Chandler, AZ 85226 | their right mind would claim them. Look for ---* Doesn't always match First paragraph only Might miss important content Sometimes grabs greetings (e.g. “Hi, \n”
10
Preprocessing Bi- and tri-grams Tag start of sentence with ^ Force “not” to join with adjacent n-grams e.g. ^there_is_not not_a_way a_way way_to to_do do_that
11
Text Modeling and Topic Discovery Assume words and/or documents belong to some class/topic Assume words are conditionally independent given the class/topic P(w|z)
12
Naïve Bayes Each document belongs to one class P(d) = \product P(w|z)
13
Naïve Bayes - Inference Expectation-Maximization
14
Latent Semantic Indexing / Latent Dirichlet Allocation Each document contains multiple topics P(d) = \product P(w|z) P(z|d)
15
Model for Conversational Text Message m Response r P(m,r|z) = P(m|z) P(r|z) P(r|m) prop to P(z) P(m|z) P(r|z)
16
Example
21
Classification Performance Labeled ~100 messages with speech acts – M/R model – 40-60% – Single-message NB – 20-30% Need more labels
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.