Download presentation
Presentation is loading. Please wait.
Published byEmory Goodwin Modified over 8 years ago
1
Machine Learning in Practice Lecture 14 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute
2
Plan for the Day Announcements Questions? Assignment 6 More about Text Using TagHelper Tools Discussion about Assignment 6 Museli Paper
3
Using TagHelper Tools
4
Setting Up Your Data
5
You can also add additional features to the right of the text column Extra Features
6
TagHelper Tools Process TagHelper Labeled Texts Unlabeled Texts Labeled Texts A Model that can Label More Texts
7
Running TagHelper Tools Click on the portal.bat executable
8
Training and Testing Start TagHelper tools by double clicking on the portal.bat icon in your TagHelperTools2 folder You will then see the following tool pallet The idea is that you will train a prediction model on your coded data and then apply that model to uncoded data Click on Train New Models
9
Loading a File First click on Add a File Then select a file
10
Simplest Usage Click “GO!” TagHelper will use its default setting to train a model on your coded examples It will use that model to assign codes to the uncoded examples
11
More Advanced Usage Another option is to modify the default settings You get to the options you can set by clicking on >> Options After you finish that, click “GO!”
12
TagHelper Customizations Feature Space Design Punctuation can be a “stand in” for mood “you think the answer is 9?” “you think the answer is 9.” Bigrams capture simple lexical patterns “common denominator” versus “common multiple” POS bigrams capture stylistic information “the answer which is …” vs “which is the answer” Line length can be a proxy for explanation depth
13
TagHelper Customizations Feature Space Design Contains non-stop word can be a predictor of whether a conversational contribution is contentful “ok sure” versus “the common denominator” Remove stop words removes some distracting features Stemming allows some generalization Multiple, multiply, multiplication Removing rare features is a cheap form of feature selection Features that only occur once or twice in the corpus won’t generalize, so they are a waste of time to include in the vector space
14
Rule Language ANY() is used to create lists COLOR = ANY(red,yellow,green,blue,purple) FOOD = ANY(cake,pizza,hamburger,steak,bread) ALL() is used to capture contingencies ALL(cake,presents) More complex rules ALL(COLOR,FOOD) * Note that you may wish to use part-of-speech tags in your rules!
15
Advanced Feature Editing
16
* For small datasets, first deselect Remove rare features.
17
Advanced Feature Editing * Next, Click on Adv Feature Editing
18
Advanced Feature Editing * Now you may begin creating your own features.
19
Types of Basic Features Primitive features inclulde unigrams, bigrams, and POS bigrams
20
Types of Basic Features The Options change which primitive features show up in the Unigram, Bigram, and POS bigram lists You can choose to remove stopwords or not You can choose whether or not to strip endings off words with stemming You can choose how frequently a feature must appear in your data in order for it to show up in your lists
21
Types of Basic Features * Now let’s look at how to create new features.
22
Creating New Features * You can use the feature editor to create new features.
23
Creating New Features * First click on ANY
24
Creating New Features * Then click ALL
25
Creating New Features * Now fill in ‘tell’ and ‘me’
26
Creating New Features * Now fill in the rest of the pattern from the POS Bigram list
27
Creating New Features * Now change the name
28
Creating New Features * Click to add to feature list
29
Using the Display Option
33
Viewing Created Features
36
Output You can find the output in the OUTPUT folder User Defined Features UserDefinedFeatures_[name of input file].txt E.g., UserDefinedFeatures_SimpleExample.xls.txt Performance Report Eval_[name of coding dimension]_[name of input file].txt E.g., Eval_Code_SimpleExample.xls.txt Output File [name of input file]_OUTPUT.xls E.g., SimpleExample_OUTPUT.xls
37
User Defined Feature File You can reuse these If you load these as the default user defined features, you don’t have to create them again by hand You do have to insert them manually
38
Loading Your User Defined Features Put your user defined feature file here
39
Loading Your User Defined Features
40
Double click
41
Loading Your User Defined Features Then click here
42
Loading Your User Defined Features
43
Or export to csv
44
Loading Your User Defined Features Now you can just copy columns for new features into your input file Will be treated like the extra features to the right of the text column You need to reload the long way when you create the final model
45
Using the Output file Prefix If you use the Output file prefix, the text you enter will be prepended to the output files Prefix1_Eval_Code_SimpleExample.xls.txt Prefix1_SimpleExample.xls
46
Performance report The performance report tells you: What dataset was used What the customization settings were At the bottom of the file are reliability statistics and a confusion matrix that tells you which types of errors are being made
47
Performance report The performance report tells you: What dataset was used What the customization settings were At the bottom of the file are reliability statistics and a confusion matrix that tells you which types of errors are being made
48
Performance report The performance report tells you: What dataset was used What the customization settings were At the bottom of the file are reliability statistics and a confusion matrix that tells you which types of errors are being made
49
Output File The output file contains The codes for each segment Note that the segments that were already coded will retain their original code The other segments will have their automatic predictions The prediction column indicates the confidence of the prediction
50
Applying a Trained Model Select a model file Then select a testing file
51
Applying a Trained Model Testing data should be set up with ? on uncoded examples Click Go! to process file
52
Results
53
Assignment 6
54
Example Negative Review in this re-make of the 1954 japanese monster film, godzilla is transformed into a " jurassic park " copy who swims from the south pacific to new york for no real reason and trashes the town. although some of the destruction is entertaining for a while, it gets old fast. the film often makes no sense ( a several-hundred foot tall beast hides in subway tunnels ), sports second-rate effects ( the baby godzillas seem to be one computer effect multiplied on the screen ), lame jokes ( mayor ebert and his assistant gene are never funny ), horrendous acting ( even matthew broderick is dull ) and an unbelievable love story ( why would anyone want to get back together with maria pitillo's character ? ). there are other elements of the film that fall flat, but going on would just be a waste of good words. only for die-hard creature feature fans, this might be fun if you could check your brain at the door. i couldn't. ( michael redman has written this column for 23 years and has seldom had a more disorienting cinematic experience than seeing both " fear and loathing " and " godzilla " in the same evening. )
55
Example Positive Review sometimes a movie comes along that falls somewhat askew of the rest. some people call it " original " or " artsy " or " abstract ". some people simply call it " trash ". a life less ordinary is sure to bring about mixed feelings. definitely a generation-x aimed movie, a life less ordinary has everything from claymation to profane angels to a karaoke-based musical dream sequence. whew ! anyone in their 30's or above is probably not going to grasp what can be enjoyed about this film. it's somewhat silly, it's somewhat outrageous, and it's definitely not your typical romance story, but for the right audience, it works. a lot of hype has been surrounding this film due to the fact that it comes to us from the same team that brought us trainspotting. well sorry folks, but i haven't seen trainspotting so i can't really compare. whether that works in this film's favor or not is beyond me. but i do know this : ewan mcgregor, whom i had never had the pleasure of watching, definitely charmed me. he was great ! cameron diaz's character was uneven and a bit hard to grasp. the audience may find it difficult to care about her, thus discouraging the hopes of seeing her unite with mcgregor
56
Positive Review Continued after we are immediately sucked into caring about and identifying with him. misguided? you bet. loveable ? you bet. a life less ordinary was a delight and even had a bonus for me when i realized it was filmed in my hometown of salt lake city, utah. this was just one more thing i didn't know about this movie when i sat down with a five dollar order of nachos and a three dollar coke. maybe not knowing the premise behind this film made for a pleasant surprise, but i think even if i had known, i would have been just as happy. a life less ordinary is quirky, eccentric, and downright charming ! not for everyone, but a definite change of pace for your typical night at the movies.
57
Note that the texts are LONG!!!
58
Takes about 15 minutes on my machine!
59
Using the Display Option
61
Helpful Hints Use Feature Selection! Limit the number of times you use the Advanced Feature Editing interface Export the features you create to CSV so you can reuse the already created versions You can use Weka once you dump out a.arff file from TagHelper tools Do your experimentation strategically Note that POS tagging is slow
62
Museli
63
Definition of “Topic” in Dialogue Discourse Segment Purpose (Passonneau and Litman, 1994), based on (Grosz and Sidner, 1984) TOPIC SHIFT = SHIFT IN PURPOSE that is acknowledge and acted upon by both dialogue participants Example: T: Let me know once you are done reading. T: I’ll be back in a min. T: Are you done reading? S: not yet. T: ok T: Do you know where to enter all the values? S: I think so. S: I’ll ask if I get stuck though.... Tutor wants to know when student is ready to start the session. Tutor checks if student knows how to setup the analysis
64
Definition of “Topic” in Dialogue Discourse Segment Purpose (Passonneau and Litman, 1994), based on (Grosz and Sidner, 1984) TOPIC SHIFT = SHIFT IN PURPOSE that is acknowledge and acted upon by both dialogue participants Example: T: Let me know once you are done reading. T: I’ll be back in a min. T: Are you done reading? S: not yet. T: ok T: Do you know where to enter all the values? S: I think so. S: I’ll ask if I get stuck though.... Tutor wants to know when student is ready to start the session. Tutor checks if student knows how to setup the analysis
65
Overview of Single Evidence Source Approaches Models based on lexical cohesion TextTiling (Hearst, 1997) Foltz (Foltz, 1998) Olney & Cai (Olney & Cai, 2005) Models relying on regularities in topic sequencing Barzilay & Lee (Barzilay & Lee, 2004)
66
MUSELI Integrates multiple sources of evidence of topic shift Features: Lexical Cohesion (via cosine correlation) Time lag between contributions Unigrams (previous and current contribution) Bigrams (previous and current cont.) POS Bigrams (previous and current cont.) Contribution Length Previous/Current Speaker Contribution of Content Words
67
* P <.005 Experimental Corpora Olney and Cai Corpus Thermo Corpus # Dialogues 4222 Conts./Dialogue 195.40217.90 Words/Cont. 28.63*5.12* Conts./Topic 24*13.31* Topics/ Dialogue 8.14*16.36* Tutor Conts./ Dialogue 97.48152.86* Student Conts./ Dialogue 97.9365.05* Our thermo corpus: Is more terse! Has fewer Contributions! Has more Topics/Dialogue! Strict turn-taking not enforced! Olney and Cai (Olney and Cai, 2005) Thermo corpus: student/tutor optimization problem, unrestricted interaction, virtually co-present
68
Baseline Degenerate Approaches ALL: every contribution = NEW_TOPIC EVEN: every n th contribution = NEW_TOPIC NONE: no NEW_TOPIC
69
Two Evaluation Metrics A metric commonly used to evaluate topic segmentation algorithms (Olney & Cai, 2005) F-measure: Precision (P): # correct predictions / # predictions Recall (R): # correct predictions / # boundaries An additional metric designed specifically for segmentation problems (Beeferman et al., 1999) P k : Pr(error|k) The probability that two contributions, separated by k contributions, are misclassified Effective if k = ½ average topic length
70
Experimental Results Olney and Cai Corpus Thermodynamics Corpus PkPk FPkPk F NONE.4897--.4900-- ALL.5180--.5100-- EVEN.5117--.5131-- TT.5069.1475.5353.1614 B&L.5092.1747.5086.1512 Foltz.3270.3492.5058.1180 Ortho.2754.6012.4898.2111 Museli.1051.8013.4043.3693 Compared to degenerates: > NO DEG. > 1 DEG. > ALL 3 DEG. P <.05
71
Experimental Results Olney and Cai Corpus Thermodynamics Corpus PkPk FPkPk F NONE.4897--.4900-- ALL.5180--.5100-- EVEN.5117--.5131-- TT.5069.1475.5353.1614 B&L.5092.1747.5086.1512 Foltz.3270.3492.5058.1180 Ortho.2754.6012.4898.2111 Museli.1051.8013.4043.3693 Museli > all approaches in BOTH corpora P <.05
72
Take Home Message We explored some of TagHelper tools’s functionality TagHelper provides simple linguistic features like bigrams and POS bigrams that can be useful for classification Assignment 6 will give you realistic experience working with text on a non- trivial classification task The most important thing for Assignment 6 is to be strategic!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.