Download presentation
Presentation is loading. Please wait.
1
TagHelper: User’s Manual Carolyn Penstein Rosé (cprose@cs.cmu.edu) Carnegie Mellon University Funded through the Pittsburgh Science of Learning Center and The Office of Naval Research, Cognitive and Neural Sciences Division
2
Copyright 2007, Carolyn Penstein Rosé, Carnegie Mellon University Licensed under GNU General Public License
3
Setting Up Your Data
5
Creating a Trained Model
6
Training and Testing Start TagHelper tools by double clicking on the portal.bat icon in your TagHelperTools2 folder You will then see the following tool pallet The idea is that you will train a prediction model on your coded data and then apply that model to uncoded data Click on Train New Models
7
Loading a File First click on Add a File Then select a file
8
Simplest Usage Click “GO!” TagHelper will use its default setting to train a model on your coded examples It will use that model to assign codes to the uncoded examples
9
More Advanced Usage The second option is to modify the default settings You get to the options you can set by clicking on >> Options After you finish that, click “GO!”
10
Options Here is where you set the options They are discussed in more detail below
11
Output You can find the output in the OUTPUT folder There will be a text file named Eval_[name of coding dimension]_[name of input file].txt This is a performance report E.g., Eval_Code_SimpleExample.xls.txt There will also be a file named [name of input file]_OUTPUT.xls This is the coded output E.g., SimpleExample_OUTPUT.xls
12
Using the Output file Prefix If you use the Output file prefix, the text you enter will be prepended to the output files There will be a text file named [prefix]_Eval_[name of coding dimension]_[name of input file].txt E.g., Prefix1_Eval_Code_SimpleExample.xls.txt There will also be a file named [prefix]_[name of input file]_OUTPUT.xls E.g., Prefix1_SimpleExample.xls
13
Evaluating Performance
14
Performance report The performance report tells you: What dataset was used
15
Performance report The performance report tells you: What dataset was used What the customization settings were
16
Performance report The performance report tells you: What dataset was used What the customization settings were At the bottom of the file are reliability statistics and a confusion matrix that tells you which types of errors are being made
17
Output File The output file contains The codes for each segment Note that the segments that were already coded will retain their original code The other segments will have their automatic predictions The prediction column indicates the confidence of the prediction
18
Using a Trained Model
19
Applying a Trained Model Select a model file Then select a testing file
20
Applying a Trained Model Testing data should be set up with ? on uncoded examples Click Go! to process file
21
Results
22
Overview of Basic Feature Extraction from Text
23
Customizations To customize the settings: Select the file Click on Options
24
Setting the Language You can change the default language from English to German Chinese requires an additional license to Academia Sinica in Taiwan
25
Preparing to get a performance report You can decide whether you want it to prepare a performance report for you. (It runs faster when this is disabled.)
26
Classifier Options Rules of thumb: SMO is state-of-the-art for text classification J48 is best with small feature sets – also handles contingencies between features well Naïve Bayes works well for models where decisions are made based on accumulating evidence rather than hard and fast rules
27
Basic Idea Represent text as a vector where each position corresponds to a term This is called the “bag of words” approach Cows make cheese 110001 Hens lay eggs 001110 Cheese Cows Eggs Hens Lay Make
28
What can’t you conclude from “bag of words” representations? Causality: “X caused Y” versus “Y caused X” Roles and Mood: “Which person ate the food that I prepared this morning and drives the big car in front of my cat” versus “The person, which prepared food that my cat and I ate this morning, drives in front of the big car.” Who’s driving, who’s eating, and who’s preparing food?
29
X’ Structure X’’ X’ X Pre-head ModSpecPost-head Mod X’ Head The black cat in the hat A complete phrase Sometimes called “a maximal projection”
30
Basic Anatomy: Layers of Linguistic Analysis Phonology: The sound structure of language Basic sounds, syllables, rhythm, intonation Morphology: The building blocks of words Inflection: tense, number, gender Derivation: building words from other words, transforming part of speech Syntax: Structural and functional relationships between spans of text within a sentence Phrase and clause structure Semantics: Literal meaning, propositional content Pragmatics: Non-literal meaning, language use, language as action, social aspects of language (tone, politeness) Discourse Analysis: Language in practice, relationships between sentences, interaction structures, discourse markers, anaphora and ellipsis
31
Part of Speech Tagging 1. CC Coordinating conjunction 2. CD Cardinal number 3. DT Determiner 4. EX Existential there 5. FW Foreign word 6. IN Preposition/subord 7. JJ Adjective 8. JJR Adjective, comparative 9. JJS Adjective, superlative 10.LS List item marker 11.MD Modal 12.NN Noun, singular or mass 13.NNS Noun, plural 14.NNP Proper noun, singular 15.NNPS Proper noun, plural 16.PDT Predeterminer 17.POS Possessive ending 18.PRP Personal pronoun 19.PP Possessive pronoun 20.RB Adverb 21.RBR Adverb, comparative 22.RBS Adverb, superlative http://www.ldc.upenn.edu/Catalog/docs/treebank2/cl93.html
32
Part of Speech Tagging 23.RP Particle 24.SYM Symbol 25.TO to 26.UH Interjection 27.VB Verb, base form 28.VBD Verb, past tense 29.VBG Verb, gerund/present participle 30.VBN Verb, past participle 31.VBP Verb, non-3rd ps. sing. present 32.VBZ Verb, 3rd ps. sing. present 33.WDT wh-determiner 34.WP wh-pronoun 35.WP Possessive wh- pronoun 36.WRB wh-adverb http://www.ldc.upenn.edu/Catalog/docs/treebank2/cl93.html
33
TagHelper Customizations Feature Space Design Think like a computer! Machine learning algorithms look for features that are good predictors, not features that are necessarily meaningful Look for approximations If you want to find questions, you don’t need to do a complete syntactic analysis Look for question marks Look for wh-terms that occur immediately before an auxilliary verb
34
TagHelper Customizations Feature Space Design Punctuation can be a “stand in” for mood “you think the answer is 9?” “you think the answer is 9.” Bigrams capture simple lexical patterns “common denominator” versus “common multiple” POS bigrams capture syntactic or stylistic information “the answer which is …” vs “which is the answer” Line length can be a proxy for explanation depth
35
TagHelper Customizations Feature Space Design Contains non-stop word can be a predictor of whether a conversational contribution is contentful “ok sure” versus “the common denominator” Remove stop words removes some distracting features Stemming allows some generalization Multiple, multiply, multiplication Removing rare features is a cheap form of feature selection Features that only occur once or twice in the corpus won’t generalize, so they are a waste of time to include in the vector space
36
Group Activity Use TagHelper features to make up rules to identify thematic roles in these sentences? Agent: who is doing the action Theme: what the action is done to Recipient: who benefits from the action Source: where the theme started Destination: where the theme ended up Tool: what the agent used to do the action to the theme Manner: how the agent behaved while doing the action 1.The man chased the intruder. 2.The intruder was chased by the man. 3.Aaron carefully wrote a letter to Marilyn. 4.Marilyn received the letter. 5.John moved the package from the table to the sofa. 6.The governor entertained the guests in the parlor.
37
New Feature Creation
38
Why create new features? You may want to generalize across sets of related words Color = {red,yellow,orange,green,blue} Food = {cake,pizza,hamburger,steak,bread} You may want to detect contingencies The text must mention both cake and presents in order to count as a birthday party You may want to combine these The text must include a Color and a Food
39
Why create new features by hand? Rules For simple rules, it might be easier and faster to write the rules by hand instead of learning them from examples Features More likely to capture meaningful generalizations Build in knowledge so you can get by with less training data
40
Rule Language ANY() is used to create lists COLOR = ANY(red,yellow,green,blue,purple) FOOD = ANY(cake,pizza,hamburger,steak,bread) ALL() is used to capture contingencies ALL(cake,presents) More complex rules ALL(COLOR,FOOD)
41
Group Project: Make a rule that will match against questions but not statements QuestionTell me what your favorite color is. StatementI tell you my favorite color is blue. QuestionWhere do you live? StatementI live where my family lives. QuestionWhich kinds of baked goods do you prefer StatementI prefer to eat wheat bread. QuestionWhich courses should I take? Statement You should take my applied machine learning course. QuestionTell me when you get up in the morning. StatementI get up early.
42
Possible Rule ANY(ALL(tell,me),BOL_WDT,BOL_WRB)
43
Advanced Feature Editing * For small datasets, first deselect Remove rare features. * Click on Adv Feature Editing
44
Types of Basic Features Primitive features inclulde unigrams, bigrams, and POS bigrams
45
Types of Basic Features The Options change which primitive features show up in the Unigram, Bigram, and POS bigram lists You can choose to remove stopwords or not You can choose whether or not to strip endings off words with stemming You can choose how frequently a feature must appear in your data in order for it to show up in your lists
46
Types of Basic Features * Now let’s look at how to create new features.
47
Creating New Features *The feature editor allows you to create new feature definitions * Click on + to add your new feature
48
Examining a New Feature Right click on a feature to examine where it matches in your data
49
Examining a New Feature
50
Adding new features by script Modify the ex_features.txt file Allows you to save your definitions Easier to cut and paste
51
Error Analysis
52
Create an Error Analysis File
53
Use TagHelper to Code Uncoded File The output file contains the codes TagHelper assigned. What you want to do now is to remove prediction column and insert the correct answers next to the TagHelper assigned answers.
54
Load Error Analysis File
55
Error Analysis Strategies Look for large error cells in the confusion matrix Locate the examples that correspond to that cell What features do those examples share? How are they different from the examples that were classified correctly?
56
Group Project From NewGroupTopic.xls create NewsGroupTrain.xls, NewsGroupTest.xls, and NewsGroupAnswers.xls Load in the NewsGroupTrain.xls data set What is the best performance you can get by playing with the standard TagHelper tools feature options? Train a model using the best settings and then use it to assign codes to NewsGroupTest.xls Copy in Answer column from NewsGroupAnswers.xls Now do an error analysis to determine why frequent mistakes are being made How could you do better?
57
Feature Selection
58
Why do irrelevant features hurt performance? They might confuse a classifier They waste time
59
Two Solutions Use a feature selection algorithm Only extract a subset of possible features
60
Feature Selection * Click on the AttributeSlectedClassifier
61
Feature Selection Feature selection algorithms pick out a subset of the features that work best Usually they evaluate each feature in isolation
62
Feature Selection * First click here * Then pick your base classifier just like before * Finally you will configure the feature selection
63
Setting Up Feature Selection
64
The number of features you pick should not be larger than the number of features available The number should not be larger than the number of coded examples you have
65
Examining Which Features are Most Predictive You can find a ranked list of features in the Performance Report if you use feature selection * Predictiveness score * Frequency
66
Optimization
67
Key idea: combine multiple views on the same data in order to increase reliability
68
Boosting In boosting, a series of models are trained and each trained model is influenced by the strengths and weaknesses of the previous model New models should be experts in classifying examples that the previous model got wrong It specifically seeks to train multiple models that complement each other In the final vote, model predictions are weighted based on their model’s performance
69
More about Boosting The more iterations, the more confident the trained classifier will be in its predictions But higher confidence doesn’t necessarily mean higher accuracy! When a classifier becomes overly confident, it is said to “over fit” Boosting can turn a weak classifier into a strong classifier A simple classifier can learn a complex rule
70
Boosting Boosting is an option listed in the Meta folder, near the Attribute Selected Classifier It is listed as AdaBoostM1 Go ahead and click on it now
71
Boosting * Now click here
72
Setting Up Boosting * Select a classifier * Set the number of cycles of boosting
73
Semi-Supervised Learning
74
Using Unlabeled Data If you have a small amount of labeled data and a large amount of unlabeled data: you can use a type of bootstrapping to learn a model that exploits regularities in the larger set of data The stable regularities might be easier to spot in the larger set than the smaller set Less likely to overfit your labeled data
75
Semi-supervised Learning Remember the Basic idea: Train on a small amount of data Add the positive and negative example you are most confident about to the training data Retrain Keep looping until you label all the data
76
Semi-supervised learning in TagHelper tools
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.