Download presentation
Presentation is loading. Please wait.
1
Classification & Mallet Shallow Processing Techniques for NLP Ling570 November 14, 2011
2
Roadmap Classification: Feature templates Case Study Examples Text Categorization Coreference Resolution Classification Systems Overview Mallet
3
Classification Problem Steps Input processing: Split data into training/dev/test
4
Classification Problem Steps Input processing: Split data into training/dev/test Convert data into an Attribute-Value Matrix Identify candidate features Perform feature selection Create AVM representation
5
Classification Problem Steps Input processing: Split data into training/dev/test Convert data into an Attribute-Value Matrix Identify candidate features Perform feature selection Create AVM representation Training
6
Classification Problem Steps Input processing: Split data into training/dev/test Convert data into an Attribute-Value Matrix Identify candidate features Perform feature selection Create AVM representation Training Testing Evaluation
7
Feature Template Example: Prevword (or w -1 )
8
Feature Template Example: Prevword (or w -1 ) Template corresponds to many features e.g. time flies like an arrow
9
Feature Template Example: Prevword (or w -1 ) Template corresponds to many features e.g. time flies like an arrow w -1 = w -1 =time w -1 =flies w -1 =like w -1 =an…
10
Feature Template Example: Prevword (or w -1 ) Template corresponds to many features e.g. time flies like an arrow w -1 = w -1 =time w -1 =flies w -1 =like w -1 =an… Shorthand for: w -1 = 0 or w -1 =time 1
11
AVM Example Time flies like an arrow Note: this is a compact form of the true sparse vector w -1 =w 0 or 1, for w in |V| w -1 w0w0 w -1 w 0 w +1 label x1 Time fliesN x2TimefliesTime flieslikeV x3flieslikeflies likeanP
12
Text Categorization Task: Given a document, assign to one of finite set of classes What are the classes? What are the features?
13
Text 1 Several hundred protesters, some wearing goggles and gas masks, marched past authorities in a downtown street Sunday, hours after riot police forced Occupy Portland demonstrators out of a pair of weeks-old encampments in nearby parks. Police moved in shortly before noon and drove protesters into the street after dozens remained in the camp in defiance city officials. Mayor Sam Adams had ordered that the camp shut down Saturday at midnight, citing unhealthy conditions and the encampment’s attraction of drug users and thieves. Anti-Wall Street protesters and their supporters flooded a city park area in Portland early Sunday in defiance of an eviction order, and authorities elsewhere stepped up pressure against the demonstrators, arresting nearly two dozen. (Nov. 13) More than 50 protesters were arrested in the police action, but officers did not use tear gas, rubber bullets or other so-called non-lethal weapons, police said. Washington Post, online 11/13/2011
14
Text 2 George Washington coach Mike Lonergan looked at the stat sheet, tried to muster a smile then clicked off the reasons why the Colonials lost to No. 24 California on Sunday night. A piercing 21-0 run by the Golden Bears at the end of the first half was at the top of the list. Not even a second straight 20-point effort from Tony Taylor was enough to dig George Washington out of the early hole, and the Colonials spent the rest of the night in a futile game of catch-up. “I’ve never really been involved with a run quite like that,” Lonergan said after Cal’s 81-54 win over George Washington. “I tried calling a couple timeouts. It was very disappointing that we just never really got our composure back the rest of that half. To end it that way and not even score any points, that was basically the game right there.” Washington Post, online 11/13/2011
15
Test 3 ‘Jersey Boys’ at the National Theatre By Jane Horwitz, Sunday, November 13, 5:29 PMJane Horwitz “Jersey Boys” is irresistible, and the touring company now at the National Theatre gets it almost entirely right. This Broadway hit (it has been running since fall 2005 and has played Washington before as well) rises well above the so-called jukebox show genre. Subtitled “The Story of Frankie Valli & the Four Seasons,” the musical tells a tale that transcends show business gossip to become a close character study of four talented but very different blue-collar guys from New Jersey — who just happen to have sung some of the best close-harmony rock/pop tunes of the late 1950s, the 1960s and into the 1970s. Washington Post, online 11/13/2011
16
What categories? What features?
17
Example: Coreference Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. Logue, a renowned speech therapist, was summoned to help the King overcome his speech impediment...
18
Example: Coreference Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. Logue, a renowned speech therapist, was summoned to help the King overcome his speech impediment...
19
Example: Coreference Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. Logue, a renowned speech therapist, was summoned to help the King overcome his speech impediment... Can be viewed as a classification problem
20
Example: Coreference Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. Logue, a renowned speech therapist, was summoned to help the King overcome his speech impediment... Can be viewed as a classification problem What are the inputs?
21
Example: Coreference Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. Logue, a renowned speech therapist, was summoned to help the King overcome his speech impediment... Can be viewed as a classification problem What are the inputs? What are the categories?
22
Example: Coreference Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. Logue, a renowned speech therapist, was summoned to help the King overcome his speech impediment... Can be viewed as a classification problem What are the inputs? What are the categories? What features would be useful?
23
Example: NER Named Entity tagging: John visited New York last Friday [person John] visited [location New York] [time last Friday] As a classification problem John/PER-B visited/O New/LOC-B York/LOC-I last/TIME-B Friday/TIME-I Input? Features? Classes?
24
Classifiers & Systems
25
Classifiers Wide variety Differ on several dimensions Supervision Learning Function Input Features
26
Supervision in Classifiers Supervised: True label/class of each training instance is provided to the learner at training time Naïve Bayes, MaxEnt, Decision Trees, Neural nets, etc
27
Supervision in Classifiers Supervised: True label/class of each training instance is provided to the learner at training time Naïve Bayes, MaxEnt, Decision Trees, Neural nets, etc Unsupervised: No true labels are provided for examples during training Clustering: k-means; Min-cut algorithms
28
Supervision in Classifiers Supervised: True label/class of each training instance is provided to the learner at training time Naïve Bayes, MaxEnt, Decision Trees, Neural nets, etc Unsupervised: No true labels are provided for examples during training Clustering: k-means; Min-cut algorithms Semi-supervised: (bootstrapping) True labels are provided for only a subset of examples Co-training, semi-supervised SVM/CRF, etc
29
Inductive Bias What form of function is learned? Function that separates members of different classes Linear separator Higher order functions Vornoi diagrams, etc
30
Inductive Bias What form of function is learned? Function that separates members of different classes Linear separator Higher order functions Vornoi diagrams, etc Graphically, decision boundary + + + - - -
31
Machine Learning Functions Problem: Can the representation effectively model the class to be learned?
32
Machine Learning Functions Problem: Can the representation effectively model the class to be learned? Motivates selection of learning algorithm ++ - - -
33
Machine Learning Functions Problem: Can the representation effectively model the class to be learned? Motivates selection of learning algorithm ++ - - - For this function, Linear discriminant is GREAT!
34
Machine Learning Functions Problem: Can the representation effectively model the class to be learned? Motivates selection of learning algorithm ++ - - - For this function, Linear discriminant is GREAT! Rectangular boundaries (e.g. ID trees) TERRIBLE!
35
Machine Learning Functions Problem: Can the representation effectively model the class to be learned? Motivates selection of learning algorithm ++ - - - For this function, Linear discriminant is GREAT! Rectangular boundaries (e.g. ID trees) TERRIBLE! Pick the right representation!
36
Machine Learning Features Inputs: E.g.words, acoustic measurements, parts-of-speech, syntactic structures, semantic classes,.. Vectors of features: E.g. word: letters ‘cat’: L1=c; L2 = a; L3 = t Parts of syntax trees?
37
Machine Learning Features Questions: Which features should be used? How should they relate to each other? Issue 1: How do we define relation in feature space if features have different scales?
38
Machine Learning Features Question: Which features should be used? How should they relate to each other? Issue 1: How do we define relation in feature space if features have different scales? Solution: Scaling/normalization
39
Machine Learning Features Question: Which features should be used? How should they relate to each other? Issue 1: How do we define relation in feature space if features have different scales? Solution: Scaling/normalization Issue 2: Which ones are important?
40
Machine Learning Features Question: Which features should be used? How should they relate to each other? Issue 1: How do we define relation in feature space if features have different scales? Solution: Scaling/normalization Issue 2: Which ones are important? If differ in irrelevant feature, should ignore
41
Machine Learning Toolkits Many learners, many tools/implementations
42
Machine Learning Toolkits Many learners, many tools/implementations Some broad tool sets weka Java, lots of classifiers, pedagogically oriented
43
Machine Learning Toolkits Many learners, many tools/implementations Some broad tool sets weka Java, lots of classifiers, pedagogically oriented mallet Java, classifiers, sequence learners More heavy duty
44
Mallet Machine learning toolkit Developed at UMass Amherst by Andrew McCallum
45
Mallet Machine learning toolkit Developed at UMass Amherst by Andrew McCallum Java implementation, open source
46
Mallet Machine learning toolkit Developed at UMass Amherst by Andrew McCallum Java implementation, open source Large collection of machine learning algorithms Targeted to language processing Naïve Bayes, MaxEnt, Decision Trees, Winnow, Boosting Also, clustering, topic models, sequence learners
47
Mallet Machine learning toolkit Developed at UMass Amherst by Andrew McCallum Java implementation, open source Large collection of machine learning algorithms Targeted to language processing Naïve Bayes, MaxEnt, Decision Trees, Winnow, Boosting Also, clustering, topic models, sequence learners Widely used, but Research software: some bugs/gaps; odd documentation
48
Installation Installed on patas /NLP_TOOLS/tool_sets/mallet/latest/ Will be updated to 2.0.7 Directories: bin/: script files src/: java source code class/: java classes lib/: jar files sample-data/: wikipedia docs for languages id, etc
49
Installation Installed on patas /NLP_TOOLS/tool_sets/mallet/latest/ Will be updated to 2.0.7
50
Environment Should be set up on patas $PATH should include /NLP_TOOLS/tool_sets/mallet/latest/bin $CLASSPATH should include /NLP_TOOLS/tool_sets/mallet/latest/lib/mallet-deps.jar; /NLP_TOOLS/tool_sets/mallet/latest/lib/mallet.jar Check: which text2vectors /NLP_TOOLS/tool_sets/mallet/latest/bin
51
Mallet Commands Mallet command types: Data preparation Data/model inspection Training Classification
52
Mallet Commands Mallet command types: Data preparation Data/model inspection Training Classification Command line scripts Shell scripts Set up java environment Invoke java programs --help lists command line parameters for scripts
53
Mallet Data Mallet data instances: Instance_id label f1 v1 f2 v2 ….. Stored in internal binary format: “vectors” Binary format used by learners, decoders Need to convert text files to binary format
54
Data Preparation Built-in data importers One class per directory, one instance per file bin/mallet import-dir --input IF --output OF Label is directory name (Also text2vectors) One instance per line bin/mallet import-file --input IF --output OF Line: instance label text ….. (Also csv2vectors) Create binary representation of text feature counts
55
Data Preparation bin/mallet import-svmlight --input IF --output OF Allows import of user constructed feature value pairs
56
Data Preparation bin/mallet import-svmlight --input IF --output OF Allows import of user constructed feature value pairs Format: label f1:v1 f2:v2 …..fn:vn Features can strings or indexes (Also bin/svmlight2vectors)
57
Data Preparation bin/mallet import-svmlight --input IF --output OF Allows import of user constructed feature value pairs Format: label f1:v1 f2:v2 …..fn:vn Features can strings or indexes (Also bin/svmlight2vectors) If building test data separately from original bin/mallet import-svmlight --input IF --output OF --use-pipe-from previously_built.vectors
58
Data Preparation bin/mallet import-svmlight --input IF --output OF Allows import of user constructed feature value pairs Format: label f1:v1 f2:v2 …..fn:vn Features can strings or indexes (Also bin/svmlight2vectors) If building test data separately from original bin/mallet import-svmlight --input IF --output OF --use-pipe-from previously_built.vectors Ensures consistent feature representation Note: can’t mix svmlight models with others
59
Accessing Binary Formats vectors2info --input IF
60
Accessing Binary Formats vectors2info --input IF -- print-labels TRUE Prints list of category labels in data set
61
Accessing Binary Formats vectors2info --input IF -- print-labels TRUE Prints list of category labels in data set -- print-matrix sic prints all features and values by string and number Returns original text feature-value list Possibly out of order
62
Accessing Binary Formats vectors2info --input IF -- print-labels TRUE Prints list of category labels in data set -- print-matrix sic prints all features and values by string and number Returns original text feature-value list Possibly out of order vectors2vectors --input IF --training-file TNF --testing-file TTF --training-portion pct
63
Accessing Binary Formats vectors2info --input IF -- print-labels TRUE Prints list of category labels in data set -- print-matrix sic prints all features and values by string and number Returns original text feature-value list Possibly out of order vectors2vectors --input IF --training-file TNF --testing-file TTF --training-portion pct Creates random training/test splits in some ratio
64
Building & Accessing Models bin/mallet train-classifier --trainer classifiertype - -training- portion 0.9 --output-classifier OF Builds classifier model Can also store model, produce scores, confusion matrix, etc
65
Building & Accessing Models bin/mallet train-classifier --input vector_data_file --trainer classifiertype --training-portion 0.9 --output-classifier OF Builds classifier model Can also store model, produce scores, confusion matrix, etc --trainer: MaxEnt, DecisionTree, NaiveBayes, etc
66
Building & Accessing Models bin/mallet train-classifier --trainer classifiertype - -training- portion 0.9 --output-classifier OF Builds classifier model Can also store model, produce scores, confusion matrix, etc --trainer: MaxEnt, DecisionTree, NaiveBayes, etc --report: train:accuracy, test:f1:en
67
Building & Accessing Models bin/mallet train-classifier --trainer classifiertype - -training- portion 0.9 --output-classifier OF Builds classifier model Can also store model, produce scores, confusion matrix, etc --trainer: MaxEnt, DecisionTree, NaiveBayes, etc --report: train:accuracy, test:f1:en Can also use pre-split training & testing files e.g. output of vectors2vectors --training-file, --testing-file
68
Building & Accessing Models bin/mallet train-classifier --trainer classifiertype - -training- portion 0.9 --output-classifier OF Builds classifier model Can also store model, produce scores, confusion matrix, etc --trainer: MaxEnt, DecisionTree, NaiveBayes, etc --report: train:accuracy, test:f1:en Confusion Matrix, row=true, column=predicted accuracy=1.0 label 0 1 |total 0 de 1. |1 1 en. 1 |1 Summary. train accuracy mean = 1.0 stddev = 0 stderr = 0 Summary. test accuracy mean = 1.0 stddev = 0 stderr = 0
69
Accessing Classifiers classifier2info --classifier maxent.model Prints out contents of model file
70
Accessing Classifiers classifier2info --classifier maxent.model Prints out contents of model file FEATURES FOR CLASS en -0.036953801963395115 book 0.004605219133228236 the 0.24270652500835088 i 0.004605219133228236
71
Testing Use new data to test a previously built classifier bin/mallet classify-svmlight --input testfile --output outputfile --classifier maxent.model
72
Testing Use new data to test a previously built classifier bin/mallet classify-svmlight --input testfile --output outputfile --classifier maxent.model Also instance file, directories: classify-file, classify-dir
73
Testing Use new data to test a previously built classifier bin/mallet classify-svmlight --input testfile --output outputfile --classifier maxent.model Also instance file, directories: classify-file, classify-dir Prints class,score matrix
74
Testing Use new data to test a previously built classifier bin/mallet classify-svmlight --input testfile --output outputfile -- classifier maxent.model Also instance file, directories: classify-file, classify-dir Prints class,score matrix Inst_id class1 score1 class2 score2 array:0en0.995de0.0046 array:1en0.970de0.0294 array:2en0.064de0.935 array:3en0.094de0.905
75
General Use bin/mallet import-svmlight --input svmltrain.vectors.txt -- output svmltrain.vectors Builds binary representation from feature:value pairs
76
General Use bin/mallet import-svmlight --input svmltrain.vectors.txt -- output svmltrain.vectors Builds binary representation from feature:value pairs bin/mallet train-classifier --input svmltrain.vectors –trainer MaxEnt --output-classifier svml.model Trains MaxEnt classifier and stores model
77
General Use bin/mallet import-svmlight --input svmltrain.vectors.txt -- output svmltrain.vectors Builds binary representation from feature:value pairs bin/mallet train-classifier --input svmltrain.vectors –trainer MaxEnt --output-classifier svml.model Trains MaxEnt classifier and stores model bin/mallet classify-svmlight --input svmltest.vectors.txt -- output - --classifier svml.model Tests on the new data
78
Other Information Website: Download and documentation (such as it is) http://mallet.cs.umass.edu
79
Other Information Website: Download and documentation (such as it is) http://mallet.cs.umass.edu API tutorial: http://mallet.cs.umass.edu/mallet-tutorial.pdf
80
Other Information Website: Download and documentation (such as it is) http://mallet.cs.umass.edu API tutorial: http://mallet.cs.umass.edu/mallet-tutorial.pdf Local guide (refers to older version 0.4) http://courses.washington.edu/ling572/winter07/homewor k/mallet_guide.pdf
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.