Fex Feature Extractor - v2
Topics Vocabulary Syntax of scripting language –Feature functions –Operators Examples –POS tagging Input Formats
Vocabulary example – A list of active records for which Fex produces a single SNOW example. Usually a sentence. record –a single position in an example (sentence). –Contains a list of fields, each of which holds a different info: e.g. NLP: Word, Tag, Vision: color, etc. Raw input to Fex –A list of valid example, (raw sentences, tagged corpora, etc. ) Fex’s Output –Lexical features written to the lexicon file. –Their corresponding numeric ID’s are written to the example file. feature function –A relation among one or more records.
Example: Feature Functions
Script Syntax A Fex script file contains a list of definitions, each of which will rewrite the given observation into a set of active features. Definition format, terms in ()’s optional: target (inc) (loc): FeatureFunc ([left, right]) target - Target index or word. To treat each record in the observation as a target, use -1. This is a macro for “all words”. inc - Include target word instead of placeholder (*) in some features. loc - Generate features with location relative to target.
FeatureFunc - A feature function defined in terms of certain unary and n-ary relations, and operators. left - Left offset of scope for generating features. Negative values are left of the target, positive to the right. right - Right offset of scope.
Basic Feature Functions Type Def Fex Notation Interpretation Output to Lexicon Labellab produces a label featurelab[target word] lab(t)lab[target tag] Wordw Active if word(s) in current w[current word] record is within scope Tag (pos)tActive if tag(s) in current t[current tag] record is within scope VowelvActive if the word(s) in v[initial vowel] current record begin with a vowel. PrefixpreActive if the word(s) in the pre[active prefix] current record begins with a prefix in a given list.
Type Def Fex Notation Interpretation Output to Lexicon SuffixsufActive if the word(s) in suf[the active suffix] the current record begins with a prefix in a given list BaselinebaseActive if a baseline tag frombase[baseline tag] a prepared list exists for the word(s) in the current record LemmalemActive if a lemma from thelem[active lemma] WordNet database exists for the word(s) in the current record
Example Sentence = “(DET The) (NN dog) (V is) (JJ mad)” method 1 Script Def Output to lexicon Output to example file dog: w [-1,1]10001 w[The] 10001, 10002, 10003, 10004: w[is] dog: t [1,2]10003 t[V] t[JJ] method 2 Script Def Output to lexicon Output to example file -1: lab10001 w[The] 1, 10001, 10002, 10003, 10004: -1: w [-1,1] w[is] -1: t [1,2] t[V] t[JJ]
Operators & Complex Functions (X) operator - Indicate that a feature is active without any specific instantiation. Script Def Output to Lexicon dog: v(X) [-1,1] v[] (x=y) operator – Creates an active feature iff the active instantiation matches the given argument. Script Def Output to Lexicon dog: w(x=is) w[is] Sentence = “(DET The) (NN dog) (V is) (JJ mad)”
& operator - conjunct two features: producing a new feature which is active iff record fulfills both constituent features. Script Def Output to Lexicon dog: w&t [-1,-1] w[The]&t[DET] | operator - disjunction of two feature: outputting a feature for each term of the disjunction that is active in the current record. Script Def Output to Lexicon dog: w|t [-1,-1] w[The] t[DET] Sentence = “(DET The) (NN dog) (V is) (JJ mad)” Operators & Complex Functions
coloc function - Consecutive feature function: takes two or more features as arguments to produce a consecutive collocation over two or more records. The order of the arguments is preserved in the active feature. Script Def Output to Lexicon mad: coloc(w, t) [-3,-1]10001w[The]-t[NN] 10002w[dog]-t[V] scoloc function –Sparse Consecutive feature function: operates similarly to coloc, except that active colocations need not be consecutive. However, the order of the arguments is still preserved in determining whether a feature is active. Script Def Output to Lexicon mad: scoloc(w,t) [-3,-1] 10001w[The]-t[NN] 10002w[dog]-t[V] 10003w[The]-t[V] Operators & Complex Functions
Example: POS tagging Useful features for POS tagging: –The preceding word is tagged c. –The following word is tagged c. –The word two before is tagged c. –The word two after is tagged c. –The preceding word is tagged c and the following word is tagged t. –The preceding word is tagged c and the word two before is tagged t –The following word is tagged c and the word two after is tagged t. –The current word is w. –The most probable part of speech for the current word is c.
Given the sentence: –(t1 The) (t2 dog) (t3 ran) (t4 very) (t5 quickly) The following Fex script will produce the features from the last slide. -1: lab(t) -1 loc: t [-2,2] -1: coloc(t,t,t) [-2,2] -1 inc: w[0,0] -1: base[0,0] To do POS tagging, an example needs to be generated for each word in observation.
For the third word, “ran”, the script produces the following output: –Script: Lexicon Output: -1: lab(t)1 lab[t3] -1 loc: t [-2,2]10001 t[t1_*] 10002t[t2*] 10003t[*t4] 10004t[*_t5] -1: coloc(t,t,t) [-2,2]10005t[t1]-t[t2]-* 10006t[t2]-*-t[t4] 10007*-t[t4]-t[t5] -1 inc: w [0,0]10008w[ran] -1: base [0,0]10009base[V] And an example in the example file: –1, 10001, 10002, 10003, 10004, 10005, 10006, 10007, 10008, 10009:
Input Formats Fex can presently accept data in two formats: –w1 w2 w3 w4 … – (t1 w1) (t2 w2) (t3 w3) (t4 w4) … –w1 (t2 w2) (t3 t3a; w3) (t4; w4 w4a) …
Input Formats Fex can presently accept data in two formats: –Old format: w1 (t2 w2) (t3 t3a; w3) (t4; w4 w4a) –New format (ILK): I-NP NNP Pierre NOFUNC Vinken I-NP NNP Vinken NP-SBJ join O COMMA COMMA NOFUNC Vinken I-NP CD 61 NOFUNC years I-NP NNS years NP old I-ADJP JJ old ADJP Vinken O COMMA COMMA NOFUNC Vinken I-VP MD will NOFUNC join 8
Using Fex (command line) fex [options] script-file lexicon-file corpus-file example-file Options: -t: target file –do not have any empty line in your file!!! –Each target in a separate line -r: test mode –Does not create new features -h, -I –Creates a histogram of active features
Using Fex (command line) Target file= targ: Script file = script : dog -1 : lab cat -1 : w [-1,-1] -1 : t [-1,-1] Corpus file = corpus (DET The) (NN dog) (V is) (JJ mad) Lexicon file =lexicon Example file=example fex –t targ script lexicon corpus example
SNoW
Sparse Networks Of Winnows Complex Features sayjoin Targets Basic Features Knowledge Enriched Features Constant feature mapping Learning
Word representation
Restrictions on the learning approach Multi- Class Variable number of features –per class –per example Efficient learning Efficient evaluation
SNoW Network of threshold gates Target nodes represent class labels Input nodes (features) and links are allocated in a data driven way ( Order of 10 5 input features for many target nodes) Each sub-network (target nodes) is learned autonomously as a function of the features An example presented is positive to one network negative to others (depends on the algorithm) Allocations of nodes (features) and links is Data-Driven (a link between feature f i and target t j is created only when f i was active with any target t j )
Word prediction using SNoW Target nodes each word in the set of candidates words is a target node Input nodes an input node for feature f i is allocated only if that feature f i was active with any target Decision task we need to choose one target among all possible candidates
SNoW (Command line) snow –train –I inputfile –F networkfile [-ABcdePrsTvW] snow –test –I inputfile –F networkfile [-bEloRvw] Architecture Winnow: -W [ , , , init weight] :targets Perceptron: -P [ , , init weight] :targets NB: -B :targets
SNoW parameters (training) -d | rel > : discarding method -e : eligibility threshold -r : number of cycles output modes -c : interval for network snapshot -v :details for the output to the screen
SNoW parameters (testing) -b : smoothing for NB -w : smoothing for W, P output modes -E : error file -o :details for the output -R : results file (stdout)
File Format (Example file) 6, 10034, 10141, 10151, 10158, 10179: 177, 10034, 10035, 10047: With weights: 6, 10034(1), 10141(1.5), 10151(0.4), 10158(2), 10179(0.1): 177, 10034(2), 10035(4), 10047(0.6): Only active feature appear in an example !!!
File Format (Network file) NB target naivebayes : 0 : : : 0 : : Winnow target winnow : 0 : : : 0 : : Perceptron target perceptron : 0 : : : 0 : :
File Format (Error file) Algorithms: Perceptron: (1, 30, 0.05) Targets: 3, 53, 73 Ex: 8 Prediction: 3 Label: 53 3: : * 73: Ex: 15 Prediction: 3 Label: 73 3: : * 53: