Machine Learning in GATE Valentin Tablan
2 Machine Learning in GATE Uses classification. [Attr 1, Attr 2, Attr 3, … Attr n ] Class Classifies annotations. (Documents can be classified as well using a simple trick.) Annotations of a particular type are selected as instances. Attributes refer to instance annotations. Attributes have a position relative to the instance annotation they refer to.
3 Attributes Attributes can be: –Boolean The [lack of] presence of an annotation of a particular type [partially] overlapping the referred instance annotation. –Nominal The value of a particular feature of the referred instance annotation. The complete set of acceptable values must be specified a-priori. –Numeric The numeric value (converted from String) of a particular feature of the referred instance annotation.
4 Implementation Machine Learning PR in GATE. Has two functioning modes: –training –application Uses an XML file for configuration: …
5 Token POS_category(0) Token category 0 NN NNP NNPS … [ ] …
6 gate.creole.ml.weka.Wrapper weka.classifiers.j48.J48 -K
7 Attributes Position Instances type: Token
8 Machine Learning PR Can save a learnt model to an external file for later use. Saves the actual model and the collected dataset. Can export the collected dataset in.arff format.
9 Standard Use Scenario Training Prepare training data by enriching the documents with annotation for attributes. (e.g. run Tokeniser, POS tagger, Gazetteer, etc). Run the ML PR in training mode. Export the dataset as.arff and perform experiments using the WEKA interface in order to find the best attribute set / algorithm / algorithm options. Update the configuration file accordingly. Run the ML PR again to collect the actual data. [ Save the learnt model. ] Application Prepare data by enriching the documents with annotation for attributes. (e.g. run Tokeniser, POS tagger, Gazetteer, etc). [ Load the previously saved model. ] Run the ML PR in application mode. [ Save the learnt model. ]
10 An Example Learn POS category from POS context.
11 Using Other ML Libraries The MLEngine Interface Method Summary void addTrainingInstance(List attributes) Adds a new training instance to the dataset. addTrainingInstanceList Object classifyInstance(List attributes) Classifies a new instance. classifyInstanceList void init() This method will be called after an engine is created and has its dataset and options set. init void setDatasetDefinition(DatasetDefintion definition) Sets the definition for the dataset used. setDatasetDefinitionDatasetDefintion void setOptions(org.jdom.Element options) Sets the options from an XML JDom element.setOptions void setOwnerPR(ProcessingResource pr) Registers the PR using the engine with the engine. setOwnerPRProcessingResource