Web-Mining Agents Part: Data Mining

Web-Mining Agents Part: Data Mining
Prof. Dr. Ralf Möller Dr. Özgür Özcep Universität zu Lübeck Institut für Informationssysteme Tanya Braun (Exercises)

Raymond J. Mooney University of Texas at Austin
Discriminative Training Markov Networks and Conditional Random Fields (CRFs) Raymond J. Mooney University of Texas at Austin 2 2

Probabilistic Classification
Let Y be the random variable for the class which takes values {y1,y2,…ym}. Let X be the random variable describing an instance consisting of a vector of values for n features <X1,X2…Xn>, let xk be a possible vector value for X and xij a possible value for Xi. For classification, we need to compute P(Y=yi | X=xk) for i = 1…m Could be done using joint distribution but this requires estimating an exponential number of parameters.

Bayesian Categorization
Determine category of xk by determining for each yi P(X=xk) can be determined since categories are complete and disjoint.

Bayesian Categorization (cont.)
Need to know: Priors: P(Y=yi) Conditionals: P(X=xk | Y=yi) P(Y=yi) are easily estimated from data. If ni of the examples in D are in yi then P(Y=yi) = ni / |D| Too many possible instances (e.g. 2n for binary features) to estimate all P(X=xk | Y=yi). Still need to make some sort of independence assumptions about the features to make learning tractable.

Naïve Bayes Generative Model
neg pos pos pos neg pos neg Category Size Color Shape red circ lg red circ med tri sm blue sm blue tri sqr lg med grn med red grn red circ med grn tri circ circ lg lg circ sm sm lg red blue circ tri sqr blue red sqr sm med red circ sm lg blue grn sqr tri Size Color Shape Positive Negative

Naïve Bayes Inference Problem
lg red circ ?? ?? neg pos pos pos neg pos neg Category Size Color Shape red circ lg red circ med tri sm blue sm blue tri sqr lg grn red circ med grn tri circ med red med grn circ lg lg circ tri sm blue sm lg red blue circ sqr red sqr sm med circ sm lg red blue grn sqr tri Size Color Shape Positive Negative

Naïve Bayesian Categorization
If we assume features of an instance are independent given the category (conditionally independent). Therefore, we then only need to know P(Xi | Y) for each possible pair of a feature-value and a category. If Y and all Xi and binary, this requires specifying only 2n parameters: P(Xi=true | Y=true) and P(Xi=true | Y=false) for each Xi P(Xi=false | Y) = 1 – P(Xi=true | Y) Compared to specifying 2n parameters without any independence assumptions.

Generative vs. Discriminative Models
Generative models are not directly designed to maximize the performance of classification. They model the complete joint distribution P(X,Y). Classification is then done using Bayesian inference given the generative model of the joint distribution. But a generative model can also be used to perform any other inference task, e.g. P(X1 | X2, …Xn, Y) “Jack of all trades, master of none.” Discriminative models are specifically designed and trained to maximize performance of classification. They only model the conditional distribution P(Y | X). By focusing on modeling the conditional distribution, they generally perform better on classification than generative models when given a reasonable amount of training data.

Logistic Regression (for discrete vars)
Assumes a parametric form for directly estimating P(Y | X). For binary concepts, this is: Equivalent to a one-layer backpropagation neural net. Logistic regression is the source of the sigmoid function used in backpropagation. Objective function for training is somewhat different.

Logistic Regression as a Log-Linear Model
Logistic regression is basically a linear model, which is demonstrated by taking logs. Also called a maximum entropy model (MaxEnt) because it can be shown that standard training for logistic regression gives the distribution with maximum entropy that is consistent with the training data.

Logistic Regression Training
Weights are set during training to maximize the conditional data likelihood : where D is the set of training examples and Yd and Xd denote, respectively, the values of Y and X for example d. Equivalently viewed as maximizing the conditional log likelihood (CLL)

Logistic Regression Training
Like neural-nets, can use standard gradient descent to find the parameters (weights) that optimize the CLL objective function. Many other more advanced training methods are possible to speed up convergence.

Preventing Overfitting in Logistic Regression
To prevent overfitting, one can use regularization (a.k.a. smoothing) by penalizing large weights by changing the training objective: Where λ is a constant that determines the amount of smoothing This can be shown to be equivalent to MAP parameter estimation assuming a Gaussian prior for W with zero mean and a variance related to 1/λ.

Multinomial Logistic Regression
Logistic regression can be generalized to multi-class problems (where Y has a multinomial distribution). Create feature functions for each combination of a class value y´ and each feature Xj and another for the “bias weight” of each class. f y´, j (Y, X) = Xj if Y= y´ and 0 otherwise f y´ (Y, X) = 1 if Y= y´ and 0 otherwise The final conditional distribution is : Note: k in {1, ..., K} enumerates all possible combinations of class values with Xj plus the class values (λk are weights) (normalizing constant)

Multinomial Logistic Regression
Formula boils down to the following: Assume Y has n class values Let βk (1 <= k <= n) be feature weight vectors Can set βn = 0 (zero vector) P(Y = k | X) = e(βk * X) / (1 + Σn-1k = 1 e(βk * X) ) Special case of bolean class follows when taking one class value as pivot and setting one weight vector to zero

Graphical Models If no assumption of independence is made, then an exponential number of parameters must be estimated for sound probabilistic inference. No realistic amount of training data is sufficient to estimate so many parameters. If a blanket assumption of conditional independence is made, efficient training and inference is possible, but such a strong assumption is rarely warranted. Graphical models use directed or undirected graphs over a set of random variables to explicitly specify variable dependencies and allow for less restrictive independence assumptions while limiting the number of parameters that must be estimated. Bayesian Networks: Directed acyclic graphs that indicate causal structure. Markov Networks: Undirected graphs that capture general dependencies.

Bayesian Networks Directed Acyclic Graph (DAG)
Nodes are random variables Edges indicate causal influences Burglary Earthquake Alarm JohnCalls MaryCalls

Conditional Probability Tables
Each node has a conditional probability table (CPT) that gives the probability of each of its values given every possible combination of values for its parents (conditioning case). Roots (sources) of the DAG that have no parents are given prior probabilities. P(B) .001 P(E) .002 Burglary Earthquake B E P(A) T .95 F .94 .29 .001 Alarm A P(J) T .90 F .05 A P(M) T .70 F .01 JohnCalls MaryCalls

Joint Distributions for Bayes Nets
A Bayesian Network implicitly defines a joint distribution. Example

Naïve Bayes as a Bayes Net
Naïve Bayes is a simple Bayes Net Y … X1 X2 Xn Priors P(Y) and conditionals P(Xi|Y) for Naïve Bayes provide CPTs for the network.

Markov Networks Undirected graph over a set of random variables, where an edge represents a dependency. The Markov blanket of a node, X, in a Markov Net is the set of its neighbors in the graph (nodes that have an edge connecting to X). Every node in a Markov Net is conditionally independent of every other node given its Markov blanket.

Distribution for a Markov Network
The distribution of a Markov net is most compactly described in terms of a set of potential functions (a.k.a. factors, compatibility functions), φk, for each clique, k, in the graph. For each joint assignment of values to the variables in clique k, φk assigns a non-negative real value that represents the compatibility of these values. The joint distribution of a Markov network is then defined by: Where x{k} represents the joint assignment of the variables in clique k, and Z is a normalizing constant that makes a joint distribution that sums to 1.

Sample Markov Network Burglary Earthquake Alarm JohnCalls MaryCalls E
2 T 50 F 10 1 200 B A 1 T 100 F 1 200 Burglary Earthquake Alarm JohnCalls MaryCalls M A 4 T 50 F 1 10 200 J A 3 T 75 F 10 1 200

Logistic Regression as a Markov Net
Logistic regression is a simple Markov Net Y … X1 X2 Xn But only models the conditional distribution, P(Y | X) and not the full joint P(X,Y)

Markov Nets vs. Bayes Nets

Markov Nets vs. Bayes Nets
Property Markov Nets Bayes Nets Form Prod. potentials Potentials Arbitrary Cond. probabilities Cycles Allowed Forbidden Partition func. Z = ? global Normalization local Indep. check Graph separation D-separation Indep. props. Some Inference MCMC etc.

Generative vs. Discriminative Sequence Labeling Models
HMMs are generative models and are not directly designed to maximize the performance of sequence labeling. They model the joint distribution P(O,Q). HMMs are trained to have an accurate probabilistic model of the underlying language, and not all aspects of this model benefit the sequence labeling task. Conditional Random Fields (CRFs) are specifically designed and trained to maximize performance of sequence labeling. They model the conditional distribution P(Q | O) HMM = Hidden Markov Model

… … Classification Naïve Bayes Discriminative Logistic Regression Y X1
Xn Generative Discriminative Conditional Y Logistic Regression … X1 X2 Xn

.. … .. … Sequence Labeling HMM Discriminative Linear-chain CRF Y1 Y2
YT HMM … X1 X2 XT Generative Discriminative Conditional .. Y1 Y2 YT Linear-chain CRF … X1 X2 XT

Simple Linear Chain CRF Features
Modeling the conditional distribution is similar to that used in multinomial logistic regression. Create feature functions fk(Yt, Yt−1, Xt) Feature for each state transition pair i, j fi,j(Yt, Yt−1, Xt) = 1 if Yt = i and Yt−1 = j and 0 otherwise Feature for each state observation pair i, o fi,o(Yt, Yt−1, Xt) = 1 if Yt = i and Xt = o and 0 otherwise Note: number of features grows quadratically in the number of states (i.e. tags).

Conditional Distribution for Linear Chain CRF
Using these feature functions for a simple linear chain CRF, we can define:

Adding Token Features to a CRF
Can add token features Xi,j … Y1 Y2 YT … … … … X1,1 X1,m X2,1 X2,m XT,1 XT,m Can add additional feature functions for each token feature to model conditional distribution.

Features in POS Tagging
For POS Tagging, use lexicographic features of tokens. Capitalized? Start with numeral? Ends in given suffix (e.g. “s”, “ed”, “ly”)?

Enhanced Linear Chain CRF (standard approach)
Can also condition transition on the current token features. … Y1 Y2 YT … X1,1 X2,1 XT,1 … … … X1,m X2,m XT,m Add feature functions: fi,j,k(Yt, Yt−1, X) 1 if Yt = i and Yt−1 = j and Xt −1,k = and 0 otherwise

Supervised Learning (Parameter Estimation)
As in logistic regression, use L-BFGS optimization procedure, to set λ weights to maximize CLL of the supervised training data. See paper for details.

Sequence Tagging (Inference)
Variant of Viterbi algorithm can be used to efficiently, O(TN2), determine the globally most probable label sequence for a given token sequence using a given log-linear model of the conditional probability P(Y | X). See paper for details.

Skip-Chain CRFs Can model some long-distance dependencies (i.e. the same word appearing in different parts of the text) by including long-distance edges in the Markov model. Y1 Y2 Y3 Y100 Y101 … X1 X2 X3 X100 X101 Michael Dell said Dell bought Additional links make exact inference intractable, so must resort to approximate inference to try to find the most probable labeling.

CRF Results Experimental results verify that they have superior accuracy on various sequence labeling tasks. Part of Speech tagging Noun phrase chunking Named entity recognition Semantic role labeling However, CRFs are much slower to train and do not scale as well to large amounts of training data. Training for POS on full Penn Treebank (~1M words) currently takes “over a week.” Skip-chain CRFs improve results on IE. 40 40

CRF Summary CRFs are a discriminative approach to sequence labeling whereas HMMs are generative. Discriminative methods are usually more accurate since they are trained for a specific performance task. CRFs also easily allow adding additional token features without making additional independence assumptions. Training time is increased since a complex optimization procedure is needed to fit supervised training data. CRFs are a state-of-the-art method for sequence labeling.

Web-Mining Agents Part: Data Mining

Similar presentations

Presentation on theme: "Web-Mining Agents Part: Data Mining"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web-Mining Agents Part: Data Mining

Similar presentations

Presentation on theme: "Web-Mining Agents Part: Data Mining"— Presentation transcript:

Similar presentations

About project

Feedback