Presentation is loading. Please wait.

Presentation is loading. Please wait.

20 th August 2008 Indications in green = Live content Indications in white = Edit in master Indications in blue = Locked elements Indications in black.

Similar presentations


Presentation on theme: "20 th August 2008 Indications in green = Live content Indications in white = Edit in master Indications in blue = Locked elements Indications in black."— Presentation transcript:

1 20 th August 2008 Indications in green = Live content Indications in white = Edit in master Indications in blue = Locked elements Indications in black = Optional elements Graphical Models for Machine Learning Ganesh Ramakrishnan

2 August 20 th, 2008 Probabilistic Graphical Models  Graphical representations of probability distributions –new insights into existing models –motivation for new models –graph based algorithms for calculation and computation

3 August 20 th, 2008 Components of a Graphical Model  Each node in a graphical model represents a random variable –Or in general, a set or vector of random variables  There are edges between nodes  It is the absence of certain edges in a graph that encodes independencies –Information provided by presence of edges is in some sense vacuous. –Eg: Degenerate case of completely connected graph, which describes all possible distributions

4 August 20 th, 2008 Types of graphical models  Directed: When all edges are directed –Hidden Markov models, Kalman filters, Factor analysis, Probabilistic principal component analysis, Independent component analysis, Mixtures of Gaussians, Transformed component analysis, Probabilistic expert systems, Sigmoid belief networks, Hierarchical mixtures of experts, Bayesian Networks, etc.  Undirected: When all edges are undirected –Markov random field, Conditional random field, etc.  Chain graphs –Have directed as well as undirected edges

5 August 20 th, 2008 Factorization and Conditional Independence properties of graphical models  Two equivalent ways of specifying a graphical model (Hammersley-clifford Theorem) –Factorization Property  How to factorize the joint distribution, given the graph –Markov or Conditional Independence property  Can we determine the conditional independence properties of a distribution directly from its graph? –undirected graphs: easy –directed graphs: one subtlety

6 August 20 th, 2008 Factorization Properties  Directed graphs –conditional independence from d-separation test  Undirected graphs –conditional independence from graph separation

7 August 20 th, 2008 Markov Properties: Undirected Graphs  Conditional independence given by graph separation  Two sets of nodes “a” and “b” are conditionally independent given a set of nodes “c” iff  Every path from a node in “a” to a node in “b” is blocked by a node in “c”.  Here, by “blocking” we mean that a node from “c” occurs on that path

8 August 20 th, 2008 Markov Properties: Directed Graphs  We say that node c blocks the path from node a to node b iff and  Identify three types of nodes that block/unblock paths  Head-to-tail node : Blocks the path

9 August 20 th, 2008 Markov Properties: Directed Graphs (contd)  Tail-to-tail node: Blocks the path  Head-to-head node: Unblocks the path

10 August 20 th, 2008 More formally…

11 August 20 th, 2008 Markov Properties: Directed Graphs (contd)  Conditional independence given by d- separation test  Two sets of nodes “a” and “b” are conditionally independent given a set of nodes “c” iff –Every path from a node in “a” to a node in “b” is blocked by a node in “c”.

12 August 20 th, 2008 Markov Properties: Directed Graphs (contd)  What about the following ?? NO YES

13 August 20 th, 2008 Graphs as Filters  Factorization and conditional independence describe identical families of distributions  Degenerate cases:  All random variables in set “x” are independent of each other => the corresponding family of p(x) will pass through any filter  The graph is completely connected => all families of distributions p(x) will pass through the filter

14 August 20 th, 2008 Directed versus Undirected If the graph is a DAG, you are not guaranteed to find an equivalent undirected graph (for a directed tree, you are guaranteed)

15 August 20 th, 2008 Markov Blankets  Markov blanket for a given node in a graph is the set of nodes that block the path from the set of remaining nodes to that node  A node is independent of all other nodes in a graph, given its Markov blanket  Undirected graph –Set of neighbors of a node is its Markov Blanket  Directed graph –Set of nodes that d-separates the given node from rest of the nodes

16 August 20 th, 2008 Examples HMM and Kalman Filter Bayesian Network

17 August 20 th, 2008 Graphical Model as Probabilistic Model for Zero-Order Logic

18 August 20 th, 2008 Examples (contd) Multiple hidden sequences

19 August 20 th, 2008 Examples (contd) Markov Random Fields Conditional Random Fields

20 August 20 th, 2008 Inferencing in Graphical Models  Exact inferencing –Message passing & junction tree algorithms  Integer Linear Programming based inferencing  Approximate Inferencing –Sampling-based methods –Variational methods  Find upper and lower bounds to marginals –Approximate algos when potentials are metrics (Kleinberg, Tardos 99)  O(logk loglogk) approx ratio for k labels  2-approx algo for uniform potentials

21 August 20 th, 2008 Exact Inferencing in Graphical Models  Why hard ?  Message passing algorithm –Applicable to tree-structured directed and undirected graphical models  Junction Tree algorithm –Applicable to arbitrary undirected graphs with cycles

22 August 20 th, 2008 Message Passing algorithm  Example  Find marginal for a particular node –for M-state nodes, cost is –exponential in length of chain –but, we can exploit 1.the graphical structure (conditional independences) 2.Avoidance of redundant computations through dynamic programming

23 August 20 th, 2008 Message Passing  Exchange sums and products  Express as product of messages (Z is obtained by normalization)

24 August 20 th, 2008 Message Passing  Recursively evaluate messages

25 August 20 th, 2008 Belief propagation  Extension to general tree- structured graphs  At each node: –form product of incoming messages and local evidence –marginalize to give outgoing message –one message in each direction across every link –also called the sum-product algorithm  Fails if there are loops

26 August 20 th, 2008 General Message Passing Algorithm (flooding)

27 August 20 th, 2008 Max Product Algorithm  Goal: find  Define  Then  Message passing algorithm with “sum” replaced by “max”  Generalization of Viterbi algorithm for HMMs

28 August 20 th, 2008 Max Product algo (flooding schedule)

29 August 20 th, 2008 Additional Machinery for finding MAP value

30 August 20 th, 2008 Additional Machinery for finding MAP value (contd)

31 August 20 th, 2008 Junction Tree Algorithm for DAGs

32 August 20 th, 2008 Examples

33 August 20 th, 2008 Triangulated Graphs: Sufficient condition for applicability of Junction Tree algorithm

34 August 20 th, 2008 Construction of Junction Trees: Use Kruksal’s and Prim’s algo

35 August 20 th, 2008 The Overall Junction Tree Algorithm

36 August 20 th, 2008 Training of Graphical Models  Criteria –Penalized log-likelihood –Pseudo log-likelihood –Margin maximization –Number of errors in predictions  Techniques –Voted perceptron –Gradient and Newton methods, L-BFGS –Gradient tree boosting –Logarithmic pooling

37 August 20 th, 2008 EM Algorithm  Under completely observed z  Coordinate descent on the auxiliary function –where

38 August 20 th, 2008 EM Steps  E-Step:  M-Step:

39 August 20 th, 2008 Integer Linear Programming – based inferencing   c (x c ) is 1 if clique c has value y c   c (x c )’s are mutually exclusive   c (x c ) and  c’ (x c’ ) are consistent for c’ ½ c  LP relaxation –  c (x c ) behave like marginal probabilities –May admit invalid solutions if the graph is un- triangulated!  Triangulation adds variables and constraints that keep the solution valid

40 20 th August 2008 Indications in green = Live content Indications in white = Edit in master Indications in blue = Locked elements Indications in black = Optional elements Some graphical models for part of speech tagging

41 August 20 th, 2008 Different Models for POS tagging  HMM  Maximum Entropy Markov Models  Conditional Random Fields

42 August 20 th, 2008 POS tagging: A Sequence Labeling Problem  Input and Output –Input sequence x  x  x    x n –Output sequence y  y  y    y m  Labels of the input sequence  Semantic representation of the input  Other Applications –Automatic speech recognition –Text processing, e.g., tagging, name entity recognition, summarization by exploiting layout structure of text, etc.

43 August 20 th, 2008 Hidden Markov Models  Doubly stochastic models  Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)  Training the model –(Baum-Welch algorithm) S2S2 S4S4 S1S1 0.9 0.5 0.8 0.2 0.1 S3S3 ACAC 0.6 0.4 ACAC 0.3 0.7 ACAC 0.5 ACAC 0.9 0.1

44 August 20 th, 2008 Hidden Markov Model (HMM) : Generative Modeling Source Model P  Y  Noisy Channel P  X  Y  y x e.g., 1 st order Markov chain Parameter estimation: maximize the joint likelihood of training examples

45 August 20 th, 2008 Dependency (1st order)

46 August 20 th, 2008 Different Models for POS tagging  HMM  Maximum Entropy Markov Models  Conditional Random Fields

47 August 20 th, 2008 Disadvantage of HMMs (1)  No Rich Feature Information –Rich information are required  When x k is complex  When data of x k is sparse  Example: POS Tagging –How to evaluate P  w k |t k  for unknown words w k ? –Useful features  Suffix, e.g., -ed, -tion, -ing, etc.  Capitalization

48 August 20 th, 2008 Disadvantage of HMMs (2)  Generative Model –Parameter estimation: maximize the joint likelihood of training examples Better Approach Discriminative model which models P  y|x  directly Maximize the conditional likelihood of training examples

49 August 20 th, 2008 Maximum Entropy Markov Model  Discriminative Sub Models –Unify two parameters in generative model into one conditional model  Two parameters in generative model,  parameter in source model and parameter in noisy channel  Unified conditional model –Employ maximum entropy principle Maximum Entropy Markov Model

50 August 20 th, 2008 General Maximum Entropy Model  Model –Model distribution P  Y  |X  with a set of features  f   f     f l  defined on X and Y  Idea –Collect information of features from training data –Assume nothing on distribution P  Y  |X  other than the collected information  Maximize the entropy as a criterion

51 August 20 th, 2008 Features  Features –0-1 indicator functions  1 if  x  y  satisfies a predefined condition  0 if not  Example: POS Tagging

52 August 20 th, 2008 Constraints  Empirical Information –Statistics from training data T Constraints Expected Value From the distribution P  Y  |X  we want to model

53 August 20 th, 2008 Maximum Entropy: Objective  Entropy Maximization Problem

54 August 20 th, 2008 Dual Problem  Dual Problem –Conditional model –Maximum likelihood of conditional data Solution Improved iterative scaling (IIS) (Berger et al. 1996) Generalized iterative scaling (GIS) (McCallum et al. 2000)

55 August 20 th, 2008 Maximum Entropy Markov Model  Use Maximum Entropy Approach to Model –1st order Features Basic features (like parameters in HMM) Bigram (1st order) or trigram (2nd order) in source model State-output pair feature  X k  x k  Y k  y k  Advantage: incorporate other advanced features on  x k  y k 

56 August 20 th, 2008 HMM vs MEMM (1st order) HMM Maximum Entropy Markov Model (MEMM)

57 August 20 th, 2008 Performance in POS Tagging  POS Tagging –Data set: WSJ –Features:  HMM features, spelling features (like – ed, -tion, -s, -ing, etc.)  Results (Lafferty et al. 2001) –1st order HMM  94.31% accuracy, 54.01% OOV accuracy –1st order MEMM  95.19% accuracy, 73.01% OOV accuracy

58 August 20 th, 2008 Different Models for POS tagging  HMM  Maximum Entropy Markov Models  Conditional Random Fields

59 August 20 th, 2008 Disadvantage of MEMMs (1)  Complex Algorithm of Maximum Entropy Solution –Both IIS and GIS are difficult to implement –Require many tricks in implementation  Slow in Training –Time consuming when data set is large  Especially for MEMM

60 August 20 th, 2008 Disadvantage of MEMMs (2)  Maximum Entropy Markov Model –Maximum entropy model as a sub model –Optimization of entropy on sub models, not on global model  Label Bias Problem –Conditional models with per-state normalization –Effects of observations are weakened for states with fewer outgoing transitions

61 August 20 th, 2008 Label Bias Problem Training Data X:Y rib:123 rob:456 1 23 r i b 4 56 r o b Model Parameters New input: rob

62 August 20 th, 2008 Solution  Global Optimization –Optimize parameters in a global model simultaneously, not in sub models separately  Alternatives –Conditional random fields –Application of perceptron algorithm

63 August 20 th, 2008 Conditional Random Field (CRF) (1)  Let – be a graph such that Y is indexed by the vertices Then  X  Y  is a conditional random field if Conditioned globally on X

64 August 20 th, 2008 Conditional Random Field (CRF) (2)  Exponential Model – : a tree (or more specifically, a chain) with cliques as edges and vertices Parameter Estimation Maximize the conditional likelihood of training examples IIS or GIS State determine d Determined by State Transitions

65 August 20 th, 2008 MEMM vs CRF  Similarities –Both employ maximum entropy principle –Both incorporate rich feature information  Differences –Conditional random fields are always globally conditioned on X, resulting in a global optimized model

66 August 20 th, 2008 Training the CRF  Maximize conditional log-likelihood  [ Collins 2002] Discriminatively learning the global weight vector by reducing the number of error predictions directly using the (voted) perceptron.  Iterate on each sequence and update weight vector. – is derived from by the Viterbi inference algorithm based on the weight vector

67 August 20 th, 2008 Training

68 August 20 th, 2008 Viterbi for Inferencing  Definition  Expressing the joint distribution of the output labels  Viterbi grows the optimal label sequence gradually by scanning the matrices from position 0 to n.

69 August 20 th, 2008 Viterbi for Inferencing (continued)  The dynamic programming recursive function

70 August 20 th, 2008 Performance in POS Tagging  POS Tagging –Data set: WSJ –Features:  HMM features, spelling features (like – ed, -tion, -s, -ing, etc.)  Results (Lafferty et al. 2001) –1st order MEMM  95.19% accuracy, 73.01% OOV accuracy –Conditional random fields  95.73% accuracy, 76.24% OOV accuracy

71 August 20 th, 2008 Comparison of the three approaches to POS Tagging  Results (Lafferty et al. 2001) –1st order HMM  94.31% accuracy, 54.01% OOV accuracy –1st order MEMM  95.19% accuracy, 73.01% OOV accuracy –Conditional random fields  95.73% accuracy, 76.24% OOV accuracy

72 20 th August 2008 Indications in green = Live content Indications in white = Edit in master Indications in blue = Locked elements Indications in black = Optional elements Inferencing under constraints and other issues

73 August 20 th, 2008 Incorporating Constraints in Viterbi  In NLP problems such as chunking and information extraction –the task is to identify segments of consecutive words in the sentence and classify them to one of several classes –A word based representation called the BIO (Begin Inside Outside) representation is used  When no two consecutive segments share the same type, the BIO representation can be simplified to the IO representation –In an interactive information extraction scenario, the system should assume that the labels of some tokens are given by the user during evaluation

74 August 20 th, 2008 Incorporating Constraints in Viterbi (continued)  BIO Constraint: Disallow a label sequence that has an O label followed immediately by an I label  Interactive labeling constraint: if a token at the position i has to be labeled a, then no path is allowed to pass the state where Problem: This matrix modification mechanism cannot be applied when the constraints define the relation of two distant tokens. Hence viterbi cannot be applied in this case ! Example applications: 1. The “no duplicate segments” constraint in several tasks (e.g., semantic role labeling), where two different segments in a sentence cannot have the same label. 2.A potential constraint in information extraction is: “if FirstName appears in the sentence, then LastName must also appear.”

75 August 20 th, 2008 Shortest path formulation  Viterbi solution = the shortest path in the graph (trellis in HMMs) constructed as follows –(cost of an edge is ) –Positions of start and end nodes are -1 and n

76 August 20 th, 2008 ILP solution  Solution to General Shortest Path Problem x uv is for the edge between nodes u and v Constraint for all intermediate nodes: Number of incoming edges = number of outgoing edges Constraint for the source Constraint for the sink

77 August 20 th, 2008 ILP Solution (continued)  ILP formulation for this specific shortest path problem M i is totally unimodular for all i => LP relaxation guarantees an integer solution

78 August 20 th, 2008 Replacing Viterbi with ILP  Advantages –It is known that all possible Boolean functions over the variables of interest can be represented as sets of linear (in)equalities.  Hence it improves expressivity of the constraints on the output space (of y’s) –Allows constraints that Viterbi allows. –Also allows constraint over labels of distant tokens which Viterbi does not allow.

79 August 20 th, 2008 Illustration of the expressivity of Inferencing with ILP  Interactive labeling constraint: –In order to force the label of token i to be 0, we can add the following constraint:  “No duplicate segment” constraint: –We enforce it by making sure that a segment by type a that ends never starts again.

80 August 20 th, 2008 Illustration of the expressivity of Inferencing with ILP (continued)  “ If label a appears, then label b must also appear” constraint: –Represented using the following linear inequality  Group labeling constraint:

81 August 20 th, 2008 Other issues of interest  Feature Induction in CRFs and Relational Markov Networks [McCallum UAI,2003] –Principle: Iteratively construct feature conjunctions that would significantly increase conditional log-likelihood if added to the model  Reducing labeling effort [Culotta et al., 2005] –Employs a constrained forward-backward algorithm to estimate confidence of CRF predictions in an active learning framework  Joint labeling of multiple sequences [McCallum et al., 2003] –Dynamic Conditional Random Fields: Use a distributed state representation: Each time slice has a set of state variables and edges –Enable labeling of sequence data in multiple interacting ways: E.g., performing part-of-speech tagging and noun-phrase segmentation simultaneously, enabling increase of joint accuracy by information sharing

82 20 th August 2008 Indications in green = Live content Indications in white = Edit in master Indications in blue = Locked elements Indications in black = Optional elements Graphical Models for some NLP tasks

83 August 20 th, 2008 Determining NP chunks [Sha and Periera, 2003]  Encode second order Markov Dependency between chunk tags –CRF labels as pairs of consecutive chunk tags: label at position i is y i = c i c i-1 where c i is the chunk tag of word i.  Factored representation for features –p(x,i) is a predicate on the input sequence x and current position i and q(y i-1,y i ) is a predicate on pairs of labels –E.g.: p(x,i) could be “ word at position i is the” or “the POS tags at positions i and i-1 are DT,NN”

84 August 20 th, 2008 Features and performance Features F scores

85 August 20 th, 2008 Co-reference Resolution [McCallum and Welner, 2004] Collective co-reference resolution of noun-phrases using three conditional–probability undirected models 1.Groups of nodes for entities  Nodes for mentions x, node for entity assignment for each mention y, and node for attribute of each entity a. 2.Nodes for mention pairs, with attributes on mentions  Change y to be a binary-valued random variable y ij for each pair of mentions (x i,x j ). Associate attribute with each mention 3.Nodes for mention pairs, graph partitioning with learned distance

86 August 20 th, 2008 Co-reference Resolution: Results  Features –Non-independent features operating at multiple levels of granularity –Tests for string and sub-string matches, acronym matches, parse- derived head-word matches, gender, WordNet subsumption, sentence distance, distance in the parse tree, etc.  F1 results on three data sets

87 August 20 th, 2008 Other applications  Named Entity Recognition [McCallum and Li, 2003] –Using Feature Induction, lexicons automatically augmented from the Web using seeds from training data and exploiting HTML formatting regularities –Demonstrate speedy building of named entity models on three languages: English, German and Hindi  Collective Multilabel classification [Nadia et al., 2005] –Exploit dependencies between labels by directly parametrizing label co- occurrences  Information Extraction [Peng et al., 2004] –Task: Extracting fields such as title, author, institution, conference, etc., from headers and citations of research papers –Explores several feature classes, Markov orders and different priors (Gaussian, exponential and hyperbolic L1) for improved regularization

88 August 20 th, 2008 Other applications [contd.]  Semantic Role Labeling [Cohn and Blunsom, 2005] –Use a Tree CRF: Tree structure requires that features incorporate either a node labeling or the labeling of a parent and its child  Basic Features: head word, head POS, phrase syntactic category, phrase path, position relative to predicate, surface distance to predicate, predicate lemma, predicate token, predicate voice, predicate sub-categorization, syntactic frame  Context Features: Head word of first NP in preposition phrase, left and right sibling head words and syntactic categories, first and last word in phrase yield and their PoS, parent syntactic category and head word  Common ancestor of the verb: The syntactic category of the deepest shared ancestor of both the verb and node  Feature conjunctions: predicate lemma + syntactic category, predicate lemma + relative position, syntactic category + first word of the phrase  Default feature: An always ‘on’ feature that models prior probabilities  Joint features over pair-wise cliques: whether the parent and child head words do not match, parent syntactic category + and child syntactic category, parent relative position + child relative position, parent relative position + child relative position + predicate PoS + predicate lemma

89 August 20 th, 2008 Summary  Introduced NLP as a Machine Learning problem, in particular classification leading to sequence labeling  Graphical models were defined: representation, training, inferencing  PoS tagging was chosen as the case study task for three sample graphical models: –HMM, MEMM, CRF  Other sequence labeling problems like chunking, Coreference, Named Entity recognition, etc using one of the two above methods were discussed

90 August 20 th, 2008 Summary  Results and the complexity of the task bring to the fore the need for theoretically sound, learning oriented and data driven paradigms.  Expected to complement and not compete with rule driven classical approached to NLP

91 August 20 th, 2008 References (1/3)  Allen, J.F. Natural Language Understanding. Benjamin Cummings, 1987, Second Edition, 1994.  Tom Mitchell. Machine Learning. McGraw Hill, 1997.  E. Charniak. Statistical Language Learning. Cambridge: MIT Press (1993)  A. Berger, S. Della Pietra, and V. Della Pietra (1996). A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics, 22(1), 39-71.  T. Dietterich, A. Ashenfelter, and Y. Bulatov. Training conditional random fields via gradient tree boosting. In ICML, 2004. http://citeseer.ist.psu.edu/dietterich04training.html.  M. Collins (2002a). Discriminative Training Methods for Hidden Markov Models: Theory and Algorithm with Perceptron Algorithms. In Proc. EMNLP-2002, 1-8.  P. Clifford. Markov random fields in statistics. In G.R. Grimmett and D.J.A. Welsh (Eds), Disorder in Physical Systems, J.M. Hammersley Festschrift, pages 19–32. Oxford University Press, 1990. http://www.statslab.cam.ac.uk/~grg/books/hammfest/3-pdc.ps.

92 August 20 th, 2008 References (2/3)  D. Jurafsky and J. Martin (2000). Speech and language processing. Prentice Hall.  J. Lafferty, A. McCallumn, and F. Pereira (2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proc. ICML-2001, 282-289.  A. McCallum, D. Freitag, and F. Pereira (2000). Maximum Entropy Markov Models for Information Extraction and Segmentation. In Proc. ICML-2000, 591-598.  C. D. Manning and H. Schutze (1999). Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.  D. Roth and W. Yih. Integer linear programming inference for conditional random fields. In ICML, pages 737–744, 2005. http://l2r.cs.uiuc.edu/~danr/Papers/RothYi05.pdf http://l2r.cs.uiuc.edu/~danr/Papers/RothYi05.pdf  A. McCallum. Efficiently inducing features of conditional random fields. In UAI, 2003. http://citeseer.ist.psu.edu/mccallum03efficiently.html.

93 August 20 th, 2008 References (3/3)  Rapid Development of Hindi Named Entity Recognition Using Conditional Random Fields and Feature Induction. Wei Li and Andrew McCallum. ACM Transactions on Asian Language Information Processing, 2003. Rapid Development of Hindi Named Entity Recognition Using Conditional Random Fields and Feature Induction  Confidence Estimation for Information Extraction. Aron Culotta and Andrew McCallum. Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 2004, short paper. Confidence Estimation for Information Extraction  Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons. Andrew McCallum and Wei Li. Seventh Conference on Natural Language Learning (CoNLL), 2003. Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons  Reducing Labeling Effort for Structured Prediction Tasks. Aron Culotta and Andrew McCallum. AAAI, 2005. Reducing Labeling Effort for Structured Prediction Tasks  Collective Multi-Label Classification. Nadia Ghamrawi and Andrew McCallum. Fourteenth Conference on Information and Knowledge Management (CIKM), 2005. Collective Multi-Label Classification  Interactive Information Extraction with Constrained Conditional Random Fields. Trausti Kristjannson, Aron Culotta, Paul Viola and Andrew McCallum. Nineteenth National Conference on Artificial Intelligence (AAAI 2004). Interactive Information Extraction with Constrained Conditional Random Fields

94 August 20 th, 2008 Further Reading (1/3)  Y. Altun, I. Tsochantaridis, T. Hofmann (2003a). Hidden Markov Support Vector Machines. In Proc. ICML-2003, 3-10.  Y. Altun, M. Johnson, and T. Hofmann (2003b). Investigating Loss Functions and Optimization Methods for Discriminative Learning of Label Sequences. In Proc. EMNLP-2003, 145-152.  M. Collins (2002b). Ranking Algorithms for Named Entity Extraction: Boosting and the Voted Perceptron. In Proc. ACL-2002, 489-496.  M. Collins and N. Duffy (2002). New Ranking Algorithms for Parsing and Tagging: Kernels over Discrete Structures, and the Voted Perceptron. In Proc. ACL-2002, 263-270.  Y. Freund and R. Schapire (1999). Large Margin Classification Using the Perceptron Algorithm. Machine Learning, 37(3), 277-296.  B. Roark, M. Saraclar, and M. Collins (2004). Corrective Language Modeling for Large Vocabulary ASR with the Perceptron Algorithm. To appear in Proc. ICASSP-2004.  B. Taskar, V. Chatalbashev, and D. Koller. Learning associative markov networks. In ICML, 2004. http://www.cs.berkeley.edu/~taskar/pubs/mmamn.ps.

95 August 20 th, 2008 Further Reading (2/3)  T. Cohn, A. Smith, and M. Osborne. Scaling conditional random fields using error correcting codes. In ACL, 2005. http://www.cs.mu.oz.au/~tacohn/acl05_scaling.pdf.  T. Dietterich, A. Ashenfelter, and Y. Bulatov. Training conditional random fields via gradient tree boosting. In ICML, 2004. http://citeseer.ist.psu.edu/dietterich04training.html.  J. H. Friedman. Greedy function approximation: A gradient boosting machine. In Annals of Statistics, volume 29, 2001. http://www-stat.stanford.edu/~jhf/ftp/trebst.ps.  J. Kleinberg and E. Tardos. Approximation algorithms for classification problems with pairwise relationships: Metric labeling and markov random fields. In FOCS, 1999. http://www.cs.cornell.edu/home/kleinber/focs99- mrf.ps  S. Kakade, Y. Teh, and S. Roweis. An alternate objective function for markovian fields. In ICML, pages 275–282, 2002. http://www.cs.berkeley.edu/~ywteh/research/newcost/icml2002.pdf.  M. Collins. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In EMNLP, 2002. http://citeseer.ist.psu.edu/collins02discriminative.html http://citeseer.ist.psu.edu/collins02discriminative.html

96 August 20 th, 2008 Further Reading (3/3)  A. Smith, T. Cohn, and M. Osborne. Logarithmic opinion pools for conditional random fields. In ACL, 2005. http://www.iccs.informatics.ed.ac.uk/~osborne/papers/acl05a.pdf.  C. Sutton and A. McCallum. Piecewise training for undirected models. In UAI, 2005. http://www.cs.umass.edu/~mccallum/papers/piecewise-uai05.pdf.  F. Sha and F. Pereira. Shallow parsing with conditional random fields. In NAACL, 2003. http://www.cis.upenn.edu/~feisha/pubs/shallow03.pdf.  B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In UAI, 2002. http://www.cs.berkeley.edu/~taskar/pubs/rmn.ps.  B. Taskar. Learning Structured Prediction Models: A Large Margin Approach. PhD thesis, Stanford University, 2004. http://www.cs.berkeley.edu/~taskar/pubs/thesis.pdf.  B. Taskar, V. Chatalbashev, and D. Koller. Learning associative markov networks. In ICML, 2004. http://www.cs.berkeley.edu/~taskar/pubs/mmamn.ps.  H. Wallach. Efficient training of conditional random fields. Master’s thesis, University of Edinburgh, 2002. http://citeseer.ist.psu.edu/wallach02efficient.html.  M. Narasimhan and J. Bilmes. Optimal sub-graphical models. In NIPS, 2004. http://ssli.ee.washington.edu/~mukundn/pubs/nips2004.pdf.  J. Pearl. Probabilistic reasoning in intelligent systems. Morgan Kaufmann, San Francisco, 1988.  Y. Qi, M. Szummer, and T. P. Minka. Bayesian conditional random fields. In AISTATS, 2005. http://people.csail.mit.edu/alanqi/papers/Qi-Bayesian-CRF- AIstat05.pdf.http://people.csail.mit.edu/alanqi/papers/Qi-Bayesian-CRF- AIstat05.pdf


Download ppt "20 th August 2008 Indications in green = Live content Indications in white = Edit in master Indications in blue = Locked elements Indications in black."

Similar presentations


Ads by Google