Probabilistic Equational Reasoning Arthur Kantor

Probabilistic Equational Reasoning Arthur Kantor akantor@uiuc.edu

3/11/042 Outline The Problem Two solutions –Generative Model –Undirected Graphical Model

3/11/043 The problem You have m objects, all with some common attributes –Publications You also have n references to those objects. –Citations of those publications The references are ambiguous –Citations have different formats and may have spelling mistakes –m may not be known How do you know if two references refer to the same object? –A common problem in citeseer.nj.nec.com –natural language processing, database merging, …

3/11/044 The problem What object do these references refer to? –“Powell” –“she” –“Mr. Powell” –References can disambiguate each other

3/11/045 Two Solutions Based on probabilistic models Objects are unobserved –number of objects m is not known Try to resolve all the references simultaneously –“she” would not co-reference “Powell” in presence of “Mr. Powell” Solution one: –Based on relational probabilistic models (RPMs) Solution two: –Based on undirected graphical models

3/11/046 RPM solution [Pasula et al.] System built to identify the papers from various citations Straight-forward Bayes rule, intelligently applied 4 classes of information –Author (unobserved) –Paper (unobserved) –AuthorAsCited –Citation Probability distributions for each class are given

3/11/047 Thick lines specify the foreign keys (can also be thought of as random vectors of pointers to objects)

3/11/048 Assume for now, that the number of papers and authors is known. Thin lines represent dependencies for every instance of the class. Think Generatively: 1) Authors are born, names picked from prior distribution 2) Papers are written, names and publication types picked from a prior

3/11/049 1) Authors are born, names picked from prior distribution 2) Papers are written, names and publication types picked from a prior 3) Based on papers, Citations are composed (perhaps misspelled)5) Finally, the text is written down4) Based on mood, perhaps pubType, a format is chosen for citation

3/11/0410 We now have P(text | all the stuff that happened) So what could have happened? P(what happened | text)

3/11/0411 Consider picking a book from the library of n books, writing down the citation putting it back on the shelf, and repeating the process once more. You now have two citations, c 1, and c 2 Consider two hypotheses: H 1 : c 1.paper = c 2.paper H 2 : c 1.paper ≠ c 2.paper What’s more likely?

3/11/0412 Consider picking a paper from the library of n papers, writing down the citation putting it back on the shelf, and repeating the process once more. You now have two citations, c 1, and c 2 Consider two hypotheses: H 1 : c 1.paper = c 2.paper H 2 : c 1.paper ≠ c 2.paper What’s more likely?

3/11/0413 What’s more likely? Depends on y1 and y2 –If it is probable that both y1 and y2 were copied down correctly, and y1 and y2 differ significantly, the cause for the difference was probably that the papers were in fact different P(H2|text)>P(H1|text) H2 is what happened H1: c1.paper = c2.paper H2: c1.paper ≠ c2.paper

3/11/0414 Concerns Unknown number of papers, authors –Condition everything on the number of papers, authors The probability space is ridiculously huge. We cannot possibly sum over it. MCMC sampling on the number of objects name/title corruption is symmetric: instead of corrupting the title, corrupt the cited title. Sum directly over small-range attributes, like doctype.

3/11/0415 Performance Works pretty well –Depends greatly on the quality of the generative model

3/11/0416 Outline The Problem Two solutions –Generative Model –Undirected Graphical Model This model gives more flexibility in specifying features than the previous one –No need to specify per-class dependencies

3/11/0417 Undirected Graphical Model Objects are implicit – we deal only with references Given: –References x 1 … x i … x n –Y ij are binary random variables Y ij =1 iff x i co-references x j –f l (x i,x j,y ij ) are feature or potential functions measure particular similarity facet between x i and x j have the property f l (x i,x j,1) = - f l (x i,x j,0) non-zero, if x i and x j are related

3/11/0418 Objective function Maximize Becomes -  if Are not all linked, to prevent inconsistencies Notational trick – the implementation simply doesn’t allow non-clique configurations Biggest if all the similar (x i,x j ) are connected and all opposite (x i,x j ) are seperated

3/11/0419 Objective function are learned by maximum likelihood, over the training data The function is concave, so we can use our favorite learning algorithm (e.g. stochastic gradient ascent)

3/11/0420 Graph partitioning Maximizing is equivalent to finding an optimal graph partitioning of a complete graph The nodes are x i s The edges (x i,x j ) are the log-potential functions applied to that pair of references

3/11/0421 X5X5 X3X3 X1X1 X2X2 X4X4 Correlation Clustering Graph G=(V,E) Partition V into clusters s.t. + edges are within clusters –edges are across clusters. No bound on # of clusters. Edges are labeled + or –. + + – – – – + – – –

3/11/0422 Agreements and Disagreements Agreements + edges inside clusters AND – edges outside clusters. X5X5 X3X3 X1X1 X2X2 X4X4 + – – – – + – – –

3/11/0423 Can either Maximize Agreements OR Minimize Disagreements Disagreements (mistakes) + edges outside clusters AND – edges inside clusters. Agreements and Disagreements Agreements + edges inside clusters AND – edges outside clusters. X5X5 X3X3 X1X1 X2X2 X4X4 +

3/11/0424 X5X5 X3X3 X1X1 X2X2 X4X4 Choosing a different partitioning is possible, but could be worse, since more disagreements are introduced. Partitions must be cliques. (introduces a bias towards small partitions?) + + – – – – + – – –

3/11/0425 A few observations Number of objects is determined automatically –Corresponds to the number of cliques Metrics are defined pairwise, but decision to join a clique involves all references If we force two cliques, the problem is equivalent to a single simulated annealing pass

3/11/0426

3/11/0427 Cross-citation disambiguation

3/11/0428 Complexity A probabilistic inference problem to a clustering problem, cannot be constant-factor-approximated in polynomial time.We can reduce a 3-SAT problem to it Correlation Clustering Algorithms exist to guarantee relative error less than (1-  ) in polynomial time for complete graphs (such as ours). Yet we are solving a probabilistic inference with a faster algorithm. How? –We have a simpler subclass of probability distributions, they are log-linear. –Probably boils down to an integer-programming problem, since not all assignments of y are allowed.

3/11/0429 Proper Noun co-reference ‘s:

3/11/0430 Proper Nouns performance Tested on –30 news wire articles –117 stories from broadcast news portion of DARAPA’s ACE set –Hand-annotated nouns (non-proper nouns ignored) –Identical feature functions on all three sets! –5-fold validation –Only 60% accuracy if proper nouns are included

Probabilistic Equational Reasoning Arthur Kantor

Similar presentations

Presentation on theme: "Probabilistic Equational Reasoning Arthur Kantor"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Probabilistic Equational Reasoning Arthur Kantor

Similar presentations

Presentation on theme: "Probabilistic Equational Reasoning Arthur Kantor"— Presentation transcript:

Similar presentations

About project

Feedback