Bo Lin Kevin Dela Rosa Rushin Shah
As part of our research, we are working on a cross- document co-reference resolution system Co-reference Resolution: Extract all noun phrases from a document (names, descriptions, pronouns), and cluster them according to the real-world entity they describe. Each such cluster a chain Within-doc: Cluster NPs from a single document Cross-doc: Cluster NPs from different documents
Run a WDC system on a document and extract chains corresponding to real-world entities. For each chain, track all sentences from which its mentions are obtained. Features over pairs of such chains: ◦ SoftTFIDF similarity between names ◦ All words TFIDF cosine similarity (over sentences) ◦ Named Entity (NE) TFIDF cosine similarity (over sentences) ◦ NE SoftTFIDF cosine similarity (over sentences) ◦ Semantic similarity between the NPs of each chain Train an SVM that classifies pairs of chains as co-referent or not Use SVM to cluster all chains from all documents in the corpus. Store this clustering in a database. Each entity a list of chains
Augment our CDC system with the following: Attribute Extraction ◦ For each chain, extract attributes such as gender, occupation, nationality, birthdates if they exist ◦ Use attributes to enable SVM to do better co-reference Relationship Extraction ◦ For each pair of chains, extract a relationship (e.g. part of, role, etc), if it exists ◦ Use relationships for better visualization of clusters
Patterns: Take seed examples of (entity, attribute) and learn extraction patterns Position: Use typical document positions of attributes Transitive: Use attributes of neighboring entities Latent: Use document-level topic models to infer attributes References: Ravinchandran & Hovy ‘02, Garera & Yarowsky ‘09
Idea: Attribute values should appear with linguistic clues around it, i.e. it can be defined as a probabilistic language model describing the chance of a word being an attribute given the context. This idea is essentially the same as in KnowItAll or Brin ‘98. Example: ◦ Marked Text: “ (born ) is a former American, active …” ◦ Generated Context: “(born X)” for X=, “a former American X” for X= …
Idea: Certain biographical attributes tend to appear in characteristic positions, often near top of article. Relative position /rank between attributes can be helpful information as well. Example: “Birthdate” is often times the first date in a biography text, or at least near the beginning of the article, and relatively speaking “Deathdate” almost always occurs sometime afterwards
Idea: Intuition is that a named entity is more likely mentioned together with another entity with similar attribute values (the most applicable ones seem to be “occupation”, “religions”) Example: “Michael Jordan” (the player) is mostly mentioned together with “Wilt Chamberlain”, “Dennis Rodman” and fellow players.
Idea: Use “latent wide-document-context” models to detect attributes that may not have been mentioned directly in article Example: Words such as “songs, album, recorded” can all collectively indicate an occupation of singer or musician
Three options: ◦ Extract variety of features and train YFCL ◦ Define Kernels to measure similarity between instances, plug into SVM ◦ Semi-supervised approach. Start with seed examples, learn patterns, iterate. Grow KB (E.g. KnowItAll, TextRunner) Ideally would prefer semi-supervised, especially since it allows open domain IE, but very time and labor-intensive Kernel approach more elegant, works better than YFCL Therefore, we proposed to use Kernels
Attributes: NNDB seed pairs, Wikipedia pages Relations: ACE Phase 2 top level relations For our CDC system: ◦ John Smith corpus ◦ WePS (person name disambiguation) corpora
Data Preparation ◦ Collected attribute extraction data set Currently our seed pairs have occupation, date-of-birth, and date-of-death values we plan on collecting birthplace, gender, and nationality ◦ Implemented the program to parse simple Wikipedia pages Framework ◦ Implemented a pipe-line process which consists of Loading the list of names and pages Parsing Wikipedia pages Sentence segmentation POS tagging / NE Tagging (If needed) Modeling and Extraction ◦ Provide capability to support different models for different attributes
Implementations ◦ Implemented context/pattern-based model, currently focusing on occupations ◦ Implemented position-based models (absolute & relative), on occupation and date of birth attribute extraction ◦ In progress of implementing transitivity-based model for occupation extraction ◦ In progress of implementing latent wide-document-context models for extracting implicit attributes
Issues Actual implementation raises a lot of problems missed in the paper such as whether POS Tagging should be included in the model, the pipeline framework is intended to solve this issue. Plan ◦ To finish the implementations and compare with two baseline models for each attributes ◦ Extend code to work on chains and consolidate the attributes ◦ Use attributes extracted as features in the CDC system’s SVM classifier to help co-reference resolution
Evaluated different kinds of Kernels (subsequence, dependency trees, etc). Chose subsequence kernels as these perform well and robust to ill-formed documents As a template, decided to follow the Bunescu and Mooney paper discussed in class Observation: Although the idea in the paper is easy to understand, implementation is quite complex
For SVM, decided to use LibSVM. Bunescu has a modified version that can take custom Kernels, but based on older release Made updates to LibSVM to reconcile different versions. ◦ Now accepts custom/pre-computed Kernels ◦ Allows richer representation of instances than numerical values (e.g. sentence fragments, named entities, etc.)
As we know, hardest part of coding Machine Learning applications isn’t the classifier (plenty of libraries), but feature extraction Wrote feature extraction code that extracts sentences from XML and produces instances for SVM, where each instance consists of: ◦ The 2 Named Entities of interest ◦ Sentence fragments before, between and after NEs If a sentence consists of > 2 NEs, make N C 2 copies
Implemented general subsequence Kernel algorithm Currently working on adapting this for the specific case of relationship Kernels (as explained in Bunescu & Mooney) Once done, will extend this code (which works on sentences), to chains, so it can be plugged into CDC system