Bo Lin Kevin Dela Rosa Rushin Shah.  As part of our research, we are working on a cross- document co-reference resolution system  Co-reference Resolution:

Bo Lin Kevin Dela Rosa Rushin Shah

 As part of our research, we are working on a cross- document co-reference resolution system  Co-reference Resolution: Extract all noun phrases from a document (names, descriptions, pronouns), and cluster them according to the real-world entity they describe.  Each such cluster  a chain  Within-doc: Cluster NPs from a single document  Cross-doc: Cluster NPs from different documents

 Run a WDC system on a document and extract chains corresponding to real-world entities. For each chain, track all sentences from which its mentions are obtained.  Features over pairs of such chains: ◦ SoftTFIDF similarity between names ◦ All words TFIDF cosine similarity (over sentences) ◦ Named Entity (NE) TFIDF cosine similarity (over sentences) ◦ NE SoftTFIDF cosine similarity (over sentences) ◦ Semantic similarity between the NPs of each chain  Train an SVM that classifies pairs of chains as co-referent or not  Use SVM to cluster all chains from all documents in the corpus.  Store this clustering in a database. Each entity  a list of chains

Augment our CDC system with the following:  Attribute Extraction ◦ For each chain, extract attributes such as gender, occupation, nationality, birthdates if they exist ◦ Use attributes to enable SVM to do better co-reference  Relationship Extraction ◦ For each pair of chains, extract a relationship (e.g. part of, role, etc), if it exists ◦ Use relationships for better visualization of clusters

 Patterns: Take seed examples of (entity, attribute) and learn extraction patterns  Position: Use typical document positions of attributes  Transitive: Use attributes of neighboring entities  Latent: Use document-level topic models to infer attributes  References: Ravinchandran & Hovy ‘02, Garera & Yarowsky ‘09

 Idea: Attribute values should appear with linguistic clues around it, i.e. it can be defined as a probabilistic language model describing the chance of a word being an attribute given the context. This idea is essentially the same as in KnowItAll or Brin ‘98.  Example: ◦ Marked Text: “ (born ) is a former American, active …” ◦ Generated Context: “(born X)” for X=, “a former American X” for X= …

 Idea: Certain biographical attributes tend to appear in characteristic positions, often near top of article. Relative position /rank between attributes can be helpful information as well.  Example: “Birthdate” is often times the first date in a biography text, or at least near the beginning of the article, and relatively speaking “Deathdate” almost always occurs sometime afterwards

 Idea: Intuition is that a named entity is more likely mentioned together with another entity with similar attribute values (the most applicable ones seem to be “occupation”, “religions”)  Example: “Michael Jordan” (the player) is mostly mentioned together with “Wilt Chamberlain”, “Dennis Rodman” and fellow players.

 Idea: Use “latent wide-document-context” models to detect attributes that may not have been mentioned directly in article  Example: Words such as “songs, album, recorded” can all collectively indicate an occupation of singer or musician

 Three options: ◦ Extract variety of features and train YFCL ◦ Define Kernels to measure similarity between instances, plug into SVM ◦ Semi-supervised approach. Start with seed examples, learn patterns, iterate. Grow KB (E.g. KnowItAll, TextRunner)  Ideally would prefer semi-supervised, especially since it allows open domain IE, but very time and labor-intensive  Kernel approach more elegant, works better than YFCL  Therefore, we proposed to use Kernels

 Attributes: NNDB seed pairs, Wikipedia pages  Relations: ACE Phase 2 top level relations  For our CDC system: ◦ John Smith corpus ◦ WePS (person name disambiguation) corpora

 Data Preparation ◦ Collected attribute extraction data set  Currently our seed pairs have occupation, date-of-birth, and date-of-death values  we plan on collecting birthplace, gender, and nationality ◦ Implemented the program to parse simple Wikipedia pages  Framework ◦ Implemented a pipe-line process which consists of  Loading the list of names and pages  Parsing Wikipedia pages  Sentence segmentation  POS tagging / NE Tagging (If needed)  Modeling and Extraction ◦ Provide capability to support different models for different attributes

 Implementations ◦ Implemented context/pattern-based model, currently focusing on occupations ◦ Implemented position-based models (absolute & relative), on occupation and date of birth attribute extraction ◦ In progress of implementing transitivity-based model for occupation extraction ◦ In progress of implementing latent wide-document-context models for extracting implicit attributes

 Issues Actual implementation raises a lot of problems missed in the paper such as whether POS Tagging should be included in the model, the pipeline framework is intended to solve this issue.  Plan ◦ To finish the implementations and compare with two baseline models for each attributes ◦ Extend code to work on chains and consolidate the attributes ◦ Use attributes extracted as features in the CDC system’s SVM classifier to help co-reference resolution

 Evaluated different kinds of Kernels (subsequence, dependency trees, etc).  Chose subsequence kernels as these perform well and robust to ill-formed documents  As a template, decided to follow the Bunescu and Mooney paper discussed in class  Observation: Although the idea in the paper is easy to understand, implementation is quite complex

 For SVM, decided to use LibSVM. Bunescu has a modified version that can take custom Kernels, but based on older release  Made updates to LibSVM to reconcile different versions. ◦ Now accepts custom/pre-computed Kernels ◦ Allows richer representation of instances than numerical values (e.g. sentence fragments, named entities, etc.)

 As we know, hardest part of coding Machine Learning applications isn’t the classifier (plenty of libraries), but feature extraction  Wrote feature extraction code that extracts sentences from XML and produces instances for SVM, where each instance consists of: ◦ The 2 Named Entities of interest ◦ Sentence fragments before, between and after NEs  If a sentence consists of > 2 NEs, make N C 2 copies

 Implemented general subsequence Kernel algorithm  Currently working on adapting this for the specific case of relationship Kernels (as explained in Bunescu & Mooney)  Once done, will extend this code (which works on sentences), to chains, so it can be plugged into CDC system

Bo Lin Kevin Dela Rosa Rushin Shah.  As part of our research, we are working on a cross- document co-reference resolution system  Co-reference Resolution:

Similar presentations

Presentation on theme: "Bo Lin Kevin Dela Rosa Rushin Shah.  As part of our research, we are working on a cross- document co-reference resolution system  Co-reference Resolution:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bo Lin Kevin Dela Rosa Rushin Shah.  As part of our research, we are working on a cross- document co-reference resolution system  Co-reference Resolution:

Similar presentations

Presentation on theme: "Bo Lin Kevin Dela Rosa Rushin Shah.  As part of our research, we are working on a cross- document co-reference resolution system  Co-reference Resolution:"— Presentation transcript:

Similar presentations

About project

Feedback