Link Detection David Eichmann School of Library and Information Science The University of Iowa David Eichmann School of Library and Information Science.

Slides:

Advertisements

Similar presentations

Data Mining David Eichmann School of Library and Information Science The University of Iowa David Eichmann School of Library and Information Science The.

Advertisements

A Human-Centered Computing Framework to Enable Personalized News Video Recommendation (Oh Jun-hyuk)

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

Part-Of-Speech Tagging and Chunking using CRF & TBL

GRAMMAR & PARSING (Syntactic Analysis) NLP- WEEK 4.

For Friday No reading Homework –Chapter 23, exercises 1, 13, 14, 19 –Not as bad as it sounds –Do them IN ORDER – do not read ahead here.

Probabilistic Detection of Context-Sensitive Spelling Errors Johnny Bigert Royal Institute of Technology, Sweden

Concepts, Semantics and Syntax in E-Discovery David Eichmann Institute for Clinical and Translational Science The University of Iowa David Eichmann Institute.

Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.

Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.

DLSI Lexical Analysis Prof Brook Wu and Ph.D. student Xin Chen.

Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University

Part of speech (POS) tagging

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Learning syntactic patterns for automatic hypernym discovery Rion Snow, Daniel Jurafsky and Andrew Y. Ng Prepared by Ang Sun

Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.

Automatically Constructing a Dictionary for Information Extraction Tasks Ellen Riloff Proceedings of the 11 th National Conference on Artificial Intelligence,

Information Extraction from Documents for Automating Softwre Testing by Patricia Lutsky Presented by Ramiro Lopez.

Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.

SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Semantic Parsing for Robot Commands Justin Driemeyer Jeremy Hoffman.

Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.

Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,

AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.

ICS611 Introduction to Compilers Set 1. What is a Compiler? A compiler is software (a program) that translates a high-level programming language to machine.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.

Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.

GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)

Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.

Triplet Extraction from Sentences Lorand Dali Blaž “Jožef Stefan” Institute, Ljubljana 17 th of October 2008.

SATs Reading Paper. What We’ll Look At: Timing Text Types Finding Information Questions.

Noun-Phrase Analysis in Unrestricted Text for Information Retrieval David A. Evans, Chengxiang Zhai Laboratory for Computational Linguistics, CMU 34 th.

Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.

Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.

TimeML compliant text analysis for Temporal Reasoning Branimir Boguraev and Rie Kubota Ando.

Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.

For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.

For Friday Finish chapter 24 No written homework.

For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.

Mayo cTAKES: UIMA Type System

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

MedKAT Medical Knowledge Analysis Tool December 2009.

Communicative and Academic English for the EFL Professional.

For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.

Supertagging CMSC Natural Language Processing January 31, 2006.

Syntactic Annotation of Slovene Corpora (SDT, JOS) Nina Ledinek ISJ ZRC SAZU

4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.

Chunk Parsing II Chunking as Tagging. Chunk Parsing “Shallow parsing has become an interesting alternative to full parsing. The main goal of a shallow.

SATs Reading Paper.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections.

An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.

Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.

Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

A German Corpus for Similarity Detection

Research Enablement Metrics

School of Library and Information Science

Robust Semantics, Information Extraction, and Information Retrieval

Probabilistic and Lexicalized Parsing

School of Library and Information Science

Effective Entity Recognition and Typing by Relation Phrase-Based Clustering

Presentation transcript:

Link Detection David Eichmann School of Library and Information Science The University of Iowa David Eichmann School of Library and Information Science The University of Iowa

Why?  We focused on link detection this year to vet a new similarity scheme  In building our extraction framework for question answering and bioinformatics we were able to derive:  A reasonably clean scheme for mapping relationships between entities; and  Decorating those entities with extracted attributes/properties (e.g., person age, relative geographical position, etc.)  We focused on link detection this year to vet a new similarity scheme  In building our extraction framework for question answering and bioinformatics we were able to derive:  A reasonably clean scheme for mapping relationships between entities; and  Decorating those entities with extracted attributes/properties (e.g., person age, relative geographical position, etc.)

Our Working Hypothesis  Assessing inter-document linkage using a concept graph derived from the extraction framework could prove to be more robust than term vector methods

Technique (in the ideal)  Sentence boundary detect the corpus  Part-of-speech tag sentence terms  Extract named entities and residual noun phrases  Generate a parse for the sentence  Using the resulting dependencies to generate graph fragments  Merge the graph fragments into a single graph for a story  Use a graph similarity scheme to assess story linkage  Sentence boundary detect the corpus  Part-of-speech tag sentence terms  Extract named entities and residual noun phrases  Generate a parse for the sentence  Using the resulting dependencies to generate graph fragments  Merge the graph fragments into a single graph for a story  Use a graph similarity scheme to assess story linkage

The graph similarity measure  Generate the Cook-Holder edit distance between two graphs  Graph_sim(g1, g2) = 1 - norm(CHed(g1,g2) / max(|g1|,|g2|))  Generate the Cook-Holder edit distance between two graphs  Graph_sim(g1, g2) = 1 - norm(CHed(g1,g2) / max(|g1|,|g2|))

Reality sets in  MT text doesn’t parse worth a …  ASR text rarely has clean sentence boundaries  Off-the-shelf parsers aren’t trained for speech grammars  Hence ASR text doesn’t parse worth a …  MT text doesn’t parse worth a …  ASR text rarely has clean sentence boundaries  Off-the-shelf parsers aren’t trained for speech grammars  Hence ASR text doesn’t parse worth a …

Regrouping  Sentence boundary detect newswire sources  Approximate sentence boundaries with speech pauses longer than a certain threshold  Skip the parse  Generate graph fragments using a window of neighboring NPs  Submitted run uses the current NP and the two downstream NPs  This clearly misses syntactically close but lexically distant NP connections…  Sentence boundary detect newswire sources  Approximate sentence boundaries with speech pauses longer than a certain threshold  Skip the parse  Generate graph fragments using a window of neighboring NPs  Submitted run uses the current NP and the two downstream NPs  This clearly misses syntactically close but lexically distant NP connections…

Contrastive Runs  Cosine vector similarity of document term vectors  Cosine vector similarity of document phrase vectors  A strawman edit distance  Construct a single string for a document comprised of the concatenation of alphabetized NPs for the document  If the graph scheme doesn’t outperform this, it’s probably not worth pursuing…  Cosine vector similarity of document term vectors  Cosine vector similarity of document phrase vectors  A strawman edit distance  Construct a single string for a document comprised of the concatenation of alphabetized NPs for the document  If the graph scheme doesn’t outperform this, it’s probably not worth pursuing…

Official Results RunSchemeP(Miss)P(FA)Norm Clink UIowa1Graph UIowa2Edit UIowa3Phrase UIowa4Word

Word Performance

Phrase Performance

Edit Distance Performance

Graph Similarity Performance

Word/Phrase Costs

Word/Edit Costs

Word/Graph Costs

Graph/Edit Costs

Conclusions  Definitely signal present in the graph similarity scheme  More tuning needed  Official Run Clink:  Actual Minimum Clink:  Official Run P(Miss):  Actual Minimum Clink P(Miss):  Definitely signal present in the graph similarity scheme  More tuning needed  Official Run Clink:  Actual Minimum Clink:  Official Run P(Miss):  Actual Minimum Clink P(Miss):

Conclusions, con’t.  Revisit the graph formation hack  Hybrid scheme  Using ideal scheme for newswires  Using hack for broadcasts  Alternatively  Aggressively segment ASR, resulting in smaller fragments  Parse everything  Note here that we don’t need full sentence structure, only good clausal structure  Revisit the graph formation hack  Hybrid scheme  Using ideal scheme for newswires  Using hack for broadcasts  Alternatively  Aggressively segment ASR, resulting in smaller fragments  Parse everything  Note here that we don’t need full sentence structure, only good clausal structure