School of Library and Information Science

Slides:



Advertisements
Similar presentations
A Human-Centered Computing Framework to Enable Personalized News Video Recommendation (Oh Jun-hyuk)
Advertisements

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Probabilistic Detection of Context-Sensitive Spelling Errors Johnny Bigert Royal Institute of Technology, Sweden
LingPipe Does a variety of tasks  Tokenization  Part of Speech Tagging  Named Entity Detection  Clustering  Identifies.
Link Detection David Eichmann School of Library and Information Science The University of Iowa David Eichmann School of Library and Information Science.
Concepts, Semantics and Syntax in E-Discovery David Eichmann Institute for Clinical and Translational Science The University of Iowa David Eichmann Institute.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
ICS611 Introduction to Compilers Set 1. What is a Compiler? A compiler is software (a program) that translates a high-level programming language to machine.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Creating Metabolic Network Models using Text Mining and Expert Knowledge J.A. Dickerson, D. Berleant, Z. Cox, W. Qi, and E. Wurtele Iowa State University.
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
Triplet Extraction from Sentences Lorand Dali Blaž “Jožef Stefan” Institute, Ljubljana 17 th of October 2008.
Triplet Extraction from Sentences Technical University of Cluj-Napoca Conf. Dr. Ing. Tudor Mureşan “Jožef Stefan” Institute, Ljubljana, Slovenia Assist.
Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Friday Finish chapter 24 No written homework.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
MedKAT Medical Knowledge Analysis Tool December 2009.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Supertagging CMSC Natural Language Processing January 31, 2006.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.
Chunk Parsing II Chunking as Tagging. Chunk Parsing “Shallow parsing has become an interesting alternative to full parsing. The main goal of a shallow.
SATs Reading Paper.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Natural Language Processing Vasile Rus
Automatic Writing Evaluation
Criterial features If you have examples of language use by learners (differentiated by L1 etc.) at different levels, you can use that to find the criterial.
A German Corpus for Similarity Detection
Getting the Most from Writing
Entity- & Topic-Based Information Ordering
Research Enablement Metrics
School of Library and Information Science
Robust Semantics, Information Extraction, and Information Retrieval
Vector Space Model Seminar Social Media Mining University UC3M
What is a Synthesis Essay?
Introduction to Textual Analysis
Compact Query Term Selection Using Topically Related Text
Web IR: Recent Trends; Future of Web Search
CSCI 5832 Natural Language Processing
Probabilistic and Lexicalized Parsing
Getting the Most from Writing
Recognizing Structure: Sentence, Speaker, andTopic Segmentation
NETWORK-BASED MODEL OF LEARNING
Extracting Semantic Concept Relations
Writing a Research Proposal
Exploiting Topic Pragmatics for New Event Detection in TDT-2004
Introduction Task: extracting relational facts from text
Automatic Detection of Causal Relations for Question Answering
Haresfield C of E Primary School
Statistical Machine Translation Papers from COLING 2004
Effective Entity Recognition and Typing by Relation Phrase-Based Clustering
Introduction to Text Analysis
SATs Reading Paper.
Information Retrieval
Statistical NLP: Lecture 10
Presentation transcript:

School of Library and Information Science Link Detection David Eichmann School of Library and Information Science The University of Iowa

Why? We focused on link detection this year to vet a new similarity scheme In building our extraction framework for question answering and bioinformatics we were able to derive: A reasonably clean scheme for mapping relationships between entities; and Decorating those entities with extracted attributes/properties (e.g., person age, relative geographical position, etc.)

Our Working Hypothesis Assessing inter-document linkage using a concept graph derived from the extraction framework could prove to be more robust than term vector methods

Technique (in the ideal) Sentence boundary detect the corpus Part-of-speech tag sentence terms Extract named entities and residual noun phrases Generate a parse for the sentence Using the resulting dependencies to generate graph fragments Merge the graph fragments into a single graph for a story Use a graph similarity scheme to assess story linkage

The graph similarity measure Generate the Cook-Holder edit distance between two graphs Graph_sim(g1, g2) = 1 - norm(CHed(g1,g2) / max(|g1|,|g2|))

Reality sets in MT text doesn’t parse worth a … ASR text rarely has clean sentence boundaries Off-the-shelf parsers aren’t trained for speech grammars Hence ASR text doesn’t parse worth a …

Regrouping Sentence boundary detect newswire sources Approximate sentence boundaries with speech pauses longer than a certain threshold Skip the parse Generate graph fragments using a window of neighboring NPs Submitted run uses the current NP and the two downstream NPs This clearly misses syntactically close but lexically distant NP connections…

Contrastive Runs Cosine vector similarity of document term vectors Cosine vector similarity of document phrase vectors A strawman edit distance Construct a single string for a document comprised of the concatenation of alphabetized NPs for the document If the graph scheme doesn’t outperform this, it’s probably not worth pursuing…

Official Results Run Scheme P(Miss) P(FA) Norm Clink UIowa1 Graph 0.7234 0.0018 0.7320 UIowa2 Edit 0.7308 0.0668 1.0582 UIowa3 Phrase 0.6971 0.0014 0.6984 UIowa4 Word 0.6851 0.0004 0.6871

Word Performance

Phrase Performance

Edit Distance Performance

Graph Similarity Performance

Word/Phrase Costs

Word/Edit Costs

Word/Graph Costs

Graph/Edit Costs

Conclusions Definitely signal present in the graph similarity scheme More tuning needed Official Run Clink: 0.0146 Actual Minimum Clink: 0.0118 Official Run P(Miss): 0.7234 Actual Minimum Clink P(Miss): 0.4951

Conclusions, con’t. Revisit the graph formation hack Hybrid scheme Using ideal scheme for newswires Using hack for broadcasts Alternatively Aggressively segment ASR, resulting in smaller fragments Parse everything Note here that we don’t need full sentence structure, only good clausal structure