WHIRL – Reasoning with IE output

Slides:



Advertisements
Similar presentations
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Advertisements

Chapter 5: Introduction to Information Retrieval
Large-Scale Entity-Based Online Social Network Profile Linkage.
A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.
Maurice Hermans.  Ontologies  Ontology Mapping  Research Question  String Similarities  Winkler Extension  Proposed Extension  Evaluation  Results.
Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.
Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.
1 I256: Applied Natural Language Processing Marti Hearst Nov 15, 2006.
Ang Sun Ralph Grishman Wei Xu Bonan Min November 15, 2011 TAC 2011 Workshop Gaithersburg, Maryland USA.
Information Retrieval Review
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Integration of Heterogeneous Databases without Common Domains Using Queries Based on Textual Similarity: William W. Cohen Machine Learning Dept. and Language.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
Information Extraction from the World Wide Web CSE 454 Based on Slides by William W. Cohen Carnegie Mellon University Andrew McCallum University of Massachusetts.
Evaluating the Performance of IR Sytems
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Introduction to Text Mining
Chapter 5: Information Retrieval and Web Search
Cornell note taking stimulates critical thinking skills. Note taking helps YOU remember what is said in class. A good set of notes can help you work on.
School of Engineering and Computer Science Victoria University of Wellington COMP423 Intelligent agents.
In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest.
Distance functions and IE -2 William W. Cohen CALD.
Similarity Joins for Strings and Sets William Cohen.
Information Extraction Yunyao Li EECS /SI /29/2006.
MINING RELATED QUERIES FROM SEARCH ENGINE QUERY LOGS Xiaodong Shi and Christopher C. Yang Definitions: Query Record: A query record represents the submission.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
December 2005CSA3180: Information Extraction I1 CSA2050: Natural Language Processing Information Extraction Named Entities IE Systems MUC Finite State.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Types of Extraction. Wrappers 2 IE from Text 3 AttributeWalmart ProductVendor Product Product NameCHAMP Bluetooth Survival Solar Multi- Function Skybox.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
WHIRL – summary of results. WHIRL project ( ) WHIRL initiated when at AT&T Bell Labs AT&T Research AT&T Labs - Research AT&T.
In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest.
Some Work on Information Extraction at IRL Ganesh Ramakrishnan IBM India Research Lab.
Presenter: Shanshan Lu 03/04/2010
Blocking. Basic idea: – heuristically find candidate pairs that are likely to be similar – only compare candidates, not all pairs Variant 1: – pick some.
Distance functions and IE – 5 William W. Cohen CALD.
India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.
Distance functions and IE – 4? William W. Cohen CALD.
Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.
Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley Supported by NSF DBI
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest.
Introduction to “Event Extraction” Jan 18, What is “Information Extraction” Filling slots in a database from sub-segments of text. As a task: October.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
More announcements Unofficial auditors: send to Sharon Woodside to make sure you get any late-breaking announcements. Project: –Already.
Record Linkage and Disclosure Limitation William W. Cohen, CALD Steve Fienberg, Statistics, CALD & C3S Pradeep Ravikumar, CALD.
Data Acquisition. Get all data necessary for the analysis task at hand Some data comes from inside the company –Need to go and talk with various data.
Distance functions and IE - 3 William W. Cohen CALD.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Why indexing? For efficient searching of a document
Queensland University of Technology
Text Based Information Retrieval
David Shepherd, Zachary P. Fry, Emily Hill, Lori Pollock, and K
Software Documentation
Issue Tracking Systems
Location Recommendation — for Out-of-Town Users in Location-Based Social Network Yina Meng.
Semi-supervised Information Extraction
Software Requirements Specification Document
WHIRL – Reasoning with IE output
Q4 Measuring Effectiveness
Towards a Personal Briefing Assistant
[jws13] Evaluation of instance matching tools: The experience of OAEI
CSE 635 Multimedia Information Retrieval
SUCCESSFUL TEXTBOOK READING AND NOTE TAKING
Introduction to Information Retrieval
CS246: Information Retrieval
Database Management Systems
Introduction to Search Engines
Presentation transcript:

WHIRL – Reasoning with IE output 11/3/10

Announcements Next week: mid-term progress reports on project Talks Mon, Wed Written 2-page status update Wed midnight Don’t get stressed about format Things to talk about: Problem and approach Related work Dataset characteristics, baseline performance Your experiences so far: what’s been hard

What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… IE NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. QA End User

What is “Information Extraction” As a task: Answering questions from a user using information in text Is building a conventional DB a necessary subgoal? When can you answer questions without one? October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… IE NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. QA End User

Deduction via co-operation User Economic issues: Who pays for integration? Who tracks errors & inconsistencies? Who fixes bugs? Who pushes for clarity in underlying concepts and object identifiers? Standards approach  publishers are responsible  publishers pay Mediator approach: 3rd party does the work, agnostic as to cost Integrated KB Site1 Site3 Site2 KB1 KB3 KB2 Standard Terminology

SELECT R.a,S.a,S.b,T.b FROM R,S,T WHIRL approach: Query Q SELECT R.a,S.a,S.b,T.b FROM R,S,T WHERE R.a~S.a and S.b~T.b (~ TFIDF-similar) Link items as needed by Q R.a S.a S.b T.b Anhai Doan Dan Weld Incrementally produce a ranked list of possible links, with “best matches” first. User (or downstream process) decides how much of the list to generate and examine. William Will Cohen Cohn Steve Steven Minton Mitton William David Cohen Cohn

WHIRL queries “Find reviews of sci-fi comedies [movie domain] FROM review SELECT * WHERE r.text~’sci fi comedy’ (like standard ranked retrieval of “sci-fi comedy”) “ “Where is [that sci-fi comedy] playing?” FROM review as r, LISTING as s, SELECT * WHERE r.title~s.title and r.text~’sci fi comedy’ (best answers: titles are similar to each other – e.g., “Hitchhiker’s Guide to the Galaxy” and “The Hitchhiker’s Guide to the Galaxy, 2005” and the review text is similar to “sci-fi comedy”)

WHIRL queries Similarity is based on TFIDF rare words are most important. Search for high-ranking answers uses inverted indices…. - It is easy to find the (few) items that match on “important” terms - Search for strong matches can prune “unimportant terms” Star Wars Episode III Hitchhiker’s Guide to the Galaxy Cinderella Man … The Hitchhiker’s Guide to the Galaxy, 2005 Men in Black, 1997 Space Balls, 1987 … Years are common in the review archive, so have low weight hitchhiker movie00137 the movie001,movie003,movie007,movie008, movie013,movie018,movie023,movie0031, …..

test train

Information integration: Outline Information integration: Some history The problem, the economics, and the economic problem “Soft” information integration Concrete uses of “soft” integration Classification Collaborative filtering Set expansion

Other string distances

Robust distance metrics for strings Kinds of distances between s and t: Edit-distance based (Levenshtein, Smith-Waterman, …): distance is cost of cheapest sequence of edits that transform s to t. Term-based (TFIDF, Jaccard, DICE, …): distance based on set of words in s and t, usually weighting “important” words Which methods work best when?

Robust distance metrics for strings SecondString (Cohen, Ravikumar, Fienberg, IIWeb 2003): Java toolkit of string-matching methods from AI, Statistics, IR and DB communities Tools for evaluating performance on test data Used to experimentally compare a number of metrics

Results: Edit-distance variants Monge-Elkan (a carefully-tuned Smith-Waterman variant) is the best on average across the benchmark datasets… 11-pt interpolated recall/precision curves averaged across 11 benchmark problems

Results: Edit-distance variants But Monge-Elkan is sometimes outperformed on specific datasets Precision-recall for Monge-Elkan and one other method (Levenshtein) on a specific benchmark

SoftTFDF: A robust distance metric We also compared edit-distance based and term-based methods, and evaluated a new “hybrid” method: SoftTFIDF, for token sets S and T: Extends TFIDF by including pairs of words in S and T that “almost” match—i.e., that are highly similar according to a second distance metric (the Jaro-Winkler metric, an edit-distance like metric).

Comparing token-based, edit-distance, and hybrid distance metrics SFS is a vanilla IDF weight on each token (circa 1959!)

SoftTFIDF is a Robust Distance Metric

Cohen, Kautz & McAllister paper [KDD 2000]

S, H are sets of tuples over “references” Definitions S, H are sets of tuples over “references” “B. Selman1”, “William W. Cohen34”, “B Selman2”,… Ipot is a weighted set of “possible” arcs. I is a subset of I. Given r, follow a chain of arcs to get the “final interpretation” of r. “B. Selman1”  “Bart Selman22”  …  “B. Selman27”

# tuples in hard DB H=I(S) Goal Given S and Ipot, find the I that minimizes Number of arcs Total weight of all arcs # tuples in hard DB H=I(S) Idea: ~= find MAP hard database behind S Arcs correspond to errors/abbreviations…. Chains of transformations correspond to errors that propogate via copying

Facts about hardening This simplifies a very simple generative model for a database Generate tuples in H one by one Generate arcs I in Ipot one by one Generate tuples in S one by one (given H and I) Greedy method makes sense: “Easy” merges can lower the cost of later “hard” merges Hardening is hard NP hard even under severe restrictions—because the choices of what to merge where are all interconnected.

affil(“Bert Sealmann”3, “Cornell”3) “B.selman” “Bart Selman” “Critical …in …” -> “Critical .. For ..” affil(“Bert Sealmann”3, “Cornell”3) author(“Bert Sealmann”3, “BLACKBOX: … problem solving ”3)