WHIRL – Reasoning with IE output

Slides:



Advertisements
Similar presentations
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
Advertisements

A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.
Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Using Machine Learning to Discover and Understand Structured Data William W. Cohen Machine Learning Dept. and Language Technologies Inst. School of Computer.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Integration of Heterogeneous Databases without Common Domains Using Queries Based on Textual Similarity: William W. Cohen Machine Learning Dept. and Language.
Recommender systems Ram Akella November 26 th 2008.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Introduction to Text Mining
Chapter 5: Information Retrieval and Web Search
School of Engineering and Computer Science Victoria University of Wellington COMP423 Intelligent agents.
In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest.
Distance functions and IE -2 William W. Cohen CALD.
Similarity Joins for Strings and Sets William Cohen.
Information Extraction Yunyao Li EECS /SI /29/2006.
MINING RELATED QUERIES FROM SEARCH ENGINE QUERY LOGS Xiaodong Shi and Christopher C. Yang Definitions: Query Record: A query record represents the submission.
Theory of Computing Lecture 15 MAS 714 Hartmut Klauck.
Types of Extraction. Wrappers 2 IE from Text 3 AttributeWalmart ProductVendor Product Product NameCHAMP Bluetooth Survival Solar Multi- Function Skybox.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
WHIRL – summary of results. WHIRL project ( ) WHIRL initiated when at AT&T Bell Labs AT&T Research AT&T Labs - Research AT&T.
In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest.
Just as there are many human languages, there are many computer programming languages that can be used to develop software. Some are named after people,
Some Work on Information Extraction at IRL Ganesh Ramakrishnan IBM India Research Lab.
Chapter 6: Information Retrieval and Web Search
Presenter: Shanshan Lu 03/04/2010
Blocking. Basic idea: – heuristically find candidate pairs that are likely to be similar – only compare candidates, not all pairs Variant 1: – pick some.
Microsoft ® Office Excel 2003 Training Using XML in Excel SynAppSys Educational Services presents:
India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.
Distance functions and IE – 4? William W. Cohen CALD.
Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer 
Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.
In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Record Linkage and Disclosure Limitation William W. Cohen, CALD Steve Fienberg, Statistics, CALD & C3S Pradeep Ravikumar, CALD.
Data Acquisition. Get all data necessary for the analysis task at hand Some data comes from inside the company –Need to go and talk with various data.
Distance functions and IE - 3 William W. Cohen CALD.
Software Development Languages and Environments. Computer Languages Just as there are many human languages, there are many computer programming languages.
Introduction to CSCI 1311 Dr. Mark C. Lewis
Data mining in web applications
Information Retrieval in Practice
Introduction to Algorithms
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Text Based Information Retrieval
Software Documentation
OPSE 301: Lab13 Data Analysis – Fitting Data to Arbitrary Functions
Martin Rajman, Martin Vesely
Effective Writing Where and how to start?
Location Recommendation — for Out-of-Town Users in Location-Based Social Network Yina Meng.
Step-By-Step Instructions for Miniproject 2
CS 430: Information Discovery
Cornell Notes.
[jws13] Evaluation of instance matching tools: The experience of OAEI
CSE 635 Multimedia Information Retrieval
SUCCESSFUL TEXTBOOK READING AND NOTE TAKING
Introduction to Algorithms
The of and to in is you that it he for was.
CS246: Information Retrieval
Database Management Systems
WHIRL – Reasoning with IE output
Web Mining Research: A Survey
Workflows and Abstractions for Map-Reduce
Entity Linking Survey
Retrieval Performance Evaluation - Measures
CSE 326: Data Structures Lecture #14
Function-oriented Design
Presentation transcript:

WHIRL – Reasoning with IE output 11/1/10

Announcements Next week: mid-term progress reports on project Talks Mon, Wed Written 2-page status update Wed midnight Don’t get stressed about format Things to talk about: Problem and approach Related work Dataset characteristics, baseline performance Your experiences so far: what’s been hard

What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… * Microsoft Corporation CEO Bill Gates Microsoft Gates Bill Veghte VP Richard Stallman founder Free Software Foundation * NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Richard Stallman founder Free Soft.. * *

What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… IE NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. QA End User

What is “Information Extraction” As a task: Answering questions from a user using information in text Is building a conventional DB a necessary subgoal? When can you answer questions without one? October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… IE NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. QA End User

When are two entities the same? Bell Labs Bell Telephone Labs AT&T Bell Labs A&T Labs AT&T Labs—Research AT&T Labs Research, Shannon Laboratory Shannon Labs Bell Labs Innovations Lucent Technologies/Bell Labs Innovations [1925] History of Innovation: From 1925 to today, AT&T has attracted some of the world's greatest scientists, engineers and developers…. [www.research.att.com] Bell Labs Facts: Bell Laboratories, the research and development arm of Lucent Technologies, has been operating continuously since 1925… [bell-labs.com]

In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest threat to his continued good health, for--the stories go--once an enemy, even a weak unskilled enemy, learned the sorcerer's true name, then routine and widely known spells could destroy or enslave even the most powerful. As times passed, and we graduated to the Age of Reason and thence to the first and second industrial revolutions, such notions were discredited. Now it seems that the Wheel has turned full circle (even if there never really was a First Age) and we are back to worrying about true names again: The first hint Mr. Slippery had that his own True Name might be known--and, for that matter, known to the Great Enemy--came with the appearance of two black Lincolns humming up the long dirt driveway ... Roger Pollack was in his garden weeding, had been there nearly the whole morning.... Four heavy-set men and a hard-looking female piled out, started purposefully across his well-tended cabbage patch.… This had been, of course, Roger Pollack's great fear. They had discovered Mr. Slippery's True Name and it was Roger Andrew Pollack TIN/SSAN 0959-34-2861. To do

Deduction via co-operation User Economic issues: Who pays for integration? Who tracks errors & inconsistencies? Who fixes bugs? Who pushes for clarity in underlying concepts and object identifiers? Standards approach  publishers are responsible  publishers pay Mediator approach: 3rd party does the work, agnostic as to cost Integrated KB Site1 Site3 Site2 KB1 KB3 KB2 Standard Terminology

Traditional approach: Linkage Queries Uncertainty about what to link must be decided by the integration system, not the end user

SELECT R.a,S.a,S.b,T.b FROM R,S,T WHERE R.a=S.a and S.b=T.b WHIRL approach: Query Q SELECT R.a,S.a,S.b,T.b FROM R,S,T WHERE R.a=S.a and S.b=T.b Link items as needed by Q R.a S.a S.b T.b Anhai Doan Dan Weld Strongest links: those agreeable to most users Weaker links: those agreeable to some users William Will Cohen Cohn Steve Steven Minton Mitton even weaker links… William David Cohen Cohn

SELECT R.a,S.a,S.b,T.b FROM R,S,T WHIRL approach: Query Q SELECT R.a,S.a,S.b,T.b FROM R,S,T WHERE R.a~S.a and S.b~T.b (~ TFIDF-similar) Link items as needed by Q R.a S.a S.b T.b Anhai Doan Dan Weld Incrementally produce a ranked list of possible links, with “best matches” first. User (or downstream process) decides how much of the list to generate and examine. William Will Cohen Cohn Steve Steven Minton Mitton William David Cohen Cohn

WHIRL queries Assume two relations: … … review(movieTitle,reviewText): archive of reviews listing(theatre, movieTitle, showTimes, …): now showing The Hitchhiker’s Guide to the Galaxy, 2005 This is a faithful re-creation of the original radio series – not surprisingly, as Adams wrote the screenplay …. Men in Black, 1997 Will Smith does an excellent job in this … Space Balls, 1987 Only a die-hard Mel Brooks fan could claim to enjoy … … Star Wars Episode III The Senator Theater 1:00, 4:15, & 7:30pm. Cinderella Man The Rotunda Cinema 1:00, 4:30, & 7:30pm. …

WHIRL queries “Find reviews of sci-fi comedies [movie domain] FROM review SELECT * WHERE r.text~’sci fi comedy’ (like standard ranked retrieval of “sci-fi comedy”) “ “Where is [that sci-fi comedy] playing?” FROM review as r, LISTING as s, SELECT * WHERE r.title~s.title and r.text~’sci fi comedy’ (best answers: titles are similar to each other – e.g., “Hitchhiker’s Guide to the Galaxy” and “The Hitchhiker’s Guide to the Galaxy, 2005” and the review text is similar to “sci-fi comedy”)

WHIRL queries Similarity is based on TFIDF rare words are most important. Search for high-ranking answers uses inverted indices…. - It is easy to find the (few) items that match on “important” terms - Search for strong matches can prune “unimportant terms” Star Wars Episode III Hitchhiker’s Guide to the Galaxy Cinderella Man … The Hitchhiker’s Guide to the Galaxy, 2005 Men in Black, 1997 Space Balls, 1987 … Years are common in the review archive, so have low weight hitchhiker movie00137 the movie001,movie003,movie007,movie008, movie013,movie018,movie023,movie0031, …..

Inference in WHIRL “Best-first” search: pick state s that is “best” according to f(s) Suppose graph is a tree, and for all s, s’, if s’ is reachable from s then f(s)>=f(s’). Then A* outputs the globally best goal state s* first, and then next best, ...

Inference in WHIRL Explode p(X1,X2,X3): find all DB tuples <p,a1,a2,a3> for p and bind Xi to ai. Constrain X~Y: if X is bound to a and Y is unbound, find DB column C to which Y should be bound pick a term t in X, find proper inverted index for t in C, and bind Y to something in that index Keep track of t’s used previously, and don’t allow Y to contain one.

Inference in WHIRL

Information integration: Outline Information integration: Some history The problem, the economics, and the economic problem “Soft” information integration Concrete uses of “soft” integration Classification Collaborative filtering Set expansion

Stopped about here….

Information integration: Outline Information integration: Some history The problem, the economics, and the economic problem “Soft” information integration Concrete uses of “soft” integration Classification Collaborative filtering Set expansion

Other string distances

Robust distance metrics for strings Kinds of distances between s and t: Edit-distance based (Levenshtein, Smith-Waterman, …): distance is cost of cheapest sequence of edits that transform s to t. Term-based (TFIDF, Jaccard, DICE, …): distance based on set of words in s and t, usually weighting “important” words Which methods work best when?

Robust distance metrics for strings SecondString (Cohen, Ravikumar, Fienberg, IIWeb 2003): Java toolkit of string-matching methods from AI, Statistics, IR and DB communities Tools for evaluating performance on test data Used to experimentally compare a number of metrics

Results: Edit-distance variants Monge-Elkan (a carefully-tuned Smith-Waterman variant) is the best on average across the benchmark datasets… 11-pt interpolated recall/precision curves averaged across 11 benchmark problems

Results: Edit-distance variants But Monge-Elkan is sometimes outperformed on specific datasets Precision-recall for Monge-Elkan and one other method (Levenshtein) on a specific benchmark

SoftTFDF: A robust distance metric We also compared edit-distance based and term-based methods, and evaluated a new “hybrid” method: SoftTFIDF, for token sets S and T: Extends TFIDF by including pairs of words in S and T that “almost” match—i.e., that are highly similar according to a second distance metric (the Jaro-Winkler metric, an edit-distance like metric).

Comparing token-based, edit-distance, and hybrid distance metrics SFS is a vanilla IDF weight on each token (circa 1959!)

SoftTFIDF is a Robust Distance Metric

Cohen, Kautz & McAllister paper

S, H are sets of tuples over “references” Definitions S, H are sets of tuples over “references” “B. Selman1”, “William W. Cohen34”, “B Selman2”,… Ipot is a weighted set of “possible” arcs. I is a subset of I. Given r, follow a chain of arcs to get the “final interpretation” of r. “B. Selman1”  “Bart Selman22”  …  “B. Selman27”

# tuples in hard DB H=I(S) Goal Given S and Ipot, find the I that minimizes Number of arcs Total weight of all arcs # tuples in hard DB H=I(S) Idea: ~= find MAP hard database behind S Arcs correspond to errors/abbreviations…. Chains of transformations correspond to errors that propogate via copying

Facts about hardening This simplifies a very simple generative model for a database Generate tuples in H one by one Generate arcs I in Ipot one by one Generate tuples in S one by one (given H and I) Greedy method makes sense: “Easy” merges can lower the cost of later “hard” merges Hardening is hard NP hard even under severe restrictions—because the choices of what to merge where are all interconnected.

affil(“Bert Sealmann”3, “Cornell”3) “B.selman” “Bart Selman” “Critical …in …” -> “Critical .. For ..” affil(“Bert Sealmann”3, “Cornell”3) author(“Bert Sealmann”3, “BLACKBOX: … problem solving ”3)