Databases & Information Retrieval Maya Ramanath ( Further Reading: Combining Database and Information-Retrieval Techniques for Knowledge Discovery. G. Weikum, G. Kasneci, M. Ramanath and F.M. Suchanek, CACM, April 2009 DB & IR: Both Sides Now. G. Weikum, Keynote at SIGMOD 2007 )
DB and IR: Different Motivations Both deal with large amounts of information, but… DBIR Applicationsonline reservation, banking libraries Emphasisdata consistency, efficiency result quality, user satisfaction Datastructured recordsunstructured text Queriespreciseinterpretations vary Resultsexact match/all results ranked/top-k results
Why Combine Now? The applications drive the need – The need to manage both structured and unstructured data in an integrated manner Healthcare example – Find young patients in central Europe who have been reported, in the last two weeks, to have symptoms of tropical virus diseases and an indication of anomalies. Newspaper archives, product catalogues, etc.
Integrating DB & IR top-k processing, keyword search on graphs top-k processing, keyword search on graphs IR Systems extracting entities and relationships, ranking for entities extracting entities and relationships, ranking for entities DB Systems Structured queries / boolean match results (SQL) Untructured queries / ranked results (keywords/top-k) Structured data (relational) Unstructured data (text) query processing for text search, effective query interfaces, ranking for structured data
Modules 1.Top-k processing 2.Query Processing and Interfaces 3.Keyword Search on Graphs 4.Entity and Relationship Extraction 5.Ranking and Structured Data
1. Top-k Processing (1/2) Structured data, with scores in multiple dimensions Return the top-k “objects” CarColor BMW X10.9 Honda City0.8 Maruti Swift0.6 Tata Nano0.1 CarMileage Honda City0.8 Maruti Swift0.6 Tata Nano0.3 BMW X10.1 CarService Tata Nano0.7 Maruti Swift0.6 Honda City0.3 BMW X10.1
1. Top-k Processing (2/2) Top-k Joins – Example: Return the best house-school pair HousesRatingLocation H10.9L1 H20.8L2 H30.6L3 H40.1L3 SchoolsRatingLocation S10.4L2 S20.2L2 S30.8L3 S40.1L3
2. Query Processing and Interfaces (1/3) Given: Database of text documents and a text- centric task. – Extract information about disease outbreaks Strategies – Scan all documents – very expensive – Filter promising documents – affects recall Develop cost models and execution strategies appropriate for this setting
2. Query Processing and Interfaces (2/3) Querying with “typed” keywords Keyword querying: Easy to use Structured queries: Precise Find the middle ground… Instead of “german has won nobel award” q(X) :- GERMAN(x), hasWonPrize(x,y), NOBEL_PRIZE(y) “german, has won (nobel award)”
2. Query Processing and Interfaces (3/3) Does the output have to be a boring list of ranked results? Nope !
3. Keyword Search on Graphs (1/3) Lots of graphs around – Relational DB (tuples+foreign keys) – XML data (elements/sub-elements/id/idrefs) – RDF (graph-structured knowledge-bases) Easy to query with keywords, instead of SQL/XQuery/SPARQL Results are the top-k interconnections between the keywords
3. Keyword Search on Graphs (2/3)
3. Keyword Search on Graphs (3/3) Query: “Einstein”, “Bohr” vegetarian Tom Cruise 1962 isa bornIn diedIn Einstein BohrNobel Prize won
4. Entity and Relationship Extraction (1/2) Information Extraction (or Knowledge Harvesting) Bill Gates was the founder of Microsoft and later it’s CEO. Apple was established on April 1, 1976 by Steve Jobs, Steve Wozniak, and Ronald Wayne. Infosys was founded on 2 July 1981 by seven entrepreneurs: N. R. Narayana Murthy, Nandan Nilekani, … CompanyFounder MicrosoftBill Gates AppleSteve Jobs AppleSteve Wozniak InfosysN. R. Narayana Murthy
4. Entity and Relationship Extraction (2/2) How to build a knowledge-base of facts? – Structurize Wikipedia – Construct rules for extraction How do I acquire all the facts in the world? – Extract “everything” – Don’t stop extracting
5. Ranking and Structured Data Not the same as top-k processing Given: Data with stucture in it – Relational tables (flat) – XML (trees/graphs) – Text documents consisting of entities Task: Rank the query results – SQL/Xquery/”typed” keywords
QUESTIONS?