Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko Presenter: Shahina.

Slides:



Advertisements
Similar presentations
Internet Search Lecture # 3.
Advertisements

Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Chapter 5: Introduction to Information Retrieval
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
TEXTRUNNER Turing Center Computer Science and Engineering
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
IT 433 Data Warehousing and Data Mining Association Rules Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Searching for Prior Art: Moving From the Search Room to the World Wide Web Larry Tarazano Primary Examiner Technology Center 1700 U.S. Patent and Trademark.
Modelled on paper by Oren Etzioni et al. : Web-Scale Information Extraction in KnowItAll System for extracting data (facts) from large amount of unstructured.
Efficient Query Evaluation on Probabilistic Databases
KnowItNow: Fast, Scalable Information Extraction from the Web Michael J. Cafarella, Doug Downey, Stephen Soderland, Oren Etzioni.
Overall Information Extraction vs. Annotating the Data Conference proceedings by O. Etzioni, Washington U, Seattle; S. Handschuh, Uni Krlsruhe.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.
Information Retrieval
Semi-Automatic Generation of Mini-Ontologies from Canonicalized Relational Tables Chris Hathaway.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Chapter 5: Information Retrieval and Web Search
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Presented by Mat Kelly CS895 – Web-based Information Retrieval Old Dominion University Septmber 27, 2011 The Deep Web: Surfacing Hidden Value Michael K.
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Search Engines and Information Retrieval Chapter 1.
Web-scale Information Extraction in KnowItAll Oren Etzioni etc. U. of Washington WWW’2004 Presented by Zheng Shao, CS591CXZ.
Artificial intelligence project
DBSQL 3-1 Copyright © Genetic Computer School 2009 Chapter 3 Relational Database Model.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”
Structured Querying of Web Text A Technical Challenge Kulsawasd Jitkajornwanich University of Texas at Arlington CSE6339 Web Mining.
Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington.
A Language Independent Method for Question Classification COLING 2004.
GEORGIOS FAKAS Department of Computing and Mathematics, Manchester Metropolitan University Manchester, UK. Automated Generation of Object.
M Taimoor Khan Course Objectives 1) Basic Concepts 2) Tools 3) Database architecture and design 4) Flow of data (DFDs)
RDF and triplestores CMSC 461 Michael Wilson. Reasoning  Relational databases allow us to reason about data that is organized in a specific way  Data.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Chapter 6: Information Retrieval and Web Search
Presenter: Shanshan Lu 03/04/2010
SESSION 3.1 This section covers using the query window in design view to create a query and sorting & filtering data while in a datasheet view. Microsoft.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Erasmus University Rotterdam Introduction Content-based news recommendation is traditionally performed using the cosine similarity and TF-IDF weighting.
Search Engines By: Faruq Hasan.
Information Integration By Neel Bavishi. Mediator Introduction A mediator supports a virtual view or collection of views that integrates several sources.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia.
Date: 2013/10/23 Author: Salvatore Oriando, Francesco Pizzolon, Gabriele Tolomei Source: WWW’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang SEED:A Framework.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Presented By: Miss N. Nembhard. Relation Algebra Relational Algebra is : the formal description of how a relational database operates the mathematics.
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
ENG 110 / HIS 113 Mortola Library.  Understand the nature and potential uses of a variety of secondary sources.  Locate books pertaining to your research.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Task 1. Know how to use keywords and search terms to search the internet Understand why it is a good idea to refine searches to reduce the number of hits.
Session 5: How Search Engines Work. Focusing Questions How do search engines work? Is one search engine better than another?
Search Engine Optimization
Searching for Prior Art: Moving From the Search Room
PRESENTED BY: PEAR A BHUIYAN
Data Mining Association Rules Assoc.Prof.Songül Varlı Albayrak
Information Retrieval
Data Integration for Relational Web
Probabilistic Databases
Open Information Extraction from the Web
Probabilistic Databases with MarkoViews
Presentation transcript:

Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko Presenter: Shahina Ferdous ID – Date – 03/23/10

Querying over Unstructured Data Web (Text Documents) Web (Text Documents) Contains vast amount Text Documents, which is: Unstructured Accessed by keywords Limited search quality

Querying over Unstructured Data Web Show me some people, what they invented, and the years they died Keyword-in Document-out

Querying over Unstructured Data Web List some Scientists with their invention and the years they died Keyword-in Document-out

Structured Querying of web Text  “Show me some people, what they invented, and the years they died” ScientistInventionsYearProb Keplerlog books Heisenbergmatrix mechanics Galileotelescope Newtoncalculus  In this paper, they proposed a structured Web query System called extraction databse, ExDB.  ExDb uses information extraction (IE) system to extract Data.  As the extracted Data can be erroneos, ExDB assigns Probability to the tuples.

ExDB Work Flow …no one could surprising. In 1877, Edison invented the phonograph. Although he… …didnt surprising. In 1877, Edison invented the phonograph. Although he… …was surprising. In 1877, Edison invented the phonograph. Although he… Obj1PredObj2prob Edisoninventedphonogr aph 0.97 Morganborn-in TypeInstanceprob scientistEinstein0.99 citySeattle0.92 Pred1Pred2prob inventeddid-invent0.85 inventedcreated0.72 Facts Types Synonyms RDBMS Query middleware invented(Edison ?e, ?i) 1. Run extractors2. Populate data model3. Query Processing & Applications Web

Information Extraction ExDB extracts several base-level concepts through combination of existing IE techniques:  Objects are Data values in the system. Examples: Einstein, telephone, Boston, Light-bulb, etc.  Predicates represents binary relation between pair of objects. Examples: discovered (Edison, phonograph), born-in (A. –Einstein, Switzerland) and sells (Amazon, PlayStation) etc.  Semantic types represents unary relation of objects. Examples: city (Boston), city (New-York) and electronics (dvd-player) etc.

Information Extraction ExDB should also extract more series of relationships to make queries even easier for the user:  Synonyms denote equivalent objects, predicates or types. Examples: Einstein and A. –Einstein almost certainly refer to same object. Also, invented and has-invented refer to same predicate.  Inclusion Dependencies describes subset relationship between two predicates. Examples: invented (?x, ?y )  discovered (?x, ?y).  Functional Dependencies are useful to answer query with negation or why an object is not an answer. For example, a probabilistic FD indicating a person can only be born in one Country: born-in(?x, ?y): ?x -> ?y p=0.95  “All Scientists born in Germany that taught at Princeton”. If after receiving the answers, they ask again to the system “Why Einstein is not an answer?”.  Using the above FD, the system will answer: “As born-in (Einstein, Switzerland)” and FD tells a person can only born in one Country, therefore probability of born-in (Einstein, Germany) is very low.

Information Extraction ExampleDescriptionIE technique invented(Edison, phonograph) Arity-2 factTextRunner Einstein Type (hypernymy)KnowItAll has-invented = invented SynonymyDIRT invented  discovered ID (troponymy)? FD: has-capital(x, y)  has-capital(y) FD (rule)?

ExDB Work Flow …no one could surprising. In 1877, Edison invented the phonograph. Although he… …didnt surprising. In 1877, Edison invented the phonograph. Although he… …was surprising. In 1877, Edison invented the phonograph. Although he… Obj1PredObj2prob Edisoninventedphonogra ph 0.97 Morganborn-in TypeInstanceprob scientistEinstein0.99 citySeattle0.92 Pred1Pred2prob inventeddid-invent0.85 inventedcreated0.72 Facts Types Synonyms RDBMS Query middleware invented(Edison ?e, ?i) 1. Run extractors2. Populate data model3. Query Processing & Applications Web

Populate Data Model Obj1PredObj2prob Edisoninventedphonogr aph 0.97 Morganborn-in TypeInstanceprob scientistEinstein0.99 cityBoston0.92 Pred1Pred2prob inventeddid-invent0.85 inventedcreated0.72 InclusionIncluderprob inventeddiscovered0.81 SeattleWashington0.65 LHSRHSprob capital(x, y)capital(y)0.77 born-in(x)country(y)0.95 Facts Types Synonyms IDs FDs It was big news when Edison invented the phonograph… He visited cities such as Boston and New York. We all know that Edison did- invent the light bulb. … In 1877 Edison created the phonograph. Morgan was born-in 1837 into a prosperous mercantile-banking family… Einstein is one of the best known scientists and intellectuals of all time. For fact extraction ExDB uses unsupervised system called TextRunner. TextRunner generates a large set of extraction while running on entire corpus of text. Unlike other IE systems, it does not require a set of target predicates specified beforehand, rather it starts by using a heavy weight linguistic parser to generate high quality extraction triples. Later they use these high quality triples as the training set to generate a light weight extraction classifier that can run on entire web-scale corpus For fact extraction ExDB uses unsupervised system called TextRunner. TextRunner generates a large set of extraction while running on entire corpus of text. Unlike other IE systems, it does not require a set of target predicates specified beforehand, rather it starts by using a heavy weight linguistic parser to generate high quality extraction triples. Later they use these high quality triples as the training set to generate a light weight extraction classifier that can run on entire web-scale corpus TextRunner For type extraction ExDB uses the KnowItAll system. KnowItALL searches the entire corpus to extract hypernym or “is-a” relationships. For example: it extracts city (Boston) from “cities such as Seattle and Boston”. Assign each extraction a probability based on its frequency (or search engine hit count). For type extraction ExDB uses the KnowItAll system. KnowItALL searches the entire corpus to extract hypernym or “is-a” relationships. For example: it extracts city (Boston) from “cities such as Seattle and Boston”. Assign each extraction a probability based on its frequency (or search engine hit count). knowItAll ExDB uses DIRT algorithm to extract predicate synonyms. DIRT computes the degree to which the argument pairs of two predicates coincide. For example, invented and has-invented will overlap many argument pairs like Edison/Light-bulb or Einstein/theory-of-relativity. ExDB uses DIRT algorithm to extract predicate synonyms. DIRT computes the degree to which the argument pairs of two predicates coincide. For example, invented and has-invented will overlap many argument pairs like Edison/Light-bulb or Einstein/theory-of-relativity. DIRT

ExDB Work Flow …no one could surprising. In 1877, Edison invented the phonograph. Although he… …didnt surprising. In 1877, Edison invented the phonograph. Although he… …was surprising. In 1877, Edison invented the phonograph. Although he… Obj1PredObj2prob Edisoninventedphonogr aph 0.97 Morganborn-in TypeInstanceprob scientistEinstein0.99 citySeattle0.92 Pred1Pred2prob inventeddid-invent0.85 inventedcreated0.72 Facts Types Synonyms RDBMS Query middleware invented(Edison ?e, ?i) 1. Run extractors2. Populate data model 3. Query Processing & Applications Web

ExDB Queries ExDB proposes the users to query over the web Data model using Datalog-like notation. Example: q(?i) :- invented(Edison, ?i) returns all inventions by Edison. Example constranits: q(?x, ?y) :- died-in( ?x, 1955?y) Example query for locally available inexpensive electronics: q(?x, ?y, ?z) :- for-sale-in( ?x, Seattle ?y), costs (?x, ?z), (?z < 25) Another example can be: q(?x, ?y, ?z) :- invented( ?x, ?y), died-in (?x, ?z), (?z < 1900) Example of projection queries: q(?s) :- invented( ?s, ?i)

Query Processing Non-projecting queries  Involves a series of join against tables in the Web Data Model  Probability of a joined tuple is the product of the individual tuple’s probabilities  Select top-k queries ranked by their probability as results. ObjectClass einsteinscientist bostoncity bohrscientist francecountry curiescientist Bugs bunnyscientist Object1PredicateObject2 einsteininventedrelativity 1848Was-year-ofrevolution edisoninventedphonograph dukakisvisitedboston einsteindied-in1955 humanshaveCold-fusion prob prob … … Types Facts Example: q(?x, ?y, ?z) :- invented ( ?x, ?y), died-in (?x, ?z). ScientistInventedDied-inprob einsteinrelativity …

Projecting queries  q (?s) :- invented ( ?s, ?i) rank scientists according to the probability of the scientist invented something without caring much about the actual invention.  Need to compute a disjunction of m probabilistic events.  A scientist Tesla appears in the output q, if the tuple invented (Tesla, I 0 ) is in the database.  There can be many inventions I 1, …, I m for Tesla such as invented (Tesla, I i ). Any of these are sufficient to return Tesla as an answer for q.  As m can be very large, a large number of very low probability extractions can unexpectedly result in a quite large probability.  Therefore, try to abstract panel of experts, where an expert is a tuple with a score such as Invented (tesla, Fluroescent-Lighting), 0.95, which determine the probability of its appearing in q.

Result of Projecting Queries q(?s) :- invented( ?s, x)Scientist invented

ExDB Prototype Web crawl: 90M pages Facts: 338M tuples, 102M objects Types: 6.6M instances Synonyms: 17k pairs No IDs or FDs yet

Applications ExDB’s extracted Data are not meant to be examined directly, rather they are used to build topic-specific tables so that human user can appreciate. A synthetic table about scientists, generated by merging answers from Died-in( ?x, ?y), invented( ?x, ?y), published( ?x, ?y) and taught( ?x, ?y). If it is possible to automatically generate an ExDB query from keywords, it is possible to build a very powerful query system. It is possible to build web Data cube over the large amount of read only structured Data of ExDB.

Alternative Models Schema Extraction Model, intends to find out single best schema for the entire set of extractions to transform the web Text into a traditional relational database Three good criteria for schema extraction are: Simplicity (few tables). Completeness (All extractions appear in the output). Fullness ( output database has no NULLs).

Alternative Models Text Query Model does not perform any information extraction at all, rather offers a descriptive query language to generates answers for users query very quickly. Extract city/date tuples from band’s website. Indicate the city where she lives. Compute the dates when the band’s city and her own city are within 100 miles of each other. User’s Query

Questions? Thank You