Toward Topic Search on the Web 전자전기컴퓨터공학과 G201249003 김영제 Database Lab.

Toward Topic Search on the Web 전자전기컴퓨터공학과 G201249003 김영제 Database Lab

Abstract Traditional web search engine Treat queries as sequences of keywords Return web pages that contain those keywords Merit Effective when the user knows exactly the right words Demerit It is useless if the user asks for a concept or topic that has broader and sometimes ambiguous meanings Most users do not know about right keyword they are looking for

1. Introduction Q1. database conferences in asian cities Users do not know the names of the conferences or cities

1. Introduction Q2. big ﬁnancial companies campaign donation What company? And what campaign donation?

1. Introduction Q3. tech companies slogan Various slogans of tech companies

1. Introduction Q4. winter vacation destinations except ﬂorida They want to know the good places but florida

1. Introduction Q1. database conferences in asian cities

1. Introduction Q2. big ﬁnancial companies campaign donation

1. Introduction Q3. tech companies slogan

1. Introduction Q4. winter vacation destinations except ﬂorida

2. Overview

Parser Queryterm sequences Query → term sequences Uses Probase 5 different types : Concepts, Entities, Attributes, Keywords, Modifiers

2. Overview Interpreter Term sequence intentsemantics Term sequence → intent or semantics based on a set of query patterns Parsed queries set of candidate queries Parsed queries → set of candidate queries by substituting abstract concepts with there specific entities

2. Overview Processor Ranks the candidate queries based on there likelihood estimated by word association probabilities Submits top queries either to Probase or to the search engine index Ranks the results in a standard way

3. The knowledge base

Probase taxonomy is very rich The core taxonomy alone contains around 2.7 million concepts which is learned from 1.68 billion web pages and 2 years’ worth of search log from a major search engine The rich set of super-concepts, sub-concepts, and similar concepts

3. The knowledge base Probase taxonomy is probabilistic Every concepts in Probase is associated with some probabilities plausibility, ambiguity, and other characteristics The probabilities are derived from evidences found in web data, search logs, and other available data It is natural for Probase to integrate information from other data sources, including other ontologies It also enables Probase to rank the information it contains

4.1 Query Parsing Basically there are five kinds of terms Keywords are nontrivial types of words other than concepts and entities, and auxiliary modifiers modify a class of things

4.1 Query Parsing “president george bush fires general batiste” [president] (george bush) fires [general] (batiste) [president] george [bush fires] [general] (batiste) (president) (george bush) fires [general] (batiste) () denotes an entity, [] a concept, <> an attribute both george bush and bush fires are valid terms in Probase President can be a concept (all presidents) or specific entity in the political leader concept The parser needs to return all meaningful parses from a query

4.1 Query Parsing For an n-word query, there are 2 n−1 possible parses which is expensive to compute We first introduce a greedy algorithm to solve this problem Then we improve this algorithm using a dynamic programming approach The greedy algorithm contains three steps (1) find all possible terms (2) find all correlations among terms (3) use a scoring function to find one meaningful parse in a greedy manner

4.1 Query Parsing (1) find all terms in a query For a sequence of n word, there are n(n+1)/2 possible subsequences We check each of them to see if they are concepts, entities or attributes We give a term t a score according to its type and length: where |t| is the number of words in t, w term (t) is the weight function defined as: where w e, w c and w a are constants, and w e > w c We let α > 1 to bias toward longer terms

4.1 Query Parsing (2) the correlations among terms Three kinds of correlations Entity-Attribute Concept-Attribute Concept-Entity R 1 -R 2 denotes the correlation between one R 1 term and several R 2 terms Example “ of (china)” : Entity-Attribut “[tech companies] and ” : Concept-Attribute Terms in a correlation do not have to be adjacent to each other keywords can be mixed with correlated terms e.g. “[presidents] and their ”

4.1 Query Parsing Based on terms and term scores, we define block and block score A block is either a correlation or a single term The block score is defined as: where p[i′, j′] is a term sequence parsed from q[i, j], and where w e−a, w c−a and w c−e are all greater than 1 If there is no correlation, the block score is equal to the sum of term scores

4.1 Query Parsing (3) greedily select the block with the highest score The greedy algorithm does not guarantee an optimal parse We propose the following dynamic programming algorithm We define a preference score s pref (n) to represent the quality of the best parse of a query of n words: By memorizing the sub-solutions of s pref (n), one can produce the parse with highest preference score Moreover, when defining s pref (n) as a score set of top parses, one can also obtain the top k parses

4.1 Query Parsing

4.2 Query Interpretation Classify the input parsed queries into different patterns 6 basic patterns 1.Single Entity (E) 2.Single Concept (C) 3.Single Entity + Attributes (E+A) 4.Single Concept + Attributes (C+A) 5.Single Concept + Keywords (C+K) 6.Concept + Keywords + Concept (C+K+C) These patterns can be combined to form more complex patterns

4.2 Query Interpretation Some example queries of each pattern

4.2 Query Interpretation Once the system determines the pattern of each parsed query, it starts interpreting them using the following strategies The general approach abstract concepts in a query → more specific search terms For E and E+A queries, no further interpretation is necessary since this type of queries are already specific and can be searched directly in both Probase and the search engine index For a C or C+A query a list of top entities associated with that concept in Probase → a list of E or E+A queries

4.2 Query Interpretation A special case is when the concept is implicitly implied by an auxiliary modifier such as “... except florida” “florida” is an entity in Probase, and we treat [except florida] as if it is a concept this concept is the most representative concept in which “florida” is an entity but with “florida” removed from it To find the most representative concept to an entity we use a DF-ITF score which is a form of inverse function of the well-known TF-IDF score

4.2 Query Interpretation For a C+K query it replaces the concept with its associated entities to form a list of Entity + Keywords queries which require no further interpretation For a C+K+C query, it is considered as a extended form of C+K queries We replace both concepts with their associated entities to form a list of Entity + Keywords + Entity queries Finally, the output of the Interpreter module is a set of substituted queries of the following 4 patterns: E, E+A, E+K and E+K+E

4.3 Query Processing The Processor module takes as input a set of substituted candidate queries submits some or all of these queries to Probase or a search index presents a final set of ranked results to the user For E and E+A pattern queries the processor queries the Probase taxonomy for all the detailed information about this particular entity This information is returned as a table which will eventually be presented to the user as an info-box (e.g. Fig. 7)

4.3 Query Processing We will focus on E+K and E+K+E queries which are more interesting and challenging to process One naive approach is submit all these substituted queries to a traditional search engine index combine the results present ranked results to the user number of such queries can be prohibitively large because many concepts are associated with large number of the entities “politicians commit crimes” → millions of candidate substitutions, such as “obama commit burglary”

4.3 Query Processing The key technical challenge is understanding user’s intent filtering out those substituted queries which are not relevant Keywords act as modifiers that limit the scope accompanying concept in C+K queries imposing a particular relationship between the two surrounding concepts in C+K+C queries For example by specifying “commit” in the previous example, it excludes other relationships such as “politicians fight crimes” or “politicians victimized by crimes”

4.3 Query Processing Proposed technique to solve word association problem Compute the word association values between an entity and a sequence of keywords Multiply that with the representativeness score of this entity in the concept it belongs to As a result we can get a relevance score for a given query

4.3.1 Word Association A word association value is used to measure the associativity of a set of words To compute the word association value between two words (which we call two-way association) measure the frequency of co-occurrence of the two words in a document among all documents or in a sentence among all sentences in the all documents

4.3.1 Word Association The concept can be easily extended to association of more than two words (namely multi-way association) Word association values among a fixed set of words are often precomputed and stored in a matrix for efficient runtime lookup Computing a two-way association matrix for a set of N words generally takes O(N 2 ) time, while computing a multi-way matrix is even more costly

4.3.2 E+K and E+K+E queries Group the E+K and E+K+E queries by their prefix E+K As a result, each E+K query form a group by itself E+K+E queries with the same prefix form a group Compute a relevance score for each group G as Where q e1 : first entity q k : the keywords q e2 : second entity of query q (q e2 may be null if q is an E+K query) rp(q) : the representativeness score of q

4.3.2 E+K and E+K+E queries Select the top n groups with the best relevance scores For the n groups if a group G is an E+K group Simply send the only query contained in this group to the search index and gather the results as the final results if a group G is an E+K+E group Use a two-pass procedure that accesses the search engine twice in the first pass, we query the search engine with {q e1,q k } Let us call the set of top pages thus returned R Remove from G all queries whose second entity does not appear in any of the pages in R In the second pass, we send the remaining queries in G to search engine Finally we collect all results from the top n groups and rank them according to some common search engine metric such as PageRank before returning to the user

4.4 Ranking Results Even though the Processor module filters out most irrelevant candidate queries the number of relevant queries can still be large! The current framework requires all results be returned from the search index before ranking and presenting the final results to the user, which can be time consuming We therefore made a few optimizations for the C+K and C+K+C queries (two of the most expensive query patterns)

4.4 Ranking Results Set up an empty pool Send the queries in the top n groups one by one to the search index Place results into the pool Pop the best result in the pool Return it to the user at regular intervals Continue this process until sufficient results are shown to the user

5. Evaluation evaluate We create a set of benchmark queries that contain concepts, entities, and attributes to evaluate for example “politicians commit crimes” (C+K+C) “large companies in chicago” (C+K) “president washington quotes” (E+A)

5.1 Semantic Query Processing Given a query, we analyze the concepts, entities, and attributes in the query For C+K and C+K+C queries Rewrite the query by replacing the concepts with appropriate entities We then send the rewritten queries to Bing using Microsoft Bing API We then combine their results and compare them with the result of the original query For other types of queries (e.g., E, C, E+A and C+A) we search for relevant information in Probase return a table that contains entities, their attributes and values In all search quality related experiments, we asked three human judges to rate each search result as being “relevant” or “not relevant”, and we record the majority vote

Quality of C+K & C+K+C queries Results for C+K+C queries have lower relevance than C+K queries across the three systems because C+K+C queries often involve more complicated semantic relations than C+K queries

Quality of C+K & C+K+C queries Fig. 14 shows that at least 77% of the relevant C+K results and 85% of the relevant C+K+C results in our top 10 could not be found in Bing or Google even after scanning the first 1000 results

Quality of C+K & C+K+C queries Comparison with other semantic search engines including Hakia, Evri and SenseBot

Quality of E, C, E+A & C+A queries For E, C, E+A, and C+A queries, we return tables containing relevant entities instead of hyperlinks The system returned some wrong answers for C and C+A queries, Probase contains some erroneous concept-entity pairs For example, it is cannot disambiguate “George Washington” as a book or a person

Time performance Table. 3 shows the average running time of the benchmark queries in different categories

Time performance Table 4 shows the average number of bytes our prototype system sends to and receives from Bing Since C+K+C queries require the system to send queries twice to the search engine, their traffic is more than C+K queries Nonetheless, the communication costs are within reasonable range

Thank you for listening

Toward Topic Search on the Web 전자전기컴퓨터공학과 G201249003 김영제 Database Lab.

Similar presentations

Presentation on theme: "Toward Topic Search on the Web 전자전기컴퓨터공학과 G201249003 김영제 Database Lab."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Toward Topic Search on the Web 전자전기컴퓨터공학과 G201249003 김영제 Database Lab.

Similar presentations

Presentation on theme: "Toward Topic Search on the Web 전자전기컴퓨터공학과 G201249003 김영제 Database Lab."— Presentation transcript:

Similar presentations

About project

Feedback