Download presentation
Presentation is loading. Please wait.
Published byNeal George Modified over 8 years ago
2
Breaking through the syntax barrier: Searching with entities and relations Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen
3
ECML/PKDDChakrabarti 20042 Wish upon a textbox, 1996 Your information need here
4
ECML/PKDDChakrabarti 20043 Wish upon a textbox, 1998 “ A rising tide of data lifts all algorithms ” Your information need here
5
ECML/PKDDChakrabarti 20044 Wish upon a textbox, post-IPO Now indexing 4,285,199,774 pages Same interface, therefore same 2-word queries Mind-reading wizard-in-black-box saves the day Your information need (still) here
6
ECML/PKDDChakrabarti 20045 If music had been invented ten years ago along with the Web, we would all be playing one-string instruments (and not making great music). Udi Manber, A9.com Plenary speech WWW 2004
7
ECML/PKDDChakrabarti 20046 Examples of “great music”… The commercial angle: they’ll call you Want to buy X, find reviews and prices Cheap tickets to Y Noun-phrase + Pagerank saves the day Find info about diabetes Find the homepage of Tom Mitchell Searching vertical portals: garbage control Searching Citeseer or IMDB through Google Someone out there had the same problem +adaptec +aha2940uw +lilo +bios
8
ECML/PKDDChakrabarti 20047 … and not-so-great music Which produces better responses? Opera fails to connect to secure IMAP tunneled through SSH opera connect imap ssh tunnel Unable to express many details of information need Opera the email client, not a kind of music The problem is with Opera, not ssh, imap, applet “Secure” is an attribute of imap, but may not juxtapose
9
ECML/PKDDChakrabarti 20048 Why telegraphic queries fail Information need relates to entities and relationships in the real world But the search engine gets only strings Risk over-/under- specified queries Never know true recall No time to deal with poor precision Query word distribution dramatically different from corpus distribution Query is inherently incomplete Fix some info, look for other
10
ECML/PKDDChakrabarti 20049 Structure in the corpus Structure in the query “Free-format text”HTMLXMLEntity+relations Bag-of-words Bool,Prox XQuery SQL Domain of IR, time to analyze queries much more deeply So far, largely for power-users, can be generated by programs Broad, deep and ad-hoc schema and data integration: Very difficult! No schema; either map to schema or support query on uninterpreted graphs Defined many real problems away; “solved” apart from performance issues Defining, labeling, extracting and ranking answers are the major issues; no universal models; need many more applications 1 3 2
11
ECML/PKDDChakrabarti 200410 Past the syntax barrier: early steps Taking the question apart Question has known parts and unknown “slots” Query-dependent information extraction Compiling relations from the Web is-instance-of (is-a), is-subclass-of is-part-of, has-attribute Graph models for textual data Searching graphs with keywords and twigs Global probabilistic graph labeling 1 2 3
12
Part-1 Working harder on the question
13
ECML/PKDDChakrabarti 200412 Atypes and ground constants Specialize given domain to a token related to ground constants in the query What animal is Winnie the Pooh? instance-of(“animal”) NEAR “Winnie the Pooh” When was television invented? instance-of(“time”) NEAR “television” NEAR synonym(“invented”) FIND x NEAR GroundConstants(question) WHERE x IS-A Atype(question) Ground constants: Winnie the Pooh, television Atypes: animal, time
14
ECML/PKDDChakrabarti 200413 Taking the question apart Atype: the type of the entity that is an answer to the question Problem: don’t want to compile a classification hierarchy of entities Laborious, can’t keep up Offline rather than question-driven Instead Set up a very large basis of features “Project” question and corpus to basis
15
ECML/PKDDChakrabarti 200414 Scoring tokens for correct Atypes FIND x “NEAR” GroundConstants(question) WHERE x IS-A Atype(question) No fixed question or answer type system Convert “x IS-A Atype(question)” to a soft match DoesAtypeMatch(x, question) QuestionAnswer tokens Passage IE-style surface feature extractors IS-A feature extractors IE-style surface feature extractors Question feature vector Snippet feature vector Learn joint distrib. …other extractors…
16
ECML/PKDDChakrabarti 200415 Features for Atype matching Question features: 1, 2, 3-token sequences starting with standard wh-words where, when, who, how_X, … Passage surface features: hasCap, hasXx, isAbbrev, hasDigit, isAllDigit, lpos, rpos,… Passage IS-A features: all generalizations of all noun senses of token Use WordNet: horse equid ungulate, hoofed mammal placental mammal animal… entity These are node IDs (“synsets”) in WordNet, not strings
17
ECML/PKDDChakrabarti 200416 Supervised learning setup Get top 300 passages from IR engine “Promising but negative” instances Crude approximation to active learning For each token invoke feature extractors Question vector x q, passage vector x p How to represent combined vector x? Label = 1 if token is in answer span, 0 o/w Question and answers from logs
18
ECML/PKDDChakrabarti 200417 Joint feature-vector design Obvious “linear” juxtaposition x =(x p,x q ) Does not expose pairwise dependencies “Quadratic” form x = x q x p All pairwise product of elements Model has param for every pair Can discount for redundancy in pair info If x q (x p ) is fixed, what x p (x q ) will yield the largest Pr(Y=1|x)? how_far when what_city region#n#3 entity#n#1
19
ECML/PKDDChakrabarti 200418 Classification accuracy Pairing more accurate than linear model Are the estimated w parameters meaningful? Given question, can return most favorable answer feature weights
20
ECML/PKDDChakrabarti 200419 Parameter anecdotes Surface and WordNet features complement each other General concepts get negative params: use in predictive annotation Learning is symmetric (Q A)
21
ECML/PKDDChakrabarti 200420 Taking the question apart Atype: the type of the entity that is an answer to the question Ground constants: Which question words are likely to appear (almost) unchanged in an answer passage? Arises in Web search sessions too Opera login fails problem with login Opera email Opera login accept password Opera account authentication …
22
ECML/PKDDChakrabarti 200421 Features to identify ground constants Local and global features POS of word, POS of adjacent words, case info, proximity to wh-word Suppose word is associated with synset set S NumSense: size of S (is word very polysemous?) NumLemma: average #lemmas describing s S (are there many aliases?) Model as a sequential learning problem Each token has local context and global features Label: does token appear near answer? POS@0POS@1POS@-1
23
ECML/PKDDChakrabarti 200422 Ground constants: sample results Global features (IDF, NumSense, NumLemma) essential for accuracy Best F1 accuracy with local features alone: 71—73% With local and global features: 81% Decision trees better than logistic regression F1=81% as against LR F1=75% Intuitive decision branches
24
ECML/PKDDChakrabarti 200423 Summary of the Atype strategy “Basis” of atypes A, a A could be synset, surface pattern, feature of a parse tree Question q “projected” to vector (w a : a A) in atype space via learning conditional model If q is “when…” or “how long…” w hasDigit and w time_period#n#1 are large, w region#n#1 is small Each corpus token t has associated indicator features a (t ) for every a hasDigit (3,000)= is-a(region#n#1) (Japan)=1
25
ECML/PKDDChakrabarti 200424 Reward proximity to ground constants A token t is a candidate answer if H q (t ): Reward tokens appearing “near” ground constants matched from question Order tokens by decreasing Atype indicator features of the token Projection of question to “atype space” …the armadillo, found in Texas, is covered with strong horny plates
26
ECML/PKDDChakrabarti 200425 Evaluation: Mean reciprocal rank (MRR) n q = smallest rank among answer passages MRR = (1/|Q|) q Q (1/n q ) Dropping passage from #1 to #2 as bad as dropping it from #2 to not reporting it at all Experiment setup: 300 top IR score passages If Pr(Y=1|token) < threshold reject token If tokens rejected reject passage Points below diagonal are good
27
ECML/PKDDChakrabarti 200426 Sample results Accept all tokens IR baseline MRR Moderate acceptance threshold non-answer passages eliminated, improves answer ranks High threshold true answers eliminated Another answer with poor rank, or rank = Additional benefits from proximity filtering
28
Part-2 Compiling fragments of soft schema
29
ECML/PKDDChakrabarti 200428 Who provides is-a info? Compiled KBs: WordNet, CYC Automatic “soft” compilations Google sets KnowItAll BioText Can use as evidence in scoring answers
30
ECML/PKDDChakrabarti 200429 Extracting is-instance-of info Which researcher built the WHIRL system? WordNet may not know Cohen IS-A researcher Google has over 4.2 billion pages “william cohen” on 86100 (p 1 =86.1k/4.2B) researcher on 4.55M (p 2 =4.55M/4.2B) +researcher +"william cohen“ on 1730: 18.55x more frequent than expected if independent Pointwise mutual information PMI Can add high-precision, low-recall patterns “cities such as New York” (26600 hits) “professor Michael Jordan” (101 hits )
31
ECML/PKDDChakrabarti 200430 Bootstrapping lists of instances Hearst 1992, Brin 1997, Etzioni 2004 A “propose-validate” approach Using existing patterns, generate queries For each web page w returned Extract potential fact e and assign confidence score Add fact to database if it has high enough score Example patterns NP1 {,} {such as|and other|including} NPList2 NP1 is a NP2, NP1 is the NP2 of NP3 the NP1 of NP2 is NP3 Start with NP1 = researcher etc.
32
ECML/PKDDChakrabarti 200431 System details The importance of shallow linguistics working together with statistical tests China is a (country) NP in Asia Garth Brooks is a (country ADJ (singer) N ) NP Unary relation example NP1 such as NPList2 & head(NP1)=plural(name(Class1)) & properNoun(head(each(NPList2))) instanceOf(Class1, head(each(NPList2)) ) “Head” of phrase
33
ECML/PKDDChakrabarti 200432 Compilation performance Recall-vs-precision exposes size and difficulty of domain “US state” is easy “Country” is difficult To improve signal-to- noise (STN) ratio, stop when confidence score is lower than threshold Substantially improves recall-vs-precision
34
ECML/PKDDChakrabarti 200433 Exploiting is-a info for ranking Use PMI scores as additional features Challenge: make frugal use of expensive inverted index probes QuestionAnswer tokens Passage IE-style surface feature extractors WN IS-A feature extractors IE-style surface feature extractors Question feature vector Snippet feature vector Learn joint distrib. PMI scores from search engine probes Atype
35
ECML/PKDDChakrabarti 200434 Structure in the corpus Structure in the query “Free-format text”HTMLXMLEntity+relations Bag-of-words Bool,Prox XQuery SQL Domain of IR, time to analyze queries much more deeply So far, largely for power-users, can be generated by programs Broad, deep and ad-hoc schema and data integration: Very difficult! No schema; either map to schema or support query on uninterpreted graphs Defined many real problems away; “solved” apart from performance issues Defining, labeling, extracting and ranking answers are the major issues 1 3 2 3a: Schema-free search on labeled graphs 3b: Labeling graphs
36
ECML/PKDDChakrabarti 200435 Mining graphs: applications abound No clean schema, data changes rapidly Lots of generic “graph proximity” information Time XYZ A. Thor Abstract:… PDF file LastMod …your ECML submission titled XYZ has been accepted… Email1 EmailDate Could I get a preprint of your recent ECML paper? Email2 EmailDate XYZ A. U. Thor EmailTo ECML Canonical node Want to quickly find this file
37
Part-3a Searching graphs
38
ECML/PKDDChakrabarti 200437 Some useful graph query styles Find single node with specified type strongly activated by given nodes paper NEAR { author=“A. U. Thor” conference=“ECML” time=“now” } May not know exact types of activators Find connected subgraph that best explains how/why two or more nodes are related connect {“Faloutsos” “Roussopulous } Reward short paths, parallel paths, few distractions Query is a small graph with match clauses
39
ECML/PKDDChakrabarti 200438 Measuring activation of single nodes movie (near “richard gere” “julia roberts”) http://www.cse.iitb.ac.in/banks/ http://www.cse.iitb.ac.in/banks/
40
ECML/PKDDChakrabarti 200439 Reporting connecting subgraphs http://www.cse.iitb.ac.in/banks/
41
ECML/PKDDChakrabarti 200440 Schema-free twig queries Query is a small graph, very light on types Node type (dblpauthor, vldbjournal) Simple predicates to qualify nodes (string, number) Match to data graph Precisely defined matches for local predicates Reward for proximity
42
ECML/PKDDChakrabarti 200441 Sample twig result VLDB journal article, 2001 Gerhard Weikum’s home page Max-Planck Institute home page
43
ECML/PKDDChakrabarti 200442 Main issues What style of queries to support? No end-user will type Xquery directly But bag-of-words is too weak How to rank result nodes and subgraphs? Reward for text match, graph proximity, … “Iceberg queries”: report top few after a lot of computation? Can ranking be learnt as with question answering? How to reuse and retarget supervision?
44
Part-3b Probabilistic graph labeling
45
ECML/PKDDChakrabarti 200444 Identifying node types How do we know a node has a given type? Find person near “Pagerank” Find student near (homepage of Tom Mitchell) Find {verb, ####, “television”} close together Nodes in our graph have Visible attributes “hobbies”, “my advisor is”, “start-of-sentence” Invisible labels “student”, “verb” Neighbors “HREF” “linear juxtaposition” Goal: given some node labels, guess all others
46
ECML/PKDDChakrabarti 200445 Example: Hypertext classification Want to assign labels to Web pages Text on a single page may be too little or misleading Page is not an isolated instance by itself Problem setup Web graph G=(V,E) Node u V is a page having text u T Edges in E are hyperlinks Some nodes are labeled Make collective guess at missing labels Probabilistic model? Benefits?
47
ECML/PKDDChakrabarti 200446 For the moment we don’t need to worry about this Seek a labeling f of all unlabeled nodes so as to maximize Let V K be the nodes with known labels and f(V K ) their known label assignments Let N(v) be the neighbors of v and N K (v) N(v) be neighbors with known labels Markov assumption: f(v) is conditionally independent of rest of G given f(N(v)) Graph labeling model
48
ECML/PKDDChakrabarti 200447 Markov assumption: label of v does not depend on unknown labels outside the set N U (v) Sum over all possible labelings of unknown neighbors of v …given edges, text, parital labeling Probability of labeling specific node v… Markov graph labeling Circularity between f(v) and f(N U (v)) Some form of iterative Gibb’s sampling or MCMC
49
ECML/PKDDChakrabarti 200448 Iterative labeling in pictures c=class, t=text, N=neighbors Text-only model: Pr[c|t] Using neighbors’ labels to update my estimates: Pr[c|t, c(N)] Repeat until histograms stabilize Local optima possible ?
50
ECML/PKDDChakrabarti 200449 Take the expectation of this term over N U (v) labelings Joint distribution over neighbors approximated as product of marginals Label estimates of v in the next iteration Iterative labeling: formula Sum over all possible N U (v) labelings still too expensive to compute In practice, prune to most likely configurations By the Markov assumption, finally we need a distribution coupling f(v) and v T (the text on v) and f(N(v))
51
ECML/PKDDChakrabarti 200450 Labeling results 9600 patents from 12 classes marked by USPTO Patents have text and cite other patents Expand test patent to include neighborhood ‘Forget’ fraction of neighbors’ classes Good for other data sets as well
52
ECML/PKDDChakrabarti 200451 Undirected Markov networks Clique c C(G) a set of completely connected nodes Clique potential c (V c ) a function over all possible configurations of nodes in V c x = vector of observed features at all nodes y = vector of labels at all nodes (mostly unknown) Label variable Local feature variable Instance Label coupling
53
ECML/PKDDChakrabarti 200452 This sum is often a mess, so we need to get back to Gibb’s sampling/MCMC anyway Conditional Markov networks Decompose conditional label probabilities into product over clique potentials Parametric model for clique potentials Params of modelFeature functions
54
ECML/PKDDChakrabarti 200453 Markov network: results Improves beyond text- only Need more comparisons with iterative labeling Can add many redundant features, e.g. are two links in the same HTML “section”? Further improves accuracy
55
ECML/PKDDChakrabarti 200454 Graph labeling: Important special cases Implementation issues Look for large cliques Still need sampling to estimate Z Trees and chains have cliques of size 2 Estimation involves Iterative numerical optimization (as before) No sampling; direct estimation of Z possible Many applications Part-of-speech and named-entity tagging Disambiguation, alias analysis
56
ECML/PKDDChakrabarti 200455 Summary Extracting more structure from the query This talk: using linguistic features In general: query/click sessions, profile info, … Compiling fragments of structure from a large corpus Supervised sequence models Semi-supervised bootstrapping techniques Labeling and searching graphs Coarse grained e.g. Web graph, node = page Fine grained e.g. node = person, email, … Pieces sometimes need each other
57
ECML/PKDDChakrabarti 200456 Structure in the corpus Structure in the query “Free-format text”HTMLXMLEntity+relations Bag-of-words Bool,Prox XQuery SQL Domain of IR, time to analyze queries much more deeply So far, largely for power-users, can be generated by programs Broad, deep and ad-hoc schema and data integration: Very difficult! No schema; either map to schema or support query on uninterpreted graphs Defined many real problems away; “solved” apart from performance issues Defining, labeling, extracting and ranking answers are the major issues; no universal models; need many more applications 1 3 2
58
ECML/PKDDChakrabarti 200457 Concluding messages Work much harder on questions Break down into what’s known, what’s not Find fragments of structure when possible Exploit user profiles and sessions Perform limited pre-structuring of corpus Difficult to anticipate all needs and applications Extract graph structure where possible (e.g. is-a) Do not insist on specific schema Exploit statistical tools on graph models Models of influence along links getting clearer Estimation not yet ready for very large-scale use Much room for creative feature and algo design
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.