Breaking through the syntax barrier: Searching with entities and relations Soumen Chakrabarti IIT Bombay
ECML/PKDDChakrabarti Wish upon a textbox, 1996 Your information need here
ECML/PKDDChakrabarti Wish upon a textbox, 1998 “ A rising tide of data lifts all algorithms ” Your information need here
ECML/PKDDChakrabarti Wish upon a textbox, post-IPO Now indexing 4,285,199,774 pages Same interface, therefore same 2-word queries Mind-reading wizard-in-black-box saves the day Your information need (still) here
ECML/PKDDChakrabarti If music had been invented ten years ago along with the Web, we would all be playing one-string instruments (and not making great music). Udi Manber, A9.com Plenary speech WWW 2004
ECML/PKDDChakrabarti Examples of “great music”… The commercial angle: they’ll call you Want to buy X, find reviews and prices Cheap tickets to Y Noun-phrase + Pagerank saves the day Find info about diabetes Find the homepage of Tom Mitchell Searching vertical portals: garbage control Searching Citeseer or IMDB through Google Someone out there had the same problem +adaptec +aha2940uw +lilo +bios
ECML/PKDDChakrabarti … and not-so-great music Which produces better responses? Opera fails to connect to secure IMAP tunneled through SSH opera connect imap ssh tunnel Unable to express many details of information need Opera the client, not a kind of music The problem is with Opera, not ssh, imap, applet “Secure” is an attribute of imap, but may not juxtapose
ECML/PKDDChakrabarti Why telegraphic queries fail Information need relates to entities and relationships in the real world But the search engine gets only strings Risk over-/under- specified queries Never know true recall No time to deal with poor precision Query word distribution dramatically different from corpus distribution Query is inherently incomplete Fix some info, look for other
ECML/PKDDChakrabarti Structure in the corpus Structure in the query “Free-format text”HTMLXMLEntity+relations Bag-of-words Bool,Prox XQuery SQL Domain of IR, time to analyze queries much more deeply So far, largely for power-users, can be generated by programs Broad, deep and ad-hoc schema and data integration: Very difficult! No schema; either map to schema or support query on uninterpreted graphs Defined many real problems away; “solved” apart from performance issues Defining, labeling, extracting and ranking answers are the major issues; no universal models; need many more applications 1 3 2
ECML/PKDDChakrabarti Past the syntax barrier: early steps Taking the question apart Question has known parts and unknown “slots” Query-dependent information extraction Compiling relations from the Web is-instance-of (is-a), is-subclass-of is-part-of, has-attribute Graph models for textual data Searching graphs with keywords and twigs Global probabilistic graph labeling 1 2 3
Part-1 Working harder on the question
ECML/PKDDChakrabarti Atypes and ground constants Specialize given domain to a token related to ground constants in the query What animal is Winnie the Pooh? instance-of(“animal”) NEAR “Winnie the Pooh” When was television invented? instance-of(“time”) NEAR “television” NEAR synonym(“invented”) FIND x NEAR GroundConstants(question) WHERE x IS-A Atype(question) Ground constants: Winnie the Pooh, television Atypes: animal, time
ECML/PKDDChakrabarti Taking the question apart Atype: the type of the entity that is an answer to the question Problem: don’t want to compile a classification hierarchy of entities Laborious, can’t keep up Offline rather than question-driven Instead Set up a very large basis of features “Project” question and corpus to basis
ECML/PKDDChakrabarti Scoring tokens for correct Atypes FIND x “NEAR” GroundConstants(question) WHERE x IS-A Atype(question) No fixed question or answer type system Convert “x IS-A Atype(question)” to a soft match DoesAtypeMatch(x, question) QuestionAnswer tokens Passage IE-style surface feature extractors IS-A feature extractors IE-style surface feature extractors Question feature vector Snippet feature vector Learn joint distrib. …other extractors…
ECML/PKDDChakrabarti Features for Atype matching Question features: 1, 2, 3-token sequences starting with standard wh-words where, when, who, how_X, … Passage surface features: hasCap, hasXx, isAbbrev, hasDigit, isAllDigit, lpos, rpos,… Passage IS-A features: all generalizations of all noun senses of token Use WordNet: horse equid ungulate, hoofed mammal placental mammal animal… entity These are node IDs (“synsets”) in WordNet, not strings
ECML/PKDDChakrabarti Supervised learning setup Get top 300 passages from IR engine “Promising but negative” instances Crude approximation to active learning For each token invoke feature extractors Question vector x q, passage vector x p How to represent combined vector x? Label = 1 if token is in answer span, 0 o/w Question and answers from logs
ECML/PKDDChakrabarti Joint feature-vector design Obvious “linear” juxtaposition x =(x p,x q ) Does not expose pairwise dependencies “Quadratic” form x = x q x p All pairwise product of elements Model has param for every pair Can discount for redundancy in pair info If x q (x p ) is fixed, what x p (x q ) will yield the largest Pr(Y=1|x)? how_far when what_city region#n#3 entity#n#1
ECML/PKDDChakrabarti Classification accuracy Pairing more accurate than linear model Are the estimated w parameters meaningful? Given question, can return most favorable answer feature weights
ECML/PKDDChakrabarti Parameter anecdotes Surface and WordNet features complement each other General concepts get negative params: use in predictive annotation Learning is symmetric (Q A)
ECML/PKDDChakrabarti Taking the question apart Atype: the type of the entity that is an answer to the question Ground constants: Which question words are likely to appear (almost) unchanged in an answer passage? Arises in Web search sessions too Opera login fails problem with login Opera Opera login accept password Opera account authentication …
ECML/PKDDChakrabarti Features to identify ground constants Local and global features POS of word, POS of adjacent words, case info, proximity to wh-word Suppose word is associated with synset set S NumSense: size of S (is word very polysemous?) NumLemma: average #lemmas describing s S (are there many aliases?) Model as a sequential learning problem Each token has local context and global features Label: does token appear near answer?
ECML/PKDDChakrabarti Ground constants: sample results Global features (IDF, NumSense, NumLemma) essential for accuracy Best F1 accuracy with local features alone: 71—73% With local and global features: 81% Decision trees better than logistic regression F1=81% as against LR F1=75% Intuitive decision branches
ECML/PKDDChakrabarti Summary of the Atype strategy “Basis” of atypes A, a A could be synset, surface pattern, feature of a parse tree Question q “projected” to vector (w a : a A) in atype space via learning conditional model If q is “when…” or “how long…” w hasDigit and w time_period#n#1 are large, w region#n#1 is small Each corpus token t has associated indicator features a (t ) for every a hasDigit (3,000)= is-a(region#n#1) (Japan)=1
ECML/PKDDChakrabarti Reward proximity to ground constants A token t is a candidate answer if H q (t ): Reward tokens appearing “near” ground constants matched from question Order tokens by decreasing Atype indicator features of the token Projection of question to “atype space” …the armadillo, found in Texas, is covered with strong horny plates
ECML/PKDDChakrabarti Evaluation: Mean reciprocal rank (MRR) n q = smallest rank among answer passages MRR = (1/|Q|) q Q (1/n q ) Dropping passage from #1 to #2 as bad as dropping it from #2 to not reporting it at all Experiment setup: 300 top IR score passages If Pr(Y=1|token) < threshold reject token If tokens rejected reject passage Points below diagonal are good
ECML/PKDDChakrabarti Sample results Accept all tokens IR baseline MRR Moderate acceptance threshold non-answer passages eliminated, improves answer ranks High threshold true answers eliminated Another answer with poor rank, or rank = Additional benefits from proximity filtering
Part-2 Compiling fragments of soft schema
ECML/PKDDChakrabarti Who provides is-a info? Compiled KBs: WordNet, CYC Automatic “soft” compilations Google sets KnowItAll BioText Can use as evidence in scoring answers
ECML/PKDDChakrabarti Extracting is-instance-of info Which researcher built the WHIRL system? WordNet may not know Cohen IS-A researcher Google has over 4.2 billion pages “william cohen” on (p 1 =86.1k/4.2B) researcher on 4.55M (p 2 =4.55M/4.2B) +researcher +"william cohen“ on 1730: 18.55x more frequent than expected if independent Pointwise mutual information PMI Can add high-precision, low-recall patterns “cities such as New York” (26600 hits) “professor Michael Jordan” (101 hits )
ECML/PKDDChakrabarti Bootstrapping lists of instances Hearst 1992, Brin 1997, Etzioni 2004 A “propose-validate” approach Using existing patterns, generate queries For each web page w returned Extract potential fact e and assign confidence score Add fact to database if it has high enough score Example patterns NP1 {,} {such as|and other|including} NPList2 NP1 is a NP2, NP1 is the NP2 of NP3 the NP1 of NP2 is NP3 Start with NP1 = researcher etc.
ECML/PKDDChakrabarti System details The importance of shallow linguistics working together with statistical tests China is a (country) NP in Asia Garth Brooks is a (country ADJ (singer) N ) NP Unary relation example NP1 such as NPList2 & head(NP1)=plural(name(Class1)) & properNoun(head(each(NPList2))) instanceOf(Class1, head(each(NPList2)) ) “Head” of phrase
ECML/PKDDChakrabarti Compilation performance Recall-vs-precision exposes size and difficulty of domain “US state” is easy “Country” is difficult To improve signal-to- noise (STN) ratio, stop when confidence score is lower than threshold Substantially improves recall-vs-precision
ECML/PKDDChakrabarti Exploiting is-a info for ranking Use PMI scores as additional features Challenge: make frugal use of expensive inverted index probes QuestionAnswer tokens Passage IE-style surface feature extractors WN IS-A feature extractors IE-style surface feature extractors Question feature vector Snippet feature vector Learn joint distrib. PMI scores from search engine probes Atype
ECML/PKDDChakrabarti Structure in the corpus Structure in the query “Free-format text”HTMLXMLEntity+relations Bag-of-words Bool,Prox XQuery SQL Domain of IR, time to analyze queries much more deeply So far, largely for power-users, can be generated by programs Broad, deep and ad-hoc schema and data integration: Very difficult! No schema; either map to schema or support query on uninterpreted graphs Defined many real problems away; “solved” apart from performance issues Defining, labeling, extracting and ranking answers are the major issues a: Schema-free search on labeled graphs 3b: Labeling graphs
ECML/PKDDChakrabarti Mining graphs: applications abound No clean schema, data changes rapidly Lots of generic “graph proximity” information Time XYZ A. Thor Abstract:… PDF file LastMod …your ECML submission titled XYZ has been accepted… 1 Date Could I get a preprint of your recent ECML paper? 2 Date XYZ A. U. Thor To ECML Canonical node Want to quickly find this file
Part-3a Searching graphs
ECML/PKDDChakrabarti Some useful graph query styles Find single node with specified type strongly activated by given nodes paper NEAR { author=“A. U. Thor” conference=“ECML” time=“now” } May not know exact types of activators Find connected subgraph that best explains how/why two or more nodes are related connect {“Faloutsos” “Roussopulous } Reward short paths, parallel paths, few distractions Query is a small graph with match clauses
ECML/PKDDChakrabarti Measuring activation of single nodes movie (near “richard gere” “julia roberts”)
ECML/PKDDChakrabarti Reporting connecting subgraphs
ECML/PKDDChakrabarti Schema-free twig queries Query is a small graph, very light on types Node type (dblpauthor, vldbjournal) Simple predicates to qualify nodes (string, number) Match to data graph Precisely defined matches for local predicates Reward for proximity
ECML/PKDDChakrabarti Sample twig result VLDB journal article, 2001 Gerhard Weikum’s home page Max-Planck Institute home page
ECML/PKDDChakrabarti Main issues What style of queries to support? No end-user will type Xquery directly But bag-of-words is too weak How to rank result nodes and subgraphs? Reward for text match, graph proximity, … “Iceberg queries”: report top few after a lot of computation? Can ranking be learnt as with question answering? How to reuse and retarget supervision?
Part-3b Probabilistic graph labeling
ECML/PKDDChakrabarti Identifying node types How do we know a node has a given type? Find person near “Pagerank” Find student near (homepage of Tom Mitchell) Find {verb, ####, “television”} close together Nodes in our graph have Visible attributes “hobbies”, “my advisor is”, “start-of-sentence” Invisible labels “student”, “verb” Neighbors “HREF” “linear juxtaposition” Goal: given some node labels, guess all others
ECML/PKDDChakrabarti Example: Hypertext classification Want to assign labels to Web pages Text on a single page may be too little or misleading Page is not an isolated instance by itself Problem setup Web graph G=(V,E) Node u V is a page having text u T Edges in E are hyperlinks Some nodes are labeled Make collective guess at missing labels Probabilistic model? Benefits?
ECML/PKDDChakrabarti For the moment we don’t need to worry about this Seek a labeling f of all unlabeled nodes so as to maximize Let V K be the nodes with known labels and f(V K ) their known label assignments Let N(v) be the neighbors of v and N K (v) N(v) be neighbors with known labels Markov assumption: f(v) is conditionally independent of rest of G given f(N(v)) Graph labeling model
ECML/PKDDChakrabarti Markov assumption: label of v does not depend on unknown labels outside the set N U (v) Sum over all possible labelings of unknown neighbors of v …given edges, text, parital labeling Probability of labeling specific node v… Markov graph labeling Circularity between f(v) and f(N U (v)) Some form of iterative Gibb’s sampling or MCMC
ECML/PKDDChakrabarti Iterative labeling in pictures c=class, t=text, N=neighbors Text-only model: Pr[c|t] Using neighbors’ labels to update my estimates: Pr[c|t, c(N)] Repeat until histograms stabilize Local optima possible ?
ECML/PKDDChakrabarti Take the expectation of this term over N U (v) labelings Joint distribution over neighbors approximated as product of marginals Label estimates of v in the next iteration Iterative labeling: formula Sum over all possible N U (v) labelings still too expensive to compute In practice, prune to most likely configurations By the Markov assumption, finally we need a distribution coupling f(v) and v T (the text on v) and f(N(v))
ECML/PKDDChakrabarti Labeling results 9600 patents from 12 classes marked by USPTO Patents have text and cite other patents Expand test patent to include neighborhood ‘Forget’ fraction of neighbors’ classes Good for other data sets as well
ECML/PKDDChakrabarti Undirected Markov networks Clique c C(G) a set of completely connected nodes Clique potential c (V c ) a function over all possible configurations of nodes in V c x = vector of observed features at all nodes y = vector of labels at all nodes (mostly unknown) Label variable Local feature variable Instance Label coupling
ECML/PKDDChakrabarti This sum is often a mess, so we need to get back to Gibb’s sampling/MCMC anyway Conditional Markov networks Decompose conditional label probabilities into product over clique potentials Parametric model for clique potentials Params of modelFeature functions
ECML/PKDDChakrabarti Markov network: results Improves beyond text- only Need more comparisons with iterative labeling Can add many redundant features, e.g. are two links in the same HTML “section”? Further improves accuracy
ECML/PKDDChakrabarti Graph labeling: Important special cases Implementation issues Look for large cliques Still need sampling to estimate Z Trees and chains have cliques of size 2 Estimation involves Iterative numerical optimization (as before) No sampling; direct estimation of Z possible Many applications Part-of-speech and named-entity tagging Disambiguation, alias analysis
ECML/PKDDChakrabarti Summary Extracting more structure from the query This talk: using linguistic features In general: query/click sessions, profile info, … Compiling fragments of structure from a large corpus Supervised sequence models Semi-supervised bootstrapping techniques Labeling and searching graphs Coarse grained e.g. Web graph, node = page Fine grained e.g. node = person, , … Pieces sometimes need each other
ECML/PKDDChakrabarti Structure in the corpus Structure in the query “Free-format text”HTMLXMLEntity+relations Bag-of-words Bool,Prox XQuery SQL Domain of IR, time to analyze queries much more deeply So far, largely for power-users, can be generated by programs Broad, deep and ad-hoc schema and data integration: Very difficult! No schema; either map to schema or support query on uninterpreted graphs Defined many real problems away; “solved” apart from performance issues Defining, labeling, extracting and ranking answers are the major issues; no universal models; need many more applications 1 3 2
ECML/PKDDChakrabarti Concluding messages Work much harder on questions Break down into what’s known, what’s not Find fragments of structure when possible Exploit user profiles and sessions Perform limited pre-structuring of corpus Difficult to anticipate all needs and applications Extract graph structure where possible (e.g. is-a) Do not insist on specific schema Exploit statistical tools on graph models Models of influence along links getting clearer Estimation not yet ready for very large-scale use Much room for creative feature and algo design