Breaking through the syntax barrier: Searching with entities and relations Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen.

Breaking through the syntax barrier: Searching with entities and relations Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen

ECML/PKDDChakrabarti 20042 Wish upon a textbox, 1996 Your information need here

ECML/PKDDChakrabarti 20043 Wish upon a textbox, 1998 “ A rising tide of data lifts all algorithms ” Your information need here

ECML/PKDDChakrabarti 20044 Wish upon a textbox, post-IPO  Now indexing 4,285,199,774 pages  Same interface, therefore same 2-word queries  Mind-reading wizard-in-black-box saves the day Your information need (still) here

ECML/PKDDChakrabarti 20045 If music had been invented ten years ago along with the Web, we would all be playing one-string instruments (and not making great music). Udi Manber, A9.com Plenary speech WWW 2004

ECML/PKDDChakrabarti 20046 Examples of “great music”…  The commercial angle: they’ll call you Want to buy X, find reviews and prices Cheap tickets to Y  Noun-phrase + Pagerank saves the day Find info about diabetes Find the homepage of Tom Mitchell  Searching vertical portals: garbage control Searching Citeseer or IMDB through Google  Someone out there had the same problem +adaptec +aha2940uw +lilo +bios

ECML/PKDDChakrabarti 20047 … and not-so-great music  Which produces better responses? Opera fails to connect to secure IMAP tunneled through SSH opera connect imap ssh tunnel  Unable to express many details of information need Opera the email client, not a kind of music The problem is with Opera, not ssh, imap, applet “Secure” is an attribute of imap, but may not juxtapose

ECML/PKDDChakrabarti 20048 Why telegraphic queries fail  Information need relates to entities and relationships in the real world  But the search engine gets only strings  Risk over-/under- specified queries Never know true recall No time to deal with poor precision  Query word distribution dramatically different from corpus distribution Query is inherently incomplete Fix some info, look for other

ECML/PKDDChakrabarti 20049 Structure in the corpus Structure in the query “Free-format text”HTMLXMLEntity+relations Bag-of-words Bool,Prox XQuery SQL Domain of IR, time to analyze queries much more deeply So far, largely for power-users, can be generated by programs Broad, deep and ad-hoc schema and data integration: Very difficult! No schema; either map to schema or support query on uninterpreted graphs Defined many real problems away; “solved” apart from performance issues Defining, labeling, extracting and ranking answers are the major issues; no universal models; need many more applications 1 3 2

ECML/PKDDChakrabarti 200410 Past the syntax barrier: early steps  Taking the question apart Question has known parts and unknown “slots” Query-dependent information extraction  Compiling relations from the Web is-instance-of (is-a), is-subclass-of is-part-of, has-attribute  Graph models for textual data Searching graphs with keywords and twigs Global probabilistic graph labeling 1 2 3

Part-1 Working harder on the question

ECML/PKDDChakrabarti 200412 Atypes and ground constants  Specialize given domain to a token related to ground constants in the query What animal is Winnie the Pooh?  instance-of(“animal”) NEAR “Winnie the Pooh” When was television invented?  instance-of(“time”) NEAR “television” NEAR synonym(“invented”)  FIND x NEAR GroundConstants(question) WHERE x IS-A Atype(question) Ground constants: Winnie the Pooh, television Atypes: animal, time

ECML/PKDDChakrabarti 200413 Taking the question apart  Atype: the type of the entity that is an answer to the question  Problem: don’t want to compile a classification hierarchy of entities Laborious, can’t keep up Offline rather than question-driven  Instead Set up a very large basis of features “Project” question and corpus to basis

ECML/PKDDChakrabarti 200414 Scoring tokens for correct Atypes  FIND x “NEAR” GroundConstants(question) WHERE x IS-A Atype(question)  No fixed question or answer type system  Convert “x IS-A Atype(question)” to a soft match DoesAtypeMatch(x, question) QuestionAnswer tokens Passage IE-style surface feature extractors IS-A feature extractors IE-style surface feature extractors Question feature vector Snippet feature vector Learn joint distrib. …other extractors…

ECML/PKDDChakrabarti 200415 Features for Atype matching  Question features: 1, 2, 3-token sequences starting with standard wh-words where, when, who, how_X, …  Passage surface features: hasCap, hasXx, isAbbrev, hasDigit, isAllDigit, lpos, rpos,…  Passage IS-A features: all generalizations of all noun senses of token Use WordNet: horse  equid  ungulate, hoofed mammal  placental mammal  animal…  entity These are node IDs (“synsets”) in WordNet, not strings

ECML/PKDDChakrabarti 200416 Supervised learning setup  Get top 300 passages from IR engine “Promising but negative” instances Crude approximation to active learning  For each token invoke feature extractors  Question vector x q, passage vector x p How to represent combined vector x?  Label = 1 if token is in answer span, 0 o/w Question and answers from logs

ECML/PKDDChakrabarti 200417 Joint feature-vector design  Obvious “linear” juxtaposition x =(x p,x q ) Does not expose pairwise dependencies  “Quadratic” form x = x q  x p All pairwise product of elements  Model has param for every pair  Can discount for redundancy in pair info  If x q (x p ) is fixed, what x p (x q ) will yield the largest Pr(Y=1|x)? how_far when what_city region#n#3 entity#n#1

ECML/PKDDChakrabarti 200418 Classification accuracy  Pairing more accurate than linear model  Are the estimated w parameters meaningful?  Given question, can return most favorable answer feature weights

ECML/PKDDChakrabarti 200419 Parameter anecdotes  Surface and WordNet features complement each other  General concepts get negative params: use in predictive annotation  Learning is symmetric (Q  A)

ECML/PKDDChakrabarti 200420 Taking the question apart Atype: the type of the entity that is an answer to the question  Ground constants: Which question words are likely to appear (almost) unchanged in an answer passage?  Arises in Web search sessions too Opera login fails problem with login Opera email Opera login accept password Opera account authentication …

ECML/PKDDChakrabarti 200421 Features to identify ground constants  Local and global features POS of word, POS of adjacent words, case info, proximity to wh-word Suppose word is associated with synset set S  NumSense: size of S (is word very polysemous?)  NumLemma: average #lemmas describing s  S (are there many aliases?)  Model as a sequential learning problem Each token has local context and global features  Label: does token appear near answer? POS@0POS@1POS@-1

ECML/PKDDChakrabarti 200422 Ground constants: sample results  Global features (IDF, NumSense, NumLemma) essential for accuracy Best F1 accuracy with local features alone: 71—73% With local and global features: 81%  Decision trees better than logistic regression F1=81% as against LR F1=75% Intuitive decision branches

ECML/PKDDChakrabarti 200423 Summary of the Atype strategy  “Basis” of atypes A, a  A could be synset, surface pattern, feature of a parse tree  Question q “projected” to vector (w a : a  A) in atype space via learning conditional model  If q is “when…” or “how long…” w hasDigit and w time_period#n#1 are large, w region#n#1 is small  Each corpus token t has associated indicator features  a (t ) for every a   hasDigit (3,000)=  is-a(region#n#1) (Japan)=1

ECML/PKDDChakrabarti 200424 Reward proximity to ground constants  A token t is a candidate answer if  H q (t ): Reward tokens appearing “near” ground constants matched from question  Order tokens by decreasing Atype indicator features of the token Projection of question to “atype space” …the armadillo, found in Texas, is covered with strong horny plates

ECML/PKDDChakrabarti 200425 Evaluation: Mean reciprocal rank (MRR)  n q = smallest rank among answer passages  MRR = (1/|Q|)  q  Q (1/n q ) Dropping passage from #1 to #2 as bad as dropping it from #2 to not reporting it at all Experiment setup:  300 top IR score passages  If Pr(Y=1|token) < threshold reject token  If tokens rejected reject passage  Points below diagonal are good

ECML/PKDDChakrabarti 200426 Sample results  Accept all tokens  IR baseline MRR  Moderate acceptance threshold  non-answer passages eliminated, improves answer ranks  High threshold  true answers eliminated Another answer with poor rank, or rank =   Additional benefits from proximity filtering

Part-2 Compiling fragments of soft schema

ECML/PKDDChakrabarti 200428 Who provides is-a info?  Compiled KBs: WordNet, CYC  Automatic “soft” compilations Google sets KnowItAll BioText  Can use as evidence in scoring answers

ECML/PKDDChakrabarti 200429 Extracting is-instance-of info  Which researcher built the WHIRL system? WordNet may not know Cohen IS-A researcher  Google has over 4.2 billion pages “william cohen” on 86100 (p 1 =86.1k/4.2B) researcher on 4.55M (p 2 =4.55M/4.2B) +researcher +"william cohen“ on 1730: 18.55x more frequent than expected if independent  Pointwise mutual information PMI  Can add high-precision, low-recall patterns “cities such as New York” (26600 hits) “professor Michael Jordan” (101 hits  )

ECML/PKDDChakrabarti 200430 Bootstrapping lists of instances  Hearst 1992, Brin 1997, Etzioni 2004  A “propose-validate” approach Using existing patterns, generate queries For each web page w returned  Extract potential fact e and assign confidence score  Add fact to database if it has high enough score  Example patterns NP1 {,} {such as|and other|including} NPList2 NP1 is a NP2, NP1 is the NP2 of NP3 the NP1 of NP2 is NP3  Start with NP1 = researcher etc.

ECML/PKDDChakrabarti 200431 System details  The importance of shallow linguistics working together with statistical tests China is a (country) NP in Asia Garth Brooks is a (country ADJ (singer) N ) NP  Unary relation example NP1 such as NPList2 & head(NP1)=plural(name(Class1)) & properNoun(head(each(NPList2)))  instanceOf(Class1, head(each(NPList2)) ) “Head” of phrase

ECML/PKDDChakrabarti 200432 Compilation performance  Recall-vs-precision exposes size and difficulty of domain “US state” is easy “Country” is difficult  To improve signal-to- noise (STN) ratio, stop when confidence score is lower than threshold Substantially improves recall-vs-precision

ECML/PKDDChakrabarti 200433 Exploiting is-a info for ranking  Use PMI scores as additional features  Challenge: make frugal use of expensive inverted index probes QuestionAnswer tokens Passage IE-style surface feature extractors WN IS-A feature extractors IE-style surface feature extractors Question feature vector Snippet feature vector Learn joint distrib. PMI scores from search engine probes Atype

ECML/PKDDChakrabarti 200434 Structure in the corpus Structure in the query “Free-format text”HTMLXMLEntity+relations Bag-of-words Bool,Prox XQuery SQL Domain of IR, time to analyze queries much more deeply So far, largely for power-users, can be generated by programs Broad, deep and ad-hoc schema and data integration: Very difficult! No schema; either map to schema or support query on uninterpreted graphs Defined many real problems away; “solved” apart from performance issues Defining, labeling, extracting and ranking answers are the major issues 1 3 2 3a: Schema-free search on labeled graphs 3b: Labeling graphs

ECML/PKDDChakrabarti 200435 Mining graphs: applications abound  No clean schema, data changes rapidly  Lots of generic “graph proximity” information Time XYZ A. Thor Abstract:… PDF file LastMod …your ECML submission titled XYZ has been accepted… Email1 EmailDate Could I get a preprint of your recent ECML paper? Email2 EmailDate XYZ A. U. Thor EmailTo ECML Canonical node Want to quickly find this file

Part-3a Searching graphs

ECML/PKDDChakrabarti 200437 Some useful graph query styles  Find single node with specified type strongly activated by given nodes paper NEAR { author=“A. U. Thor” conference=“ECML” time=“now” } May not know exact types of activators  Find connected subgraph that best explains how/why two or more nodes are related connect {“Faloutsos” “Roussopulous } Reward short paths, parallel paths, few distractions  Query is a small graph with match clauses

ECML/PKDDChakrabarti 200438 Measuring activation of single nodes movie (near “richard gere” “julia roberts”) http://www.cse.iitb.ac.in/banks/ http://www.cse.iitb.ac.in/banks/

ECML/PKDDChakrabarti 200439 Reporting connecting subgraphs http://www.cse.iitb.ac.in/banks/

ECML/PKDDChakrabarti 200440 Schema-free twig queries  Query is a small graph, very light on types Node type (dblpauthor, vldbjournal) Simple predicates to qualify nodes (string, number)  Match to data graph Precisely defined matches for local predicates Reward for proximity

ECML/PKDDChakrabarti 200441 Sample twig result VLDB journal article, 2001 Gerhard Weikum’s home page Max-Planck Institute home page

ECML/PKDDChakrabarti 200442 Main issues  What style of queries to support? No end-user will type Xquery directly But bag-of-words is too weak  How to rank result nodes and subgraphs? Reward for text match, graph proximity, … “Iceberg queries”: report top few after a lot of computation?  Can ranking be learnt as with question answering? How to reuse and retarget supervision?

Part-3b Probabilistic graph labeling

ECML/PKDDChakrabarti 200444 Identifying node types  How do we know a node has a given type? Find person near “Pagerank” Find student near (homepage of Tom Mitchell) Find {verb, ####, “television”} close together  Nodes in our graph have Visible attributes “hobbies”, “my advisor is”, “start-of-sentence” Invisible labels “student”, “verb” Neighbors “HREF” “linear juxtaposition”  Goal: given some node labels, guess all others

ECML/PKDDChakrabarti 200445 Example: Hypertext classification  Want to assign labels to Web pages  Text on a single page may be too little or misleading  Page is not an isolated instance by itself  Problem setup Web graph G=(V,E) Node u  V is a page having text u T Edges in E are hyperlinks Some nodes are labeled Make collective guess at missing labels  Probabilistic model? Benefits?

ECML/PKDDChakrabarti 200446 For the moment we don’t need to worry about this  Seek a labeling f of all unlabeled nodes so as to maximize  Let V K be the nodes with known labels and f(V K ) their known label assignments  Let N(v) be the neighbors of v and N K (v)  N(v) be neighbors with known labels  Markov assumption: f(v) is conditionally independent of rest of G given f(N(v)) Graph labeling model

ECML/PKDDChakrabarti 200447 Markov assumption: label of v does not depend on unknown labels outside the set N U (v) Sum over all possible labelings of unknown neighbors of v …given edges, text, parital labeling Probability of labeling specific node v… Markov graph labeling  Circularity between f(v) and f(N U (v))  Some form of iterative Gibb’s sampling or MCMC

ECML/PKDDChakrabarti 200448 Iterative labeling in pictures  c=class, t=text, N=neighbors  Text-only model: Pr[c|t]  Using neighbors’ labels to update my estimates: Pr[c|t, c(N)]  Repeat until histograms stabilize Local optima possible ?

ECML/PKDDChakrabarti 200449 Take the expectation of this term over N U (v) labelings Joint distribution over neighbors approximated as product of marginals Label estimates of v in the next iteration Iterative labeling: formula  Sum over all possible N U (v) labelings still too expensive to compute  In practice, prune to most likely configurations By the Markov assumption, finally we need a distribution coupling f(v) and v T (the text on v) and f(N(v))

ECML/PKDDChakrabarti 200450 Labeling results  9600 patents from 12 classes marked by USPTO  Patents have text and cite other patents  Expand test patent to include neighborhood  ‘Forget’ fraction of neighbors’ classes  Good for other data sets as well

ECML/PKDDChakrabarti 200451 Undirected Markov networks  Clique c  C(G) a set of completely connected nodes  Clique potential  c (V c ) a function over all possible configurations of nodes in V c  x = vector of observed features at all nodes  y = vector of labels at all nodes (mostly unknown) Label variable Local feature variable Instance Label coupling

ECML/PKDDChakrabarti 200452 This sum is often a mess, so we need to get back to Gibb’s sampling/MCMC anyway Conditional Markov networks  Decompose conditional label probabilities into product over clique potentials  Parametric model for clique potentials Params of modelFeature functions

ECML/PKDDChakrabarti 200453 Markov network: results  Improves beyond text- only  Need more comparisons with iterative labeling  Can add many redundant features, e.g. are two links in the same HTML “section”? Further improves accuracy

ECML/PKDDChakrabarti 200454 Graph labeling: Important special cases  Implementation issues Look for large cliques Still need sampling to estimate Z  Trees and chains have cliques of size  2  Estimation involves Iterative numerical optimization (as before) No sampling; direct estimation of Z possible  Many applications Part-of-speech and named-entity tagging Disambiguation, alias analysis

ECML/PKDDChakrabarti 200455 Summary  Extracting more structure from the query This talk: using linguistic features In general: query/click sessions, profile info, …  Compiling fragments of structure from a large corpus Supervised sequence models Semi-supervised bootstrapping techniques  Labeling and searching graphs Coarse grained e.g. Web graph, node = page Fine grained e.g. node = person, email, …  Pieces sometimes need each other

ECML/PKDDChakrabarti 200456 Structure in the corpus Structure in the query “Free-format text”HTMLXMLEntity+relations Bag-of-words Bool,Prox XQuery SQL Domain of IR, time to analyze queries much more deeply So far, largely for power-users, can be generated by programs Broad, deep and ad-hoc schema and data integration: Very difficult! No schema; either map to schema or support query on uninterpreted graphs Defined many real problems away; “solved” apart from performance issues Defining, labeling, extracting and ranking answers are the major issues; no universal models; need many more applications 1 3 2

ECML/PKDDChakrabarti 200457 Concluding messages  Work much harder on questions Break down into what’s known, what’s not Find fragments of structure when possible Exploit user profiles and sessions  Perform limited pre-structuring of corpus Difficult to anticipate all needs and applications Extract graph structure where possible (e.g. is-a) Do not insist on specific schema  Exploit statistical tools on graph models Models of influence along links getting clearer Estimation not yet ready for very large-scale use Much room for creative feature and algo design

Breaking through the syntax barrier: Searching with entities and relations Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen.

Similar presentations

Presentation on theme: "Breaking through the syntax barrier: Searching with entities and relations Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Breaking through the syntax barrier: Searching with entities and relations Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen.

Similar presentations

Presentation on theme: "Breaking through the syntax barrier: Searching with entities and relations Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen."— Presentation transcript:

Similar presentations

About project

Feedback