Breaking through the syntax barrier: Searching with entities and relations Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen.

Slides:



Advertisements
Similar presentations
Is Question Answering an Acquired Skill? Soumen Chakrabarti G. Ramakrishnan D. Paranjpe P. Bhattacharyya IIT Bombay.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Learning on the Test Data: Leveraging “Unseen” Features Ben Taskar Ming FaiWong Daphne Koller.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Exact Inference in Bayes Nets
Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
Intelligent systems Lecture 6 Rules, Semantic nets.
Using Web Queries for Learner Error Detection Michael Gamon, Microsoft Research Claudia Leacock, Butler-Hill Group.
Search Engines and Information Retrieval
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Is Question Answering an Acquired Skill? Soumen Chakrabarti IIT Bombay With Ganesh Ramakrishnan Deepa Paranjpe Pushpak Bhattacharyya.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
KnowItNow: Fast, Scalable Information Extraction from the Web Michael J. Cafarella, Doug Downey, Stephen Soderland, Oren Etzioni.
INFERRING NETWORKS OF DIFFUSION AND INFLUENCE Presented by Alicia Frame Paper by Manuel Gomez-Rodriguez, Jure Leskovec, and Andreas Kraus.
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
Methods for Domain-Independent Information Extraction from the Web An Experimental Comparison Oren Etzioni et al. Prepared by Ang Sun
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Information Retrieval in Practice
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Type-enabled Keyword Searches with Uncertain Schema Soumen Chakrabarti IIT Bombay
Search Engines and Information Retrieval Chapter 1.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Web-scale Information Extraction in KnowItAll Oren Etzioni etc. U. of Washington WWW’2004 Presented by Zheng Shao, CS591CXZ.
Iterative Readability Computation for Domain-Specific Resources By Jin Zhao and Min-Yen Kan 11/06/2010.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Presenter: Shanshan Lu 03/04/2010
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Enhanced Answer Type Inference from Questions using Sequential Models Vijay Krishnan Sujatha Das Soumen Chakrabarti IIT Bombay.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Is Question Answering an Acquired Skill? Soumen Chakrabarti IIT Bombay With Ganesh Ramakrishnan Deepa Paranjpe Vijay Krishnan Arnab Nandi.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
KnowItAll April William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Text Based Information Retrieval
Information Retrieval
Objective of This Course
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Introduction to Information Retrieval
Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^
Presentation transcript:

Breaking through the syntax barrier: Searching with entities and relations Soumen Chakrabarti IIT Bombay

ECML/PKDDChakrabarti Wish upon a textbox, 1996 Your information need here

ECML/PKDDChakrabarti Wish upon a textbox, 1998 “ A rising tide of data lifts all algorithms ” Your information need here

ECML/PKDDChakrabarti Wish upon a textbox, post-IPO  Now indexing 4,285,199,774 pages  Same interface, therefore same 2-word queries  Mind-reading wizard-in-black-box saves the day Your information need (still) here

ECML/PKDDChakrabarti If music had been invented ten years ago along with the Web, we would all be playing one-string instruments (and not making great music). Udi Manber, A9.com Plenary speech WWW 2004

ECML/PKDDChakrabarti Examples of “great music”…  The commercial angle: they’ll call you Want to buy X, find reviews and prices Cheap tickets to Y  Noun-phrase + Pagerank saves the day Find info about diabetes Find the homepage of Tom Mitchell  Searching vertical portals: garbage control Searching Citeseer or IMDB through Google  Someone out there had the same problem +adaptec +aha2940uw +lilo +bios

ECML/PKDDChakrabarti … and not-so-great music  Which produces better responses? Opera fails to connect to secure IMAP tunneled through SSH opera connect imap ssh tunnel  Unable to express many details of information need Opera the client, not a kind of music The problem is with Opera, not ssh, imap, applet “Secure” is an attribute of imap, but may not juxtapose

ECML/PKDDChakrabarti Why telegraphic queries fail  Information need relates to entities and relationships in the real world  But the search engine gets only strings  Risk over-/under- specified queries Never know true recall No time to deal with poor precision  Query word distribution dramatically different from corpus distribution Query is inherently incomplete Fix some info, look for other

ECML/PKDDChakrabarti Structure in the corpus Structure in the query “Free-format text”HTMLXMLEntity+relations Bag-of-words Bool,Prox XQuery SQL Domain of IR, time to analyze queries much more deeply So far, largely for power-users, can be generated by programs Broad, deep and ad-hoc schema and data integration: Very difficult! No schema; either map to schema or support query on uninterpreted graphs Defined many real problems away; “solved” apart from performance issues Defining, labeling, extracting and ranking answers are the major issues; no universal models; need many more applications 1 3 2

ECML/PKDDChakrabarti Past the syntax barrier: early steps  Taking the question apart Question has known parts and unknown “slots” Query-dependent information extraction  Compiling relations from the Web is-instance-of (is-a), is-subclass-of is-part-of, has-attribute  Graph models for textual data Searching graphs with keywords and twigs Global probabilistic graph labeling 1 2 3

Part-1 Working harder on the question

ECML/PKDDChakrabarti Atypes and ground constants  Specialize given domain to a token related to ground constants in the query What animal is Winnie the Pooh?  instance-of(“animal”) NEAR “Winnie the Pooh” When was television invented?  instance-of(“time”) NEAR “television” NEAR synonym(“invented”)  FIND x NEAR GroundConstants(question) WHERE x IS-A Atype(question) Ground constants: Winnie the Pooh, television Atypes: animal, time

ECML/PKDDChakrabarti Taking the question apart  Atype: the type of the entity that is an answer to the question  Problem: don’t want to compile a classification hierarchy of entities Laborious, can’t keep up Offline rather than question-driven  Instead Set up a very large basis of features “Project” question and corpus to basis

ECML/PKDDChakrabarti Scoring tokens for correct Atypes  FIND x “NEAR” GroundConstants(question) WHERE x IS-A Atype(question)  No fixed question or answer type system  Convert “x IS-A Atype(question)” to a soft match DoesAtypeMatch(x, question) QuestionAnswer tokens Passage IE-style surface feature extractors IS-A feature extractors IE-style surface feature extractors Question feature vector Snippet feature vector Learn joint distrib. …other extractors…

ECML/PKDDChakrabarti Features for Atype matching  Question features: 1, 2, 3-token sequences starting with standard wh-words where, when, who, how_X, …  Passage surface features: hasCap, hasXx, isAbbrev, hasDigit, isAllDigit, lpos, rpos,…  Passage IS-A features: all generalizations of all noun senses of token Use WordNet: horse  equid  ungulate, hoofed mammal  placental mammal  animal…  entity These are node IDs (“synsets”) in WordNet, not strings

ECML/PKDDChakrabarti Supervised learning setup  Get top 300 passages from IR engine “Promising but negative” instances Crude approximation to active learning  For each token invoke feature extractors  Question vector x q, passage vector x p How to represent combined vector x?  Label = 1 if token is in answer span, 0 o/w Question and answers from logs

ECML/PKDDChakrabarti Joint feature-vector design  Obvious “linear” juxtaposition x =(x p,x q ) Does not expose pairwise dependencies  “Quadratic” form x = x q  x p All pairwise product of elements  Model has param for every pair  Can discount for redundancy in pair info  If x q (x p ) is fixed, what x p (x q ) will yield the largest Pr(Y=1|x)? how_far when what_city region#n#3 entity#n#1

ECML/PKDDChakrabarti Classification accuracy  Pairing more accurate than linear model  Are the estimated w parameters meaningful?  Given question, can return most favorable answer feature weights

ECML/PKDDChakrabarti Parameter anecdotes  Surface and WordNet features complement each other  General concepts get negative params: use in predictive annotation  Learning is symmetric (Q  A)

ECML/PKDDChakrabarti Taking the question apart Atype: the type of the entity that is an answer to the question  Ground constants: Which question words are likely to appear (almost) unchanged in an answer passage?  Arises in Web search sessions too Opera login fails problem with login Opera Opera login accept password Opera account authentication …

ECML/PKDDChakrabarti Features to identify ground constants  Local and global features POS of word, POS of adjacent words, case info, proximity to wh-word Suppose word is associated with synset set S  NumSense: size of S (is word very polysemous?)  NumLemma: average #lemmas describing s  S (are there many aliases?)  Model as a sequential learning problem Each token has local context and global features  Label: does token appear near answer?

ECML/PKDDChakrabarti Ground constants: sample results  Global features (IDF, NumSense, NumLemma) essential for accuracy Best F1 accuracy with local features alone: 71—73% With local and global features: 81%  Decision trees better than logistic regression F1=81% as against LR F1=75% Intuitive decision branches

ECML/PKDDChakrabarti Summary of the Atype strategy  “Basis” of atypes A, a  A could be synset, surface pattern, feature of a parse tree  Question q “projected” to vector (w a : a  A) in atype space via learning conditional model  If q is “when…” or “how long…” w hasDigit and w time_period#n#1 are large, w region#n#1 is small  Each corpus token t has associated indicator features  a (t ) for every a   hasDigit (3,000)=  is-a(region#n#1) (Japan)=1

ECML/PKDDChakrabarti Reward proximity to ground constants  A token t is a candidate answer if  H q (t ): Reward tokens appearing “near” ground constants matched from question  Order tokens by decreasing Atype indicator features of the token Projection of question to “atype space” …the armadillo, found in Texas, is covered with strong horny plates

ECML/PKDDChakrabarti Evaluation: Mean reciprocal rank (MRR)  n q = smallest rank among answer passages  MRR = (1/|Q|)  q  Q (1/n q ) Dropping passage from #1 to #2 as bad as dropping it from #2 to not reporting it at all Experiment setup:  300 top IR score passages  If Pr(Y=1|token) < threshold reject token  If tokens rejected reject passage  Points below diagonal are good

ECML/PKDDChakrabarti Sample results  Accept all tokens  IR baseline MRR  Moderate acceptance threshold  non-answer passages eliminated, improves answer ranks  High threshold  true answers eliminated Another answer with poor rank, or rank =   Additional benefits from proximity filtering

Part-2 Compiling fragments of soft schema

ECML/PKDDChakrabarti Who provides is-a info?  Compiled KBs: WordNet, CYC  Automatic “soft” compilations Google sets KnowItAll BioText  Can use as evidence in scoring answers

ECML/PKDDChakrabarti Extracting is-instance-of info  Which researcher built the WHIRL system? WordNet may not know Cohen IS-A researcher  Google has over 4.2 billion pages “william cohen” on (p 1 =86.1k/4.2B) researcher on 4.55M (p 2 =4.55M/4.2B) +researcher +"william cohen“ on 1730: 18.55x more frequent than expected if independent  Pointwise mutual information PMI  Can add high-precision, low-recall patterns “cities such as New York” (26600 hits) “professor Michael Jordan” (101 hits  )

ECML/PKDDChakrabarti Bootstrapping lists of instances  Hearst 1992, Brin 1997, Etzioni 2004  A “propose-validate” approach Using existing patterns, generate queries For each web page w returned  Extract potential fact e and assign confidence score  Add fact to database if it has high enough score  Example patterns NP1 {,} {such as|and other|including} NPList2 NP1 is a NP2, NP1 is the NP2 of NP3 the NP1 of NP2 is NP3  Start with NP1 = researcher etc.

ECML/PKDDChakrabarti System details  The importance of shallow linguistics working together with statistical tests China is a (country) NP in Asia Garth Brooks is a (country ADJ (singer) N ) NP  Unary relation example NP1 such as NPList2 & head(NP1)=plural(name(Class1)) & properNoun(head(each(NPList2)))  instanceOf(Class1, head(each(NPList2)) ) “Head” of phrase

ECML/PKDDChakrabarti Compilation performance  Recall-vs-precision exposes size and difficulty of domain “US state” is easy “Country” is difficult  To improve signal-to- noise (STN) ratio, stop when confidence score is lower than threshold Substantially improves recall-vs-precision

ECML/PKDDChakrabarti Exploiting is-a info for ranking  Use PMI scores as additional features  Challenge: make frugal use of expensive inverted index probes QuestionAnswer tokens Passage IE-style surface feature extractors WN IS-A feature extractors IE-style surface feature extractors Question feature vector Snippet feature vector Learn joint distrib. PMI scores from search engine probes Atype

ECML/PKDDChakrabarti Structure in the corpus Structure in the query “Free-format text”HTMLXMLEntity+relations Bag-of-words Bool,Prox XQuery SQL Domain of IR, time to analyze queries much more deeply So far, largely for power-users, can be generated by programs Broad, deep and ad-hoc schema and data integration: Very difficult! No schema; either map to schema or support query on uninterpreted graphs Defined many real problems away; “solved” apart from performance issues Defining, labeling, extracting and ranking answers are the major issues a: Schema-free search on labeled graphs 3b: Labeling graphs

ECML/PKDDChakrabarti Mining graphs: applications abound  No clean schema, data changes rapidly  Lots of generic “graph proximity” information Time XYZ A. Thor Abstract:… PDF file LastMod …your ECML submission titled XYZ has been accepted… 1 Date Could I get a preprint of your recent ECML paper? 2 Date XYZ A. U. Thor To ECML Canonical node Want to quickly find this file

Part-3a Searching graphs

ECML/PKDDChakrabarti Some useful graph query styles  Find single node with specified type strongly activated by given nodes paper NEAR { author=“A. U. Thor” conference=“ECML” time=“now” } May not know exact types of activators  Find connected subgraph that best explains how/why two or more nodes are related connect {“Faloutsos” “Roussopulous } Reward short paths, parallel paths, few distractions  Query is a small graph with match clauses

ECML/PKDDChakrabarti Measuring activation of single nodes movie (near “richard gere” “julia roberts”)

ECML/PKDDChakrabarti Reporting connecting subgraphs

ECML/PKDDChakrabarti Schema-free twig queries  Query is a small graph, very light on types Node type (dblpauthor, vldbjournal) Simple predicates to qualify nodes (string, number)  Match to data graph Precisely defined matches for local predicates Reward for proximity

ECML/PKDDChakrabarti Sample twig result VLDB journal article, 2001 Gerhard Weikum’s home page Max-Planck Institute home page

ECML/PKDDChakrabarti Main issues  What style of queries to support? No end-user will type Xquery directly But bag-of-words is too weak  How to rank result nodes and subgraphs? Reward for text match, graph proximity, … “Iceberg queries”: report top few after a lot of computation?  Can ranking be learnt as with question answering? How to reuse and retarget supervision?

Part-3b Probabilistic graph labeling

ECML/PKDDChakrabarti Identifying node types  How do we know a node has a given type? Find person near “Pagerank” Find student near (homepage of Tom Mitchell) Find {verb, ####, “television”} close together  Nodes in our graph have Visible attributes “hobbies”, “my advisor is”, “start-of-sentence” Invisible labels “student”, “verb” Neighbors “HREF” “linear juxtaposition”  Goal: given some node labels, guess all others

ECML/PKDDChakrabarti Example: Hypertext classification  Want to assign labels to Web pages  Text on a single page may be too little or misleading  Page is not an isolated instance by itself  Problem setup Web graph G=(V,E) Node u  V is a page having text u T Edges in E are hyperlinks Some nodes are labeled Make collective guess at missing labels  Probabilistic model? Benefits?

ECML/PKDDChakrabarti For the moment we don’t need to worry about this  Seek a labeling f of all unlabeled nodes so as to maximize  Let V K be the nodes with known labels and f(V K ) their known label assignments  Let N(v) be the neighbors of v and N K (v)  N(v) be neighbors with known labels  Markov assumption: f(v) is conditionally independent of rest of G given f(N(v)) Graph labeling model

ECML/PKDDChakrabarti Markov assumption: label of v does not depend on unknown labels outside the set N U (v) Sum over all possible labelings of unknown neighbors of v …given edges, text, parital labeling Probability of labeling specific node v… Markov graph labeling  Circularity between f(v) and f(N U (v))  Some form of iterative Gibb’s sampling or MCMC

ECML/PKDDChakrabarti Iterative labeling in pictures  c=class, t=text, N=neighbors  Text-only model: Pr[c|t]  Using neighbors’ labels to update my estimates: Pr[c|t, c(N)]  Repeat until histograms stabilize Local optima possible ?

ECML/PKDDChakrabarti Take the expectation of this term over N U (v) labelings Joint distribution over neighbors approximated as product of marginals Label estimates of v in the next iteration Iterative labeling: formula  Sum over all possible N U (v) labelings still too expensive to compute  In practice, prune to most likely configurations By the Markov assumption, finally we need a distribution coupling f(v) and v T (the text on v) and f(N(v))

ECML/PKDDChakrabarti Labeling results  9600 patents from 12 classes marked by USPTO  Patents have text and cite other patents  Expand test patent to include neighborhood  ‘Forget’ fraction of neighbors’ classes  Good for other data sets as well

ECML/PKDDChakrabarti Undirected Markov networks  Clique c  C(G) a set of completely connected nodes  Clique potential  c (V c ) a function over all possible configurations of nodes in V c  x = vector of observed features at all nodes  y = vector of labels at all nodes (mostly unknown) Label variable Local feature variable Instance Label coupling

ECML/PKDDChakrabarti This sum is often a mess, so we need to get back to Gibb’s sampling/MCMC anyway Conditional Markov networks  Decompose conditional label probabilities into product over clique potentials  Parametric model for clique potentials Params of modelFeature functions

ECML/PKDDChakrabarti Markov network: results  Improves beyond text- only  Need more comparisons with iterative labeling  Can add many redundant features, e.g. are two links in the same HTML “section”? Further improves accuracy

ECML/PKDDChakrabarti Graph labeling: Important special cases  Implementation issues Look for large cliques Still need sampling to estimate Z  Trees and chains have cliques of size  2  Estimation involves Iterative numerical optimization (as before) No sampling; direct estimation of Z possible  Many applications Part-of-speech and named-entity tagging Disambiguation, alias analysis

ECML/PKDDChakrabarti Summary  Extracting more structure from the query This talk: using linguistic features In general: query/click sessions, profile info, …  Compiling fragments of structure from a large corpus Supervised sequence models Semi-supervised bootstrapping techniques  Labeling and searching graphs Coarse grained e.g. Web graph, node = page Fine grained e.g. node = person, , …  Pieces sometimes need each other

ECML/PKDDChakrabarti Structure in the corpus Structure in the query “Free-format text”HTMLXMLEntity+relations Bag-of-words Bool,Prox XQuery SQL Domain of IR, time to analyze queries much more deeply So far, largely for power-users, can be generated by programs Broad, deep and ad-hoc schema and data integration: Very difficult! No schema; either map to schema or support query on uninterpreted graphs Defined many real problems away; “solved” apart from performance issues Defining, labeling, extracting and ranking answers are the major issues; no universal models; need many more applications 1 3 2

ECML/PKDDChakrabarti Concluding messages  Work much harder on questions Break down into what’s known, what’s not Find fragments of structure when possible Exploit user profiles and sessions  Perform limited pre-structuring of corpus Difficult to anticipate all needs and applications Extract graph structure where possible (e.g. is-a) Do not insist on specific schema  Exploit statistical tools on graph models Models of influence along links getting clearer Estimation not yet ready for very large-scale use Much room for creative feature and algo design