Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Framework for Learning to Query Heterogeneous Data William W. Cohen Machine Learning Department and Language Technologies Institute School of Computer.

Similar presentations


Presentation on theme: "A Framework for Learning to Query Heterogeneous Data William W. Cohen Machine Learning Department and Language Technologies Institute School of Computer."— Presentation transcript:

1 A Framework for Learning to Query Heterogeneous Data William W. Cohen Machine Learning Department and Language Technologies Institute School of Computer Science Carnegie Mellon University joint work with: Einat Minkov, Andrew Ng, Richard Wang, Anthony Tomasic, Bob Frederking

2 Outline Two views on data quality: –Cleaning your data vs living with the mess. –“A lazy/Bayesian view of data cleaning” A framework for querying dirty data –Data model –Query language –Baseline results (biotext and email) –How to improve results with learning Learning to re-rank query output Conclusions

3 [Science 1959]

4

5

6

7 A Bayesian Looks at Record Linkage Record linkage problem: given two sets of records A={a 1,…,a m } and B={b 1,…,b n }, determine when referent(a i )=referent(b j ) Idea: compute for each a i,b j pair Pr(referent(a i )=referent(b j )) Pick two thresholds: –Pr(a=b) > HI  accept pairing –Pr(a=b) < LO  reject pairing –otherwise, “clerical review” by a human clerk Every optimal decision boundary is defined by a threshold on the ranked list. Thresholds depend on prior probability of a and b matching. ABPr(A=B) A17B220.99 A43B070.98 ……… A21B130.85 A37B440.82 A84B030.79 A83B710.63 ……… A24B520.25

8 A Bayesian Looks at Record Linkage Every optimal decision boundary is defined by a threshold on the ranked list. ABPr(A=B) A17B220.99 A43B070.98 A16B230.91 A21B130.85 A37B440.82 A84B030.79 A83B710.63 A91B210.46 A24B520.25 n*m pairs M M M M U U U U U M M U U U U U U U U U U U U M M M M... 2 n*m ways to link In other words: –2 n*m – n*m linkages can be discarded as impossible* –of the remaining n*m, all but HI-LO can be discarded as “improbable” But wait: why doesn’t the human clerk pick a threshold between LO and HI?

9 A Bayesian Looks at Record Linkage ABPr(A=B) A17B220.99 A43B070.98 A32B720.91 A21B130.85 A37B440.82 A84B030.79 A83B710.63 A21B430.46 A24B520.25 M M M M U U M M M M M U M U U U U U U U M

10 Linking multiple relations: “database hardening” Database S1 (extracted from paper 1’s title page): Database S2 (extracted from paper 2’s bibliography): Assumption: identical strings from the same source are co-referent (one sense per discourse?)

11 So this gives some known matches, which might interact with proposed matches: e.g. here we deduce... Using multiple relations: “database hardening”

12 “Soft database” from IE: “Hard database” suitable for Oracle, MySQL, etc

13 (McAllister et al, KDD2000) Defined “hardening”: –Find “interpretation” (maps variant->name) that produces a compact version of “soft” database S. –Probabilistic interpretation of hardening: Original “soft” data S is version of latent “hard” data H. Hardening finds max likelihood H. –Hardening is hard! Optimal hardening is NP-hard. –Greedy algorithm: naive implementation is quadratic in |S| clever data structures make it P(n log n), where n=|S|d Other related work: –Pasula et al, NIPS2002: more explicit generative Bayesian formulation and MCMC method, experimental support –Wellner & McCallum 2004, Parag & Domingos 2004, Culotta & McCallum 2005,.... Using multiple relations: “database hardening”

14 A Bayesian Looks at Record Linkage ABPr(A=B) A17B220.99 A43B070.98 A32B720.91 A21B130.85 A37B440.82 A84B030.79 A83B710.63 A21B430.46 A24B520.25 M M M M U U M M M M M U M U U U U U U U M An alternate view of the process: 1.F-S’s method answers the question directly for the cases that everyone would agree on. 2.Human effort is used to answer the cases that are a little harder.

15 A Bayesian Looks at Record Linkage ABPr(A=B) A17B220.99 A43B070.98 A32B720.91 A21B130.85 A37B440.82 A84B030.79 A83B710.63 A21B430.46 A24B520.25 M M M M U U An alternate view of the process: 1.F-S’s method answers the question directly for the cases that everyone would agree on. 2.Human effort is used to answer the cases that are a little harder. Q: is A43 in B? A: yes (p=0.98) Q: is A21 in B? A: unlikely Q: is A83 in B? A: not clear… ?

16 Passing linkage decisions along to the user Usual goal: link records and create a single highly accurate database for users query. Equality is often uncertain, given available information about an entity –“name: T. Kennedy occupation: terrorist” The interpretation of “equality” may change from user to user and application to application –Does “Boston Market” = “McDonalds” ? –Alternate goal: wait for a query, then answer it, propogating uncertainty about linkage decisions on that query to the end user X

17 WHIRL project (1997-2000) WHIRL initiated when at AT&T Bell Labs AT&T Research AT&T Labs - Research AT&T Labs AT&T Research AT&T Research – Shannon Laboratory AT&T Shannon Labs

18 When are two entities the same? Bell Labs Bell Telephone Labs AT&T Bell Labs A&T Labs AT&T Labs—Research AT&T Labs Research, Shannon Laboratory Shannon Labs Bell Labs Innovations Lucent Technologies/Bell Labs Innovations History of Innovation: From 1925 to today, AT&T has attracted some of the world's greatest scientists, engineers and developers…. [www.research.att.com] Bell Labs Facts: Bell Laboratories, the research and development arm of Lucent Technologies, has been operating continuously since 1925… [bell-labs.com] [1925]

19 “Buddhism rejects the key element in folk psychology: the idea of a self (a unified personal identity that is continuous through time)… King Milinda and Nagasena (the Buddhist sage) discuss … personal identity… Milinda gradually realizes that "Nagasena" (the word) does not stand for anything he can point to: … not … the hairs on Nagasena's head, nor the hairs of the body, nor the "nails, teeth, skin, muscles, sinews, bones, marrow, kidneys,..." etc… Milinda concludes that "Nagasena" doesn't stand for anything… If we can't say what a person is, then how do we know a person is the same person through time? … There's really no you, and if there's no you, there are no beliefs or desires for you to have… The folk psychology picture is profoundly misleading and believing it will make you miserable.” -S. LaFave When are two entities are the same?

20 Linkage Queries Traditional approach: Uncertainty about what to link must be decided by the integration system, not the end user

21 Link items as needed by Q WHIRL vision: Query Q SELECT R.a,S.a,S.b,T.b FROM R,S,T WHERE R.a=S.a and S.b=T.b R.aS.aS.bT.b Anhai Doan Dan Weld Strongest links: those agreeable to most users WilliamWillCohenCohn SteveStevenMintonMitton Weaker links: those agreeable to some users WilliamDavidCohenCohn even weaker links…

22 Link items as needed by Q WHIRL vision: Query Q SELECT R.a,S.a,S.b,T.b FROM R,S,T WHERE R.a~S.a and S.b~T.b (~ TFIDF-similar) R.aS.aS.bT.b Anhai Doan Dan Weld Incrementally produce a ranked list of possible links, with “best matches” first. User (or downstream process) decides how much of the list to generate and examine. WilliamWillCohenCohn SteveStevenMintonMitton WilliamDavidCohenCohn DB 1 + DB 2 ≠ DB

23 WHIRL queries Assume two relations: review(movieTitle,reviewText): archive of reviews listing(theatre, movieTitle, showTimes, …): now showing The Hitchhiker’s Guide to the Galaxy, 2005 This is a faithful re-creation of the original radio series – not surprisingly, as Adams wrote the screenplay …. Men in Black, 1997 Will Smith does an excellent job in this … Space Balls, 1987 Only a die-hard Mel Brooks fan could claim to enjoy … …… Star Wars Episode III The Senator Theater 1:00, 4:15, & 7:30pm. Cinderella Man The Rotunda Cinema 1:00, 4:30, & 7:30pm. ………

24 WHIRL queries “Find reviews of sci-fi comedies [movie domain] FROM review SELECT * WHERE r.text~’sci fi comedy’ (like standard ranked retrieval of “sci-fi comedy”) “ “Where is [that sci-fi comedy] playing?” FROM review as r, LISTING as s, SELECT * WHERE r.title~s.title and r.text~’sci fi comedy’ (best answers: titles are similar to each other – e.g., “Hitchhiker’s Guide to the Galaxy” and “The Hitchhiker’s Guide to the Galaxy, 2005” and the review text is similar to “sci-fi comedy”)

25 WHIRL queries Similarity is based on TFIDF  rare words are most important. Search for high-ranking answers uses inverted indices…. The Hitchhiker’s Guide to the Galaxy, 2005 Men in Black, 1997 Space Balls, 1987 … Star Wars Episode III Hitchhiker’s Guide to the Galaxy Cinderella Man …

26 WHIRL queries Similarity is based on TFIDF  rare words are most important. Search for high-ranking answers uses inverted indices…. The Hitchhiker ’s Guide to the Galaxy, 2005 Men in Black, 1997 Space Balls, 1987 … Star Wars Episode III Hitchhiker ’s Guide to the Galaxy Cinderella Man … Years are common in the review archive, so have low weight hitchhikermovie00137 themovie001,movie003,movie007,movie008, movie013,movie018,movie023,movie0031, ….. - It is easy to find the (few) items that match on “ important ” terms - Search for strong matches can prune “unimportant terms”

27 WHIRL results This sort of worked: –Interactive speeds (<0.3s/q) with a few hundred thousand tuples. –For 2-way joins, average precision (sort of like area under precision-recall curve) from 85% to 100% on 13 problems in 6 domains. –Average precision better than 90% on 5-way joins

28 WHIRL and soft integration WHIRL worked for a number of web-based demo applications. –e.g., integrating data from 30-50 smallish web DBs with <1 FTE labor WHIRL could link many data types reasonably well, without engineering WHIRL generated numerous papers (Sigmod98, KDD98, Agents99, AAAI99, TOIS2000, AIJ2000, ICML2000, JAIR2001) WHIRL was relational –But see ELIXIR (SIGIR2001) WHIRL users need to know schema of source DBs WHIRL’s query-time linkage worked only for TFIDF, token- based distance metrics –  Text fields with few misspellimgs WHIRL was memory-based –all data must be centrally stored—no federated data. –  small datasets only

29 Link items as needed by Q WHIRL vision: very radical, everything was inter-dependent Query Q SELECT R.a,S.a,S.b,T.b FROM R,S,T WHERE R.a~S.a and S.b~T.b (~ TFIDF-similar) R.aS.aS.bT.b Anhai Doan Dan Weld Incrementally produce a ranked list of possible links, with “best matches” first. User (or downstream process) decides how much of the list to generate and examine. WilliamWillCohenCohn SteveStevenMintonMitton WilliamDavidCohenCohn To make SQL- like queries, user must understand the schema of the underlying DB (and hence someone must understand DB1, DB2, DB3,... ?

30 Outline Two views on data quality: –Cleaning your data vs living with the mess. –A lazy/Bayesian view of data cleaning A framework for querying dirty data –Data model –Query language –Baseline results (biotext and email) How to improve results with learning –Learning to re-rank query output Conclusions

31 BANKS: Basic Data Model Database is modeled as a graph –Nodes = tuples –Edges = references between tuples foreign key, inclusion dependencies,.. Edges are directed. MultiQuery Optimization S. SudarshanPrasan Roy writes author paper Charuta BANKS: Keyword search… User need not know organization of database to formulate queries.

32 BANKS: Answer to Query Query: “sudarshan roy” Answer: subtree from graph MultiQuery Optimization S. SudarshanPrasan Roy writes author paper

33 BANKS: Basic Data Model Database is modeled as a graph –Nodes = tuples –Edges = references between tuples edges are directed. foreign key, inclusion dependencies,..

34 BANKS: Basic Data Model Database All information is modeled as a graph –Nodes = tuples or documents or strings or words –Edges = references between tuples nodes edges are directed, labeled and weighted foreign key, inclusion dependencies,... doc/string D to word contained by D (TFIDF weighted, perhaps) word W to doc/string containing W (inverted index) [string S to strings ‘similar to’ S] not quite so basic

35 Similarity in a BANKS-like system Motivation: why I’m interested in –structured data that is partly text – similarity! –structured data represented as graphs; all sorts of information can be poured into this model. –measuring similarity of nodes in graphs Coming up next: –a simple query language for graphs; –experiments on natural types of queries; –techniques for learning to answer queries of a certain type better

36 Yet another schema-free query language Assume data is encoded in a graph with: –a node for each object x –a type of each object x, T(x) –an edge for each binary relation r:x  y Queries are of this form: –Given type t* and node x, find y:T(y)=t* and y~x. We’d like to construct a general-purpose similarity function x~y for objects in the graph: We’d also like to learn many such functions for different specific tasks (like “who should attend a meeting”) Node similarity

37 Similarity of Nodes in Graphs Given type t* and node x, find y:T(y)=t* and y~x. Similarity defined by “damped” version of PageRank Similarity between nodes x and y: –“Random surfer model”: from a node z, with probability α, stop and “output” z pick an edge label r using Pr(r | z)... e.g. uniform pick a y uniformly from { y’ : z  y with label r } repeat from node y.... –Similarity x~y = Pr( “output” y | start at x) Intuitively, x~y is summation of weight of all paths from x to y, where weight of path decreases exponentially with length.

38 BANKS: Basic Data Model Database All information is modeled as a graph –Nodes = tuples or documents or strings or words –Edges = references between tuples nodes edges are directed, labeled and weighted foreign key, inclusion dependencies,... doc/string D to word contained by D (TFIDF weighted, perhaps) word W to doc/string containing W (inverted index) [string S to strings ‘similar to’ S] not quite so basic optional—strings that are similar in TFIDF/cosine distance will still be “nearby” in graph (connected by many length=2 paths) “William W. Cohen, CMU” “Dr. W. W. Cohen” cohen williamw dr cmu

39 Similarity of Nodes in Graphs Random surfer on graphs: –natural extension to PageRank –closely related to Lafferty’s heat diffusion kernel but generalized to directed graphs –somewhat amenable to learning parameters of the walk (gradient search, w/ various optimization metrics): Toutanova, Manning & NG, ICML2004 Nie et al, WWW2005 Xi et al, SIGIR 2005 –can be sped up and adapted to longer walks by sampling approaches to matrix multiplication (e.g. Lewis & E. Cohen, SODA 1998), similar to particle filtering –our current implementation (GHIRL): Lucene + Sleepycat with extensive use of memory caching (sampling approaches visit many nodes repeatedly)

40 Query: “sudarshan roy” Answer: subtree from graph MultiQuery Optimization S. SudarshanPrasan Roy writes author paper

41 Query: “sudarshan roy” Answer: subtree from graph y: paper(y) & y~“roy”w: paper(y) & w~“roy” AND

42 Evaluation on Personal Information Management Tasks Such as: Person Name Disambiguation in Email Threading Finding email-address aliases given a person’s name Finding relevant meeting attendees Many tasks can be expressed as simple, non-conjunctive search queries in this framework. novel [eg Diehl, Getoor, Namata, 2006] [eg Lewis & Knowles 97] What is the email address for the person named “Halevy” mentioned in this presentation? What files from my home machine will I need for this meeting? What people will attend this meeting?... ? Also consider a generalization: x  V q V q is a distribution over nodes x [Minkov et al, SIGIR 2006]

43 file 1 Email address 1 person name 2 date 1 term 2 term 3 term 5 term 6 term 1 term 4 term 7 term 8 Email address 2 Email address 3 Email address 4 person name 3 person name 4 person name 1 Sent_ from sf_inv alias a_inv sent_date sd_Inv in_ file If_inv in_ subj is_inv Sent_ to st_inv file 2 sent_to sent_ from Email address 5 person name 5 sent_ to term 10 term 9 term 11 date 2 sent_date +1_day Email as a graph

44 Person Name Disambiguation Q: “who is Andy?” Given: a term that is not mentioned ‘as is’ in header (otherwise, easy), that is known to be a personal name Output: ranked person nodes. file term: andy file Person: Andrew Johns Person file * This task is complementary to person name annotation in email ( E. Minkov, R. Wang, W.Cohen, Extracting Personal Names from Emails: Applying Named Entity Recognition to Informal Text, HLT/EMNLP 2005)

45 Corpora and Datasets a. Corpora b. Types of names Example nicknames: Dave for David, Kai for Keiko, Jenny for Qing

46 Person Name Disambiguation 1. Baseline: String matching (& common nicknames) Find persons that are similar to the name term (Jaro) Successful in many cases Not successful for some nicknames Can not handle ambiguity (arbitrary) 2. Graph walk: term V q : name term node (2 steps) Models co-occurrences. Can not handle ambiguity (dominant) 3. Graph walk: term+file V q : name term + file nodes (2 steps) The file node is natural available context Solves the ambiguity problem! But, incorporates additional noise. 4. Graph walk: term+file, reranked using learning Re-rank the output of (3), using: - path-describing features - ‘source count’ : do the paths originate from a single or two source nodes - string similarity

47 Results

48 baseline: string match, nicknames graph walk from name graph walk from {name,file} after learning-to-rank

49 Results Enron execs

50 Results

51 Learning There is no single “best” measure of similarity: –How can you learn how to better rank graph nodes, for a particular task? Learning methods for graph walks: –The parameters can be adjusted using gradient descent methods (Diligenti et-al, IJCAI 2005) –We explored a node re-ranking approach – which can take advantage of a wider range of features features (and is complementary to parameter tuning) Features of candidate answer y describe the set of paths from query x to y

52 Re-ranking overview Boosting-based reranking, following ( Collins and Koo, Computational Linguistics, 2005 ): A training example includes: –a ranked list of l i nodes. –Each node is represented through m features –At least one known correct node Scoring function: Find w that minimizes (boosted version): Requires binary features and has a closed form formula to find best feature and delta in each iteration., where original score y~x linear combination of features

53 Path describing Features The set of paths to a target node in step k is recovered in full. K=0K=1K=2 X1X1 X2X2 X3X3 X4X4 X5X5 Paths (x 3, k=2): x 2  x 1  x 3 x 4  x 1  x 3 x 2  x 2  x 3 x 2  x 3 ‘Edge unigram’ features: was edge type l used in reaching x from V q. ‘Edge bigram’ features: were edge types l 1 and l 2 used (in that order) in reaching x from V q. ‘Top edge bigram’ features: were edge types l 1 and l 2 used (in that order) in reaching x from V q, among the top two highest scoring paths.

54 Results

55 Threading Threading is an interesting problem, because: There are often irregularities in thread structural information, thus threads discourse should be captured using an intelligent approach ( D.E. Lewis and K.A. Knowles, Threading email: A preliminary study, Information Processing and Management, 1997 ) Threading information can improve message categorization into topical folders ( B. Klimt and Y. Yang, The Enron corpus: A new dataset for email classification research, ECML, 2004 ) Adjacent messages in a thread can be assumed to be most similar to each other in the corpus. Therefore, threading is related to the general problem of finding similar messages in a corpus. The task: given a message, retrieve adjacent messages in the thread

56 Some intuition ? file x

57 Some intuition ? file x Shared content

58 Some intuition ? file x Shared content Social network

59 Some intuition ? file x Shared content Social network Timeline

60 Threading: experiments 1.Baseline: TF-IDF Similarity Consider all the available information (header & body) as text 2.Graph walk: uniform Start from the file node, 2 steps, uniform edge weights 3.Graph walk: random Start from the file node, 2 steps, random edge weights (best out of 10) 4.Graph walk: reranked Rerank the output of (3) using the graph-describing features

61 Results Highly-ranked edge-bigrams: sent-from  sent-to -1 date-of  date-of -1 has-term  has-term -1

62 Finding email-aliases given a name Given: a person’s name (term) Retrieve: the full set of relevant email-addresses (email-address)

63 Finding Meeting Attendees Extended graph contains 2 months of calendar data: [Minkov et al, CEAS 2006]

64 Main Contributions Presented an extended similarity measure incorporating non-textual objects Finite lazy random walks to perform typed search A re-ranking paradigm to improve on graph walk results Instantiation of this framework for email Defined and evaluated novel tasks for email

65 Another Task that Can be Formulated as a Graph Query: GeneId-Ranking Given: –a biomedical paper abstract Find: –the geneId for every gene mentioned in the abstract Method: –from paper x, ranked list of geneId y: x~y Background resources: –a “synonym list”: geneId  { name1, name2,... } –one or more protein NER systems –training/evaluation data: pairs of (paper, {geneId1,...., geneIdn})

66 true labels NER extractor Sample abstracts and synonyms MGI:96273 Htr1a 5-hydroxytryptamine (serotonin) receptor 1A 5-HT1A receptor MGI:104886 Gpx5 glutathione peroxidase 5 Arep... 52,000+ for mouse, 35,000+ for fly

67 Graph for the task.... file:doc115 “HT1A”“HT1” “CA1” term:HTterm:1term:Aterm:CA term:hippo- campus “5-HT1A receptor” “Htr1a”... abstracts proteins terms synonyms “eIF-1A”... geneIds MGI:95298MGI:46273... hasProtein hasTerm inFile synonym

68 file:doc115 “HT1A”“HT1” “CA1” term:HTterm:1term:Aterm:CA term:hippo- campus “5-HT1A receptor” “Htr1a”... abstracts proteins terms synonyms “eIF-1A”... geneIds MGI:95298MGI:46273... hasProtein hasTerm inFile synonym noisy training abstracts file:doc214file:doc523file:doc6273...

69 Experiments Data: Biocreative Task 1B –mouse: 10,000 train abstracts, 250 devtest, using first 150 for now; 50,000+ geneId’s; graph has 525,000+ nodes NER systems: –likelyProtein: trained on yapex.train using off-the-shelf NER systems (Minorthird) –possibleProtein: same, but modified (on yapex.test) to optimize F3, not F1 (rewards recall over precision)

70 Experiments with NER TokenSpan PrecisionRecallPrecisionRecallF1 94.964.887.262.172.5 49.097.447.282.560.0 81.631.366.726.845.3 43.988.530.456.639.6 50.146.924.543.931.4 likely possible likely possible yapex.test mouse dictionary

71 Experiments with Graph Search Baseline method: –extract entities of type x –for each string of type x, find best-matching synonym, and then its geneId consider only synonyms sharing >=1 token Soft/TFIDF distance break ties randomly –rank geneId’s by number of times they are reached rewards multiple mentions (even via alternate synonyms) Evaluation: –average, over 50 test documents, of non-interpolated average precision (plausible for curators) max F1 over all cutoff’s

72 Experiments with Graph Search mouse eval datasetMAPmaxF1 likelyProtein + softTFIDF 45.058.1 possibleProtein + softTFIDF 62.674.9 graph walk51.364.3

73 Baseline vs Graphwalk Baseline includes: –softTFIDF distances from NER entity to gene synonyms –knowledge that “shortcut” path doc  entity  synonym  geneId is important Graph includes: –IDF effects, correlations, training data, etc Proposed graph extension: –add softTFIDF and “shortcut” edges Learning and reranking: –start with “local” features f i (e) of edges e=u  v –for answer y, compute expectations: E( f i (e) | start=x,end=y) –use expectations as feature values and voted perceptron (Collins, 2002) as learning-to-rank method.

74 Experiments with Graph Search mouse eval datasetMAPaverage max F1 likelyProtein + softTFIDF 45.058.1 possibleProtein + softTFIDF 62.674.9 graph walk51.364.3 walk + extra links73.080.7 walk + extra links + learning79.783.9

75 Experiments with Graph Search

76 Hot off the presses Ongoing work: learn NER system from pairs of (document,geneIdList) –much easier to obtain training data than documents in which every occurrence of every gene name is highlighted (usual NER training data) –obtains F1 of 71.1 on mouse data (vs 45.3 by training on YAPEX data, which is from different distribution)

77 Experiments with Graph Search mouse eval datasetMAP (Yapex trained) likelyProtein + softTFIDF 45.0 possibleProtein + softTFIDF 62.6 graph walk 51.3 walk + extra links 73.0 walk + extra links + learning 79.7

78 Experiments with Graph Search mouse eval datasetMAP (Yapex trained) MAP (MGI trained) likelyProtein + softTFIDF 45.072.7 possibleProtein + softTFIDF 62.665.7 graph walk 51.354.4 walk + extra links 73.076.7 walk + extra links + learning 79.784.2

79 Experiments on BioCreative Blind Test Set mouse blind test dataMAP (Yapex trained) Max F1 (Yapex trained) likelyProtein + softTFIDF 36.842.1 possibleProtein + softTFIDF 61.167.2 graph walk 64.069.5 walk + extra links + learning 71.175.5 (45.0)(58.1) (62.6)(74.9) (51.3)(64.3) (79.7)(83.9)

80 Experiments with Graph Search mouse blind test dataMAP (Yapex trained) Max F1 (Yapex trained) likelyProtein + softTFIDF 36.842.1 possibleProtein + softTFIDF 61.167.2 graph walk 64.069.5 walk + extra links + learning 71.175.5 mouse blind test dataMAP (MGI trained) Max F1 (MGI trained) walk + extra links + learning 80.183.7

81 mouse blind test dataMAP (Yapex trained) Average Max F1 (Yapex trained) walk + extra links + learning 71.175.5 (MGI trained) walk + extra links + learning 80.183.7

82 Outline Two views on data quality: –Cleaning your data vs living with the mess. –“A lazy/Bayesian view of data cleaning” A framework for querying dirty data –Data model –Query language –Baseline results (biotext and email) –How to improve results with learning Learning to re-rank query output Conclusions

83 Contributions: –a very simple query language for graphs, based on a diffusion-kernel (damped PageRank,...) similarity metric –experiments on natural types of queries: finding likely meeting attendees finding related documents (email threading) disambiguating person and gene/protein entity names –techniques for learning to answer queries reranking using expectations of simple, local features tune performance to a particular “similarity”

84 Conclusions Some open problems: –scalability & efficiency: K-step walk on node-node graph with fan-out b is O(KbN) accurate sampling is O(1min) for 10-steps with O(10 6 ) nodes. –faster, better learning methods: combine re-ranking with learning parameters of graph walk –add language modeling, topic modeling: extend graph to include models as well as data

85 Conclusions Don’t forget that there are two views on data quality: –Cleaning your data vs living with the mess. –“A lazy/Bayesian view of data cleaning” –SQL/Oracle vs Google vs something in between.... ?


Download ppt "A Framework for Learning to Query Heterogeneous Data William W. Cohen Machine Learning Department and Language Technologies Institute School of Computer."

Similar presentations


Ads by Google