Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Pas¸ca, Warren Shen, Fei Wu, Gengxin Miao, Chung Wu 2011, VLDB Xunnan Xu
Annotating tables (the recovery of semantics) Title could be missing Subjects could be missing Relevant information might not be close at all Improve table search Bloom period (Property) of shrubs (Class) <- focused on in this paper Color (Property) of Azalea (Instance)
isA database Berlin is a city. CSCI572 is a course. relation database Microsoft is headquartered in Redmond. San Francisco is located in California. Why is this useful? Tables are structured, more popular names could help identify others CityState San FranciscoCalifornia San MateoCalifornia
Extract pairs from web pages with patterns like: Easy? Not really… To check the boundary of a Class: noun phrases whose last component is a plural-form noun and that are not contained in and do not contain another noun phrase Michigan counties such as Among the lovely cities To check the boundary of an Instance: I occurs as an entire query in query logs
Mine more instances Headquartered in I => I is a city Handle sentence duplicates: Sentence fingerprint -> the hash of first 250 characters Score the pairs: Score(I, C) = Size({Pattern(I, C)}) 2 x Freq(I, C) {Pattern(I, C)} – the set of patterns Freq(I, C) – the number of appearances Similar to tf/idf
TextRunner was used to extract the relations TextRunner is a research project at the University of Washington. It uses Conditional Random Field (CRF) to detect the relations among noun phrases. CRF is a popular word in machine learning world: applying pre- defined feature functions to phrases to compute the final probability of a sentence (normalized score 0 ~ 1) Example: f(sentence, i, label i, label i-1 ) = 1 if word i is in and label i-1 is an adjective, otherwise 0 => Microsoft is headquartered in beautiful Redmond.
Assumptions If many instances in that column are assigned to a class, then the next instance very likely also belongs to it. The best label is the one that is most likely to produce the observed values in the column. (maximum likelihood hypothesis) Definitions v i – value i in column L i – possible label for that column, L(A) – the best label U(l i, V) – the score of label i after assigned to the set (V) of values
Gold standard Labels are manually evaluated by annotators Vital > okay > incorrect Allegan, Barry, Berrien –> Michigan counties (vital) Allegan, Barry, Berrien -> Illinois counties (incorrect) Relation quality 128 binary relations using gold standard Web-extractedYAGO from WikipediaFreebase Labeled subject columns1,496,550185,013577,811 Instances in ontology155,831,8551,940,79716,252,633 Web-extractedFreebase No. of relations vital/okay83 (64.8%)37 (28.9%)
Results are fetched automatically but compared manually: 100 queries, using top-5 of the results – 500 Results were shuffled and evaluated by 3 people using single blinding test Scores: right on - has all information about a large number of instances of the class and values for the property relevant - has information about only some of the instances, or of properties that were closely related to the queried property irrelevant Candidates TABLE – the method in this paper GOOG – results from google.com GOOGR – top 1000 results from Google intersected with the table corpus) DOCUMENT – document-based approach
Method All RatingsRatings by Queries Totalabc Similar Results abc TABLE DOC GOOG GOOGR Method Query PrecisionQuery Recall abcabc TABLE DOC GOOG GOOGR (a) right on, (b) right on or relevant, (c) right on or relevant and in a table