Probase : A Knowledge Base for Text Understanding 4/16/2017 8:13 PM Probase : A Knowledge Base for Text Understanding © 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Integrating, representing, and reasoning over human knowledge: the grand challenge of the 21st century.
Knowledge Bases Scope: Challenges: data integration Cross domain / Encyclopedia: (Freebase, Cyc, etc) Vertical domain / Directory: (Entity Cube, etc) Challenges: completeness (e.g., a complete list of restaurants) accuracy (e.g., address, phone, menu of each restaurant) data integration
“Concepts are the glue that holds our mental world together.” “By being able to place something into its proper place in the concept hierarchy, one can learn a considerable amount about it.”
Yellow Pages or Knowledge Bases? Existing Knowledge Bases are more like big Yellow Page Books Goals: Complete list of items Complete information of each item But humans do not need “complete” knowledge in order to understand the world The statistical machine learning strategies that it uses are indeed a big advance over traditional GOFAI techniques. But they still fall far short of what human beings do. To see this, take Watson’s most famous blunder. On day 2 of the “Jeopardy!” challenge, the competitors were given a clue in the category “U.S. Cities.” The clue was, “Its largest airport is named for a World War II hero; its second largest for a World War II battle.” Both Ken Jennings and Brad Rutter correctly answered Chicago. Watson didn’t just get the wrong city; its answer, Toronto, is not a United States city at all. This is the kind of blunder that practically no human being would make. It may be what motivated Jennings to say, “The illusion is that this computer is doing the same thing that a very good ‘Jeopardy!’ player would do. It’s not. It’s doing something sort of different that looks the same on the surface. And every so often you see the cracks.” So what went wrong? David Ferrucci, the principal investigator on the IBM team, says that there were a variety of factors that led Watson astray. But one of them relates specifically to the machine learning strategies. During its training phase, Watson had learned that categories are only a weak indicator of the answer type. A clue in the category “U.S. Cities,” for example, might read “Rochester, New York, grew because of its location on this.” The answer, the Erie Canal, is obviously not a U.S. city. So from examples like this Watson learned, all else being equal, that it shouldn’t pay too much attention to mismatches between category and answer type. The problem is that lack of attention to such a mismatch will sometimes produce a howler. Knowing when it’s relevant to pay attention to the mismatch and when it’s not is trivial for a human being. But Watson doesn’t understand relevance at all. It only measures statistical frequencies. Because it is relatively common to find mismatches of this sort, Watson learns to weigh them as only mild evidence against the answer. But the human just doesn’t do it that way. The human being sees immediately that the mismatch is irrelevant for the Erie Canal but essential for Toronto. Past frequency is simply no guide to relevance.
What’s wrong with the yellow pages approach? Rich in instances, poor in concepts Understanding = forming concepts Ontologies always have an elegant hierarchy? Manually crafted ontologies are for human use Our mental world does not have an elegant hierarchy
Probase: “It is better to use the world as its own model.” Capture concepts (in our mental world) Quantify uncertainty (for reasoning) The new, embodied paradigm in AI, deriving primarily from the work of roboticist Rodney Brooks, insists that the body is required for intelligence. Indeed, Brooks’s classic 1990 paper, “Elephants Don’t Play Chess,” rejected the very symbolic computation paradigm against which Dreyfus had railed, favoring instead a range of biologically inspired robots that could solve apparently simple, but actually quite complicated, problems like locomotion, grasping, navigation through physical environments and so on. To solve these problems, Brooks discovered that it was actually a disadvantage for the system to represent the status of the environment and respond to it on the basis of pre-programmed rules about what to do, as the traditional GOFAI systems had. Instead, Brooks insisted, “It is better to use the world as its own model.” Brooks believes: - Attempting to come up with models and representations of the real world is the wrong approach - It is ``better to use the world as its own model’’ - Brooks, R.A. (1990) Elephants don’t play chess. In Pattie Maes (Ed.) Designing autonomous agents. Cambridge, Mass, MIT Press Elephants don’t play chess – but still intelligent !
1 2 4 3 Capture concepts in human mind More than 2.7 million concepts automatically harnessed from 1.68 billion documents 2 Capture concepts in human mind Represent them in a computable form Transform them to machines Machines have better understanding of human world Computation/Reasoning enabled by scoring: Consensus: e.g., is there a company called Apple? Typicality: e.g. how likely you think of Apple when you think about companies? Ambiguity: e.g., does the word Apple, sans any context, represent Apple the company? Similarity: e.g., how likely is an actor also a celebrity? Freshness: e.g., Pluto as a dwarf planet is a claim more fresh than Pluto as a planet. … 4 A little knowledge goes a long way after machines acquire a human touch 3 Give machines a new CPU (Commonsense Processing Unit) powered by a distributed graph engine called Trinity.
Probase: Freebase: Cyc: Probase has a big concept space 2.7 M concepts automatically harnessed 2 K concepts built by community effort 120 K concepts 25 years human labor Probase: Freebase: Cyc:
Probase has 2.7 million concepts (frequency distribution)
Probase vs. Freebase Uncertainty Knowledge is black and white. Correctness is a probability. Live with dirty data. Dirty data is very useful. Knowledge is black and white. Clean up everything. Dirty data is unusable.
Probase Internals artist painter art created by painting Movement Born Picasso Movement Born Died … Cubism 1881 1973 … created by art painting Year Type … Guernica 1937 Oil on Canvas …
Data Sources Patterns for single statements Examples: NP such as {NP, NP, ..., (and|or)} NP such NP as {NP,}* {(or|and)} NP NP {, NP}* {,} or other NP NP {, NP}* {,} and other NP NP {,} including {NP ,}* {or | and} NP NP {,} especially {NP,}* {or|and} NP Examples: Good: “rich countries such as USA and Japan …” Tough: “animals other than cats such as dogs …” Hopeless: “At Berklee, I was playing with cats such as Jeff Berlin, Mike Stern, Bill Frisell, and Neil Stubenhaus.”
Examples Animals other than dogs such as cats … Eating fruits such as apple … can be good to your health. Eating disorders such as … can be bad to your health.
Properties Given a class, find its properties Candidate seed properties: “What is the [property] of [instance]?” “Where”, “When”, “Who” are also considered
Similarity between two concepts Weighted linear combinations of Similarity between the set of instances Similarity between the set of attributes (nation, country) (celebrity, well-known politicians)
Beyond noun phrases Example: the verb “hit” Small object, Hard surface (bullet, concrete), (ball, wall) Natural disaster, Area (earthquake, Seattle), (Hurricane Floyd, Florida) Emergency, Country (economic crisis, Mexico), (flood, Britain)
Quantify Uncertainty Typicality Similarity the foundation of P(concept | instance) P(instance | concept) P(concept | property) P(property | concept) Similarity sim(concept1, concept2) the foundation of text understanding and reasoning
Text Mining / IE: State of the Art Bag of words based approach: e.g., LDA Based on multiple document statistics Simple bag-of-words, no semantics Supervised learning: e.g., CRF Labeled training data required Difficulty for out-of-sample features Lack of semantics What role can a knowledgebase play?
Shopping at Bellevue during TechFest Five of us bought 5 Kinects and posed in front of an Apple store. Apple store that sells fruits or apple store that sells iPads?
Step by Step Understanding Entity abstraction Attribute abstraction Short text/query (1-5 words) understanding Text block/document understanding
Explicit Semantic Analysis (ESA) Goal: latent topics => explicit topics Approach: mapping text to Wikipedia articles An inverted list that record words’ occurrence in wiki articles Given a document, we derive a distribution of wiki articles
Explicit Semantic Analysis (ESA) bag of words => bag of Wikipedia articles (a distribution over Wikipedia articles) A bag of Wikipedia articles is not equivalent to a clear concept in our mental world. Top 10 concepts for “Bank of America” Bank, Bank of America, Bank of America Plaza (Atlanta), Bank of America Plaza (Dallas), MBNA, VISA (credit card), Bank of America Tower New York City, NASDAQ, MasterCard, Bank of America Corporate Center
Short Text Challenge: Applications Not enough statistics Twitter Query/Search Log Anchor Text Image/video tag Document paraphrasing and annotation
Comparison of Knowledge Bases WordNet Wikipedia Freebase Probase Cat Feline; Felid; Adult male; Man; Gossip; Gossiper; Gossipmonger; Rumormonger; Rumourmonger; Newsmonger; Woman; Adult female; Stimulant; Stimulant drug; Excitant; Tracked vehicle; ... Domesticated animals; Cats; Felines; Invasive animal species; Cosmopolitan species; Sequenced genomes; Animals described in 1758; TV episode; Creative work; Musical recording; Organism classification; Dated location; Musical release; Book; Musical album; Film character; Publication; Character species; Top level domain; Animal; Domesticated animal; ... Animal; Pet; Species; Mammal; Small animal; Thing; Mammalian species; Small pet; Animal species; Carnivore; Domesticated animal; Companion animal; Exotic pet; Vertebrate; ... IBM N/A Companies listed on the New York Stock Exchange; IBM; Cloud computing providers; Companies based in Westchester County, New York; Multinational companies; Software companies of the United States; Top 100 US Federal Contractors; ... Business operation; Issuer; Literature subject; Venture investor; Competitor; Software developer; Architectural structure owner; Website owner; Programming language designer; Computer manufacturer/brand; Customer; Operating system developer; Processor manufacturer; ... Company; Vendor; Client; Corporation; Organization; Manufacturer; Industry leader; Firm; Brand; Partner; Large company; Fortune 500 company; Technology company; Supplier; Software vendor; Global company; Technology company; ... Language Communication; Auditory communication; Word; Higher cognitive process; Faculty; Mental faculty; Module; Text; Textual matter; Languages; Linguistics; Human communication; Human skills; Wikipedia articles with ASCII art Employer; Written work; Musical recording; Musical artist; Musical album; Literature subject; Query; Periodical; Type profile; Journal; Quotation subject; Type/domain equivalent topic; Broadcast genre; Periodical subject; Video game content descriptor; ... Instance of: Cognitive function; Knowledge; Cultural factor; Cultural barrier; Cognitive process; Cognitive ability; Cultural difference; Ability; Characteristic; Attribute of: Film; Area; Book; Publication; Magazine; Country; Work; Program; Media; City; ...
When the machine sees the word ‘apple’
When the machine sees ‘apple’ and ‘pear’ together
When the machine sees ‘China’ and ‘Israel’ together
What China is but Israel is not?
What Israel is but China is not?
What’s the difference from a text cloud?
When a machine sees attributes … website president city motto state type director
Entity Abstraction Given a set of entities Target Concept (Naïve Bayes Rule) Where is a concept, and is computed based on the concept-entity co-occurrence
How to Infer Concept from Attribute? Given a set of attributes The Naïve Bayes Rule gives where (university, florida state university, 75) (university, harvard university, 388) (university, university of california, 142) (country, china, 97346) (country, the united states , 91083) (country, india , 80351) (country, canada , 74481) (florida state university, website, 34) (harvard university, website, 38) (university of california, city, 12) (china, capital, 43) (the united states , capital, 32) (india , population, 35) (canada , population, 21) (university, website, 4568) (university, city, 2343) (country, capital, 4345) (country, population, 3234) ……
Examples Concepts related to entities - “china” and “india” Concepts related to attributes – “language” and “population” Concept Entity Co-occurrence Concept Number Entity Number P(e|c) P(c|e) country india 80905 2262485 197915 0.03576 0.40879 china 98517 269127 0.04354 0.36606 emerging market 6556 29298 0.22377 0.02436 5702 0.19462 0.02881 area 2231 2525020 0.00088 0.00829 1797 0.00071 0.00908 Concept Attribute P(c, a) P(c) P(a) P(a|c) P(c|a) country population 4.08183 173.44931 41736.78060 0.02353 0.00010 language 1.48795 58584.50905 0.00858 0.00003 emerging market 4.52949 402.13772 0.01126 0.00008 16.54701 0.04115 0.00040
When Type of Term is Unknown: Given a set of terms with unknown types Generative model Using Naive Bayesian rule gives: Discriminative model (Noisy-OR) And using twice Bayesian rule gives: where indicate “entity” and indicate “attribute”
Examples Given “china”, “india”, “language” and “population”, “emerging market” will be ranked as 1st Concept Entity Co-occurrence Concept Number Entity Number P(e|c) P(c|e) country india 80905 2262485 197915 0.03576 0.40879 china 98517 269127 0.04354 0.36606 emerging market 6556 29298 0.22377 0.02436 5702 0.19462 0.02881 area 2231 2525020 0.00088 0.00829 1797 0.00071 0.00908 Concept Attribute P(c, a) P(c) P(a) P(a|c) P(c|a) factor population 75.74704 71073.46656 41736.78060 0.00107 0.00181 language 113.32628 58584.50905 0.00159 0.00193 countries 4.08183 173.44931 0.02353 0.00010 1.48795 0.00858 0.00003 emerging market 4.52949 402.13772 0.01126 0.00008 16.54701 0.04115 0.00040
Example (Cont’d)
Clustering Twitter Messages Problem 1 (unique concepts): use keywords to retrieve tweets in 3 categories: 1. Microsoft, Yahoo, Google, IBM, Facebook 2. cat, dog, fish, pet, bird 3. Brazil, China, Russia, India Problem 2 (concepts with subtle differences): use keywords to retrieve tweets in 4 categories: 1. United states, American, Canada 2. Malaysia, China, Singapore, India, Thailand, Korea 3. Angola, Egypt, Sudan, Zambia, Chad, Gambia, Congo 4. Belgium, Finland, France, Germany, Greece, Spain, Switzerland
Comparison Results
Other Possible Applications Query Expansion Information retrieval Content based advertisement Short Text Classification/Clustering Twitter analysis/search Text block tagging; image/video surrounding text summarization Document Classification/Clustering/Summarization News analysis Out-of-sample example classification/clustering
4/16/2017 8:13 PM http://www.vitaminshoppe.com/store/en/browse/sku_detail.jsp?id=VS-1803 Glucosamine & Chondroitin © 2011 Microsoft Corporation. All rights reserved. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. © 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.