Presentation is loading. Please wait.

Presentation is loading. Please wait.

Probase : A Knowledge Base for Text Understanding

Similar presentations


Presentation on theme: "Probase : A Knowledge Base for Text Understanding"— Presentation transcript:

1 Probase : A Knowledge Base for Text Understanding
4/16/2017 8:13 PM Probase : A Knowledge Base for Text Understanding © 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

2 Integrating, representing, and reasoning over human knowledge: the grand challenge of the 21st century.

3 Knowledge Bases Scope: Challenges: data integration
Cross domain / Encyclopedia: (Freebase, Cyc, etc) Vertical domain / Directory: (Entity Cube, etc) Challenges: completeness (e.g., a complete list of restaurants) accuracy (e.g., address, phone, menu of each restaurant) data integration

4 “Concepts are the glue that holds our mental world together.”
“By being able to place something into its proper place in the concept hierarchy, one can learn a considerable amount about it.”

5 Yellow Pages or Knowledge Bases?
Existing Knowledge Bases are more like big Yellow Page Books Goals: Complete list of items Complete information of each item But humans do not need “complete” knowledge in order to understand the world The statistical machine learning strategies that it uses are indeed a big advance over traditional GOFAI techniques. But they still fall far short of what human beings do. To see this, take Watson’s most famous blunder. On day 2 of the “Jeopardy!” challenge, the competitors were given a clue in the category “U.S. Cities.” The clue was, “Its largest airport is named for a World War II hero; its second largest for a World War II battle.” Both Ken Jennings and Brad Rutter correctly answered Chicago. Watson didn’t just get the wrong city; its answer, Toronto, is not a United States city at all. This is the kind of blunder that practically no human being would make. It may be what motivated Jennings to say, “The illusion is that this computer is doing the same thing that a very good ‘Jeopardy!’ player would do. It’s not. It’s doing something sort of different that looks the same on the surface. And every so often you see the cracks.” So what went wrong? David Ferrucci, the principal investigator on the IBM team, says that there were a variety of factors that led Watson astray. But one of them relates specifically to the machine learning strategies. During its training phase, Watson had learned that categories are only a weak indicator of the answer type. A clue in the category “U.S. Cities,” for example, might read “Rochester, New York, grew because of its location on this.” The answer, the Erie Canal, is obviously not a U.S. city. So from examples like this Watson learned, all else being equal, that it shouldn’t pay too much attention to mismatches between category and answer type. The problem is that lack of attention to such a mismatch will sometimes produce a howler. Knowing when it’s relevant to pay attention to the mismatch and when it’s not is trivial for a human being. But Watson doesn’t understand relevance at all. It only measures statistical frequencies. Because it is relatively common to find mismatches of this sort, Watson learns to weigh them as only mild evidence against the answer. But the human just doesn’t do it that way. The human being sees immediately that the mismatch is irrelevant for the Erie Canal but essential for Toronto. Past frequency is simply no guide to relevance.

6 What’s wrong with the yellow pages approach?
Rich in instances, poor in concepts Understanding = forming concepts Ontologies always have an elegant hierarchy? Manually crafted ontologies are for human use Our mental world does not have an elegant hierarchy

7 Probase: “It is better to use the world as its own model.”
Capture concepts (in our mental world) Quantify uncertainty (for reasoning) The new, embodied paradigm in AI, deriving primarily from the work of roboticist Rodney Brooks, insists that the body is required for intelligence. Indeed, Brooks’s classic 1990 paper, “Elephants Don’t Play Chess,” rejected the very symbolic computation paradigm against which Dreyfus had railed, favoring instead a range of biologically inspired robots that could solve apparently simple, but actually quite complicated, problems like locomotion, grasping, navigation through physical environments and so on. To solve these problems, Brooks discovered that it was actually a disadvantage for the system to represent the status of the environment and respond to it on the basis of pre-programmed rules about what to do, as the traditional GOFAI systems had. Instead, Brooks insisted, “It is better to use the world as its own model.” Brooks believes: - Attempting to come up with models and representations of the real world is the wrong approach - It is ``better to use the world as its own model’’ - Brooks, R.A. (1990) Elephants don’t play chess. In Pattie Maes (Ed.) Designing autonomous agents. Cambridge, Mass, MIT Press Elephants don’t play chess – but still intelligent !

8 1 2 4 3 Capture concepts in human mind
More than 2.7 million concepts automatically harnessed from 1.68 billion documents 2 Capture concepts in human mind Represent them in a computable form Transform them to machines Machines have better understanding of human world Computation/Reasoning enabled by scoring: Consensus: e.g., is there a company called Apple? Typicality: e.g. how likely you think of Apple when you think about companies? Ambiguity: e.g., does the word Apple, sans any context, represent Apple the company? Similarity: e.g., how likely is an actor also a celebrity? Freshness: e.g., Pluto as a dwarf planet is a claim more fresh than Pluto as a planet. 4 A little knowledge goes a long way after machines acquire a human touch 3 Give machines a new CPU (Commonsense Processing Unit) powered by a distributed graph engine called Trinity.

9 Probase: Freebase: Cyc:
Probase has a big concept space 2.7 M concepts automatically harnessed 2 K concepts built by community effort 120 K concepts 25 years human labor Probase: Freebase: Cyc:

10 Probase has 2.7 million concepts (frequency distribution)

11 Probase vs. Freebase Uncertainty Knowledge is black and white.
Correctness is a probability. Live with dirty data. Dirty data is very useful. Knowledge is black and white. Clean up everything. Dirty data is unusable.

12 Probase Internals artist painter art created by painting Movement Born
Picasso Movement Born Died Cubism 1881 1973 created by art painting Year Type Guernica 1937 Oil on Canvas

13 Data Sources Patterns for single statements Examples:
NP such as {NP, NP, ..., (and|or)} NP such NP as {NP,}* {(or|and)} NP NP {, NP}* {,} or other NP NP {, NP}* {,} and other NP NP {,} including {NP ,}* {or | and} NP NP {,} especially {NP,}* {or|and} NP Examples: Good: “rich countries such as USA and Japan …” Tough: “animals other than cats such as dogs …” Hopeless: “At Berklee, I was playing with cats such as Jeff Berlin, Mike Stern, Bill Frisell, and Neil Stubenhaus.”

14 Examples Animals other than dogs such as cats …
Eating fruits such as apple … can be good to your health. Eating disorders such as … can be bad to your health.

15 Properties Given a class, find its properties
Candidate seed properties: “What is the [property] of [instance]?” “Where”, “When”, “Who” are also considered

16 Similarity between two concepts
Weighted linear combinations of Similarity between the set of instances Similarity between the set of attributes (nation, country) (celebrity, well-known politicians)

17 Beyond noun phrases Example: the verb “hit” Small object, Hard surface
(bullet, concrete), (ball, wall) Natural disaster, Area (earthquake, Seattle), (Hurricane Floyd, Florida) Emergency, Country (economic crisis, Mexico), (flood, Britain)

18 Quantify Uncertainty Typicality Similarity the foundation of
P(concept | instance) P(instance | concept) P(concept | property) P(property | concept) Similarity sim(concept1, concept2) the foundation of text understanding and reasoning

19 Text Mining / IE: State of the Art
Bag of words based approach: e.g., LDA Based on multiple document statistics Simple bag-of-words, no semantics Supervised learning: e.g., CRF Labeled training data required Difficulty for out-of-sample features Lack of semantics What role can a knowledgebase play?

20 Shopping at Bellevue during TechFest
Five of us bought 5 Kinects and posed in front of an Apple store. Apple store that sells fruits or apple store that sells iPads?

21 Step by Step Understanding
Entity abstraction Attribute abstraction Short text/query (1-5 words) understanding Text block/document understanding

22 Explicit Semantic Analysis (ESA)
Goal: latent topics => explicit topics Approach: mapping text to Wikipedia articles An inverted list that record words’ occurrence in wiki articles Given a document, we derive a distribution of wiki articles

23 Explicit Semantic Analysis (ESA)
bag of words => bag of Wikipedia articles (a distribution over Wikipedia articles) A bag of Wikipedia articles is not equivalent to a clear concept in our mental world. Top 10 concepts for “Bank of America” Bank, Bank of America, Bank of America Plaza (Atlanta), Bank of America Plaza (Dallas), MBNA, VISA (credit card), Bank of America Tower New York City, NASDAQ, MasterCard, Bank of America Corporate Center

24 Short Text Challenge: Applications Not enough statistics Twitter
Query/Search Log Anchor Text Image/video tag Document paraphrasing and annotation

25 Comparison of Knowledge Bases
WordNet Wikipedia Freebase Probase Cat Feline; Felid; Adult male; Man; Gossip; Gossiper; Gossipmonger; Rumormonger; Rumourmonger; Newsmonger; Woman; Adult female; Stimulant; Stimulant drug; Excitant; Tracked vehicle; ... Domesticated animals; Cats; Felines; Invasive animal species; Cosmopolitan species; Sequenced genomes; Animals described in 1758; TV episode; Creative work; Musical recording; Organism classification; Dated location; Musical release; Book; Musical album; Film character; Publication; Character species; Top level domain; Animal; Domesticated animal; ... Animal; Pet; Species; Mammal; Small animal; Thing; Mammalian species; Small pet; Animal species; Carnivore; Domesticated animal; Companion animal; Exotic pet; Vertebrate; ... IBM N/A Companies listed on the New York Stock Exchange; IBM; Cloud computing providers; Companies based in Westchester County, New York; Multinational companies; Software companies of the United States; Top 100 US Federal Contractors; ... Business operation; Issuer; Literature subject; Venture investor; Competitor; Software developer; Architectural structure owner; Website owner; Programming language designer; Computer manufacturer/brand; Customer; Operating system developer; Processor manufacturer; ... Company; Vendor; Client; Corporation; Organization; Manufacturer; Industry leader; Firm; Brand; Partner; Large company; Fortune 500 company; Technology company; Supplier; Software vendor; Global company; Technology company; ... Language Communication; Auditory communication; Word; Higher cognitive process; Faculty; Mental faculty; Module; Text; Textual matter; Languages; Linguistics; Human communication; Human skills; Wikipedia articles with ASCII art Employer; Written work; Musical recording; Musical artist; Musical album; Literature subject; Query; Periodical; Type profile; Journal; Quotation subject; Type/domain equivalent topic; Broadcast genre; Periodical subject; Video game content descriptor; ... Instance of: Cognitive function; Knowledge; Cultural factor; Cultural barrier; Cognitive process; Cognitive ability; Cultural difference; Ability; Characteristic; Attribute of: Film; Area; Book; Publication; Magazine; Country; Work; Program; Media; City; ...

26 When the machine sees the word ‘apple’

27 When the machine sees ‘apple’ and ‘pear’ together

28 When the machine sees ‘China’ and ‘Israel’ together

29 What China is but Israel is not?

30 What Israel is but China is not?

31 What’s the difference from a text cloud?

32

33 When a machine sees attributes …
website president city motto state type director

34 Entity Abstraction Given a set of entities
Target Concept (Naïve Bayes Rule) Where is a concept, and is computed based on the concept-entity co-occurrence

35 How to Infer Concept from Attribute?
Given a set of attributes The Naïve Bayes Rule gives where (university, florida state university, 75) (university, harvard university, 388) (university, university of california, 142) (country, china, 97346) (country, the united states , 91083) (country, india , 80351) (country, canada , 74481) (florida state university, website, 34) (harvard university, website, 38) (university of california, city, 12) (china, capital, 43) (the united states , capital, 32) (india , population, 35) (canada , population, 21) (university, website, 4568) (university, city, 2343) (country, capital, 4345) (country, population, 3234) ……

36 Examples Concepts related to entities - “china” and “india”
Concepts related to attributes – “language” and “population” Concept Entity Co-occurrence Concept Number Entity Number P(e|c) P(c|e) country india 80905 197915 china 98517 269127 emerging market 6556 29298 5702 area 2231 1797 Concept Attribute P(c, a) P(c) P(a) P(a|c) P(c|a) country population language emerging market

37 When Type of Term is Unknown:
Given a set of terms with unknown types Generative model Using Naive Bayesian rule gives: Discriminative model (Noisy-OR) And using twice Bayesian rule gives: where indicate “entity” and indicate “attribute”

38 Examples Given “china”, “india”, “language” and “population”, “emerging market” will be ranked as 1st Concept Entity Co-occurrence Concept Number Entity Number P(e|c) P(c|e) country india 80905 197915 china 98517 269127 emerging market 6556 29298 5702 area 2231 1797 Concept Attribute P(c, a) P(c) P(a) P(a|c) P(c|a) factor population language countries emerging market

39 Example (Cont’d)

40 Clustering Twitter Messages
Problem 1 (unique concepts): use keywords to retrieve tweets in 3 categories: 1. Microsoft, Yahoo, Google, IBM, Facebook 2. cat, dog, fish, pet, bird 3. Brazil, China, Russia, India Problem 2 (concepts with subtle differences): use keywords to retrieve tweets in 4 categories: 1. United states, American, Canada 2. Malaysia, China, Singapore, India, Thailand, Korea 3. Angola, Egypt, Sudan, Zambia, Chad, Gambia, Congo 4. Belgium, Finland, France, Germany, Greece, Spain, Switzerland

41 Comparison Results

42 Other Possible Applications
Query Expansion Information retrieval Content based advertisement Short Text Classification/Clustering Twitter analysis/search Text block tagging; image/video surrounding text summarization Document Classification/Clustering/Summarization News analysis Out-of-sample example classification/clustering

43 4/16/2017 8:13 PM Glucosamine & Chondroitin © 2011 Microsoft Corporation. All rights reserved. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. © 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.


Download ppt "Probase : A Knowledge Base for Text Understanding"

Similar presentations


Ads by Google