CPT-S 580-06 Advanced Databases 11 Yinghui Wu EME 49.

Slides:



Advertisements
Similar presentations
Understanding Tables on the Web Jingjing Wang. Problem to Solve A wealth of information in the World Wide Web Not easy to access or process by machine.
Advertisements

The 20th International Conference on Software Engineering and Knowledge Engineering (SEKE2008) Department of Electrical and Computer Engineering
Date: 2014/05/06 Author: Michael Schuhmacher, Simon Paolo Ponzetto Source: WSDM’14 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Knowledge-based Graph Document.
YAGO: A Large Ontology from Wikipedia and WordNet Fabian M. Suchanek, Gjergji Kasneci, Gerhard Weikum Max-Planck-Institute for Computer Science, Saarbruecken,
Probase: Understanding Data on the Web Haixun Wang Microsoft Research Asia.
COGEX at the Second RTE Marta Tatu, Brandon Iles, John Slavick, Adrian Novischi, Dan Moldovan Language Computer Corporation April 10 th, 2006.
Web Scale Taxonomy Cleansing Taesung Lee, Zhongyuan Wang, Haixun Wang, Seung-won Hwang VLDB 2011.
Gerhard Weikum Max Planck Institute for Informatics & Saarland University Semantic Search: from Names and Phrases to.
Knowledge Graph: Connecting Big Data Semantics
Leveraging Community-built Knowledge For Type Coercion In Question Answering Aditya Kalyanpur, J William Murdock, James Fan and Chris Welty Mehdi AllahyariSpring.
Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao.
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
UCB CS Research Fair Search Text Mining Web Site Usability Marti Hearst SIMS.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Sunday May 4 – 5 PM Bradford, Hlava, McNaughton
Overview of Search Engines
Saarbrucken / Germany ¨
Text Analytics And Text Mining Best of Text and Data
CSC 9010 Spring Paula Matuszek A Brief Overview of Watson.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Attribute Extraction and Scoring: A Probabilistic Approach Taesung Lee, Zhongyuan Wang, Haixun Wang, Seung-won Hwang Microsoft Research Asia Speaker: Bo.
Tables to Linked Data Zareen Syed, Tim Finin, Varish Mulwad and Anupam Joshi University of Maryland, Baltimore County
Author: William Tunstall-Pedoe Presenter: Bahareh Sarrafzadeh CS 886 Spring 2015.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.
Understanding User’s Query Intent with Wikipedia G 여 승 후.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Summary Knowledge Bases from Web are Real, Big & Useful: Entities, Classes & Relations Key Asset for Intelligent Applications: Semantic Search, Question.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia.
Exploiting Relevance Feedback in Knowledge Graph Search
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
Tutorial: Knowledge Bases for Web Content Analytics
Commonsense Reasoning in and over Natural Language Hugo Liu, Push Singh Media Laboratory of MIT The 8 th International Conference on Knowledge- Based Intelligent.
LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge Date : 2013/03/25 Resource : WWW 2012 Advisor : Dr. Jia-Ling Koh Speaker : Wei.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Information Extraction from Wikipedia: Moving Down the Long.
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
2014 Lexicon-Based Sentiment Analysis Using the Most-Mentioned Word Tree Oct 10 th, 2014 Bo-Hyun Kim, Sr. Software Engineer With Lina Chen, Sr. Software.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Sentiment Analysis Using Common- Sense and Context Information Basant Agarwal 1,2, Namita Mittal 2, Pooja Bansal 2, and Sonal Garg 2 1 Department of Computer.
Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS
NELL Knowledge Base of Verbs
Summarizing Entities: A Survey Report
Web IR: Recent Trends; Future of Web Search
Semantic Network & Knowledge Graph
Presented by: Prof. Ali Jaoua
Panagiotis G. Ipeirotis Luis Gravano
ProBase: common Sense Concept KB and Short Text Understanding
The Winograd Schema Challenge Hector J. Levesque AAAI, 2011
Information Retrieval and Web Search
Introduction to Artificial Intelligence
KnowItAll and TextRunner
Presentation transcript:

CPT-S Advanced Databases 11 Yinghui Wu EME 49

Knowledge bases and knowledge retrieval 22

Turn Web into Knowledge Base Web Knowledge knowledge acquisition intelligent interpretation more knowledge, analytics, insight

10M entities in 350K classes 120M facts for 100 relations 100 languages 95% accuracy 4M entities in 250 classes 500M facts for 6000 properties live updates 40M entities in topics 1B facts for 4000 properties core of Google Knowledge Graph Web of Data & Knowledge 600M entities in topics 20B facts > 60 Bio. subject-predicate-object triples from > 1000 sources

Knowledge Bases: a Pragmatic Definition Comprehensive and semantically organized machine-readable collection of universally relevant or domain-specific entities, classes, and SPO facts (attributes, relations) plus spatial and temporal dimensions plus commonsense properties and rules plus contexts of entities and facts (textual & visual witnesses, descriptors, statistics) plus …..

History of Digital Knowledge Bases Cyc  x: human(x)  (  y: mother(x,y)   z: father(x,z))  x,u,w: (mother(x,u)  mother(x,w)  u=w) WordNet guitarist  {player,musician}  artist algebraist  mathematician  scientist Wikipedia 4.5 Mio. English articles 20 Mio. contributors from humans for humans from algorithms for machines

Some Publicly Available Knowledge Bases YAGO: yago-knowledge.orgyago-knowledge.org Dbpedia: dbpedia.orgdbpedia.org Freebase:freebase.comfreebase.com Entitycube: entitycube.research.microsoft.comentitycube.research.microsoft.com renlifang.msra.cn NELL: rtw.ml.cmu.edurtw.ml.cmu.edu DeepDive: deepdive.stanford.edudeepdive.stanford.edu Probase: research.microsoft.com/en-us/projects/probase/ research.microsoft.com/en-us/projects/probase/ KnowItAll / ReVerb: openie.cs.washington.eduopenie.cs.washington.edu reverb.cs.washington.edu BabelNet: babelnet.orgbabelnet.org WikiNet: ConceptNet: conceptnet5.media.mit.educonceptnet5.media.mit.edu WordNet: wordnet.princeton.eduwordnet.princeton.edu Linked Open Data: linkeddata.orglinkeddata.org 7

Knowledge for Intelligence Enabling technology for:  disambiguation in written & spoken natural language  deep reasoning (e.g. QA to win quiz game)  machine reading (e.g. to summarize book or corpus)  semantic search in terms of entities&relations ( not keywords&pages )  entity-level linkage for Big Data European composers who have won film music awards? Chinese professors who founded Internet companies? Enzymes that inhibit HIV? Influenza drugs for teens with high blood pressure?... Politicians who are also scientists? Relationships between John Lennon, Billie Holiday, Heath Ledger, King Kong? 8

Use-Case: Internet Search

Google Knowledge Graph (Google Blog: „Things, not Strings“, 16 May 2012)

Use Case: Question Answering knowledge back-ends question classification & decomposition D. Ferrucci et al.: Building Watson. AI Magazine, Fall IBM Journal of R&D 56(3/4), 2012: This is Watson. This town is known as "Sin City" & its downtown is "Glitter Gulch" This American city has two airports named after a war hero and a WW II battle Q: Sin City ?  movie, graphical novel, nickname for city, … A: Vegas ? Strip ?  Vega (star), Suzanne Vega, Vincent Vega, Las Vegas, …  comic strip, striptease, Las Vegas Strip, … 11

Use Case: Text Analytics (Disease Networks) K.Goh,M.Kusick,D.Valle,B.Childs,M.Vidal,A.Barabasi: The Human Disease Network, PNAS, May 2007 diabetes mellitus, diabetis type 1, diabetes type 2, diabetes insipidus, insulin-dependent diabetes mellitus with ophthalmic complications, ICD-10 E23.2, OMIM , MeSH C , MeSH D003924, … need to understand synonyms vs. homonyms of entities & relations (Google: „things, not strings“) add genetic & pathway data, patient data, reports in social media, etc. → bottlenecks: data variety & data veracity → key asset: digital background knowledge for data cleaning, fusion, sense-making

Use Case: Big Data Analytics (Side Effects of Drug Combinations) Deeper insight from both expert data & social media: actual side effects of drugs … and drug combinations risk factors and complications of (wide-spread) diseases alternative therapies aggregation & comparison by age, gender, life style, etc. Structured Expert Data Social Media harness knowledge base(s) on diseases, symptoms, drugs, biochemistry, food, demography, geography, culture, life style, jobs, transportation, etc. etc.

Big Data + Text Analytics Identify relevant contents sources Identify entities of interest & their relationships Position in time & space Group and aggregate Find insightful patterns & predict trends General Design Pattern: 14 Who covered which other singer? Who influenced which other musicians? Entertainment: Drugs (combinations) and their side effects Health: Politicians‘ positions on controversial topics and their involvement with industry Politics: Customer opinions on small-company products, gathered from social media Business: Trends in society, cultural factors, etc. Culturomics:

Knowledge Bases are labeled graphs singer person resource location city Tupelo subclassOf type bornIn type subclassOf Classes/ Concepts/ Types Instances/ entities Relations/ Predicates A knowledge base can be seen as a directed labeled multi-graph, where the nodes are entities and the edges relations. 15

An entity can have different labels singer person “Elvis” “The King” type label The same label for two entities: ambiguity The same label for two entities: ambiguity The same entity has two labels: synonymy The same entity has two labels: synonymy type 16

Different views of a knowledge base SubjectPredicateObject Elvistypesinger ElvisbornInTupelo... type(Elvis, singer) bornIn(Elvis,Tupelo)... Logical notation: Triple notation: singer type Graph notation: Tupelo bornIn "RDFS Ontology" and "Knowledge Base (KB)" 17

Find classes and instances singer person type Which classes exist? (aka entity types, unary predicates, concepts) subclassOf Which subsumptions hold? Which entities belong to which classes? Which entities exist? 18

WordNet is a lexical knowledge base WordNet project (1985-now) singer person subclassOf living being subclassOf “person” label “individual” “soul” WordNet contains 82,000 classes WordNet contains 118,000 class labels WordNet contains thousands of subclassOf relationships 19 George Miller Christiane Fellbaum

Taxonomic Knowledge – Wikipedia centric methods 20 Graham Cormode courtesy Graham AT&T

Wikipedia's categories contain classes But: categories do not form a taxonomic hierarchy 21

Link Wikipedia categories to WordNet? American billionaires Technology company founders Apple Inc. Deaths from cancer Internet pioneers tycoon, magnate entrepreneur pioneer, innovator ? pioneer, colonist ? Wikipedia categories WordNet classes 22

YAGO = WordNet + Wikipedia WordNet Wikipedia Related project: WikiTaxonomy 105,000 subclassOf links 88% accuracy [Ponzetto & Strube: AAAI‘07] 200,000 classes 460,000 subclassOf 3 Mi. instances 96% accuracy [Suchanek: WWW‘07] American people of Syrian descent person organism subclassOf Steve Jobs type 23

Link Wikipedia & WordNet by Random Walks [Navigli 2010] construct neighborhood around source and target nodes use contextual similarity (glosses etc.) as edge weights compute personalized PR (PPR) with source as start node rank candidate targets by their PPR scores Formula One drivers {driver, device driver} computer program chauffeur race driver trucker tool causal agent Barney Oldfield {driver, operator of vehicle} Formula One champions truck drivers motor racing Michael Schumacher Wikipedia categoriesWordNet classes 24 >

Learning More Mappings [ Wu & Weld: WWW‘08 ] Kylin Ontology Generator (KOG): learn classifier for subclassOf across Wikipedia & WordNet using YAGO as training data advanced ML methods (SVM‘s, MLN‘s) rich features from various sources category/class name similarity measures category instances and their infobox templates: template names, attribute names (e.g. knownFor) Wikipedia edit history: refinement of categories Hearst patterns: C such as X, X and Y and other C‘s, … other search-engine statistics: co-occurrence frequencies > 3 Mio. entities > 1 Mio. w/ infoboxes > categories 25

Probase builds a taxonomy from the Web ProBase 2.7 Mio. classes from 1.7 Bio. Web pages [Wu et al.: SIGMOD 2012] Use Hearst liberally to obtain many instance candidates: “plants such as trees and grass“ “plants include water turbines“ “western movies such as The Good, the Bad, and the Ugly“ Problem: signal vs. noise Assess candidate pairs statistically: P[X|Y]  subclassOf(Y X) = 1- p(every evidence is negative) Problem: ambiguity of labels Merge labels of same class: X such as Y 1 and Y 2  same sense of X 26

Web Scale Taxonomy Cleansing Leverage Freebase to Clean and Enrich Probase? FreebaseProbase How is it built?manualautomatic Data modeldeterministicProbabilistic Taxonomy topologyMostly treeDAG # of concepts12.7 thousand2 million # of entities (instances)13 million16 million Information about entityrichSparse adoptionWidely usedNew

Why do not bootstrap from freebase Probase try to capture concepts in our mental world –Freebase only has 12.7 thousand concepts/categories –For those concepts in Freebase we can also easily acquire –The key challenge is how can we capture tail concepts automatically with high precision Probase try to quantify uncertainty –Freebase treats facts as black or white –A large part of Freebase instances(over two million instances) are distributed in a few very popular concepts like “track” and “book” How can we get better one? –Cleansing and enriching Probase by mapping its instances to Freebase

How Can We Attack the Problem? Entity resolution on the union of Probase & Freebase Simple String Similarity? George W. Bush George H. W. Bush –Jaccard Similarity = ¾ George W. Bush Dubya –Jaccard Similarity = 0 Many other learnable string similarity measures are also not free from these kind of problems

Two Types of Evidence in Action Instance Pair Space Correct Matching Pairs [Evidence] Wikilinks [Evidence] String Similarity [Evidence] Wikipedia Redirect Page [Evidence] Wikipedia Disambiguation Page Negative Evidence

Positive Evidence To handle nickname case –Ex) George W. Bush – Dubya –Synonym pairs from known sources – [Silviu Cucerzan, EMNLP-CoNLL 2007, …] Wikipedia Redirects Wikipedia Internal Links – [[ George W. Bush | Dubya ]] Wikipedia Disambiguation Page Patterns such as ‘whose nickname is’, ‘also known as’

Negative Evidence What if we know they are different? It is possible to… –stop transitivity at the right place –safely utilize string similarity –remove false positive evidence ‘Clean data has no duplicates’ Principle –The same entity would not be mentioned in a list of instances several times in different forms –assume these are clean: ‘List of ***’, ‘Table of ***’ such as … Freebase George W. Bush Dubya Bush Bush Sr. President Bush George H. W. Bush Different!

With a Pair of Concepts Building a graph of entities –Quantifying positive evidence for edge weight –Noisy-or model Using All kinds of positive evidence including string similarity George W. Bush Dubya Bush President Bush George H. W. Bush President Bill Clinton Hillary Clinton Bill Clinton

Multi-Multiway Cut Problem Multi-Multiway Cut problem on a graph of entities –A variant of s-t cut –No vertices with the same label may belong to one connected component (cluster) {George W. Bush, President Bill Clinton, George H. W. Bush}, {George W. Bush, Bill Clinton, Hillary Clinton}, … –The cost function: the sum of removed edge weights George W. Bush Dubya Bush President Bush George H. W. Bush President Bill Clinton Hillary Clinton Bill Clinton

method Monte Carlo heuristics algorithm: –Step 1: Randomly insert one edge at a time with probability proportional to the weight of edge –Step 2: Skip an edge if it violates any piece of negative evidence Repeat the random process several times, and then choose the best one that minimize the cost

George W. Bush Dubya Bush Bush Sr. President Bush George H. W. Bush President Bill Clinton

George W. Bush Dubya Bush Bush Sr. President Bush George H. W. Bush President Bill Clinton

George W. Bush Dubya Bush Bush Sr. President Bush George H. W. Bush President Bill Clinton

George W. Bush Dubya Bush Bush Sr. President Bush George H. W. Bush President Bill Clinton

George W. Bush Dubya Bush Bush Sr. President Bush George H. W. Bush President Bill Clinton

George W. Bush Dubya Bush Bush Sr. President Bush George H. W. Bush President Bill Clinton

George W. Bush Dubya Bush Bush Sr. President Bush George H. W. Bush President Bill Clinton

Use query logs to refine taxonomy [Pasca 2011] Input: type(Y, X 1 ), type(Y, X 2 ), type(Y, X 3 ), e.g, extracted from Web Goal: rank candidate classes X 1, X 2, X 3 H1: X and Y should co-occur frequently in queries  score1(X)  freq(X,Y) * #distinctPatterns(X,Y) H2: If Y is ambiguous, then users will query X Y:  score2(X)  (  i=1..N term-score(t i  X)) 1/N example query: "Michael Jordan computer scientist" H3: If Y is ambiguous, then users will query first X, then X Y:  score3(X)  (  i=1..N term-session-score(t i  X)) 1/N Combine the following scores to rank candidate classes: 43

Lessons learned Semantic classes for entities > 10 Mio. entities in 100,000‘s of classes backbone for other kinds of knowledge harvesting great mileage for semantic search e.g. politicians who are scientists, French professors who founded Internet companies, … Variety of methods noun phrase analysis, random walks, extraction from tables, … Still room for improvement higher coverage, deeper in long tail, … 44