Big Text: from Language to Knowledge Gerhard Weikum Max Planck Institute for Informatics & Saarland University Saarbrücken, Germany

Slides:



Advertisements
Similar presentations
CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
Advertisements

“How Can Research Help Me?” Please make SURE your notes are similar to what I have written in mine.
Date: 2014/05/06 Author: Michael Schuhmacher, Simon Paolo Ponzetto Source: WSDM’14 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Knowledge-based Graph Document.
YAGO: A Large Ontology from Wikipedia and WordNet Fabian M. Suchanek, Gjergji Kasneci, Gerhard Weikum Max-Planck-Institute for Computer Science, Saarbruecken,
Our Digital World Second Edition
1 gStore: Answering SPARQL Queries Via Subgraph Matching Presented by Guan Wang Kent State University October 24, 2011.
Gerhard Weikum Max Planck Institute for Informatics & Saarland University Semantic Search: from Names and Phrases to.
Choosing a Topic and Developing Research Questions
YAGO-NAGA Project Presented By: Mohammad Dwaikat To: Dr. Yuliya Lierler CSCI 8986 – Fall 2012.
The Culture of The Millennial Generation
Probabilistic Latent-Factor Database Models Denis Krompaß 1, Xueyan Jiang 1,Maximilian Nickel 2 and Volker Tresp 1,3 1 Department of Computer Science.
SEARCHING THE INTERNET An Overview. What is a Search Engine electronically gathers Internet information into an index to be searched using keywords, narrowing.
Helping people find content … preparing content to be found Enabling the Semantic Web Joseph Busch.
Search Engines and Information Retrieval
Database and Information- Retrieval Methods for Knowledge Discovery Database and Information- Retrieval Methods for Knowledge Discovery Gerhard Weikum,
DART 261 Library Research Melinda Reinhart Visual Arts Librarian October 2010.
Information Retrieval in Practice
Overview of Web Data Mining and Applications Part I
SEO & Content Marketing | April 2015 bradforster.org Winning at SEO & Content Marketing.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
SOCIAL NETWORKS AND THEIR IMPACTS ON BRANDS Edwin Dionel Molina Vásquez.
 Type of site: Social annotations, highlighting and social bookmarking  Launched: July 24 th, 2006  Institutional video:
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
DEANNA: Natural Language Questions for the Web of Data Mohamed Yahya † Klaus Berberich † Shady Elbassiousni * Maya Ramanath ‡ Volker Tresp # Gerhard Weikum.
TWIRL Twinning virtual World (on- line) Information with Real world (off-Line) data sources Kick-Off Meeting Cassidian 08 & 09 October 2012, Paris - France.
Selecting a Topic and Purpose
Effective Content Techniques for Online Media Cindy Royal, Ph.D Associate Professor Texas State University School of Journalism and Mass Communication.
How to Research. Research Paper Assignment Identify what the assignment requires:  topic possibilities  number of sources  type of sources (journal,
“……What has TV guide got to with news?”. “In order to have a successful report you must assemble the facts and opinions from a variety of sources, review.
Online Intelligence Solutions HOW CAN BIG DATA ANALYTICS HELP THE MEDIA BUSINESS? Mathieu Llorens – CEO – AT Internet Futur Media – June 27th Moscow.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
INTRODUCTION TO RESEARCH. Learning to become a researcher By the time you get to college, you will be expected to advance from: Information retrieval–
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Strategies for Conducting Research on the Internet Angela Carritt User Coordinator, Oxford University Library Services Angela Carritt User Education Coordinator,
LOD for the Rest of Us Tim Finin, Anupam Joshi, Varish Mulwad and Lushan Han University of Maryland, Baltimore County 15 March 2012
Natural Language Questions for the Web of Data Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany Shady Elbassuoni.
Natural Language Questions for the Web of Data Mohamed Yahya 1, Klaus Berberich 1, Shady Elbassuoni 2 Maya Ramanath 3, Volker Tresp 4, Gerhard Weikum 1.
Discovering Computers Fundamentals, Third Edition CGS 1000 Introduction to Computers and Technology Spring 2007.
ITGS Databases.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Majid Sazvar Knowledge Engineering Research Group Ferdowsi University of Mashhad Semantic Web Reasoning.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Summary Knowledge Bases from Web are Real, Big & Useful: Entities, Classes & Relations Key Asset for Intelligent Applications: Semantic Search, Question.
WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.
Natural Language Questions for the Web of Data Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany Shady Elbassuoni.
Mr. P’s Class Term Paper All the Steps on the Path to an “A” Term Paper in World History.
Tutorial: Knowledge Bases for Web Content Analytics
Commonsense Reasoning in and over Natural Language Hugo Liu, Push Singh Media Laboratory of MIT The 8 th International Conference on Knowledge- Based Intelligent.
Leveraging Knowledge Bases for Contextual Entity Exploration Categories Date:2015/09/17 Author:Joonseok Lee, Ariel Fuxman, Bo Zhao, Yuanhua Lv Source:KDD'15.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Advisor-Advisee Relationships from Research Publication.
Artificial Intelligence, simulation and modelling.
Sports Market Research. Know Your Customer How do businesses know their customers needs and wants?  Ask them/talking to customers  Surveys  Questionnaires.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Searching the Web for academic information Ruth Stubbings.
Big Text: from Names and Phrases to Entities and Relations Gerhard Weikum Max Planck Institute for Informatics Saarbrücken, Germany
Internet and world wide web Information Technology
Information Retrieval in Practice
E-Commerce Theories & Practices
Web IR: Recent Trends; Future of Web Search
ESWC’14 龚赛赛.
Semantic Network & Knowledge Graph
Online Tool Screen shots
TDM=Text Mining “automated processing of large amounts of structured digital textual content for purposes of information retrieval, extraction, interpretation.
Discovering Emerging Entities with Ambiguous Names
Course Choice - S4 Computing Science Learning Intentions
Course Summary ChengXiang “Cheng” Zhai Department of Computer Science
PolyAnalyst Web Report Training
Presentation transcript:

Big Text: from Language to Knowledge Gerhard Weikum Max Planck Institute for Informatics & Saarland University Saarbrücken, Germany

From Natural-Language Text to Knowledge Web Contents Knowledge knowledge acquisition intelligent interpretation more knowledge, analytics, insight

Cyc TextRunner/ ReVerb WikiTaxonomy/ WikiNet SUMO ConceptNet 5 BabelNet ReadTheWeb Web of Data & Knowledge (Linked Open Data) > 50 Bio. subject-predicate-object triples from > 1000 sources

10M entities in 350K classes 120M facts for 100 relations 100 languages 95% accuracy 4M entities in 250 classes 500M facts for 6000 properties live updates 40M entities in topics 1B facts for 4000 properties core of Google Knowledge Graph Web of Data & Knowledge 600M entities in topics 20B facts > 50 Bio. subject-predicate-object triples from > 1000 sources

Web of Data & Knowledge > 50 Bio. subject-predicate-object triples from > 1000 sources Bob_Dylan type songwriter Bob_Dylan type civil_rights_activist songwriter subclassOf artist Bob_Dylan composed Hurricane Hurricane isAbout Rubin_Carter Steve_Jobs marriedTo Sara_Lownds validDuring [Sep-1965, June-1977] Bob_Dylan knownAs „voice of a generation“ Steve_Jobs „was big fan of“ Bob_Dylan Bob_Dylan „briefly dated“ Joan_Baez taxonomic knowledge factual knowledge temporal knowledge terminological knowledge evidence & belief knowledge

Knowledge for Intelligent Applications Enabling technology for: disambiguation in written & spoken natural language deep reasoning (e.g. QA to win quiz game) machine reading (e.g. to summarize book or corpus) semantic search in terms of entities&relations (not keywords&pages) entity-level linkage for Big Data & Big Text analytics

Big Text Analytics: Who Covered Whom? 1000‘s of Databases 100 Mio‘s of Web Tables 100 Bio‘s of Web & Social Media Pages Hannes Wader Elvis Presley Wooden Heart Elvis Presley F. Silcher Muss i denn Tote Hosen Hannes Wader Heute hier morgen dort Musician Original Title in different language, country, key, … with more sales, awards, media buzz, ….....

Big Text Analytics: Who Covered Whom? 1000‘s of Databases 100 Mio‘s of Web Tables 100 Bio‘s of Web & Social Media Pages in different language, country, key, … with more sales, awards, media buzz, …..... Hannes Wader Wooden Heart Hannes Wader Heute Hier Tote Hosen Morgen Dort Musician PerformedTitle Elvis Wood Heart F. Silcher Muss i denn Hans E. Wader Heute Hier Musician CreatedTitle Name Place U2 Dublin Dagstuhl Wadern Name Group Bono U2 Campino Tote Hosen Wadern Big Data & Big Text: challenge Variety & Veracity

Big Data & Big Text Analytics Who covered which other singer? Who influenced which other musicians? Entertainment: Drugs (combinations) and their side effects Health: Politicians‘ positions on controversial topics and their involvement with industry Politics: Customer opinions on small-company products, gathered from social media Business: Identify relevant contents sources Identify entities of interest & their relationships Position in time & space Group and aggregate Find insightful patterns & predict trends General Design Pattern: 9 Trends in society, cultural factors, etc. Culturomics:

Outline Lovely NERD The Dark Side The New Chocolate Conclusion Introduction 

Lovely NERD

Named Entity Recognition & Disambiguation Hurricane, about Carter, is on Bob‘s Desire. It is played in the film with Washington. (NERD) contextual similarity: mention vs. entity (bag-of-words, language model) prior popularity of name-entity pairs

Named Entity Recognition & Disambiguation Hurricane, about Carter, is on Bob‘s Desire. It is played in the film with Washington. (NERD) Coherence of entity pairs: semantic relationships shared types (categories) overlap of Wikipedia links

Named Entity Recognition & Disambiguation Hurricane, about Carter, is on Bob‘s Desire. It is played in the film with Washington. racism protest song boxing champion wrong conviction Grammy Award winner protest song writer film music composer civil rights advocate Academy Award winner African-American actor Cry for Freedom film Hurricane film racism victim middleweight boxing nickname Hurricane falsely convicted Coherence: (partial) overlap of (statistically weighted) entity-specific keyphrases

Named Entity Recognition & Disambiguation Hurricane, about Carter, is on Bob‘s Desire. It is played in the film with Washington. NED algorithms compute mention-to-entity mapping over weighted graph of candidates by popularity & similarity & coherence KB provides building blocks: name-entity dictionary, relationships, types, text descriptions, keyphrases, statistics for weights

Joint Mapping Build mention-entity graph or joint-inference factor graph from knowledge and statistics in KB Compute high-likelihood mapping (ML or MAP) or dense subgraph (with high total edge weight) such that: each m is connected to exactly one e (or at most one e) m1 m2 m3 m4 e1 e2 e3 e4 e5 e6

Coherence Graph Algorithm D5 Overview May 14, Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e) Approx. algorithms (greedy, randomized, …), hash sketches, … 82% precision on CoNLL‘03 benchmark Open-source software & online service AIDA m1 m2 m3 m4 e1 e2 e3 e4 e5 e6

NERD at Work

NERD at Work

NERD auf Deutsch

NERD on Tables

Entity Matching in Structured Data entity linkage: key to data integration long-standing problem, very difficult, unsolved H.L. Dunn: Record Linkage. American Journal of Public Health 36 (12), 1946 H.B. Newcombe et al.: Automatic Linkage of Vital Records. Science 130 (3381), 1959 Hurricane 1975 Forever Young 1972 Like a Hurricane 1975 ………. Hurricane Dylan Like a Hurricane Young Hurricane Everette. Dylan Bob 1941 Thomas Dylan Swansea 1914 Young Brigham 1801 Young Neil Toronto 1945 Denny Sandy London 1947 Hurricane Katrina New Orleans 2005 Hurricane Sandy New York 2012 ………. Variety & Veracity ! ?

Linking Big Data & Big Text 23 Musician Song Year Listeners Charts... Bob Dylan Death is not … Bob Dylan Don‘t think twice Bob Dylan Make you feel … Nick Cave Death is not … Kronos Q. Don‘t think twice Adele Make you feel … H. Wader Heute hier Tote Hosen Heute hier …

Outline Lovely NERD The Dark Side The New Chocolate Conclusion Introduction  

Big Text: the New Chocolate

Semantic Search over News

Semantic Search over News

Entity Analytics over News

Entity Analytics over News

Machine Reading of Scholarly Papers

Machine Reading of Health Forums

Big Data & Text Analytics: Side Effects of Drug Combinations Deeper insight from both expert data & social media: actual side effects of drugs … and drug combinations risk factors and complications of (wide-spread) diseases alternative therapies aggregation & comparison by age, gender, life style, etc. Structured Expert Data Social Media

Machine Reading: from Names and Phrases to Entities, Classes, and Relations Rome (Italy) Lazio Roma AS Roma Maestro Card Ennio Morricone MDMA l‘Estasi dell‘Oro Leonard Bernstein Jack Ma Yo-Yo Ma cover of story about born in plays for film music goal in football plays sport plays music western movie Western Digital Ma played his version of the Ecstasy. The Maestro from Rome wrote scores for westerns.

wrote scores r: composed Disambiguation for Entities, Classes & Relations scores for westerns from Rome Maestro Combinatorial Optimization by ILP (with type constraints etc.) e: Rome (Italy) e: Lazio Roma e: MaestroCard e: Ennio Morricone c: conductor c:soundtrack r: soundtrackFor r: shootsGoalFor r: bornIn r: actedIn c: western movie e: Western Digital weighted edges (coherence, similarity, etc.) ( M. Yahya et al.: EMNLP’12, CIKM‘13) c: musician r: giveExam ILP optimizers like Gurobi solve this in seconds

Outline Lovely NERD The Dark Side The New Chocolate Conclusion Introduction   

The Dark Side of Big Data

search Internet publish & recommend Entity Linking: Privacy at Stake Levothroid shaking Addison ’ s disease ……… Nive concert Greenland singers Somalia elections Steve Biko search engine Zoe female 29y Jamame social network Nive Nielsen Cry Freedom discuss & seek help online forum female Somalia Synthroid tremble ………. Addison disorder ……….

social network Synthroid tremble ………. Addison disorder ………. search Internet Privacy Adversaries Linkability Threats:  Weak cues: profiles, friends, etc.  Semantic cues: health, taste, queries  Statistical cues: correlations discuss & seek help publish & recommend Levothroid shaking Addison ’ s disease ……… Nive concert Greenland singers Somalia elections Steve Biko Nive Nielsen female Somalia female 29y Jamame online forum search engine Cry Freedom

social network search social network search engine online forum Synthroid tremble ………. Addison disorder ………. Internet Levothroid shaking Addison’s disease ……… Nive concert Greenland singers Somalia elections female Somalia Goal: Automated Privacy Advisor discuss & seek help publish & recommend Nive Nielsen Cry Freedom female 29y Jamame Privacy Adviser (PA): Software tool that  analyses risk  alerts user  advises user explains consequences recommends policy changes Your queries may lead to linking your identies in Facebook and patient.co.uk ! …………. Would you like to use an anonymization tool for your search requests? ……….. ERC Project imPACT (Backes/Druschel/Majumdar/Weikum)

Outline Lovely NERD The Dark Side The New Chocolate Conclusion Introduction 

Big Text & Big Data Big Text & NERD: valuable content about entities lifted towards knowledge & analytic insight Machine Reading: discover and interpret names & phrases as entities, classes, relations, spatio-temporal modifiers, sentiments, beliefs, …. Big Data: interlink natural-language text, social media, structured data & knowledge bases, images, videos and help users coping with privacy risks

Take-Home Message: From Language to Knowledge Web Contents Knowledge more knowledge, analytics, insight knowledge acquisition intelligent interpretation Knowledge „Who Covered Whom?“ and More! (Entities, Classes, Relations)