INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE 10: Knowledge and The Social Web.

Slides:



Advertisements
Similar presentations
YAGO: A Large Ontology from Wikipedia and WordNet Fabian M. Suchanek, Gjergji Kasneci, Gerhard Weikum Max-Planck-Institute for Computer Science, Saarbruecken,
Advertisements

DBpedia: A Nucleus for a Web of Open Data
THE ESP GAME, & PEEKABOOM LUIS VON AHN CARNEGIE MELLON UNIVERSITY.
1 Human Computation Play a Game to Develop an Ontology Peyman Nasirifard p+e+y+m+a+b-b+n dot deri.org.
ConceptNet: A Wonderful Semantic World
Using the Semantic Web to Construct an Ontology- Based Repository for Software Patterns Scott Henninger Computer Science and Engineering University of.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Galia Angelova Institute for Parallel Processing, Bulgarian Academy of Sciences Visualisation and Semantic Structuring of Content (some.
WEBQUEST Let’s Begin TITLE AUTHOR:. Let’s continue Return Home Introduction Task Process Conclusion Evaluation Teacher Page Credits This document should.
Sensemaking and Ground Truth Ontology Development Chinua Umoja William M. Pottenger Jason Perry Christopher Janneck.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Peekaboom: A Game for Locating Objects in Images
Online communities 1 Theory revision Complete some of the activities in this powerpoint and use the revision book to answer questions.
StaffCV Recruitment Management Solution A New Paradigm In Recruiting StaffCV Recruitment Management Solution A New Paradigm In Recruiting.
To quantitatively test the quality of the spell checker, the program was executed on predefined “test beds” of words for numerous trials, ranging from.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
RDF (Resource Description Framework) Why?. XML XML is a metalanguage that allows users to define markup XML separates content and structure from formatting.
CCT 355: E-Business Technologies Class 2: Introduction to Information Systems.
Semantic Web outlook and trends May The Past 24 Odd Years 1984 Lenat’s Cyc vision 1989 TBL’s Web vision 1991 DARPA Knowledge Sharing Effort 1996.
The Internet Public Library Using the Internet Public Library: Jennifer Lau-Bond User Services.
Beyond Skill and Drill Using Web 2.0 Technologies to Increase Engagement and Participation for ALL Students Text barriers to w. code and your.
807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Wikipedia for Text Analytics.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
Steps Toward an AGI Roadmap Włodek Duch ( Google: W. Duch) AGI, Memphis, 1-2 March 2007 Roadmaps: A Ten Year Roadmap to Machines with Common Sense (Push.
AELDP ACADEMIC READING. Questions Do you have any questions about academic reading?
Knowledge representation
Entity Recognition via Querying DBpedia ElShaimaa Ali.
An Overview of MPEG-21 Cory McKay. Introduction Built on top of MPEG-4 and MPEG-7 standards Much more than just an audiovisual standard Meant to be a.
Artificial intelligence project
Semantic Search: different meanings. Semantic search: different meanings Definition 1: Semantic search as the problem of searching documents beyond the.
Push Singh & Tim Chklovski. AI systems need data – lots of it! Natural language processing: Parsed & sense-tagged corpora, paraphrases, translations Commonsense.
Bio-Medical Information Retrieval from Net By Sukhdev Singh.
How to Research. Research Paper Assignment Identify what the assignment requires:  topic possibilities  number of sources  type of sources (journal,
Integrating Live Plant Images with Other Types of Biodiversity Records Steve Baskauf Vanderbilt Dept. of Biological Sciences
Semantic Web Applications GoodRelations BBC Artists BBC World Cup 2010 Website Emma Nherera.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
Visualization of Heterogeneous Data Mike Cammarano Xin (Luna) Dong Bryan Chan Jeff Klingner Justin Talbot Alon Halevy Pat Hanrahan.
-1- Philipp Heim, Thomas Ertl, Jürgen Ziegler Facet Graphs: Complex Semantic Querying Made Easy Philipp Heim 1, Thomas Ertl 1 and Jürgen Ziegler 2 1 Visualization.
Domain Modeling In FREMA David Millard Yvonne Howard Hugh Davis Gary Wills Lester Gilbert Learning Societies Lab University of Southampton, UK.
Labeling Images for FUN!!! Yan Cao, Chris Hinrichs.
Using UML, Patterns, and Java Object-Oriented Software Engineering Chapter 4, Requirements Elicitation.
PLoS ONE Application Journal Publishing System (JPS) First application built on Topaz application framework Web 2.0 –Uses a template engine to display.
ICT in Primary Language Learning Presentation English Didactics Course Janne Lumme 13th Oct 2004.
Ideas Session Willer Travassos, Jan. 24th. GWAP Games with a purpose (GWAP) uses the computational power of humans to perform tasks that computers are.
Definition of a taxonomy “System for naming and organizing things into groups that share similar characteristics” Taxonomy Architectures Applications.
Social software YEFI P. TELAUMBANUA What is Social Software? It is a kind of an interactive tools handle mediated interactions between a pair or.
Evaluating Websites November Don’t view the Internet as: a one stop information and research center the only place to look for information a place.
Center for Computational Analysis of Social and Organizational Systems Labeling images with a computer game By von Ahn & Dabbish.
LOD for the Rest of Us Tim Finin, Anupam Joshi, Varish Mulwad and Lushan Han University of Maryland, Baltimore County 15 March 2012
CROWDSOURCING Massimo Poesio Part 2: Games with a Purpose.
Blogs, Wikis and Podcasting  By Zach, Andrew and Sam.
Understanding User’s Query Intent with Wikipedia G 여 승 후.
Introduction to the Semantic Web and Linked Data
Understanding User Goals in Web Search University of Seoul Computer Science Database Lab. Min Mi-young.
WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Blogging. Website and blog A website, also written as web site,or simply site, is a set of related web pages typically served from a single web domain.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
DBpedia - A Crystallization Point
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Research Skills for Your Essay Where to begin…. Starting the search task for real Finding and selecting the best resources are the key to any project.
Big Data: Every Word Managing Data Data Mining TerminologyData Collection CrowdsourcingSecurity & Validation Universal Translation Monolingual Dictionaries.
Exploiting Wikipedia as External Knowledge for Document Clustering
Massimo Poesio Lecture 7: Wikipedia for Text Analytics
Big Data Quality the next semantic challenge
Big Data Quality the next semantic challenge
DBpedia 2014 Liang Zheng 9.22.
Big Data Quality the next semantic challenge
Presentation transcript:

INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE 10: Knowledge and The Social Web

`CYC convinced the AI community that creating a commonsense knowledge base by hand is impossible’ (Massimo, Lecture 1) That may depend on how many people you put on to it!

THE SOCIAL WEB Increasingly, the Web is becoming not just a way to facilitate information exchange or commercial transactions, but also a tool to facilitate socialization (Facebook, LinkedIn, etc) Also, where information can be collectively created

SOCIAL CREATION OF KNOWLEDGE

WIKIPEDIA Wikipedia is a free, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation. Wikipedia's articles have been written collaboratively by volunteers around the world. Almost all of its articles can be edited by anyone who can access the Wikipedia website. The free encyclopedia that anyone can edit ----

WIKIPEDIA Wikipedia is: 1. domain independent – it has a large coverage 2. up-to-date – to process current information 3. multilingual – to process information in many languages

Title Abstract Infoboxes Geo-coordinates Categories Images Links Other languages Other wiki pages To the web Redirects Disambiguates

WIKIPEDIA Wikipedia is an encyclopedia written collaboratively by many of its readers Lots of people are constantly improving Wikipedia, making thousands of changes an hour, all of which are recorded on article histories and recent changes. Inappropriate changes are usually removed quickly Unlike other encyclopedias, the volunteer authors of articles in Wikipedia don't have to be experts or scholars (though some certainly are).

Encyclopedic knowledge in coreference resolution [The FCC] took [three specific actions] regarding [AT&T]. By a 4-0 vote, it allowed AT&T to continue offering special discount packages to big customers, called Tariff 12, rejecting appeals by AT&T competitors that the discounts were illegal. ….. [The agency] said that because MCI's offer had expired AT&T couldn't continue to offer its discount plan.

Why Wikipedia may help addressing the encyclopedic knowledge problem The Federal Communications Commission (FCC) is an independent United States government agency, created, directed, and empowered by Congressional statute (see 47 U.S.C. § 151 and 47 U.S.C. § 154). Congressionalstatute U.S.C.§ 151U.S.C.§ 154

Another interesting scenario A fresh mandate for [Mr Ahmadinejad] would, say his critics, consecrate the “revolution within a revolution” he has been trying to effect since his surprise electoral triumph in Best known to outsiders for his bellicose grandstanding, [the incumbent] is more familiar to Iranians as a radical and hyperactive populist who has used the tacit backing of his fellow conservative, Mr Khamenei, greatly to expand the powers of the presidency. Source: It could make a big difference, The Economist, Mar 19th 2009

Why Wikipedia may help addressing the encyclopedic knowledge problem

Wikipedia as Ontology Unlike other standard ontologies, such as WordNet and Mesh, Wikipedia itself is not a structured thesaurus. However, it is more… – Comprehensive: it contains 12 million articles (2.8 million in the English Wikipedia) – Accurate : A study by Giles (2005) found Wikipedia can compete with Encyclopædia Britannica in accuracy*. – Up to date: Current and emerging concepts are absorbed timely. * Giles, J Internet encyclopaedias go head to head. Nature 438: 900–901.

Wikipedia as Ontology Moreover, Wikipedia has a well-formed structure – Each article only describes a single concept. – The title of the article is a short and well-formed phrase like a term in a traditional thesaurus.

Wikipedia Article that describes the Concept Artificial intelligence

Wikipedia as Ontology Moreover, Wikipedia has a well-formed structure – Each article only describes a single concept – The title of the article is a short and well-formed phrase like a term in a traditional thesaurus. – Equivalent concepts are grouped together by redirected links.

AI is redirected to its equivalent concept Artificial Intelligence

Wikipedia as Ontology Moreover, Wikipedia has a well-formed structure – Each article only describes a single concept – The title of the article is a short and well-formed phrase like a term in a traditional thesaurus. – Equivalent concepts are grouped together by redirected links. – It contains a hierarchical categorization system, in which each article belongs to at least one category.

The concept Artificial Intelligence belongs to four categories: Artificial intelligence, Cybernetics, Formal sciences & Technology in society

Wikipedia as Ontology Moreover, Wikipedia has a well-formed structure – Each article only describes a single concept – The title of the article is a short and well-formed phrase like a term in a traditional thesaurus. – Equivalent concepts are grouped together by redirected links. – It contains a hierarchical categorization system, in which each article belongs to at least one category. – Polysemous concepts are disambiguated by Disambiguation Pages.

The different meanings that Artificial intelligence may refer to are listed in its disambiguation page.

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA Taxonomic information: category structure Attributes: infobox, text

Wikipedia category network

Deriving a taxonomy from Wikipedia (AAAI 2007) Start with the category tree

Deriving a taxonomy from Wikipedia (AAAI 2007) Induce a subsumption hierarchy

INFOBOXES Collaborative content Semi- structured data {{Infobox Writer | bgcolour = silver | name = Edgar Allan Poe | image = Edgar_Allan_Poe_2.jpg | caption = This [[daguerreotype]] of Poe was taken in | birth_date = {{birth date|1809|1|19|mf=y}} | birth_place = [[Boston, Massachusetts]] [[United States|U.S.]] | death_date = {{death date and age|1849|10|07|1809|01|19}} | death_place = [[Baltimore, Maryland]] [[United States|U.S.]] | occupation = Poet, short story writer, editor, literary critic | movement = [[Romanticism]], [[Dark romanticism]] | genre = [[Horror fiction]], [[Crime fiction]], [[Detective fiction]] | magnum_opus = The Raven | spouse = [[Virginia Eliza Clemm Poe]]...

DBpedia.org is a effort to : extract structured information from Wikipedia make this information available on the Web under an open license interlink the DBpedia dataset with other datasets on the Web DBPEDIA

 1,600,000 concepts  including  58,000 persons  70,000 places  35,000 music albums  12,000 films  described by 91 million triples  using 8,141 different properties.  557,000 links to pictures  1,300,000 links external web pages  207,000 Wikipedia categories  75,000 YAGO categories The DBpedia Dataset

The DBpedia.org project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web. It uses the SPARQL query language to query this data. At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data. REPRESENTING EXTRACTED INFORMATION

dbpedia:native_name Calgary”; dbpedia:altitude “1048”; dbpedia:population_city “988193”; dbpedia:population_metro “ ”; mayor_name dbpedia:Dave_Bronconnier ; governing_body dbpedia:Calgary_City_Council;... Extracting Infobox Data (RDF Representation):

SPARQL : SPARQL is a query language for RDF. RDF is a directed, labeled graph data format for representing information in the Web. This specification defines the syntax and semantics of the SPARQL query language for RDF. SPARQL can be used to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware.

  hosted on a OpenLink Virtuoso server  can answer SPARQL queries like  Give me all Sitcoms that are set in NYC?  All tennis players from Moscow?  All films by Quentin Tarentino?  All German musicians that were born in Berlin in the 19th century? The DBpedia SPARQL Endpoint

Efforts such as Wikipedia indicate that many Web surfers may be willing to participate in collective resource-producing efforts – Other initiatives: Citizen Science, Cognition and Language Laboratory, … This has been taken advantage of in AI – Open Mind Commonsense (Singh) (collecting facts) – Semantic Wikis WEB COLLABORATION FOR KNOWLEDGE ACQUISITION

Open Mind Common Sense – Singh Crater mapping (results)– Kanefsky Crater mappingresults Learner / Learner2 / 1001 Paraphrases– Chklovski FACTory – CyCORP FACTory Hot or Not– 8 Days Hot or Not ESP / Phetch / Verbosity / Peekaboom– von Ahn ESPPhetchPeekaboom Galaxy Zoo– Oxford University WEB COLLABORATION PROJECTS

OPEN MIND COMMONSENSE A project started in 2000 by Push Singh to take advantage of people’s collaboration to collect commonsense

WHAT’S IN OPEN MIND COMMONSENSE: CAR

OPEN MIND COMMONSENSE: ADDING KNOWLEDGE

OMCS ADDING KNOWLEDGE, 2

OPEN MIND COMMONSENSE: CHECKING KNOWLEDGE

FROM OPENMIND COMMONSENSE TO CONCEPT NET ConceptNet (Havasi et al, 2009) is a semantic network extracted from OpenMind Commonsense assertions using simple heuristics

CONCEPT NET

ConceptNet Example

FROM OPENMIND COMMONSENSE FACTS TO CONCEPTNET A lime is a very sour fruit isa(lime,fruit) property_of(lime,very_sour)

GAMES WITH A PURPOSE Luis von Ahn pioneered a new approach to resource creation on the Web: GAMES WITH A PURPOSE, or GWAP, in which people, as a side effect of playing, perform tasks ‘computers are unable to perform’ (sic)

GWAP vs OPEN MIND COMMONSENSE vs MECHANICAL TURK GWAP do not rely on altruism or financial incentives to entice people to perform certain actions The key property of games is that PEOPLE WANT TO PLAY THEM

EXAMPLES OF GWAP Games at – ESP – Verbosity – TagATune Other games – Peekaboom – Phetch

ESP The first GWAP developed by von Ahn and their group (2003 / 2004) The problem: obtain accurate description of images to be used – To train image search engines – To develop machine learning approaches to vision The goal: label the majority of the images on the Web

ESP: the game

ESP: THE GAME Two partners are picked at random from the large number of players online They are not told who their partner is, and can’t communicate with them They are both shown the same image The goal: guess how their partner will describe the image, and type that description – Hence, the ESP game If any of the strings typed by one player matches the string typed by the other player, they score points

THE TASK

SCORING BY MATCHING

THE CHALLENGE: SCORES One of the motivating factors is to try to score as many points as possible Hourly, daily, weekly, and monthly scores are shown

SCORES

THE CHALLENGE: TIMING Partners try to agree on as many images as they can during 2 ½ minutes The termometer on the side indicates how many images they have agreed on If they agree on 15 images they score bonus points

TABOO WORDS To ensure the production of a large number of specific labels, some words are declared TABOO and not allowed Taboo words are obtained from the game itself: any word that has been agreed upon by players who were shown a picture earlier becomes a taboo word for that image

TABOO WORDS

PASSING

GOOD LABELS, COMPLETING AN IMAGE A label is considered “good” when more than N players produce it (with N a parameter of the game) An image is “done” when its list of taboo words is so extensive that most players pass on it

IMPLEMENTATION Pre-recorded game play – Especially at the beginning, and at quiet times, there won’t always be players to pair with – In these cases a player is paired against a recorded ‘hand’ of a previous game with the same picture Cheating – Players could cheat in a number of ways, including agreeing on labels / playing against themselves – A number of mechanisms are in place against those cases Selecting images

SOME STATISTICS In the 4 months between August 9 th 2003 and December 10th 2003 – players – 1.2 million labels for 293,760 images – 80% of players played more than once By 2008: – 200,000 players – 50 million labels

ANALYSIS The numbers indicate that the game is fun to play Exciting factors: – Playing with a partner – Playing against time

QUALITY OF THE LABELS For IMAGE SEARCH: – choose 10 labels among those produced and look at which images are returned Compare labels produced by players with labels produced by participants in an experiment – 15 participants, 20 images among the 1000 with more than 5 labels – 83% of game labels also produced by participants Manual assessment of labels (‘would you use these labels to describe this image?’) – 15 participants, 20 images – 85% of words rated useful

GOOGLE IMAGE LABELLER

THE TASK

RESULTS

VERBOSITY … or, the game approach to collecting commonsense knowledge Motivation: slow progress both on CYC (5 million facts collected) and on Open Mind Commonsense (around 700,000 facts)

THE GAME Based on an existing game, TABOO: – Players have to guess a word – One of the players gives hints concerning the word In Verbosity, you have two players, the DESCRIBER and the GUESSER, and a SECRET WORD

THE GAME

TEMPLATES IN VERBOSITY As in Open Mind Commonsense, templates are used to ensure that the relations / properties of interest are collected The Describer produces hints by filling in a template

GUESSING ATTRIBUTES

PRODUCING A DESCRIPTION

TEMPLATES _ is a kind of _ _ is used for _ _ is typically near/in/on _ _ is the opposite of _ / _ is related to _

EMULATION As in ESP game, pre-recorded games are used when a player cannot be paired with another player The asymmetry of the game causes a problem not encountered in ESP game – Describer: can just repeat behavior of previous describer – Guesser: not so easy

RESULTS Only published results I’m aware of predate the actual release of the game so I don’t know about the QUANTITY Quality: – Ask six raters whether 200 facts collected using Verbosity are ‘true’ – Around 85% success

PHRASE DETECTIVES

2 tasks : – Find The Culprit (Annotation) User must identify the closest antecedent of a markable if it is anaphoric – Detectives Conference (Validation) User must agree/disagree with a coreference relation entered by another user PHRASE DETECTIVES: THE TASKS

NAME THE CULPRIT

READINGS V. Nastase& M. Strube, Transforming Wikipedia into a large scale multilingual concept network, Artificial Intelligence, 2012 C. Havasi, J. Pustejovsky, R. Speer and H. Lieberman, Digital Intuition: Applying Common Sense Using Dimensionality Reduction, IEEE Intelligent Systems, 2009 L. von Ahn and L. Dabbish (2008). Designing games with a purpose. Communications of the ACM, v. 51, n.8, Poesio, Chamberlain, Kruschwitz, Robaldo, & Ducceschi, Phrase Detectives: Utilizing Collective Intelligence for Internet- Scale Language Resource Creation. ACM Transactions on Intelligent Interactive Systems