Research Internships Advanced Research and Modeling Research Group.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Usage Statistics in Context: related standards and tools Oliver Pesch Chief Strategist, E-Resources EBSCO Information Services Usage Statistics and Publishers:
The Integration of Biological Data Using Semantic Web Technologies Susie Stephens Principal Product Manager, Life Sciences Oracle
Office of SA to CNS GeoIntelligence Introduction Data Mining vs Image Mining Image Mining - Issues and Challenges CBIR Image Mining Process Ontology.
Interactive Reasoning in Large and Uncertain RDF Knowledge Bases Martin Theobald Joint work with: Maximilian Dylla, Timm Meiser, Ndapa Nakashole, Christina.
A PowerPoint Presentation
Gerhard Weikum Max Planck Institute for Informatics & Saarland University Semantic Search: from Names and Phrases to.
YAGO-NAGA Project Presented By: Mohammad Dwaikat To: Dr. Yuliya Lierler CSCI 8986 – Fall 2012.
joint work with Shady Elbassuoni, Georgiana Ifrim, Gjergji Kasneci,
URDF Query-Time Reasoning in Uncertain RDF Knowledge Bases Ndapandula Nakashole Mauro Sozio Fabian Suchanek Martin Theobald.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Search Engines and Information Retrieval
Semantic Search Jiawei Rong Authors Semantic Search, in Proc. Of WWW Author R. Guhua (IBM) Rob McCool (Stanford University) Eric Miller.
The Last Lecture Agenda –1:40-2:00pm Integrating XML and Search Engines—Niagara way –2:00-2:10pm My concluding remarks (if any) –2:10-2:45pm Interactive.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Which Nobel laureate survived both world wars and all his four children? Tandem What‘s this? Question AnsweringPhoto AnnotationTimeline Analysis.
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001 George Kollios Boston University.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
CSC 9010 Spring Paula Matuszek A Brief Overview of Watson.
Search Engines and Information Retrieval Chapter 1.
Chapter 6 Understanding Each Other CSE 431 – Intelligent Agents.
SEEKING STATEMENT-SUPPORTING TOP-K WITNESSES Date: 2012/03/12 Source: Steffen Metzger (CIKM’11) Speaker: Er-gang Liu Advisor: Dr. Jia-ling Koh 1.
Estimating Importance Features for Fact Mining (With a Case Study in Biography Mining) Sisay Fissaha Adafre School of Computing Dublin City University.
Human-Computer Interaction Introduction © Brian Whitworth.
Ensemble Solutions for Link-Prediction in Knowledge Graphs
Author: William Tunstall-Pedoe Presenter: Bahareh Sarrafzadeh CS 886 Spring 2015.
DBrev: Dreaming of a Database Revolution Gjergji Kasneci, Jurgen Van Gael, Thore Graepel Microsoft Research Cambridge, UK.
Chapter 1 Introduction to Data Mining
Chapter 6 Understanding Each Other CSE 431 – Intelligent Agents.
1  Special Cases:  Query Semantics: (“Marginal Probabilities”)  Run query Q against each instance D i ; for each answer tuple t, sum up the probabilities.
-1- Philipp Heim, Thomas Ertl, Jürgen Ziegler Facet Graphs: Complex Semantic Querying Made Easy Philipp Heim 1, Thomas Ertl 1 and Jürgen Ziegler 2 1 Visualization.
Assigning Global Relevance Scores to DBpedia Facts Philipp Langer, Patrick Schulze, Stefan George, Tobias Metzke, Ziawasch Abedjan, Gjergji Kasneci DESWeb.
EU Project proposal. Andrei S. Lopatenko 1 EU Project Proposal CERIF-SW Andrei S. Lopatenko Vienna University of Technology
Overviews of ITCS 6161/8161: Advanced Topics on Database Systems Dr. Jianping Fan Department of Computer Science UNC-Charlotte
Department of computer science and engineering Two Layer Mapping from Database to RDF Martin Švihla Research Group Webing Department.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Oracle Database 11g Semantics Overview Xavier Lopez, Ph.D., Dir. Of Product Mgt., Spatial & Semantic Technologies Souripriya Das, Ph.D., Consultant Member.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Most of contents are provided by the website Introduction TJTSD66: Advanced Topics in Social Media Dr.
1 Context-Aware Internet Sharma Chakravarthy UT Arlington December 19, 2008.
Semantic Publishing Benchmark Task Force Fourth TUC Meeting, Amsterdam, 03 April 2014.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
1 NAGA: Searching and Ranking Knowledge Gjergji Kasneci Joint work with: Fabian M. Suchanek, Georgiana Ifrim, Maya Ramanath, and Gerhard Weikum.
Tutorial: Knowledge Bases for Web Content Analytics
Data Mining and Decision Support
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
Introduction to the Semantic Web Jeff Heflin Lehigh University.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
CS276B Text Information Retrieval, Mining, and Exploitation Practical 1 Jan 14, 2003.
Chapter 3 Building Business Intelligence Chapter 3 DATABASES AND DATA WAREHOUSES Building Business Intelligence 6/22/2016 1Management Information Systems.
What we mean by Big Data and Advanced Analytics
Einat Minkov University of Haifa, Israel CL course, U
Data Mining – Intro.
YAGO-QA Answering Questions by Structured Knowledge Queries
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Web IR: Recent Trends; Future of Web Search
Non-Standard-Datenbanken
Semantic Network & Knowledge Graph
CS & CS Capstone Project & Software Development Project
Data Warehousing Data Mining Privacy
Non-Standard-Datenbanken
Context-Aware Internet
A framework for ontology Learning FROM Big Data
Presentation transcript:

Research Internships Advanced Research and Modeling Research Group

ADREM – What? Research group that deals with computational aspects of data – databases – data mining – Information retrieval

ADREM – Who? DB/DM/IR Floris Geerts Bart Goethals Martin Theobald Bioinf Kris Laukens Tim Van den Bulcke + Phd students and postdoctoral researchers

Internships – What? 2 research internships (15 credits each) Msc thesis (30 credits). Goal: internships are an initiation to research and is in collaboration with researchers in ADReM 15 credits is a lot = internship is time consuming! 1 credit = 15 hour work… Balance your course load and internship well. Internships are not necessarily related to your Msc thesis (but it can) In a Msc thesis your ability to independently do research plays an important role.

Internships – Who? Everyone who follows the research option in the database Msc program

Research In an internship you need to: 1.Understand a specific problem 2.Implement an (existing) method for solving the problem 3.Test and evaluate 4.Write a report (Msc thesis: you have to solve the problem as well by designing new methods…)

Internships in a company It is allowed to do a internship in a company but you have to ask permission Also, you have to find the company yourself and convince us that there is research involved You can’t receive any money from the company during your internship

Databases, data mining, information retrieval These are not separate research domains The topics for internships that each of us will present next are usually on the intersection of these areas. Let’s see some example topics….

Bart Goethals

Recommender Systems Implement state of the art recommenders Pattern mining for better recommendations Interactive Recommendation Explaining recommendations Test recommenders for real data

Visual Instant Interactive Pattern Mining Study Visualizations enabling Interactive Pattern Mining Implement and Experiment with novel instant mining methods

Pattern based Clustering Implement and evaluate different techniques for clustering based pattern mining, and pattern based clustering

Data Mining for Cleaning Study and experiment with data mining methods for data cleaning.

Martin Theobald

Information Extraction (I): Wikipedia Infoboxes

bornOn(Jeff, 09/22/42) gradFrom(Jeff, Columbia) hasAdvisor(Jeff, Arthur) hasAdvisor(Surajit, Jeff) knownFor(Jeff, Theory) Information Extraction (I): Infoboxes YAGO/DBpedia et al. >120 M facts for YAGO2 (mostly from Wikipedia infoboxes)

Information Extraction (II): Wikipedia Categories

?

RDF Knowledge Bases Entity Max_Planck Apr 23, 1858 Person City Country subclass Location subclass instanceOf subclass bornOn “Max Planck” means subclass Oct 4, 1947 diedOn Kiel bornIn Nobel Prize Erwin_Planck FatherOf hasWon Scientist means “Max Karl Ernst Ludwig Planck” Physicist instanceOf subclass Biologist subclass Germany Politician Angela Merkel Schleswig- Holstein State “Angela Dorothea Merkel” Oct 23, 1944 diedOn Organization subclass Max_Planck Society instanceOf means instanceOf subclass means “Angela Merkel” means citizenOf instanceOf locatedIn subclass accuracy  95% 3 Mio. entities, 120 Mio. facts 100 relations, 200k classes

Linked Open Data As of Sept. 2011: > 200 sources > 30 billion RDF triples > 400 million links

Currently (Sept. 2011) > 5 Mio owl:sameAs links between DBpedia/YAGO/Freebase As of Sept. 2011: > 5 million owl:sameAs links between DBpedia/YAGO/Freebase

IBM Watson: Deep Question Answering 99 cents got me a 4-pack of Ytterlig coasters from this Swedish chain This town is known as "Sin City" & its downtown is "Glitter Gulch" William Wilkinson's "An Account of the Principalities of Wallachia and Moldavia" inspired this author's most famous novel As of 2010, this is the only former Yugoslav republic in the EU YAGO knowledge back-ends question classification & decomposition D. Ferrucci et al.: Building Watson: An Overview of the DeepQA Project. AI Magazine, Fall 2010.

A big US city with two airports, one named after a World War II hero, and one named after a World War II battle field? Jeopardy!

Structured Knowledge Queries A big US city with two airports, one named after a World War II hero, and one named after a World War II battle field? Select Distinct ?c Where { ?c type City. ?c locatedIn USA. ?a1 type Airport. ?a2 type Airport. ?a1 locatedIn ?c. ?a2 locatedIn ?c. ?a1 namedAfter ?p. ?p type WarHero. ?a2 namedAfter ?b. ?b type BattleField. } Use manually created templates for mapping sentence patterns to structured queries. Works for factoid and list questions.

Mining Rules from RDF Knowledge Bases A-priori-style pre-filtering of low-support join patterns Dynamic programming ILP algorithm Learning with constants and type constraints Ground truth for bornIn (partially known) Facts produced by the rule (only partially true) Closed World Assumption: strongly penalizes the rule Specificity: avoid producing overly general rules Use a combination of statistical measures Confidence instead of Accuracy: do not penalize the rule for unseen entities Our solution: Overly general Refine by types Ground truth for livesIn (only partially known) Knowledge base for livesIn (known positive examples) Facts produced by the rule (only partially correct) Goal: Inductively learn (soft) rules: livesIn(x,y) :- bornIn(x,y) G KB R

Rule-based Reasoning (Soft) Deduction Rules vs. (Hard) Consistency Constraints People may live in more than one place livesIn(x,y)  marriedTo(x,z)  livesIn(z,y) livesIn(x,y)  hasChild(x,z)  livesIn(z,y) People are not born in different places/on different dates bornIn(x,y)  bornIn(x,z)  y=z People are not married to more than one person (at the same time, in most countries?) marriedTo(x,y,t 1 )  marriedTo(x,z,t 2 )  y≠z  disjoint(t 1,t 2 ) [0.8] [0.5]

Probabilistic RDF Database   \/ /\ graduatedFrom (Surajit, Princeton) [0.7] graduatedFrom (Surajit, Princeton) [0.7] hasAdvisor (Surajit,Jeff )[0.8] hasAdvisor (Surajit,Jeff )[0.8] worksAt (Jeff,Stanford )[0.9] worksAt (Jeff,Stanford )[0.9] graduatedFrom (Surajit, Stanford) [0.6] graduatedFrom (Surajit, Stanford) [0.6] Query graduatedFrom(Surajit, y) Query graduatedFrom(Surajit, y) CD AB A  (B  (C  D))  A  (B  (C  D)) graduatedFrom (Surajit, Princeton) graduatedFrom (Surajit, Princeton) graduatedFrom (Surajit, Stanford) graduatedFrom (Surajit, Stanford) Q1Q1 Q2Q2 Rules hasAdvisor(x,y)  worksAt(y,z)  graduatedFrom(x,z) [0.4] graduatedFrom(x,y)  graduatedFrom(x,z)  y=z Rules hasAdvisor(x,y)  worksAt(y,z)  graduatedFrom(x,z) [0.4] graduatedFrom(x,y)  graduatedFrom(x,z)  y=z Base Facts graduatedFrom(Surajit, Princeton) [0.7] graduatedFrom(Surajit, Stanford) [0.6] graduatedFrom(David, Princeton) [0.9] hasAdvisor(Surajit, Jeff) [0.8] hasAdvisor(David, Jeff) [0.7] worksAt(Jeff, Stanford) [0.9] type(Princeton, University) [1.0] type(Stanford, University) [1.0] type(Jeff, Computer_Scientist) [1.0] type(Surajit, Computer_Scientist) [1.0] type(David, Computer_Scientist) [1.0] Base Facts graduatedFrom(Surajit, Princeton) [0.7] graduatedFrom(Surajit, Stanford) [0.6] graduatedFrom(David, Princeton) [0.9] hasAdvisor(Surajit, Jeff) [0.8] hasAdvisor(David, Jeff) [0.7] worksAt(Jeff, Stanford) [0.9] type(Princeton, University) [1.0] type(Stanford, University) [1.0] type(Jeff, Computer_Scientist) [1.0] type(Surajit, Computer_Scientist) [1.0] type(David, Computer_Scientist) [1.0] 1-(1-0.72)x(1-0.6) = x0.9 = x( )=0.078(1-0.7)x0.888=0.266

Temporal Knowledge

‘03 ‘05‘07 playsFor(Beckham, Real, T 1 ) Base Facts Derived Facts ‘05‘00‘02‘07 playsFor(Ronaldo, Real, T 2 ) ‘04 ‘03‘04 ‘07 ‘05 playsFor(Beckham, Real, T 1 )  playsFor(Ronaldo, Real, T 2 )  overlaps(T 1,T 2 )  t 3 teamMates(Beckham, Ronaldo, t 3 )  State Relation teamMates(Beckham, Ronaldo, T 3 ) Probabilistic-Temporal Consistency Reasoning

Topics for Internships & Master Theses Research Internships Preparation & Integration of Linked Data Sources for Scientific Experiments (SQL/Java/Python) Mining Association Rules from Linked Data (Java/C++) Visualization Frontend for Linked Data (ActionScript & Adobe Flash) Master Theses Implementation of a distributed rule-based query engine for RDF data (C++ & Message Passing Interface) Implementation of a distributed factor graph model for correlated RDF facts (C++ & Message Passing Interface) Faceted Search and Interactive Browsing for Linked Data

Floris Geerts

Find top-3 flights from Edi to NYC with at most one stop  Items: flights  Selection criteria: relational queries  Utility function: in terms of price and duration (for ranking) RDBMS-based recommendation systems 32 Books, music, news, Web sites, research papers,….. top-k items … NY EDI items Top-k item selection Utility function Selection criteria

valid query relaxation Query relaxation 33 Q(f#, name,type,ticket, time) = ∃ DT, AT, AD, x To ( flight ( f#, EDI, x To, DT, 5/19/2012, AT, AD, Pr ) ∧ POI ( name, x To, type, ticket, time) ∧ x To = NYC ) Q 1 (f#, name, type, ticket, time) = ∃ DT, AT, AD, u To, w Edi, w NYC,w DD ( flight ( f#, w Edi, x To, DT,w DD, AT,A D, Pr ) ∧ x To = w NYC ∧ POI( name, u To, type, ticket, time) ∧ w DD =5/19/2012 ∧ dist(w NYC,NYC)≤15 ∧ dist(w Edi,EDI) ≤15 ∧ x To =u To ) E = { EDI,NYC,4/1/2012 }, X = { x To } There is no direct flight from EDI to NYC Relaxation: cities within 15 miles of EDI or NYC are acceptable Query for 5-day holiday dist(w DD,5/10/2012 ) ≤ 3 Further relaxation: departure dates within 3 days of 5/19/2012 are acceptable

Top-k query answering algorithm on top of RDBMS Query relaxation approaches and query completion Topics

Data quality Detecting and correcting inconsistencies Finding duplicates Finding most up-to-date information

Semantic errors Yahoo! Finance Nasdaq Day’s Range: wk Range: Wk: Day’s Range:

Instance ambiguity

Out-of-Date Data 4:05 pm 3:57 pm

Unit errors 76,821, B

Fast inconsistency detection Duplication elimination algorithms Automated repairing algorithms Mining of “data quality rules” Topics