Research Internships Advanced Research and Modeling Research Group
ADREM – What? Research group that deals with computational aspects of data – databases – data mining – Information retrieval
ADREM – Who? DB/DM/IR Floris Geerts Bart Goethals Martin Theobald Bioinf Kris Laukens Tim Van den Bulcke + Phd students and postdoctoral researchers
Internships – What? 2 research internships (15 credits each) Msc thesis (30 credits). Goal: internships are an initiation to research and is in collaboration with researchers in ADReM 15 credits is a lot = internship is time consuming! 1 credit = 15 hour work… Balance your course load and internship well. Internships are not necessarily related to your Msc thesis (but it can) In a Msc thesis your ability to independently do research plays an important role.
Internships – Who? Everyone who follows the research option in the database Msc program
Research In an internship you need to: 1.Understand a specific problem 2.Implement an (existing) method for solving the problem 3.Test and evaluate 4.Write a report (Msc thesis: you have to solve the problem as well by designing new methods…)
Internships in a company It is allowed to do a internship in a company but you have to ask permission Also, you have to find the company yourself and convince us that there is research involved You can’t receive any money from the company during your internship
Databases, data mining, information retrieval These are not separate research domains The topics for internships that each of us will present next are usually on the intersection of these areas. Let’s see some example topics….
Bart Goethals
Recommender Systems Implement state of the art recommenders Pattern mining for better recommendations Interactive Recommendation Explaining recommendations Test recommenders for real data
Visual Instant Interactive Pattern Mining Study Visualizations enabling Interactive Pattern Mining Implement and Experiment with novel instant mining methods
Pattern based Clustering Implement and evaluate different techniques for clustering based pattern mining, and pattern based clustering
Data Mining for Cleaning Study and experiment with data mining methods for data cleaning.
Martin Theobald
Information Extraction (I): Wikipedia Infoboxes
bornOn(Jeff, 09/22/42) gradFrom(Jeff, Columbia) hasAdvisor(Jeff, Arthur) hasAdvisor(Surajit, Jeff) knownFor(Jeff, Theory) Information Extraction (I): Infoboxes YAGO/DBpedia et al. >120 M facts for YAGO2 (mostly from Wikipedia infoboxes)
Information Extraction (II): Wikipedia Categories
?
RDF Knowledge Bases Entity Max_Planck Apr 23, 1858 Person City Country subclass Location subclass instanceOf subclass bornOn “Max Planck” means subclass Oct 4, 1947 diedOn Kiel bornIn Nobel Prize Erwin_Planck FatherOf hasWon Scientist means “Max Karl Ernst Ludwig Planck” Physicist instanceOf subclass Biologist subclass Germany Politician Angela Merkel Schleswig- Holstein State “Angela Dorothea Merkel” Oct 23, 1944 diedOn Organization subclass Max_Planck Society instanceOf means instanceOf subclass means “Angela Merkel” means citizenOf instanceOf locatedIn subclass accuracy 95% 3 Mio. entities, 120 Mio. facts 100 relations, 200k classes
Linked Open Data As of Sept. 2011: > 200 sources > 30 billion RDF triples > 400 million links
Currently (Sept. 2011) > 5 Mio owl:sameAs links between DBpedia/YAGO/Freebase As of Sept. 2011: > 5 million owl:sameAs links between DBpedia/YAGO/Freebase
IBM Watson: Deep Question Answering 99 cents got me a 4-pack of Ytterlig coasters from this Swedish chain This town is known as "Sin City" & its downtown is "Glitter Gulch" William Wilkinson's "An Account of the Principalities of Wallachia and Moldavia" inspired this author's most famous novel As of 2010, this is the only former Yugoslav republic in the EU YAGO knowledge back-ends question classification & decomposition D. Ferrucci et al.: Building Watson: An Overview of the DeepQA Project. AI Magazine, Fall 2010.
A big US city with two airports, one named after a World War II hero, and one named after a World War II battle field? Jeopardy!
Structured Knowledge Queries A big US city with two airports, one named after a World War II hero, and one named after a World War II battle field? Select Distinct ?c Where { ?c type City. ?c locatedIn USA. ?a1 type Airport. ?a2 type Airport. ?a1 locatedIn ?c. ?a2 locatedIn ?c. ?a1 namedAfter ?p. ?p type WarHero. ?a2 namedAfter ?b. ?b type BattleField. } Use manually created templates for mapping sentence patterns to structured queries. Works for factoid and list questions.
Mining Rules from RDF Knowledge Bases A-priori-style pre-filtering of low-support join patterns Dynamic programming ILP algorithm Learning with constants and type constraints Ground truth for bornIn (partially known) Facts produced by the rule (only partially true) Closed World Assumption: strongly penalizes the rule Specificity: avoid producing overly general rules Use a combination of statistical measures Confidence instead of Accuracy: do not penalize the rule for unseen entities Our solution: Overly general Refine by types Ground truth for livesIn (only partially known) Knowledge base for livesIn (known positive examples) Facts produced by the rule (only partially correct) Goal: Inductively learn (soft) rules: livesIn(x,y) :- bornIn(x,y) G KB R
Rule-based Reasoning (Soft) Deduction Rules vs. (Hard) Consistency Constraints People may live in more than one place livesIn(x,y) marriedTo(x,z) livesIn(z,y) livesIn(x,y) hasChild(x,z) livesIn(z,y) People are not born in different places/on different dates bornIn(x,y) bornIn(x,z) y=z People are not married to more than one person (at the same time, in most countries?) marriedTo(x,y,t 1 ) marriedTo(x,z,t 2 ) y≠z disjoint(t 1,t 2 ) [0.8] [0.5]
Probabilistic RDF Database \/ /\ graduatedFrom (Surajit, Princeton) [0.7] graduatedFrom (Surajit, Princeton) [0.7] hasAdvisor (Surajit,Jeff )[0.8] hasAdvisor (Surajit,Jeff )[0.8] worksAt (Jeff,Stanford )[0.9] worksAt (Jeff,Stanford )[0.9] graduatedFrom (Surajit, Stanford) [0.6] graduatedFrom (Surajit, Stanford) [0.6] Query graduatedFrom(Surajit, y) Query graduatedFrom(Surajit, y) CD AB A (B (C D)) A (B (C D)) graduatedFrom (Surajit, Princeton) graduatedFrom (Surajit, Princeton) graduatedFrom (Surajit, Stanford) graduatedFrom (Surajit, Stanford) Q1Q1 Q2Q2 Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z) [0.4] graduatedFrom(x,y) graduatedFrom(x,z) y=z Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z) [0.4] graduatedFrom(x,y) graduatedFrom(x,z) y=z Base Facts graduatedFrom(Surajit, Princeton) [0.7] graduatedFrom(Surajit, Stanford) [0.6] graduatedFrom(David, Princeton) [0.9] hasAdvisor(Surajit, Jeff) [0.8] hasAdvisor(David, Jeff) [0.7] worksAt(Jeff, Stanford) [0.9] type(Princeton, University) [1.0] type(Stanford, University) [1.0] type(Jeff, Computer_Scientist) [1.0] type(Surajit, Computer_Scientist) [1.0] type(David, Computer_Scientist) [1.0] Base Facts graduatedFrom(Surajit, Princeton) [0.7] graduatedFrom(Surajit, Stanford) [0.6] graduatedFrom(David, Princeton) [0.9] hasAdvisor(Surajit, Jeff) [0.8] hasAdvisor(David, Jeff) [0.7] worksAt(Jeff, Stanford) [0.9] type(Princeton, University) [1.0] type(Stanford, University) [1.0] type(Jeff, Computer_Scientist) [1.0] type(Surajit, Computer_Scientist) [1.0] type(David, Computer_Scientist) [1.0] 1-(1-0.72)x(1-0.6) = x0.9 = x( )=0.078(1-0.7)x0.888=0.266
Temporal Knowledge
‘03 ‘05‘07 playsFor(Beckham, Real, T 1 ) Base Facts Derived Facts ‘05‘00‘02‘07 playsFor(Ronaldo, Real, T 2 ) ‘04 ‘03‘04 ‘07 ‘05 playsFor(Beckham, Real, T 1 ) playsFor(Ronaldo, Real, T 2 ) overlaps(T 1,T 2 ) t 3 teamMates(Beckham, Ronaldo, t 3 ) State Relation teamMates(Beckham, Ronaldo, T 3 ) Probabilistic-Temporal Consistency Reasoning
Topics for Internships & Master Theses Research Internships Preparation & Integration of Linked Data Sources for Scientific Experiments (SQL/Java/Python) Mining Association Rules from Linked Data (Java/C++) Visualization Frontend for Linked Data (ActionScript & Adobe Flash) Master Theses Implementation of a distributed rule-based query engine for RDF data (C++ & Message Passing Interface) Implementation of a distributed factor graph model for correlated RDF facts (C++ & Message Passing Interface) Faceted Search and Interactive Browsing for Linked Data
Floris Geerts
Find top-3 flights from Edi to NYC with at most one stop Items: flights Selection criteria: relational queries Utility function: in terms of price and duration (for ranking) RDBMS-based recommendation systems 32 Books, music, news, Web sites, research papers,….. top-k items … NY EDI items Top-k item selection Utility function Selection criteria
valid query relaxation Query relaxation 33 Q(f#, name,type,ticket, time) = ∃ DT, AT, AD, x To ( flight ( f#, EDI, x To, DT, 5/19/2012, AT, AD, Pr ) ∧ POI ( name, x To, type, ticket, time) ∧ x To = NYC ) Q 1 (f#, name, type, ticket, time) = ∃ DT, AT, AD, u To, w Edi, w NYC,w DD ( flight ( f#, w Edi, x To, DT,w DD, AT,A D, Pr ) ∧ x To = w NYC ∧ POI( name, u To, type, ticket, time) ∧ w DD =5/19/2012 ∧ dist(w NYC,NYC)≤15 ∧ dist(w Edi,EDI) ≤15 ∧ x To =u To ) E = { EDI,NYC,4/1/2012 }, X = { x To } There is no direct flight from EDI to NYC Relaxation: cities within 15 miles of EDI or NYC are acceptable Query for 5-day holiday dist(w DD,5/10/2012 ) ≤ 3 Further relaxation: departure dates within 3 days of 5/19/2012 are acceptable
Top-k query answering algorithm on top of RDBMS Query relaxation approaches and query completion Topics
Data quality Detecting and correcting inconsistencies Finding duplicates Finding most up-to-date information
Semantic errors Yahoo! Finance Nasdaq Day’s Range: wk Range: Wk: Day’s Range:
Instance ambiguity
Out-of-Date Data 4:05 pm 3:57 pm
Unit errors 76,821, B
Fast inconsistency detection Duplication elimination algorithms Automated repairing algorithms Mining of “data quality rules” Topics