Presentation is loading. Please wait.

Presentation is loading. Please wait.

By Borys Omelayenko, Ph.D.. Data and vocabularies Data is isolated Vocabularies link it to the world give some extra meaning to build new applications.

Similar presentations

Presentation on theme: "By Borys Omelayenko, Ph.D.. Data and vocabularies Data is isolated Vocabularies link it to the world give some extra meaning to build new applications."— Presentation transcript:

1 By Borys Omelayenko, Ph.D.

2 Data and vocabularies Data is isolated Vocabularies link it to the world give some extra meaning to build new applications Tetracycline works by stopping the growth of bacteria.

3 What it is all about TitleCategoryLocation PotCookingParis AxeToolsLyon ChurchBuildingSofia LabelPpl. Paris2.1m Sofia1.2m LabelDescr. ToolsHummer FoodMeat Data records Vocabulary terms Vocabulary terms Inside a company or a private cloud

4 What are they, practically?

5 Vocabulary Linguistics “all the words known and used by a particular person” Source: Computing Vocabulary is a database of terms, known and used by your system

6 Term ‘Human’

7 Terms ‘Paris’ and ‘Sofia’

8 Source: Chemical

9 Source: Drug

10 Vocabulary Is an additional database Made by somebody else Focused and specialized Certain aspect of the terms Handful relations May have millions terms You want to use it It can quickly add up to your data Drag your data out of isolation Bring added value to your customers Next: three use cases

11 How to improve recall

12 Search results

13 Object page: женщина париж How did we found it?

14 Enriched records in SOLR Paris Paris 04 Paris Île-de-France France Woman Documents from Paris, don’t need to mention ‘France’, they will still be found on a query for ‘France’ Broader for ‘woman’ France Франция Frankrijk Paris Париж Parijs Orphelinat des postes … Woman Женщина Vrouw Population structure Структура населения

15 How to improve precision

16 Museums MultiMedian Dutch R&D project Completed in 2006 Link together A dozen museum databases A dozen vocabularies Source: (June 2015)


18 And try to navigate Autocomplete Searches these databases Groups Artefacts (data documents) Terms Combines them into Informative autocomplete Source: (June 2015)


20 Graph scheme Person Artefact Style Derain Portrait of Matisse Modern Place Event Paris worked in author title made at label name (label) label made in worked at held at associated with participated at More-or-less AAT structure

21 Source: (June 2015) Derain Clustered results Lets search Interesting pattern, without Derain but very relevant. Would never be found with text search

22 How to link documents to terms?

23 Lets search for terms Paris Looks simple: search for ‘Paris’ on

24 Bring me to Paris! How many? Source: (June 2015) Which one?

25 Disambiguation tip: where Population Paris, Paris 04, Paris all nested Choose the most specific one London would become Westminster

26 Disambiguation tip: where Athens Greece Georgia (Georgia US, not Georgia) Choose by the country of data Or use time constraints (antique) Colonization Duplicates in place names Filtering Drop what you don’t need Administrative units for museums Skip parks, rivers, etc. Keep parks for a hiking web site

27 Disambiguation tip: what In culture they often use geoCultural origin Ancient Greece is limited in time & space But they mix up ‘what’ & ‘where’  Village in middle of France 20 residents and a million artefacts It was called ‘Roman’ Tip: use new links put records on a map

28 Disambiguation tip: who There are millions people involved in culture Names are often ambiguous Tips: Compare year of painting to birth-death Look at the ‘death’ field

29 Disambiguation tip: when Maybe the easiest Fairly limited area One-dimensional Many numerical values, like ‘early 13 th century’ Just stay in the past A museum was dating objects with future dates Tip: use new links Put your records on a timeline

30 How far we can go on cheap Fully automatic tagging Europeana 18.7 m records (2009) In total 11,2 m out of 18,7 m records gets at least one link. VocabularySizeSOLR fieldLinks, m Whopainters from Wikipedia10,000creator0.01 WhatGEMET10,000subject2.4 WhereGeonames140,000coverage2.8 WhenSemium Time2,500date7.9 Each link is 20+ multilingual synonyms


32 19761988 Future Database theory gave us modern databases It’s time for the graph theory

Download ppt "By Borys Omelayenko, Ph.D.. Data and vocabularies Data is isolated Vocabularies link it to the world give some extra meaning to build new applications."

Similar presentations

Ads by Google