Download presentation
Presentation is loading. Please wait.
Published byJune Goodwin Modified over 9 years ago
1
By Borys Omelayenko, Ph.D.
2
Data and vocabularies Data is isolated Vocabularies link it to the world give some extra meaning to build new applications Tetracycline works by stopping the growth of bacteria.
3
What it is all about TitleCategoryLocation PotCookingParis AxeToolsLyon ChurchBuildingSofia LabelPpl. Paris2.1m Sofia1.2m LabelDescr. ToolsHummer FoodMeat Data records Vocabulary terms Vocabulary terms Inside a company or a private cloud
4
What are they, practically?
5
Vocabulary Linguistics “all the words known and used by a particular person” Source: http://dictionary.cambridge.org/dictionary/british/vocabulary Computing Vocabulary is a database of terms, known and used by your system
6
Term ‘Human’
7
Terms ‘Paris’ and ‘Sofia’
8
Source: http://www.genome.jp/dbget-bin/www_bget?cpd:C06570 Chemical
9
Source: http://www.genome.jp/dbget-bin/www_bget?dr:D00201 Drug
10
Vocabulary Is an additional database Made by somebody else Focused and specialized Certain aspect of the terms Handful relations May have millions terms You want to use it It can quickly add up to your data Drag your data out of isolation Bring added value to your customers Next: three use cases
11
How to improve recall
12
Search results
13
Object page: женщина париж How did we found it?
14
Enriched records in SOLR Paris Paris 04 Paris Île-de-France France Woman Documents from Paris, don’t need to mention ‘France’, they will still be found on a query for ‘France’ Broader for ‘woman’ France Франция Frankrijk Paris Париж Parijs Orphelinat des postes … Woman Женщина Vrouw Population structure Структура населения
15
How to improve precision
16
Museums MultiMedian Dutch R&D project Completed in 2006 Link together A dozen museum databases A dozen vocabularies Source: e-culture.multimedian.nl (June 2015)
18
And try to navigate Autocomplete Searches these databases Groups Artefacts (data documents) Terms Combines them into Informative autocomplete Source: e-culture.multimedian.nl (June 2015)
20
Graph scheme Person Artefact Style Derain Portrait of Matisse Modern Place Event Paris worked in author title made at label name (label) label made in worked at held at associated with participated at More-or-less AAT structure
21
Source: e-culture.multimedian.nl (June 2015) Derain Clustered results Lets search Interesting pattern, without Derain but very relevant. Would never be found with text search
22
How to link documents to terms?
23
Lets search for terms Paris geonames.org/123 Looks simple: search for ‘Paris’ on geonames.org
24
Bring me to Paris! How many? Source: www.geonames.org (June 2015) Which one?
25
Disambiguation tip: where Population Paris, Paris 04, Paris all nested Choose the most specific one London would become Westminster
26
Disambiguation tip: where Athens Greece Georgia (Georgia US, not Georgia) Choose by the country of data Or use time constraints (antique) Colonization Duplicates in place names Filtering Drop what you don’t need Administrative units for museums Skip parks, rivers, etc. Keep parks for a hiking web site
27
Disambiguation tip: what In culture they often use geoCultural origin Ancient Greece is limited in time & space But they mix up ‘what’ & ‘where’ Village in middle of France 20 residents and a million artefacts It was called ‘Roman’ Tip: use new links put records on a map
28
Disambiguation tip: who There are millions people involved in culture Names are often ambiguous Tips: Compare year of painting to birth-death Look at the ‘death’ field
29
Disambiguation tip: when Maybe the easiest Fairly limited area One-dimensional Many numerical values, like ‘early 13 th century’ Just stay in the past A museum was dating objects with future dates Tip: use new links Put your records on a timeline
30
How far we can go on cheap Fully automatic tagging Europeana 18.7 m records (2009) In total 11,2 m out of 18,7 m records gets at least one link. borys.name/blog/semantic_tagging_of_europeana_data.html VocabularySizeSOLR fieldLinks, m Whopainters from Wikipedia10,000creator0.01 WhatGEMET10,000subject2.4 WhereGeonames140,000coverage2.8 WhenSemium Time2,500date7.9 Each link is 20+ multilingual synonyms
32
19761988 Future Database theory gave us modern databases It’s time for the graph theory
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.