Presentation is loading. Please wait.

Presentation is loading. Please wait.

Wikidata, a target for Europeana’s semantic strategy Valentine Charles, Hugo Manguinhas, Antoine Isaac: Europeana Vladimir Alexiev: Ontotext Corp GLAM.

Similar presentations


Presentation on theme: "Wikidata, a target for Europeana’s semantic strategy Valentine Charles, Hugo Manguinhas, Antoine Isaac: Europeana Vladimir Alexiev: Ontotext Corp GLAM."— Presentation transcript:

1 Wikidata, a target for Europeana’s semantic strategy Valentine Charles, Hugo Manguinhas, Antoine Isaac: Europeana Vladimir Alexiev: Ontotext Corp GLAM Wiki 2015, Den Haag

2 Europeana.eu, Europe’s cultural heritage portal 40M objects from 2,200 galleries, museums, archives and libraries

3 Europeana has many data challenges: diversity  Aggregates metadata from the cultural heritage sector in Europe Large amount of references to places, agents, concepts, time

4 Europeana has many data challenges: diversity  Metadata in more than 30 languages  From all EU countries

5 Europeana’s priority 1: Improve data quality  Europeana Data Model (EDM), a framework for richer data Re-uses several existing Semantic Web-based models Dublin Core, OAI-ORE, SKOS, CIDOC-CRM… EDM gives support for contextual resources (semantic layer)  Rely on vocabularies to solve a problem of data interlinking Encourage data providers to contribute their own vocabularies and benefit from data links made at data providers’ level

6 Vocabularies currently provided to Europeana

7 Europeana also manages its own vocabularies

8 External Dataset and Vocabulary External Dataset and Vocabulary Europeana performs automatic enrichment based on vocabularies Goal: Contextualization which reaches outside the scope of a particular platform Object

9 Automatic enrichment process in Europeana Selection of metadata fields in resource descriptions Selection of potential rules to match Selection of metadata fields in resource descriptions Selection of potential rules to match Analysis Matching the values of the metadata fields to values of the contextual resources Adding contextual links Matching the values of the metadata fields to values of the contextual resources Adding contextual links Linking Selecting the values from the contextual resource Augmentation of the search index with the labels from the vocabulary Selecting the values from the contextual resource Augmentation of the search index with the labels from the vocabulary Augmentation

10 Enrichment Types and Current Vocabularies Enrichment TypeTarget vocabulary Source metadata fields PlacesGeoNamesdcterms:spatial, dc:coverage ConceptsGEMET, DBpediadc:subject, dc:type AgentsDBpediadc:creator, dc:contributor TimeSemium Time dc:date, dc:coverage, dcterms:temporal, edm:year

11 Europeana enrichment - an example

12 How Wikidata fits in Europeana’s semantic strategy?

13 Wikipedia's Relevance for Cultural Heritage  Authority Lists and Thesauri have central importance in CH  Wikipedia being "the sum of all knowledge" has broader reach than any institutional authority list  Only large-scale aggregations like VIAF (35 institutions) and LCSH (about 10 libraries around LoC) are comparable  While some facts are inaccurate and disputable, Wikipedia has a great role as a source of stable URLs on all kinds of topics

14 How Big is Wikidata?  Name data sources for semantic enrichment (Europeana Creative D2.4) gives DBpedia and Wikidata stats Name data sources for semantic enrichment  Wikidata: 3y old, 14M items, 209M edits  2.7M humans, 5k families, 22k literary characters  215k organizations  66k creative orgs (bands, radio/TV stations, newspapers…)  30k educational institutions  20k non-profit orgs  13k GLAM orgs: 0.5k galleries,1k libraries, 0.2k archives, 9k museums  500k creative works  110k heritages sites and monuments  40k family names, 20k first names

15 Is this big enough?  Wikidata: 2.7M humans, 215k organizations, 800k places, 500k works  VIAF: 35M personal names, 5.4M orgs/conferences, 410k places, 1.7M works  GeoNames: 9M places  Only 1.1M persons are coreferenced, see Authority Addicts: The New Frontier of Authority Control on WikidataAuthority Addicts: The New Frontier of Authority Control on Wikidata  VIAF much bigger but still Wikidata is very important for GLAM:  Wikidata is active in Authority Control and Coreferencing  (VIAF) Moving to Wikidata: will get 1M persons/orgs, and many multilingual names (see next) (VIAF) Moving to Wikidata  Authority Files have barely more than names & dates; Wikipedia often has a lot more info

16 Wikidata Multilingual Coverage  Wikidata/DBpedia has huge multilingual coverage  Each entity is represented in 2.11 Wikipedias on average (see Europeana food and drink classification scheme, EFD D2.2) Europeana food and drink classification scheme  But popular entities are present in many more (up to 180); and even in one Wikipedia there are many languages  E.g. Lucas Cranach in Wikidata: 57 lang tags, representing 44 languages and 13 language variants  Languages are consistently marked  Important for semantic enrichment (Named Entity Recognition)  Even though language labels in Europeana are not consistent 

17 Name Variants for Lucas Cranach  Wikidata and VIAF each have 70 variants and dominate the "Wikipedia tradition" and "Library tradition" datasets respectively (see Name data sources for semantic enrichment) Name data sources for semantic enrichment  Only 5 variants are in common (see Interactive Venn diagram) Interactive Venn diagram  Excellent complementarity. VIAF has more variants, Wikidata more multilingual names  VIAF's move to sync to Wikidata will narrow the gap

18 Wikidata is connected to other vocabularies  Europeana prefers using pivot vocabularies that are connected to many other vocabularies It is key to avoid duplication and redundancy  Wikidata has lot of coreferences to other vocabularies that can be used to create extra links, and extract missing data https://www.wikidata.org/wiki/Wikidata:WikiProject_Authority_con trol https://www.wikidata.org/wiki/Wikidata:WikiProject_Authority_con trol https://twitter.com/hashtag/coreferencing: shots and news https://twitter.com/hashtag/coreferencing Please tweet!

19 VIAF-Wikidata Coreferences for Lucas Cranach  Can be leveraged to fill the gaps, e.g. bring RKDartists into VIAF VIAFid in VIAFWikidataid in Wikidata viafID49268177VIAF49268177 BAVADV10197613 BNC.a10853637 BNEXX907273 BNFcb12176451hBNF12176451h DNB118522582GND118522582 ISNI0000000121319721ISNI0000 0001 2131 9721 JPG500115364ULAN500115364 LCn50020861LCCNn50020861 LNBLNC10-000002573 NDL00436834 NKCjn20000700335 NLA000035031951 NLI000035532,001445575,001448179 NLPa16828161 NTA068435312NTA PPN068435312 NUKATvtls000190728 SELIBR182422 SUDOC028710010 WKPLucas_Cranach_the_ElderMany Wikipedias IMAGINET7238,T267474Cantica10853637 Commons CreatorLucas Cranach (I) Commons categoryLucas Cranach d. Ä. Freebase/m/0kqp0 RKDartists18978 SIMBADCRANACH, Lucas the Elder Your Paintings lucas- ​ the- ​ elder- ​ cranach

20 Wikidata Coreferencing (1)  Excellent Mix-n-Match tool by Magnus Manske. 54 catalogs loaded!!Mix-n-Match tool  Decent auto-matching and excellent crowd-sourcing features

21 Wikidata Coreferencing (2)  Excellent Authority Control navbox in Wikipedia  E.g. matching British Museum person-institution thesaurus (currently not coreferenced to anything: high value to BM)

22 Europeana Food and Drink  How do you define such wide area as Food and Drink, which is so pervasive in every day life and culture?  Europeana food and drink classification scheme (EFD D2.2, or presentation) studies ~20 datasets for relevance to FD Europeana food and drink classification schemepresentation  Concludes that Wikipedia is our playing ground, and we should try to use Wikipedia Categories to delineate the topic AGROVOC has 32k concepts but on production/science Wikipedia/DBpedia has 6.6k proper Foods (with infoboxes and ingredients) But I estimate 0.6-1.2M things relevant to FD in all Wikipedias  Background image: 2 levels of Food_and_drink cat hierarchy2 levels of Food_and_drink cat hierarchy

23 Wikidata is Easily Accessible  It is important for Europeana to have the data Technically available: Data dump preferably as Linked Data (RDF) SPARQL end-point or other query mechanism (e.g. WDQ) Properly documented and structured Wikidata has an excellent Property Proposal process Wikidata integrity constraints are excellent In contrast, no Class creation process, so the classes are quite a mess (16k of which 2/3 have less than 5 instances) Data templates should be made more visible and be used as references Open access

24 Wikidata Property Integrity Constraints  E.g. ULAN id constraints help to find records to merge / splitULAN id constraints  E.g. Communist Party of the Russian Federation has 5 LCNAF id's, what's up? Is it so popular with the Library of Congress?Communist Party of the Russian Federation

25 How Wikidata will be used by Europeana  Semantic Enrichment of Europeana data with additional information With a specific focus on entities such as persons and concepts  Linking Europeana objects with Wikidata Approach similar to https://www.wikidata.org/wiki/Wikidata:WikiProject_sum_of_all_p aintings https://www.wikidata.org/wiki/Wikidata:WikiProject_sum_of_all_p aintings But would be extended to the whole Europeana dataset Links would be added in the Europeana data  Structure (data template) for CH objects (e.g. paintings) still not very rich on Wikidata, e.g. Measurements not there Improvements are made all the time, but see next

26 Wikidata Items as Linking Hubs  Still, they're great as stable URLs  Providing the basic info (who, when, where, what)  And acting as coreferencing hubs  I don't expect Wikidata CH objects to ever be described in the full richness & complexity of professional art research. E.g. see British Museum Mapping to CIDOC CRM British Museum Mapping

27 Wikidata and DBpedia  Wikidata and DBpedia are the two structured representations of Wikipedia  Wikidata: initially populated from Wikipedia, manually curated, will master structured data for Wikipedia. Synchronized through an assortment of bots  Data is fairly accurate but data depth is still small  DBpedia: automatically extracted from Wikipedia, live update, one- way extraction only.  Data reach is deep, but there are many problems in ontology and individual mappings, especially for non-English. E.g. United Nations is extracted as "Country". See DBpedia Ontology and Mapping Problems.DBpedia Ontology and Mapping Problems Should they be together?

28 GLAMs should add to Wikipedia or Wikidata!  EFD project. Swiecenie Koszyczek, "blessing of the baskets", a colorful Polish tradition  There's no article in pl.wikipedia.org, so we can't relate such artifacts to anything  Content partner's museum staff have no time to make a proper Wikipedia article  But adding a Wikidata item is quick & easy  Appropriate categories (Easter Traditions, Easter- related Foods) will put it in context

29 Thank you Valentine Charles, valentine.charles@europeana.euvalentine.charles@europeana.eu Vladimir Alexiev, vladimir.alexiev@ontotext.comvladimir.alexiev@ontotext.com Hugo Manguinhas, hugo.manguinhas@europeana.euhugo.manguinhas@europeana.eu


Download ppt "Wikidata, a target for Europeana’s semantic strategy Valentine Charles, Hugo Manguinhas, Antoine Isaac: Europeana Vladimir Alexiev: Ontotext Corp GLAM."

Similar presentations


Ads by Google