Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jürgen Umbrich Invited talk at eXascale Infolab, Fribourg, June 2016

Similar presentations


Presentation on theme: "Jürgen Umbrich Invited talk at eXascale Infolab, Fribourg, June 2016"— Presentation transcript:

1 Jürgen Umbrich Invited talk at eXascale Infolab, Fribourg, June 2016
Multi-level semantic labelling of numerical values (ISWC 2016 submission) Jürgen Umbrich Invited talk at eXascale Infolab, Fribourg, June 2016

2 ABOUT MYSELF 2012: PhD A Hybrid Framework For Querying Linked Data Dynamically NUI Galway, Ireland (Prof. Decker, Prof. Polleres) : FUJITSU KI2NA: Exploiting the potential of Linked Data in enterprises Current: PostDoc at Vienna University of Economics and Business Linked Traversal querying Semantic Web Search Engine Dynamic Linked Data Observatory SPARQLES Endpoint monitoring Polyglot backends combined with LD principles OpenData Quality assessment Evolution monitoring Data freshness

3 INSTITUTE FOR INFORMATION BUSINESS
Vienna University of Economics and Business

4 OPEN DATA PORTAL WATCH Best Paper @OBD 2015 Scalable quality assessment & monitoring framework for 261+ Open Data Portals Weekly snapshots quality assessment 6 dimensions, 19 metrics

5 ADEQUATE (FFG PROJECT)
Quality Improvement of Open Data

6 GRAPHSENSE (FFG PROJECT) INSIGHT INTO DIGITAL CURRENCIES
Exploration and pattern detection in BitCoin

7 HDT: A LIGHTWEIGHT BINARY FORMAT FOR RDF
Highly compact serialization of RDF Compact RDF store (without prior decompression) Includes internal indexes to solve basic queries once it is loaded in main memory Very fast on basic queries (triple patterns), x 1.5 faster than Virtuoso, RDF3x. C++/Java libraries, standalone application, integration within Jena and Node.js

8 DBPEDIA WAYBACK MACHINE
Best Poster @SEMANTICS 2015 Retrieve historical versions of a DBpedia resource What was the version of “Donald Trump” on dd/mm/yyyy? How was its evolution? Re-apply DBpedia mappings on the Wikipedia revision history

9 ARCHIVING LINKED OPEN DATA
Efficient/Scalable Representations of RDF Archives Query Languages and Benchmarking BEAR: Blueprint on benchmarking archives of semantic 2016 Data: Crawl from Linked Data Observatory Basic queries: Materialize, get Version… Initial evaluation on archiving policies

10 If you can’t enforce it, contract it: Enforceability in Policy-Driven (Linked) Data Markets
Simon Steyskal, Sabrina Kirrane Data published in LDM must be accessed and used in a manner, which is compliant with access restrictions, licenses, institutional and community norms, and privacy requirements. What if compliance cannot be enforced?

11 Multi-level semantic labelling
TODAY’S TALK Multi-level semantic labelling of numerical values ISWC 2016 submission

12 MOTIVATION: OPEN DATA

13 MOTIVATION: OPEN DATA

14 MOTIVATION: OPEN DATA IMPACT
Enterprises using Open Data

15 MOTIVATION: USEFUL DATA
election results geographical data streets, kindergarten, trees, parks, first aid stations, … ambient assistant living addresses demographic statistics health information many sources are manually curated

16 EXPLOITING THE POTENTIAL transforming CSVs to 5 Star Data
CSV URLs CSVs link to other CSVs CSVs link to other resources (partially) Convert to RDF allows for better/improved … search & discovery integration processing

17 EXAMPLE dbp:capacity dpo:City dpo:Stadium dpo:Country stadium name
Ernst Happel Stadium 50865 Vienna Austria Franz Horr Stadium 13400 Red Bull Arena 32000 Salzburg

18 EXISTING APPROACHES CSV2RDF approaches primarily focus
semantically label columns and cell values headers to classes/properties, values to entities and perform Ontology alignment approaches primarily focus on Web Tables, which are well formed, syntactical structure (<thead>), human readable textual descriptions for mappings

19 BUT Open Data tables typically contain
a large portion of numerical columns missing headers and/or non-textual headers BableNet solutions that solely focus on textual “cues” are only partially applicable for mapping such data sources

20 OUR MISSON Identifying the most likely semantic label for a bag of numerical values capacity <a stadium> <country Austria> stadium name capacity city country Ernst Happel Stadium 50865 Vienna Austria Franz Horr Stadium 13400 Red Bull Arena 32000 Salzburg

21 OUR MISSON Identifying the most likely semantic label for a bag of numerical values capacity <a stadium> <country Austria> Ernst Happel Stadium 50865 Vienna Austria Franz Horr Stadium 13400 Red Bull Arena 32000 Salzburg

22 OUR MISSON Identifying the most likely semantic label for a bag of numerical values capacity <a stadium> <country Austria> 50865 13400 32000

23 APPROACH hierarchical clustering over an RDF knowledge base
to build a background knowledge graph nodes representing typical numerical representatives, annotated with context information, i.e., grouped by properties and their shared domain (subject) pairs k-nearest neighbours search aggregation of the results at different levels to find the most likely context of the values, e.g.: property type context

24 EXAMPLE

25 BACKGROUND KNOWLEDGE Construction of the Type hierarchy
represents rdfs:subClassOf relation for all available types Construction: Find properties with numerical range Collect entities and all of their p-o pairs Materials the OWL class hierarchy Form a cluster for each type, containing all entity information

26 BACKGROUND KNOWLEDGE Construction of the p-o hierarchy
divisive hierarchical clustering approach start with one node cluster build candidates: constrain property-object: all subject share the same property-object pair constrain size: candidate nodes are larger than 1% of the parent node and smaller than 99% sort candidates by their distance select candidate with largest distance, subsequently select non overlapping candidates

27 K-NEAREST NEIGHBOUR SEARCH GENERAL IDEA
Mapping bags of numerical value to vector space (feature vector)

28 K-NEAREST NEIGHBOUR SEARCH GENERAL IDEA
Mapping bags of numerical value to vector space (feature vector)

29 K-NEAREST NEIGHBOUR SEARCH GENERAL IDEA
Compute & rank k-nearest neighbours for input values 1) input: [ 187, 201, 199, 198, 195, 199, 203, … ] 2) mapping: 3) compute distance to neighbours 4) select K nearest 2 4 3 6 1 5

30 K-NEAREST NEIGHBOUR SEARCH GENERAL IDEA
Compute & rank k-nearest neighbours for input values 1) input: [ 187, 201, 199, 198, 195, 199, 203, … ] 2) mapping: 3) compute distance to neighbours 4) select K nearest 5) rank nodes diff ranking algorithms e.g., majority vote: by property > type, p-o pair 2 4 3 Top 1: out of 6 Top 2: out of 6 Top 3: out of 6 6 1 5

31 RESULT AGGREGATION 2 4 3 6 1 5

32 EVALUATION SETUP DISTANCE FUNCTIONS Data
DBPedia 3.9 50 most frequent numerical properties DISTANCE FUNCTIONS euclidean distance (min, max, mean, stddev) distribution similarity (Kolmogorov-Smirnov (KS) distance)

33 EVALUATION SETUP AGGREGATION FUNCTION AGGREGATION LEVELS
majority vote and average distance AGGREGATION LEVELS property exact type 30 GB RAM 3 different knowledge bases root type all types p-o level

34 EVALUATION: TEST/TRAIN DATA
train/test split : 80/20 20% of the subjects for each property as test data test context graph: similar as background construction, however, without constrains randomly select leaf nodes

35 EVALUATION: DISTANCE FUNCTION 1787 TEST NODES
Best: Kolmogorov-Smirnov (KS) distance exact = correct property, type and p-o prop = correct property type = correct type stype = correct super type

36 EVALUATION: LARGE-SCALE 33657 TEST NODES
9% of test nodes are contained 1-1 in knowledge graph !! aggregation majority and average vote different neighbours majority vote slightly better more neighbours also better

37 EVALUATION: OPEN DATA TABLES 1170 TABLES
labelling numerical columns manual inspection of top 100 tables ( based on distance) Findings Dealing with timeline data: values for different time points -> not in DBPedia missing domain knowledge reports about spendings, election results, tourism Aggregation of column scores: especially for type detection ( majority vote over column types) Combine with complementary approaches

38 CONCLUSION juergen.umbrich@wu.ac.at
semantic labelling of numerical values k-nearest neighbour search hierarchical unsupervised background knowledge(BK) can assign fine-grained semantic labels if there is enough evidence in BK 99.5% correct properties 96.3% correct parent types Future work: find and integrate more background knowledge inspect Open Data Tables in more detail

39 semantic labelling of numerical values
k-nearest neighbour search hierarchical unsupervised background knowledge(BK) we can assign fine-grained semantic labels if there is enough evidence in BK 99.5% correct properties 96.3% correct parent types Future work: more background knowledge Open Data Tables

40 DCAT Quality Dimensions & Metrics (1/3)

41 DCAT Quality Dimensions & Metrics (2/3)

42 DCAT Quality Dimensions & Metrics (2/3)


Download ppt "Jürgen Umbrich Invited talk at eXascale Infolab, Fribourg, June 2016"

Similar presentations


Ads by Google