Jürgen Umbrich Invited talk at eXascale Infolab, Fribourg, June 2016

Jürgen Umbrich Invited talk at eXascale Infolab, Fribourg, June 2016
Multi-level semantic labelling of numerical values (ISWC 2016 submission) Jürgen Umbrich Invited talk at eXascale Infolab, Fribourg, June 2016

ABOUT MYSELF 2012: PhD A Hybrid Framework For Querying Linked Data Dynamically NUI Galway, Ireland (Prof. Decker, Prof. Polleres) : FUJITSU KI2NA: Exploiting the potential of Linked Data in enterprises Current: PostDoc at Vienna University of Economics and Business Linked Traversal querying Semantic Web Search Engine Dynamic Linked Data Observatory SPARQLES Endpoint monitoring Polyglot backends combined with LD principles OpenData Quality assessment Evolution monitoring Data freshness

INSTITUTE FOR INFORMATION BUSINESS
Vienna University of Economics and Business

OPEN DATA PORTAL WATCH Best Paper @OBD 2015 Scalable quality assessment & monitoring framework for 261+ Open Data Portals Weekly snapshots quality assessment 6 dimensions, 19 metrics

ADEQUATE (FFG PROJECT)
Quality Improvement of Open Data

GRAPHSENSE (FFG PROJECT) INSIGHT INTO DIGITAL CURRENCIES
Exploration and pattern detection in BitCoin

HDT: A LIGHTWEIGHT BINARY FORMAT FOR RDF
Highly compact serialization of RDF Compact RDF store (without prior decompression) Includes internal indexes to solve basic queries once it is loaded in main memory Very fast on basic queries (triple patterns), x 1.5 faster than Virtuoso, RDF3x. C++/Java libraries, standalone application, integration within Jena and Node.js

DBPEDIA WAYBACK MACHINE
Best Poster @SEMANTICS 2015 Retrieve historical versions of a DBpedia resource What was the version of “Donald Trump” on dd/mm/yyyy? How was its evolution? Re-apply DBpedia mappings on the Wikipedia revision history

ARCHIVING LINKED OPEN DATA
Efficient/Scalable Representations of RDF Archives Query Languages and Benchmarking BEAR: Blueprint on benchmarking archives of semantic 2016 Data: Crawl from Linked Data Observatory Basic queries: Materialize, get Version… Initial evaluation on archiving policies

If you can’t enforce it, contract it: Enforceability in Policy-Driven (Linked) Data Markets
Simon Steyskal, Sabrina Kirrane Data published in LDM must be accessed and used in a manner, which is compliant with access restrictions, licenses, institutional and community norms, and privacy requirements. What if compliance cannot be enforced?

Multi-level semantic labelling
TODAY’S TALK Multi-level semantic labelling of numerical values ISWC 2016 submission

MOTIVATION: OPEN DATA

MOTIVATION: OPEN DATA IMPACT
Enterprises using Open Data

MOTIVATION: USEFUL DATA
election results geographical data streets, kindergarten, trees, parks, first aid stations, … ambient assistant living addresses demographic statistics health information many sources are manually curated

EXPLOITING THE POTENTIAL transforming CSVs to 5 Star Data
CSV URLs CSVs link to other CSVs CSVs link to other resources (partially) Convert to RDF allows for better/improved … search & discovery integration processing

EXAMPLE dbp:capacity dpo:City dpo:Stadium dpo:Country stadium name
Ernst Happel Stadium 50865 Vienna Austria Franz Horr Stadium 13400 Red Bull Arena 32000 Salzburg …

EXISTING APPROACHES CSV2RDF approaches primarily focus
semantically label columns and cell values headers to classes/properties, values to entities and perform Ontology alignment approaches primarily focus on Web Tables, which are well formed, syntactical structure (<thead>), human readable textual descriptions for mappings

BUT Open Data tables typically contain
a large portion of numerical columns missing headers and/or non-textual headers BableNet solutions that solely focus on textual “cues” are only partially applicable for mapping such data sources

OUR MISSON Identifying the most likely semantic label for a bag of numerical values capacity <a stadium> <country Austria> stadium name capacity city country Ernst Happel Stadium 50865 Vienna Austria Franz Horr Stadium 13400 Red Bull Arena 32000 Salzburg …

OUR MISSON Identifying the most likely semantic label for a bag of numerical values capacity <a stadium> <country Austria> Ernst Happel Stadium 50865 Vienna Austria Franz Horr Stadium 13400 Red Bull Arena 32000 Salzburg …

OUR MISSON Identifying the most likely semantic label for a bag of numerical values capacity <a stadium> <country Austria> 50865 13400 32000 …

APPROACH hierarchical clustering over an RDF knowledge base
to build a background knowledge graph nodes representing typical numerical representatives, annotated with context information, i.e., grouped by properties and their shared domain (subject) pairs k-nearest neighbours search aggregation of the results at different levels to find the most likely context of the values, e.g.: property type context

EXAMPLE

BACKGROUND KNOWLEDGE Construction of the Type hierarchy
represents rdfs:subClassOf relation for all available types Construction: Find properties with numerical range Collect entities and all of their p-o pairs Materials the OWL class hierarchy Form a cluster for each type, containing all entity information

BACKGROUND KNOWLEDGE Construction of the p-o hierarchy
divisive hierarchical clustering approach start with one node cluster build candidates: constrain property-object: all subject share the same property-object pair constrain size: candidate nodes are larger than 1% of the parent node and smaller than 99% sort candidates by their distance select candidate with largest distance, subsequently select non overlapping candidates

K-NEAREST NEIGHBOUR SEARCH GENERAL IDEA
Mapping bags of numerical value to vector space (feature vector)

Compute & rank k-nearest neighbours for input values 1) input: [ 187, 201, 199, 198, 195, 199, 203, … ] 2) mapping: 3) compute distance to neighbours 4) select K nearest 2 4 3 6 1 5

Compute & rank k-nearest neighbours for input values 1) input: [ 187, 201, 199, 198, 195, 199, 203, … ] 2) mapping: 3) compute distance to neighbours 4) select K nearest 5) rank nodes diff ranking algorithms e.g., majority vote: by property > type, p-o pair 2 4 3 Top 1: out of 6 Top 2: out of 6 Top 3: out of 6 6 1 5

RESULT AGGREGATION 2 4 3 6 1 5

EVALUATION SETUP DISTANCE FUNCTIONS Data
DBPedia 3.9 50 most frequent numerical properties DISTANCE FUNCTIONS euclidean distance (min, max, mean, stddev) distribution similarity (Kolmogorov-Smirnov (KS) distance)

EVALUATION SETUP AGGREGATION FUNCTION AGGREGATION LEVELS
majority vote and average distance AGGREGATION LEVELS property exact type 30 GB RAM 3 different knowledge bases root type all types p-o level

EVALUATION: TEST/TRAIN DATA
train/test split : 80/20 20% of the subjects for each property as test data test context graph: similar as background construction, however, without constrains randomly select leaf nodes

EVALUATION: DISTANCE FUNCTION 1787 TEST NODES
Best: Kolmogorov-Smirnov (KS) distance exact = correct property, type and p-o prop = correct property type = correct type stype = correct super type

EVALUATION: LARGE-SCALE 33657 TEST NODES
9% of test nodes are contained 1-1 in knowledge graph !! aggregation majority and average vote different neighbours majority vote slightly better more neighbours also better

EVALUATION: OPEN DATA TABLES 1170 TABLES
labelling numerical columns manual inspection of top 100 tables ( based on distance) Findings Dealing with timeline data: values for different time points -> not in DBPedia missing domain knowledge reports about spendings, election results, tourism Aggregation of column scores: especially for type detection ( majority vote over column types) Combine with complementary approaches

CONCLUSION juergen.umbrich@wu.ac.at
semantic labelling of numerical values k-nearest neighbour search hierarchical unsupervised background knowledge(BK) can assign fine-grained semantic labels if there is enough evidence in BK 99.5% correct properties 96.3% correct parent types Future work: find and integrate more background knowledge inspect Open Data Tables in more detail

semantic labelling of numerical values
k-nearest neighbour search hierarchical unsupervised background knowledge(BK) we can assign fine-grained semantic labels if there is enough evidence in BK 99.5% correct properties 96.3% correct parent types Future work: more background knowledge Open Data Tables

DCAT Quality Dimensions & Metrics (1/3)

DCAT Quality Dimensions & Metrics (2/3)

Jürgen Umbrich Invited talk at eXascale Infolab, Fribourg, June 2016

Similar presentations

Presentation on theme: "Jürgen Umbrich Invited talk at eXascale Infolab, Fribourg, June 2016"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jürgen Umbrich Invited talk at eXascale Infolab, Fribourg, June 2016

Similar presentations

Presentation on theme: "Jürgen Umbrich Invited talk at eXascale Infolab, Fribourg, June 2016"— Presentation transcript:

Similar presentations

About project

Feedback