Journal of Web Semantics 55 (2019) Characterising Dataset Search – an Analysis of Search Logs and Data Requests Journal of Web Semantics 55 (2019)
Introduction Background of Dataset Search Goal the generation of metadata needs to be done on a property by property basis, which represents a cost for data publishers current open data portal solutions base their metadata search on indexing free text descriptions of datasets and applying document modelling and search techniques Goal to advance towards the understanding of the most important properties of a dataset description from the point of view of data consumers by analysing how people search for data on current portals reduces the time and effort advanced search functionalities
Contribution A systematic study of the patterns and specific attributes that data consumers use to search for data and how it compares with general web search.
Related Work Web Search Dataset Search General Web Search Vertical Search (e.g. e-mail search, people search in Facebook) Dataset Search Relatively unexplored area compared to document search Many portals are based on CKAN -> Solr -> Lucene -> TF-IDF, but in structured documents main topic or the key concepts might be mentioned only once
Related Work Metrics for general web search query log analysis Query Length and Distribution Query Types Classification User and Session Statistics Query Structure Topics
Methodology Used three types of data in experiments Internal search logs Queries issued directly to the internal search capacity of a data portal into the search box. External search logs Queries issued through a general web search engines search as that lead to a page of the data portal. Data requests Data requests are a representation of information needs submitted by users of a data portal in order to get a specific dataset that they usually could not find.
Findings - Users Location Devices An overwhelming majority are desktop computers (85% on average) Time of access Users are mostly active during weekdays, and weekends is approximately half or a third of that during week days. Channels the majority of users reach portals through the result page of a web search engine
Findings - Users Browsers a higher share of IE users by almost 10% compared to general web browser usage New and returning users returning users view on average more pages and engage in longer sessions. Search exits and refinements much higher than that of another government website
Findings – Internal & External Queries Query length Internal queries External queries
Findings – Internal & External Queries Query types
Findings – Internal & External Queries Query topics
Findings – Data Requests Data Attributes Geospatial information (n = 77.5%) Temporal information (n = 44%) Restriction (n = 26.5%) Granularity (n = 24.5%)
Findings – Data Requests Request Context Representation and structure Expected outcome Rationale Quality
Conclusion Dataset queries are generally short. Dataset search seems to occur mostly in a work-related environment. There is a difference in topics, length and structure between dataset queries issued directly to data portals and dataset queries issued to web search engines. Data requests describe the data by using boundaries and restrictions about location, temporality, specific data type and/or specific granularity The prioritary properties to describe datasets are temporal and geospatial coverage, with varying levels of granularity.