© Copyright 2008 STI INNSBRUCK Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis OC Working Group – Serge Tymaniuk
Overview Introduction Methodology Results Questions 2
Introduction Written by Christian Bizer (1), Kai Eckert (1), Robert Meusel (1), Hannes Mühleisen (2), Michael Schuhmacher (1), and Johanna Völker (1) –(1) Data and Web Science Group, University of Mannheim, Germany –(2) Database Architectures Group, Centrum Wiskunde & Informatica, Netherlands Features: –Analysis of RDFa, Microdata, and Microformats adoption on the Web –Based on large public Web crawl of 3 billion HTML pages –Aims at revealing the main topical areas of the published data and different vocabularies within each topical area –Examine structural richness (which properties are used to described popular types of entities) 3
Web Crawl Web crawl provided by Common Crawl foundation available as ARC files from Amazon S3. 3,005,626,093 unique HTML pages from 40.6 million pay-level-domains. Crawling conducted between Jan. - June 2012 Compressed size of the corpus is 48TB Relies on the PageRank algorithm 4
Data Extraction Process Parsing framework is executed on Amazon EC2 Relies on Anything To Triples ( parsing library from Apachehttp://any23.apache.org/ Rapidminer data mining framework is used for vocabulary term co-occurrence analyses 5
Results: Overall picture Structured data was discovered within 369M out of 3B pages contained in the Common Crawl corpus (12.3%), and within 2.29M out of 40.6M domains (5.64%) 6
Results: Deployment by FORMAT 7 * PLDs – Public Level Domains (i.e. websites) * URLs – HTML pages
Results: Deployment by POPULARITY * According to Alexa Internet Inc. (AL) list of the most frequently visited websites 8
Results: Deployment by domains 9
Results: Deployment on the same Website 93,5% of all website which has structured data use only a single format 10
11 Results: Deployment of RDFa Most frequently used RDFa classes: Alexa top 100 websites that use RDFa: IMDB Microsoft News Portal BBC Most frequently used properties co-occurring with all the 4 most frequently used OGP classes:
12 Results: Deployment of Microdata Most frequently used Microdata classes: Alexa top 100 websites that use Microdata: eBay Microsoft Corp. Apple Inc.
13 Results: Deployment of Microformats Most frequently used Microformats classes: Alexa top 100 websites that use Microformats: Wikipedia Adobe Taobao marketplace
Results: Topical Domains Dominant Domains of the published data: –Persons and Organizations (by all 3 formats) –Blog- and CMS-related metadata (by RDFa and Microdata) –Navigational metadata (by RDFa and Microdata) –Product data (by all 3 formats) –Event data (by Microformats) 14
Results: Structural Richness Only a small set of generic properties is used to describe entities: –Instances of OGP class “Product” are described by title, url, site_name, description in most classes –Instances of Schema class “Product” is described largely only by name and description. Additional extraction techniques has to be employed for deeper understanding 15
Sources 16 1.Christian Bizer, Kai Eckert, Robert Meusel, Hannes Mühleisen, Michael Schuhmacher, and Johanna Völker, (2012). Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis. Retrieved from: DeploymentRDFaMicrodataMicroformats-ISWC-InUse-2013.pdfhttp://hannes.muehleisen.org/Bizer-etal- DeploymentRDFaMicrodataMicroformats-ISWC-InUse-2013.pdf
Thank you for your attention! 17 Questions?