Efficient Processing of Semantic Information on the Web Georg Lausen Technische Fakultät Universität Freiburg
The amount of available information on Web still is increasing rapidly. (Semi-)Automatic Data Extraction. Resource Description Framework (RDF). SPARQL is the standard query language for RDF. Efficiency and Scalability of query processing. Processing of Semantic Information on the Web
Efficiency and Scalability: A Variety of Approaches Single machine RDF stores Parallel Database Approach: Vertica and others Approaches based on Hadoop (MapReduce Paradigm) – Hadoop – Hadoop++ – Integration of databases: HadoopDB – Language translation Mapping SPARQL to Hadoop/HBase directly Mapping SPARQL to Pig Latin Non Hadoop clusters
Cluster-based Parallelism vs Parallel Database/Single Machine RDF-Store Each technology has its own advantages and problems. Rough characterization: QueryingLoading Parallel Database / Single Machine RDF-Store +- Cluster-based Parallelism -+ Loading in the context of Web research: Extract Transform Load schema. SPARQL provides a declarative way for specifying the transformation and querying.
ETL and Querying in the context of Web research Web documentsInitial RDF graphRDF store E L T Efficient Loading Efficient querying SPARQL PigSPARQL: Mapping SPARQL to PigLatin; to appear Semantic Web Information Management – SWIM 2011