Slug: A Semantic Web Crawler Leigh Dodds Engineering Manager, Ingenta Jena User Conference May 2006.

Slides:

Advertisements

Similar presentations

Connecting Social Content Services using FOAF, RDF and REST Leigh Dodds, Engineering Manager, Ingenta Amsterdam, May 2005.

Advertisements

Semantic Web Introduction

The CERIF-2000 Implementation. Andrei S. Lopatenko CERIF Implementation Guidelines Andrei Lopatenko Vienna University of Technology

1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.

Web Categorization Crawler – Part I Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Final Presentation Sep Web Categorization.

Fast Track to ColdFusion 9. Getting Started with ColdFusion Understanding Dynamic Web Pages ColdFusion Benchmark Introducing the ColdFusion Language Introducing.

ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.

Peoplesoft: Building and Consuming Web Services

Overview of Search Engines

Databases & Data Warehouses Chapter 3 Database Processing.

ArcGIS Workflow Manager An Introduction

JSP Standard Tag Library

PREMIS Tools and Services Rebecca Guenther Network Development & MARC Standards Office, Library of Congress NDIIPP Partners Meeting July 21,

Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.

M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.

Nutch in a Nutshell (part I) Presented by Liew Guo Min Zhao Jin.

Data File Access API : Under the Hood Simon Horwith CTO Etrilogy Ltd.

Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec

Crawlers - Presentation 2 - April (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch.

Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.

1 Foundations V: Infrastructure and Architecture, Middleware Deborah McGuinness TA Weijing Chen Semantic eScience Week 10, November 7, 2011.

Configuration Management (CM)

Crawling Slides adapted from

1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.

Nadir Saghar, Tony Pan, Ashish Sharma REST for Data Services.

EXist Indexing Using the right index for you data Date: 9/29/2008 Dan McCreary President Dan McCreary & Associates (952) M.

ISpheresImage iSpheresImage Feature Overview and Progress Summary.

McLean HIGHER COMPUTER NETWORKING Lesson 7 Search engines Description of search engine methods.

Andrew S. Budarevsky Adaptive Application Data Management Overview.

Module 10 Administering and Configuring SharePoint Search.

Curtis Spencer Ezra Burgoyne An Internet Forum Index.

The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.

Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.

Introduction to the Semantic Web and Linked Data

User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.

Appendix E: Overview of HTTP ©SoftMoore ConsultingSlide 1.

1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:

1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.

Ontology Resource Discussion

RESTful Web Services What is RESTful?

Handling Semantic Data for Software Projects Data Management CSE G674 – SW Engineering Project.

RDF and Relational Databases

1 Chapter 12 Configuration management This chapter is extracted from Sommerville’s slides. Text book chapter 29 1.

ESG-CET Meeting, Boulder, CO, April 2008 Gateway Implementation 4/30/2008.

Steven Perry Dave Vieglais. W a s a b i Web Applications for the Semantic Architecture of Biodiversity Informatics Overview WASABI is a framework for.

Java Programming: Advanced Topics 1 Building Web Applications Chapter 13.

Sesame A generic architecture for storing and querying RDF and RDFs Written by Jeen Broekstra, Arjohn Kampman Summarized by Gihyun Gong.

Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.

Google Analytics Graham Triggs Head of Repository Systems, Symplectic.

Search Engine and Optimization 1. Introduction to Web Search Engines 2.

Apache Cocoon – XML Publishing Framework 데이터베이스 연구실 박사 1 학기 이 세영.

1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.

BOF-1147, JavaTM Technology and WebDAV: Standardizing Content Management Java and WebDAV Juergen Pill Team Leader Software AG Remy Maucherat Software Engineer.

SPARQLing SERVICES Leigh Dodds Engineering Manager, Ingenta XTech, May 2006.

Sharing Maps and Layers to Portal for ArcGIS Melanie Summers, Tom Shippee, Ty Fitzpatrick.

Database System Laboratory Mercator: A Scalable, Extensible Web Crawler Allan Heydon and Marc Najork International Journal of World Wide Web, v.2(4), p ,

Information Retrieval in Practice

Developing Linked Data Applications

Linked Data Web that can be processed by machines

An Overview of Data-PASS Shared Catalog

Middleware independent Information Service

Software Design and Architecture

PDAP Query Language International Planetary Data Alliance

The Re3gistry software and the INSPIRE Registry

Robotic Search Engines for the Physical World

PREMIS Tools and Services

LOD reference architecture

Metadata The metadata contains

WebDAV Design Overview

Presentation transcript:

Slug: A Semantic Web Crawler Leigh Dodds Engineering Manager, Ingenta Jena User Conference May 2006

Overview ● Do we need Semantic Web Crawlers? ● Current Features ● Crawler Architecture ● Crawler Configuration ● Applications and Future Extensions

Do We Need Semantic Web Crawlers? ● Increasing availability of distributed data – Mirroring often only option for large sources ● Varying application needs – Real-time retrieval not always necessary/desirable ● Personal metadata increasingly distributed – Need a means to collate data ● Compiling large, varied datasets for research – Triple store and query engine load testing

Introducing Slug ● Open Source multi-threaded web crawler ● Supports creation of crawler “profiles” ● Highly extensible ● Cache content in file system or database ● Crawl new content, or “freshen” existing data ● Generates RDF metadata for crawling activity ● Hopefully(!) easy to use, and well documented

CRAWLER ARCHITECTURE

Crawler Architecture ● Basic Java Framework – Multi-threaded retrieval of resources via HTTP – Could be used to support other protocols – Extensible via RDF configuration file ● Simple Component Model – Content processing and task filtering components – Implement custom components for new behaviours ● Number of built-in behaviours – e.g. Crawl depth limiting; URL blacklisting, etc

Component Model

Consumers ● Responsible for processing results of tasks – Support for multiple consumers per profile ● RDFConsumer – Parses content; Updates memory with triple count – Discovers rdfs:seeAlso links; Submits new tasks ● ResponseStorer – Store retrieved content in file system ● PersistentResponseStorer – Store retrieved content in Jena persistent model

Task Filters ● Filters are applied before new Tasks accepted – Support for multiple filters per profile – Task must pass all filters to be accepted ● DepthFilter – Rejects tasks that are beyond a certain “depth” ● RegexFilter – Reject URLs that match a regular expression ● SingleFetchFilter – Loop avoidance; remove previously encountered URLs

CRAWLER CONFIGURATION

Scutter Profile ● A combination of configuration options ● Uses custom RDFS Vocabulary ● Current options: – Number of threads – Memory location – Memory type (persistent, file system) – Specific collection of Consumers and Filters ● Custom components may have own configuration

Example Profile

Example Consumer RDFConsumer Discovers seeAlso links in RDF models and adds them to task list com.ldodds.slug.http.RDFConsumer

Sample Filter Limit Depth of Crawling com.ldodds.slug.http.DepthFilter = this then url not included. Initial depth is 0 --> 3

Sample Memory Configuration <slug:modelURI rdf:resource=" jdbc:mysql://localhost/DB USER PASSWORD MySQL com.mysql.jdbc.Driver

CRAWLER MEMORY

Scutter Vocabulary ● Vocabulary for crawl related metadata – Where have I been? – What responses did I get? – Where did I find a reference to this document? ● Draft Specification by Morten Frederiksen Draft Specification ● Crawler automatically generates history ● Components can store additional metadata

Scutter Vocab Overview ● Representation – “shadow resource” of a source document – scutter:source = URI of source document – scutter:origin = URIs which reference source – Related to zero or more Fetches ( scutter:fetch ) – scutter:latestFetch = Most recent Fetch – May be skipped because of previous error ( scutter:skip )

Scutter Vocab Overview ● Fetch – Describes a GET of a source document – HTTP Headers and Status – dc:date – scutter:rawTripleCount, included if parsed – May have caused a scutter:error and a Reason ● Reason – Why was there an error? – Why is a Representation being skipped?

List Crawl History for Specific Representation PREFIX dc: PREFIX scutter: SELECT ?date ?status ?contentType ?rawTripleCount WHERE { ?representation scutter:fetch ?fetch; scutter:source. ?fetch dc:date ?date. OPTIONAL { ?fetch scutter:status ?status. } OPTIONAL { ?fetch scutter:contentType ?contentType. } OPTIONAL { ?fetch scutter:rawTripleCount ?rawTripleCount. } } ORDER BY DESC(?date)

WORKING WITH SLUG

Working with Slug ● Traditional Crawling Activities – E.g. Adding data to a local database ● Maintaining a local cache of useful data – E.g. Crawl data using file system cache –...and maintain with “ -freshen ” – Code for generating LocationMapper configuration ● Mapping the Semantic Web? – Crawl history contains document relationships – No need to keep content, just crawl...

Future Enhancements ● Support the Robot Exclusion Protocol ● Allow configuration of the User-Agent header ● Implement throttling on a global and per-domain basis ● Check additional HTTP status codes to "skip" more errors ● Support white-listing of URLs ● Expose and capture more statistics while in- progress

Future Enhancements ● Support Content Negotiation to negotiate data ● Allow pre-processing of data (GRDDL) ● Follow more than just rdfs:seeAlso links – allow configurable link discovery ● Integrate a “smushing” utility – Better manage persistent data ● Anything else?!

Questions? cts/slug

Attribution and Licence The following images were used in these slides Thanks to the authors! Licence for this presentation: Creative Commons Attribution-ShareAlike 2.5