Content Storage with Apache Jackrabbit 2009-03-25, Jukka Zitting.

Slides:



Advertisements
Similar presentations
Introductory to database handling Endre Sebestyén.
Advertisements

Refeng Wu CQ5 WCM System Administrator
JTX Overview Overview of Job Tracking for ArcGIS (JTX)
My first computer: The Apple ][ It wanted to be programmed.
Unveiling ProjectWise V8 XM Edition. ProjectWise V8 XM Edition An integrated system of collaboration servers that enable your AEC project teams, your.
Futures – Alpha Cloud Deployment and Application Management.
Using JavaServer Pages Harry R. Erwin, PhD CIT304/CSE301.
SOFTWARE PRESENTATION ODMS (OPEN SOURCE DOCUMENT MANAGEMENT SYSTEM)
DAVID M. KROENKE’S DATABASE PROCESSING, 10th Edition © 2006 Pearson Prentice Hall 13-1 COS 346 Day 25.
Object-Oriented Enterprise Application Development Tomcat 3.2 Configuration Last Updated: 03/30/2001.
3.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 3: Introducing Active Directory.
Fast Track to ColdFusion 9. Getting Started with ColdFusion Understanding Dynamic Web Pages ColdFusion Benchmark Introducing the ColdFusion Language Introducing.
Software Frameworks for Acquisition and Control European PhD – 2009 Horácio Fernandes.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Chapter 11 Data Management Layer Design
Hands-On Microsoft Windows Server 2003 Administration Chapter 5 Administering File Resources.
Source Code Management Or Configuration Management: How I learned to Stop Worrying and Hate My Co-workers Less.
SQL Server ® 2008 ® Native Client. Agenda  Introduction to SQL Server Native Client  Building High-Performance Data Access Solutions  Going Beyond.
Discover, Master, InfluenceSlide 1 SQL Server Compact Edition and the Entity Framework Rob Sanders Readify.
Confidential ODBC May 7, Features What is ODBC? Why Create an ODBC Driver for Rochade? How do we Expose Rochade as Relational Transformation.
Introduction To Databases IDIA 618 Fall 2014 Bridget M. Blodgett.
Content Management with Apache Jackrabbit Jukka Zitting Day Software (862)
9.1 © 2004 Pearson Education, Inc. Exam Planning, Implementing, and Maintaining a Microsoft Windows Server 2003 Active Directory Infrastructure.
Cube Enterprise Database Solution presented to MTF GIS Committee presented by Minhua Wang Citilabs, Inc. November 20, 2008.
1Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 8 Reporting from Contract.
Kuali Rice at Indiana University Rice Setup Options July 29-30, 2008 Eric Westfall.
Data File Access API : Under the Hood Simon Horwith CTO Etrilogy Ltd.
Meet with the AppEngine Márk Gergely eu.edge. What is AppEngine? It’s a tool, that lets you run your web applications on Google's infrastructure. –Google's.
Microsoft SharePoint Server 2010 for the Microsoft ASP.NET Developer Yaroslav Pentsarskyy
1 ® Copyright 2009 Adobe Systems Incorporated. All rights reserved. Adobe confidential. 1 Building Portlets with ColdFusion Pete Freitag Foundeo, Inc.
XML Registries Source: Java TM API for XML Registries Specification.
Module 7 Active Directory and Account Management.
Treasures of Cache 2010 Ian Cargill Development Manager Dendrite Clinical Systems.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Ch 2 – Application Assembly and Deployment COSC 617 Jeff Schmitt September 14, 2006.
1 Apache TomEE // JavaEE Web Profile on Tomcat Jonathan #TomEE.
JCR Content Management Jukka Zitting
AEM & TDD It’s so boring… AUGUST 6, 2015
Up to Speed with Java Content Repository API and Jackrabbit > Alexandru Popescu InfoQ.com > Jukka Zitting Day.
Running Kuali: A Technical Perspective Ailish Byrne (Indiana University) Jonathan Keller (University of California, Davis)
KEW Definitions Document Type The Document Type defines the routing definition and other properties for a set of documents. Each document is an instance.
© FPT SOFTWARE – TRAINING MATERIAL – Internal use 04e-BM/NS/HDCV/FSOFT v2/3 JSP Application Models.
DSpace System Architecture 11 July 2002 DSpace System Architecture.
Persistence – Iteration 4 Vancouver Bootcamp Aaron Zeckoski
Object storage and object interoperability
ESG-CET Meeting, Boulder, CO, April 2008 Gateway Implementation 4/30/2008.
Introduction to Active Directory
15 Copyright © 2004, Oracle. All rights reserved. Adding JAAS Security to the Client.
Session #2482 Implementing WebDAV Using J2EE TM and Open Source Technologies Juergen Pill Team Leader Software AG Remy Maucherat Software Engineer Sun.
MCSE: Windows Server 2003 Active Directory Planning, Implementation, and Maintenance Study Guide, Second Edition (70-294) Chapter 1: Overview of the Active.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.
Developer Exam Preparation Thom Robbins Bryan Soltis
Active Directory Domain Services (AD DS). Identity and Access (IDA) – An IDA infrastructure should: Store information about users, groups, computers and.
BOF-1147, JavaTM Technology and WebDAV: Standardizing Content Management Java and WebDAV Juergen Pill Team Leader Software AG Remy Maucherat Software Engineer.
Computing in High Energy and Nuclear Physics 2012 May 21-25, 2012 New York United States The version control service for ATLAS data acquisition configuration.
Introduction  Model contains different kinds of elements (such as hosts, databases, web servers, applications, etc)  Relations between these elements.
Integrating CMS/DMS into OpenOffice Michael E. Bohn Consultant Office Migration Sun Microsystems GmbH.
Product Training Program
Apache Ignite Data Grid Research Corey Pentasuglia.
Open Source distributed document DB for an enterprise
Overall Architecture and Component Model
INTRODUCING ADAMS/CAR
ESIS Consulting LLC (C) ESIS Consulting LLC. All rights reserved
OurSQL = MySQL + Blockchain
The Most Popular Android UI Automation Testing Tool Andrii Voitenko
Developing and testing enterprise Java applications
David Cleverly – Development Lead
SDMX IT Tools SDMX Registry
Presentation transcript:

Content Storage with Apache Jackrabbit , Jukka Zitting

Introducing JCR Content Repository for Java ™ Technology API -> Version 1.0 defined in JSR 170, final in > Version 2.0 defined in JSR 283, final (hopefully) in > Open source RI and TCK developed in Apache Jackrabbit Flexible, hierarchically ordered content store with features like full text search, versioning, transactions, observation, etc. Combines and extends features often found in file systems and databases; a "best of both worlds" approach.

Introducing Apache Jackrabbit Entered Apache Incubator in 2004, graduated in 2006, now a TLP with 24 committers Version 1.0 (and JCR RI) in 2006, currently at Version 2.0 (and JCR 2.0 RI) planned for Also: JCR Commons, Sling Components: jackrabbit-api jackrabbit-core jackrabbit-standalone jackrabbit-webapp jackrabbit-jca jackrabbit-ocm jackrabbit-jcr-rmi jackrabbit-webdav jackrabbit-jcr-server etc.

Structure of a content repository Consists of one or more workspaces, one of which is the default workspace. A workspace consists of a tree of nodes, each of which can have zero or more child nodes. Each workspace has a single root node. Nodes have properties. Properties are typed (string, integer, date, binary, reference, etc.) and can be either single- or multivalued.

Structure, cont. Each node or property has a name, and can be accessed using a path. Names are namespaced. A referenceable node has a UUID, through which it can be accessed or referenced. Referential integrity is guaranteed. Each node has a primary node type and zero or more mixin types. Node types define the structure of a node. A node can also be unstructured.

Content repository features Read/write Search (XPath and SQL) XML import/export Observation Versioning Node types Locking Access control Atomic changes XA transactions

Getting started with Jackrabbit Download and run the standalone server: java -jar jackrabbit-standalone jar -> Web interface at -> WebDAV at -> JCR access over RMI at -> Repository data in./jackrabbit -> Configuration in./jackrabbit/repository.xml If you have a servlet container, use the webapp If you have a J2EE application server, use jca

Remote access JCR-RMI layer available since Jackrabbit 0.9. Good functional coverage, not so good performance. -> administrative tools Look at clustering for better performance. Jackrabbit 1.6: spi2dav o.a.j.rmi.repository: new URLRemoteRepository( " new RMIRemoteRepository( "//.../repository"); Classpath: jcr-1.0.jar jackrabbit-jcr-rmi jar jackrabbit-api jar (also in rmiregistry!)

Jackrabbit as a shared resource JCA adapter in an application server -> accessed through JNDI Jackrabbit webapp in a servlet container -> JNDI or servlet context -> JNDI: complex setup -> cross-context access JNDI configuration with jackrabbit-servlet: Repository org.apache.jackrabbit.servlet.JNDIRepositoryServlet location Accessing the repository: public class MyServlet extends HttpServlet { private final Repository repository = new ServletRepository(this); }

Jackrabbit in embedded mode RepositoryConfig config = RepositoryConfig.create( "/path/to/repository", "/path/to/repository.xml"); RepositoryImpl repository = RepositoryImpl.create(config); try { // Use the javax.jcr interfaces to access the repository } finally { repository.shutdown(); }

Embedded mode, cont. Only a single instance can be running at a time. For concurrent access: - multiple sessions - RMI for remote access - clustering Also: TransientRepository Maven coordinates org.apache.jackrabbit jackrabbit-core commons-logging org.slf4j slf4j-log4j org.slf4j jcl-over-slf4j 1.5.3

Logging - what's this SLF4J thing? Jackrabbit uses the SLF4J logging facade for logging. Benefits: Great for embedded uses, can adapt to Drawbacks: What do I put in my classpath?

SLF4J in practice SLF4J API is automatically included as a dependency of jackrabbit-core. You need to explicitly add the SLF4J implementation. Jackrabit webapp, jca and standalone use slf4j-log4j, so you can use normal log4j configuration. Classpath with log4j: slf4j-api jar slf4j-log4j jar log4j jar log4j.properties Classpath with no logging: slf4j-api jar slf4j-nop jar

Repository configuration Configuration in a repository.xml file. Default configuration shipped with Jackrabbit. Structure defined in a DTD. Contains global settings and a workspace configuration template. The template is instantiated to a workspace.xml configuration file for each new workspace. Main elements: clustering, workspace, versioning, security, persistence, search index, data store, file system

Persistence managers Use one of the bundle persistence managers. -> select one for your db Other persistence managers mostly for backwards compatibility. Default based on embedded Apache Derby. Configuration: driver,url,user,password schema schemaObjectPrefix minBlobSize bundleCacheSize consistencyCheck/Fix blockOnConnectionLoss Needs CREATE TABLE permissions!

Search index Per-workspace search indexes based on Apache Lucene. By default everything is indexed. Use index configuration to customize. Clustering: Each cluster node maintains local search indexes. Configuration: min/maxMergeDocs mergeFactor analyzer textFilterClasses respectDocumentOrder resultFetchSize Performance: property, type, ft -> fast path -> slow

Text extraction Full text indexing of file contents based on various parser libraries (POI, PDFBox, etc.). Currently only for the jcr:data property with correct jcr:mimeType. Jackrabbit 2.0: indexing of all binary properties. textFilterClasses: PlainTextExtractor MsWordTextExtractor MsExcelTextExtractor MsPowerPointTextExtractor PdfTextExtractor OpenOfficeTextExtractor RTFTextExtractor HTMLTextExtractor XMLTextExtractor

Data store - dealing with lots of data Data store feature available since Jackrabbit 1.4 Content-addressed storage of large binary properties. Completely transparent to client applications. Uses garbage collection to remove unused data. Implementations: FileDataStore DbDataStore sandbox: S3DataStore Garbage collection: gc = si.createDSGC(); gc.scan(); gc.stopScan(); gc.deleteUnused();

Content modeling: David's model 1.Data First. Structure Later. Maybe. 2.Drive the content hierarchy, don't let it happen. 3.Workspaces are for clone(), merge() and update() 4.Beware of Same Name Siblings 5.References considered harmful 6.Files are Files are Files 7.ID's are evil

Content modeling: Example CREATE TABLE author ( id INTEGER, name VARCHAR, ); CREATE TABLE post ( id INTEGER, author INTEGER, posted DATETIME, title VARCHAR, body TEXT ); /blog = Jukka Zitting /2009 /03 /25 @body

Example, cont. CREATE TABLE comment ( id INTEGER, post INTEGER, title VARCHAR, body TEXT ); CREATE TABLE media ( id INTEGER, post INTEGER, data BLOB, caption VARCHAR ); /hello /comments = /media [nt:folder] /image.jpg [nt:file] /code [nt:folder] /Example.java [nt:file]

Example, cont. Versioning: node.checkout(); //... node.save(); node.checkin(); Locking: /hello [mix:lockable] node.lock(); //... node.unlock();

Common issues: Content hierarchy Jackrabbit doesn't support very flat content hierarchies. You'll start seeing problems when you put more than 10k child nodes under a single parent. Solution: Add more depth to your hierarchy. Divide entries by date, category, author, etc. If nothing else, use the first letter(s) of a title, a content hash, or even a random number to distribute the nodes. Note: You can still access the entries as a single flat set through search. The hierarchy is for browsing.

Common issues: Concurrent edits Three ways to handle concurrent edits: 1. Merge changes 2. Fail conflicting changes 3. Block concurrent changes Jackrabbit does 1 by default, and falls back to 2 when merge fails. You can explicitly opt for 3 by using the JCR locking feature. Estimate: How often conflicts would happen? Will the benefits of locking be worth the overhead.

Questions?