ManifoldCF for Content Acquisition

Slides:



Advertisements
Similar presentations
Implementing Tableau Server in an Enterprise Environment
Advertisements

Business Development Suit Presented by Thomas Mathews.
MICHAEL MARINO CSC 101 Whats New in Office Office Live Workspace 3 new things about Office Live Workspace are: Anywhere Access Store Microsoft.
MAE Training for User July 8, Agenda Wiki FishEye Crucible Stash.
Feature requests for Case Manager By Spar Nord Bank A/S IBM Insight 2014 Spar Nord Bank A/S1.
Physical Topology Logical Topology Authentication Licensing.
STORY TITLE 1 Lotus Domino IBM JCR Included Content Stores IBM FileNet* IBM CM* Other* Integration & Coexistence Connectors enable direct access from user’s.
Introduction KWizCom Business Card Founded in 2005 Headquartered in Toronto Global provider of add-ons and services customers worldwide Business.
1 The IIPC Web Curator Tool: Steve Knight The National Library of New Zealand Philip Beresford and Arun Persad The British Library An Open Source Solution.
Object-Oriented Enterprise Application Development Tomcat 3.2 Configuration Last Updated: 03/30/2001.
Report Distribution Report Distribution in PeopleTools 8.4 Doug Ostler & Eric Knapp 7264.
Presented by IBM developer Works ibm.com/developerworks/ 2006 January – April © 2006 IBM Corporation. Making the most of Creating Eclipse plug-ins.
SOA, EDA, ECM and more Discover a pragmatic architecture for an intelligent enterprise, to maximize impact on the business Patrice Bertrand Software Architect.
SQL Reporting Services Overview SSRS includes all the development and management pieces necessary to publish end user reports in  HTML  PDF 
Software Documentation Written By: Ian Sommerville Presentation By: Stephen Lopez-Couto.
Passage Three Introduction to Microsoft SQL Server 2000.
Welcome to the Minnesota SharePoint User Group. Quick Intro Announcements Personalization in SharePoint Configuring User Profiles Configuring Audiences.
1 Enterprise Search From Microsoft Unlock the potential of your organization NameTitle Microsoft Corporation.
Create with SharePoint 2010 Jen Dodd Sr. Solutions Consultant
Windows.Net Programming Series Preview. Course Schedule CourseDate Microsoft.Net Fundamentals 01/13/2014 Microsoft Windows/Web Fundamentals 01/20/2014.
Sage CRM Developers Course
A Scalable Application Architecture for composing News Portals on the Internet Serpil TOK, Zeki BAYRAM. Eastern MediterraneanUniversity Famagusta Famagusta.
Valma Technical Aspects
About Dynamic Sites (Front End / Back End Implementations) by Janssen & Associates Affordable Website Solutions for Individuals and Small Businesses.
Module 8 Configuring and Securing SharePoint Services and Service Applications.
Introducing Reporting Services for SQL Server 2005.
1 © Copyright 2009 EMC Corporation. All rights reserved. ISIS and PixTools Toolkits Quickly Enabling Document Capture Solutions EMC Corporation.
IBM OmniFind Enterprise Edition V9.1 – July 2010 Data Source – FileNet P8 crawler overview  Key features: –Access to FileNet P8 Content Engine by using.
Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010.
SharePoint 2010 Search Architecture The Connector Framework Enhancing the Search User Interface Creating Custom Ranking Models.
ISetup – A Guide/Benefit for the Functional User! Mohan Iyer January 17 th, 2008.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
CAS Lightning Talk Jasig-Sakai 2012 Tuesday June 12th 2012 Atlanta, GA Andrew Petro - Unicon, Inc.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
What’s new in Kentico CMS 5.0 Michal Neuwirth Product Manager Kentico Software.
0 SharePoint Search 2013 Rafael de la Cruz SharePoint Developer Seneca Resources twitter.com/delacruz_rafael
Windows Role-Based Access Control Longhorn Update
Model View Controller MVC Web Software Architecture.
Connect. Communicate. Collaborate PerfsonarUI plug-in tutorial Nina Jeliazkova ISTF, Bulgaria.
Solutions using Microsoft Content Management Server 2002 Connector for SharePoint Technologies Sue Corke Mark Harrison Microsoft UK.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
Jini Architecture Introduction System Overview An Example.
Implementing and Using the SIRWEB Interface Setup of the CGI script and web procfile Connecting to your database using HTML Retrieving data using the CGI.
1 Registry Services Overview J. Steven Hughes (Deputy Chair) Principal Computer Scientist NASA/JPL 17 December 2015.
Leveraging Kinetic Task Management Unus Gaffoor & Michael Poole Kinetic Data.
CSI 3125, Preliminaries, page 1 SERVLET. CSI 3125, Preliminaries, page 2 SERVLET A servlet is a server-side software program, written in Java code, that.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
Plug-In Architecture Pattern. Problem The functionality of a system needs to be extended after the software is shipped The set of possible post-shipment.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
CONFIDENTIAL Overview NTP Software Object Store and Cloud Connector™ (OSCC™) has a carefully structured architecture that includes a number of collaborative.
#SummitNow Super Size Your Search 14 th November 2013 Fran Alvarez (Zaizi)
Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)
Excel Services Displays all or parts of interactive Excel worksheets in the browser –Excel “publish” feature with optional parameters defined in worksheet.
XML 2002 Annotation Management in an XML CMS A Case Study.
#SummitNow Super Size Your Search 14 th November 2013 Fran Alvarez (Zaizi)
© 2009 IBM Corporation For Internal Distribution Only © 2009 IBM Corporation For Internal Distribution Only ECM Product Solution Training ® Margaret Worel,
Introduction The concept of a web framework originates from the basic idea that every web application obtains its foundations from a similar set of guidelines.
Nithyamoorthy S Core Mind Technologies
PowerApps & Flow Licensing Overview for Partners
Beyond the BDC\BCS Model
Overall Architecture and Component Model
Software Documentation
Johannes Peter MediaMarktSaturn Retail Group
What’s changed in the Shibboleth 1.2 Origin
BusinessObjects 4.2 SP3 What's new for System Administration in CMC
Chapter 2: Operating-System Structures
Technical Integration Guide
敦群數位科技有限公司(vanGene Digital Inc.) 游家德(Jade Yu.)
Chapter 2: Operating-System Structures
Presentation transcript:

ManifoldCF for Content Acquisition Karl Wright, Nokia Inc. kwright@apache.org, 11/10/2011

What this presentation is about An introduction to ManifoldCF Presenter: Karl Wright, original ManifoldCF developer Challenge: Getting content into a search engine, keeping it up to date, and securing it

Information about me My name is Karl Wright Principal Software Engineer at Nokia, Inc. Former Principal Software Engineer at MetaCarta, Inc. Core committer for ManifoldCF Author of ManifoldCF in Action

ManifoldCF… Pulls documents from disparate sources Writes documents into the target(s) of your choice Provides an end-user authorization mechanism Synchronizes, doesn’t just crawl once! Has bounded memory usage Is reasonably performant Is extendible to new kinds of repositories Shows you what it is doing Is resilient against restart

ManifoldCF vs. Nutch, Heritrix Tree operations? No Yes Some Web only? Http, ftp, svn All sorts of content UI? Restartable? Painful Uses Hadoop Incremental? Not really Basic support Max docs “web scale” 100,000,000 Technically no limit; 10,000,000 tested (using postgresql) Docs/sec 80+ per instance Scales as needed 80+ per instance (using postgresql) Memory bounded? Security model?

How does ManifoldCF fit?

Just how many kinds of document repositories are out there? File systems (CIFS too) Windows shares The Web (RSS too) Wikis Databases CMIS repositories SharePoint (Microsoft) FileNet (IBM) Documentum (EMC) LiveLink (OpenText) Meridio (Autonomy) Many, many more

What is a connector? A connector is code implementing an interface ManifoldCF uses three kinds of ‘connector’ “Authority connector” understands a specific authorization entity, e.g. AD or LiveLink “Repository connector” understands a specific content repository, e.g. Windows shares or Documentum “Output connector” understands a specific output destination, e.g. Apache Solr or OpenSearchServer

Connections and jobs A ‘connection’ is a configured instance of a ‘connector’ object Connections are pooled Max number of similar connections is configurable Jobs describe “what” and “when”, not “how” Has a repository connection and an output connection Not really a task; but rather a set of documents

ManifoldCF Document Flow

ManifoldCF’s Crawling Models Push vs. Pull Observation: ‘Push’ model may require notifications to be queued 1 Observation: ‘Push’ is no longer an option if ANY notification is overlooked 2 Observation: There are no real-world systems I’ve found that really support ‘push’! ManifoldCF uses ‘pull’ exclusively right now If you don’t queue them, you slow down the repository’s “save” transactions E.g. ACL changes don’t show up as a document change, so such repositories can’t notify you when document access changes

ManifoldCF’s Crawling Models, ctd. For incremental ‘pull’: Need to periodically identify documents that have ‘changed’ within a given time window Changes include “add”, “modify”, or “delete” Only a few repositories can tell you about “delete” Connectors in ManifoldCF declare their ability to detect different kinds of changes

Continuous vs. Periodic Crawling ‘Continuous’ crawling Can’t delete documents from index unless they are discovered missing on refetch Can refetch or expire documents on a dynamic schedule Can reseed, also on a schedule ‘Periodic’ crawling MODEL_ADD_CHANGE_DELETE, MODEL_ADD, or MODEL_ALL A MODEL_ALL connector is “stupid”, a MODEL_ADD_CHANGE_DELETE one is “brilliant” Two kinds of cycle: Seeding, discovery/processing/indexing, (maybe) clean up Complex decision as to which kind happens, based on both connector model and job state

Crawling models, graphic

Dealing with Disparate Systems Connection configuration stored as XML in the database A job’s document specification and output specification are also stored as XML Connector-defined unlimited strings for document identifier, document version, output version, access token Connector provides UI for editing its configuration, specification

Example: File system job

… vs. Web Job

MCF Process Architecture

ManifoldCF Authorization Requirements Observation: Every repository has its own notion of document authorization Observation: Most repositories are effectively ACL-based Observation: Active Directory handles 95% of enterprise authentication ManifoldCF idea: Enforce repository’s existing security model, rather than inventing something new

ManifoldCF Document Authorization Observation: A separate crawl for each end user is not going to work Observation: Post-filtering of search results has some nasty edge cases 1 Observation: Document security doesn’t change very often 2 Observation: User changes should take effect immediately 3 ManifoldCF filters by search-engine query Document access tokens are passed to the target User access tokens are obtained at search time, via the MCF Authority Service (1) Imagine picking out 10’s of results from 1,000,000’s of candidates (2) Thus, people don’t seem to expect search engine results to change without explicit update (3) E.g. “user lockout”

MCF Security Architecture

Securing Documents from Multiple Repositories You can define multiple authority connections in ManifoldCF Each authority connection supplies its own access tokens Every repository connection has an MCF authority connection All access tokens from an authority are qualified with the authority connection name

So how do I write a connector? Write a class implementing an interface IOutputConnector IAuthorityConnector IRepositoryConnector Build and deploy Register it, or add it to connectors.xml for the Quick Start That’s it! You’re done! Read ManifoldCF in Action if you want to do it right

Who has used ManifoldCF?

What’s new in the last 12 months? Name change Three releases ManifoldCF in Action Quick Start example ManifoldCF API Service (REST style, uses JSON) Scripting language Solr plugin distribution Hsqldb, Derby support Wiki connector CMIS repository connector OpenSearchServer output connector

What’s coming? Better scalability via NoSQL (Voldemort?) Post-search document filtering support Always more connectors and performance improvements MySQL support

Shameless Plug for “ManifoldCF in Action” Available as “early access” from Manning Publishing Helpful for users, integrators, and connector writers Won’t be put into production until ManifoldCF grows, so please help us to do that!

Resources ManifoldCF in Action, from Manning Publishing http://www.manning.com/wright ManifoldCF deployment instructions http://incubator.apache.org/connectors/how-to-build-and-deploy.html ManifoldCF API documentation http://incubator.apache.org/connectors/programmatic-operation.html ManifoldCF script language documentation http://incubator.apache.org/connectors/script.html

Contact Karl Wright kwright@apache.org http://manifoldcfinaction.blogspot.com