Presentation is loading. Please wait.

Presentation is loading. Please wait.

#SummitNow Managing a Billion Object Repository November 13, 2013 Munwar Shariff CTO, CIGNEX Datamatics

Similar presentations


Presentation on theme: "#SummitNow Managing a Billion Object Repository November 13, 2013 Munwar Shariff CTO, CIGNEX Datamatics"— Presentation transcript:

1 #SummitNow Managing a Billion Object Repository November 13, 2013 Munwar Shariff CTO, CIGNEX Datamatics munwar@cignex.com

2 #SummitNow Co-Founder & Chief Technology Officer of CIGNEX Datamatics 23+ years of Industry Experience Author of the First Alfresco Book (2006) Certified Alfresco Trainer Author of Five Technical Books About the Speaker

3 #SummitNow Agenda Use Case: Social Security e-Benefits System The need for “Big Content” Solutions Evaluated Alfresco as Big Content Platform Summary

4 #SummitNow Use Case: Social Security e-Benefits System

5 #SummitNow My Benefits Employment Services Cash Assistance Food Stamps HousingHealthcareChildcareInsurance Program Coverage

6 #SummitNow Scalable Centralized Document Repository One-time migration of existing docs (~10 yrs of archives) Secure access Meta-data Management High Performance Search and Retrieval Correspondence Templates (Versioned) Program Objectives

7 #SummitNow ~500 Million objects, grows to Billion ~60 Million objects added per year Estimated repository size = 60TB Users = 30,000 Document Ingestion rate = 200,000/ hour Search (6 months date range) = 500/ 2 sec PCL to PDF conversion = 25,000/ day Scalability Requirements

8 #SummitNow The need for Big Content (Unstructured Big Data)

9 #SummitNow Types of Big Content Social Media Postings Audio & Video Files Web Logs, Emails Blogs & Comments Records & Documents Source: Gartner, 17 Oct 2012

10 #SummitNow Core Enterprise Metadata Framework (Elements Applicable to All Enterprise Content) Core Enterprise Metadata Framework (Elements Applicable to All Enterprise Content) Domain Specific Metadata (Brand, Product, Department) Domain Specific Metadata (Brand, Product, Department) Domain Specific Metadata (Brand, Product, Department) Domain Specific Metadata (Brand, Product, Department) Application Metadata Application Metadata Application Metadata Application Metadata Application Metadata Application Metadata Application Metadata Application Metadata Big Content Needs More Metadata Source: Gartner, 15 May 2013

11 #SummitNow Search provides a ready entry into the Big Content Data-centric vendors acquired search companies Alfresco => Apache Solr (“SolrCloud” in future?) Enterprise Search is the key Source: Gartner, 13 May 2013

12 #SummitNow Big Content Discovery & Analysis Analysis Level Discovery Level Users Fuzzy Matching Mechanism Indexing Search Engine

13 #SummitNow Solutions Evaluated

14 #SummitNow 1.Scalable Repository, High Ingestion Rate 2.High Performance Search and Retrieval 3.Secure access at “county” level (group) 4.Compliance on storage (physical separation) 5.Version Control, Workflow & Business Rules 6.Web Services API for external access Technical Requirements

15 #SummitNow 1. MongoDB + Solr + Liferay Pros: Highly Scalable High performance API based access Pros: Highly Scalable High performance API based access Cons: Secure (Group) Access requires heavy customization Not a traditional ECM install & Configuration Content services missing such as versioning, workflow, business rules Cons: Secure (Group) Access requires heavy customization Not a traditional ECM install & Configuration Content services missing such as versioning, workflow, business rules

16 #SummitNow 2. Lily = Hadoop Hbase + Solr Pros: Highly Scalable API based access Few content services such as versioning Separation of storage Pros: Highly Scalable API based access Few content services such as versioning Separation of storage Cons: Queuing system is not robust Performance Issues Secure (Group) Access requires heavy customization Cons: Queuing system is not robust Performance Issues Secure (Group) Access requires heavy customization

17 #SummitNow 3. Alfresco + SolrCloud + DPE Pros: Highly Scalable High performance Secure Access Separation of storage Content services API based access Pros: Highly Scalable High performance Secure Access Separation of storage Content services API based access Cons: Need to programmatically maintain index /repository consistency Custom “Data Processing Engine” requires support Cons: Need to programmatically maintain index /repository consistency Custom “Data Processing Engine” requires support

18 #SummitNow Alfresco as Big Content Platform go big or go home…

19 #SummitNow Architecture Solr Search Legacy System Data Processing Engine Workload Scheduler 15,000+ Docs/Day Secure & Flexible Content Repository 200,000+ Ingestion/Hr Ingestion Rate 25/second Various Documents

20 #SummitNow Operating System : Ubuntu Server 12.04.2 ECM = Alfresco EE version 4.1.4 Database = Oracle RAC 11g 11.2.0.3 File Storage = Veritas Cluster File System Search = SolrCloud (Apache Solr version 4.3.1 and Apache Zookeeper 3.4.5) Application Server : Node.js (Event driven, non-blocking I/O model for data intensive real-time applications that run across distributed devices) PCL to PDF converter = PageTech ESB = Oracle Service Bus Software

21 #SummitNow Data Processing Engine (DPE) Central controller/ broker Document ingestion in Alfresco including pre- processing, splitting, meta-data extraction Brokering index updates, receiving and queuing real time content updates from the ECM, pushed at a later stage to the SolrCloud index

22 #SummitNow Asynchronous I/O enabling high data ingestion/export throughput Flexibility of consistency models: for batch operations - eventual consistency for online operations - transactional consistency Distributed processing model: DPE can scale up horizontally by distributing processing across multiple nodes with co-ordination handled using messages/event bus Extensible Synchronization: Synchronization can be extended to multiple indexing engines that can support additional operations such as statistical and analytical, semantic search (RDF/SPARQL), or graph traversals DPE Highlights

23 #SummitNow Highly scalable (production use cases of 3+ billion documents on such setup) Date range based sharding policy can be implemented Can have multiple Alfresco repositories using the same SolrCloud instance Custom SolrCloud Integration

24 #SummitNow Performance Scaling cluster with multi-core processors, large memories, multiple high-performance gigabit Ethernet interfaces for client access File System Scaling supports individual file systems of up to 256 terabytes capacity and up to a billion files per file system, with no practical limit on the number of file systems hosted by a cluster Veritas Cluster File System

25 #SummitNow Physical storage (file system) to be isolated per county as per compliance requirements Configured “Alfresco Content Store Selector” for each county County ID (key) is the meta-data Physical Separation of Files

26 #SummitNow Audit Trail High Availability (fail-over) Backup policies Business rules on Spaces (folders) Share Sites Folders - Taxonomy specific storage ECM Features

27 #SummitNow Deployment

28 #SummitNow Number of Servers = 9 2 Alfresco, 4 Solr (2 Solr, 2 Zookeeper), 2 Data Processing Engine, 1 PDF Convertor 120 GB RAM 72 CPU Cores 16 TB File System Storage per annum Hardware

29 #SummitNow The solution architecture is designed for horizontal scalability – to scale really big considering future requirements The proposed design supports “Performance SLAs” considering the load and number of people who would access the system We have considered “modular” approach in our design to replace the components in future if there is a need to do so Solution Benefits

30 #SummitNow = Big Content Platform Conclusion

31 #SummitNow About CIGNEX Datamatics Since 2000, delivering Open Source solutions for the enterprise through adoption and integration to: Address business goals Lower the cost of doing business Gain competitive advantage Portal Solutions Content Solutions Big Data Analytics Solutions 400+ Open Source Solutions 450+ Open Source Experts 200+ Open Source Integrations 13+ Books on Open Source 5000+ Community Contributions

32 #SummitNow Where we help our customers… User eXperience Platform Portals Liferay, Drupal, JBoss, ZK, HTML5, MuleSoft Intranet Extranet EAI SOA Social Collaboration Big Data Portal Mobile Portal Enterprise Content Management Content Alfresco, Adobe CQ, Drupal, Magento, JBoss, Moodle, Ephesoft, Liferay WCM DM RM CMS DAM e-Commerce e-Learning ERP Imaging Solutions Making Data Work Big Data Analytics Hadoop Ecosystem, MongoDB, Neo4j, Pentaho, Talend, Solr, Jaspersoft Data Integration Information Delivery Data Analysis Managed Cloud Services  Develop Deploy Manage VAR/Annual Product Subscription  Liferay Alfresco Cloudera Hadoop MongoDB

33 #SummitNow


Download ppt "#SummitNow Managing a Billion Object Repository November 13, 2013 Munwar Shariff CTO, CIGNEX Datamatics"

Similar presentations


Ads by Google