Download presentation
Presentation is loading. Please wait.
Published byHillary Hudson Modified over 8 years ago
1
#SummitNow Managing a Billion Object Repository November 13, 2013 Munwar Shariff CTO, CIGNEX Datamatics munwar@cignex.com
2
#SummitNow Co-Founder & Chief Technology Officer of CIGNEX Datamatics 23+ years of Industry Experience Author of the First Alfresco Book (2006) Certified Alfresco Trainer Author of Five Technical Books About the Speaker
3
#SummitNow Agenda Use Case: Social Security e-Benefits System The need for “Big Content” Solutions Evaluated Alfresco as Big Content Platform Summary
4
#SummitNow Use Case: Social Security e-Benefits System
5
#SummitNow My Benefits Employment Services Cash Assistance Food Stamps HousingHealthcareChildcareInsurance Program Coverage
6
#SummitNow Scalable Centralized Document Repository One-time migration of existing docs (~10 yrs of archives) Secure access Meta-data Management High Performance Search and Retrieval Correspondence Templates (Versioned) Program Objectives
7
#SummitNow ~500 Million objects, grows to Billion ~60 Million objects added per year Estimated repository size = 60TB Users = 30,000 Document Ingestion rate = 200,000/ hour Search (6 months date range) = 500/ 2 sec PCL to PDF conversion = 25,000/ day Scalability Requirements
8
#SummitNow The need for Big Content (Unstructured Big Data)
9
#SummitNow Types of Big Content Social Media Postings Audio & Video Files Web Logs, Emails Blogs & Comments Records & Documents Source: Gartner, 17 Oct 2012
10
#SummitNow Core Enterprise Metadata Framework (Elements Applicable to All Enterprise Content) Core Enterprise Metadata Framework (Elements Applicable to All Enterprise Content) Domain Specific Metadata (Brand, Product, Department) Domain Specific Metadata (Brand, Product, Department) Domain Specific Metadata (Brand, Product, Department) Domain Specific Metadata (Brand, Product, Department) Application Metadata Application Metadata Application Metadata Application Metadata Application Metadata Application Metadata Application Metadata Application Metadata Big Content Needs More Metadata Source: Gartner, 15 May 2013
11
#SummitNow Search provides a ready entry into the Big Content Data-centric vendors acquired search companies Alfresco => Apache Solr (“SolrCloud” in future?) Enterprise Search is the key Source: Gartner, 13 May 2013
12
#SummitNow Big Content Discovery & Analysis Analysis Level Discovery Level Users Fuzzy Matching Mechanism Indexing Search Engine
13
#SummitNow Solutions Evaluated
14
#SummitNow 1.Scalable Repository, High Ingestion Rate 2.High Performance Search and Retrieval 3.Secure access at “county” level (group) 4.Compliance on storage (physical separation) 5.Version Control, Workflow & Business Rules 6.Web Services API for external access Technical Requirements
15
#SummitNow 1. MongoDB + Solr + Liferay Pros: Highly Scalable High performance API based access Pros: Highly Scalable High performance API based access Cons: Secure (Group) Access requires heavy customization Not a traditional ECM install & Configuration Content services missing such as versioning, workflow, business rules Cons: Secure (Group) Access requires heavy customization Not a traditional ECM install & Configuration Content services missing such as versioning, workflow, business rules
16
#SummitNow 2. Lily = Hadoop Hbase + Solr Pros: Highly Scalable API based access Few content services such as versioning Separation of storage Pros: Highly Scalable API based access Few content services such as versioning Separation of storage Cons: Queuing system is not robust Performance Issues Secure (Group) Access requires heavy customization Cons: Queuing system is not robust Performance Issues Secure (Group) Access requires heavy customization
17
#SummitNow 3. Alfresco + SolrCloud + DPE Pros: Highly Scalable High performance Secure Access Separation of storage Content services API based access Pros: Highly Scalable High performance Secure Access Separation of storage Content services API based access Cons: Need to programmatically maintain index /repository consistency Custom “Data Processing Engine” requires support Cons: Need to programmatically maintain index /repository consistency Custom “Data Processing Engine” requires support
18
#SummitNow Alfresco as Big Content Platform go big or go home…
19
#SummitNow Architecture Solr Search Legacy System Data Processing Engine Workload Scheduler 15,000+ Docs/Day Secure & Flexible Content Repository 200,000+ Ingestion/Hr Ingestion Rate 25/second Various Documents
20
#SummitNow Operating System : Ubuntu Server 12.04.2 ECM = Alfresco EE version 4.1.4 Database = Oracle RAC 11g 11.2.0.3 File Storage = Veritas Cluster File System Search = SolrCloud (Apache Solr version 4.3.1 and Apache Zookeeper 3.4.5) Application Server : Node.js (Event driven, non-blocking I/O model for data intensive real-time applications that run across distributed devices) PCL to PDF converter = PageTech ESB = Oracle Service Bus Software
21
#SummitNow Data Processing Engine (DPE) Central controller/ broker Document ingestion in Alfresco including pre- processing, splitting, meta-data extraction Brokering index updates, receiving and queuing real time content updates from the ECM, pushed at a later stage to the SolrCloud index
22
#SummitNow Asynchronous I/O enabling high data ingestion/export throughput Flexibility of consistency models: for batch operations - eventual consistency for online operations - transactional consistency Distributed processing model: DPE can scale up horizontally by distributing processing across multiple nodes with co-ordination handled using messages/event bus Extensible Synchronization: Synchronization can be extended to multiple indexing engines that can support additional operations such as statistical and analytical, semantic search (RDF/SPARQL), or graph traversals DPE Highlights
23
#SummitNow Highly scalable (production use cases of 3+ billion documents on such setup) Date range based sharding policy can be implemented Can have multiple Alfresco repositories using the same SolrCloud instance Custom SolrCloud Integration
24
#SummitNow Performance Scaling cluster with multi-core processors, large memories, multiple high-performance gigabit Ethernet interfaces for client access File System Scaling supports individual file systems of up to 256 terabytes capacity and up to a billion files per file system, with no practical limit on the number of file systems hosted by a cluster Veritas Cluster File System
25
#SummitNow Physical storage (file system) to be isolated per county as per compliance requirements Configured “Alfresco Content Store Selector” for each county County ID (key) is the meta-data Physical Separation of Files
26
#SummitNow Audit Trail High Availability (fail-over) Backup policies Business rules on Spaces (folders) Share Sites Folders - Taxonomy specific storage ECM Features
27
#SummitNow Deployment
28
#SummitNow Number of Servers = 9 2 Alfresco, 4 Solr (2 Solr, 2 Zookeeper), 2 Data Processing Engine, 1 PDF Convertor 120 GB RAM 72 CPU Cores 16 TB File System Storage per annum Hardware
29
#SummitNow The solution architecture is designed for horizontal scalability – to scale really big considering future requirements The proposed design supports “Performance SLAs” considering the load and number of people who would access the system We have considered “modular” approach in our design to replace the components in future if there is a need to do so Solution Benefits
30
#SummitNow = Big Content Platform Conclusion
31
#SummitNow About CIGNEX Datamatics Since 2000, delivering Open Source solutions for the enterprise through adoption and integration to: Address business goals Lower the cost of doing business Gain competitive advantage Portal Solutions Content Solutions Big Data Analytics Solutions 400+ Open Source Solutions 450+ Open Source Experts 200+ Open Source Integrations 13+ Books on Open Source 5000+ Community Contributions
32
#SummitNow Where we help our customers… User eXperience Platform Portals Liferay, Drupal, JBoss, ZK, HTML5, MuleSoft Intranet Extranet EAI SOA Social Collaboration Big Data Portal Mobile Portal Enterprise Content Management Content Alfresco, Adobe CQ, Drupal, Magento, JBoss, Moodle, Ephesoft, Liferay WCM DM RM CMS DAM e-Commerce e-Learning ERP Imaging Solutions Making Data Work Big Data Analytics Hadoop Ecosystem, MongoDB, Neo4j, Pentaho, Talend, Solr, Jaspersoft Data Integration Information Delivery Data Analysis Managed Cloud Services Develop Deploy Manage VAR/Annual Product Subscription Liferay Alfresco Cloudera Hadoop MongoDB
33
#SummitNow
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.