Modernising CERN Document Conversion service

Slides:



Advertisements
Similar presentations
using SharePoint Automation Services;
Advertisements

Version 6.1. Old Vs New Input Format - PDF Meta-Scan & Structured Folder Output Features & Benefits Examples – Other Verticals Content.
MS Exchange and MS SharePoint Connectors Version
Altman IM Ltd | | capture | convert | route | connect | workflow ScanPath provides enhanced scanning & document processing.
SSRS 2008 Architecture Improvements Scale-out SSRS 2008 Report Engine Scalability Improvements.
WSUS Presented by: Nada Abdullah Ahmed.
Client Lunch & Learn (12:15). Association for Information & Image Management Nov Research Scanner Utilization.
ManageEngine TM Applications Manager 8 Monitoring Custom Applications.
EASY LOGISTICS CENTER - the TURNTABLE for information, documents and processes EASY LOGISTICS CENTER DOCUMENTS SHOP CONTENT COMMUNITY MODULES EASY ENTERPRISE.
IRISDocument Server IRISPowerscan IRISCapture Pro/X4D Alone or together to meet your needs Alejandro Grüssi VAR / OEM Account Manager.
SHAREPOINT CONNECTOR What’s new in SharePoint 2010 Market SharePoint is widely adopted by all types of companies. Kyocera offers a Simple but feature.
Enterprise Reporting with Reporting Services SQL Server 2005 Donald Farmer Group Program Manager Microsoft Corporation.
ENTERPRISE JOB SCHEDULER SAJEEV RAMAKRISHNAN 29 AUG 2014.
WSS 3.0 Architecture and Enhancements Ashvini Shahane Member – Synergetics Research Lab.
First Indico Workshop Conversion Server Thomas Baron May 2013 CERN.
ViciDocs for BPO Companies Creating Info repositories from documents.
DONE-10: Adminserver Survival Tips Brian Bowman Product Manager, Data Management Group.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Overview Scale out architecture Servers, services, and topology in Central Administration.
Microsoft SharePoint Server 2010 for the Microsoft ASP.NET Developer Yaroslav Pentsarskyy
CAS Lightning Talk Jasig-Sakai 2012 Tuesday June 12th 2012 Atlanta, GA Andrew Petro - Unicon, Inc.
CERN IT Department CH-1211 Geneva 23 Switzerland t Daniel Gomez Ruben Gaspar Ignacio Coterillo * Dawid Wojcik *CERN/CSIC funded by Spanish.
ArcGIS Server for Administrators
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
Future home directories at CERN
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
Afresco Overview Document management and share
Satisfy Your Technical Curiosity 27, 28 & 29 March 2007 International Convention Center (ICC) Ghent, Belgium.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Reconfigurable Communication Interface Between FASTER and RTSim Dec0907.
Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.
CERN IT Department CH-1211 Genève 23 Switzerland t Migration from ELFMs to Agile Infrastructure CERN, IT Department.
The 2007 Microsoft Office System Servers Enterprise Content Management, Workflow and Forms Martin Parry Developer and Platform Group, Microsoft Ltd
The best of WF 4.0 and AppFabric Damir Dobric MVP-Connected System Developer Microsoft Connected System Division Advisor Visual Studio Inner Circle member.
CERN IT-Storage Strategy Outlook Alberto Pace, Luca Mascetti, Julien Leduc
The Holmes Platform and Applications
ArcGIS for Server Security: Advanced
ADVANCED HOSTING Adrian Newby, CTO.
Progress Apama Fundamentals
ICE Integrated Cloud Environment Cloud Scanning and Mobile Printing
Jean-Philippe Baud, IT-GD, CERN November 2007
Dispatcher Phoenix Is…
ONAP on Vagrant for ONAPers
Web application hosting with Openshift, and Docker images
The effort-saving, cost-cutting, low-overhead, cloud capture platform.
Web application hosting with Openshift, and Docker images
Leveraging the Business Intelligence Features in SharePoint 2010
Federation made simple
Netscape Application Server
PLM, Document and Workflow Management
CMS High Level Trigger Configuration Management
Web Hosting with OpenShift
StratusLab Final Periodic Review
StratusLab Final Periodic Review
Platform as a Service.
Integration of Network Services Interface version 2 with the JUNOS Space SDK
The Power Of Generic Infrastructure
Chapter 3: Windows7 Part 4.
Dev Test on Windows Azure Solution in a Box
Lecture 1: Multi-tier Architecture Overview
Introduction to Apache
Exploring the Power of EPDM Tasks - Working with and Developing Tasks in EPDM By: Marc Young XLM Solutions
Automated Bulk Signing Solution
Converter for IIS and SharePoint Converts s into SharePoint list items 24/7 Creates SharePoint list items from s
Office Edition Overview (Dec. 2018).
DBOS DecisionBrain Optimization Server
Licensing Overview May 2019.
MS Confidential : SharePoint 2010 Developer Workshop (Beta1)
DIBBs Brown Dog BDFiddle
Presentation transcript:

Modernising CERN Document Conversion service Ruben Gaspar, IT-CDA-IC 20/10/2017 HEPiX workshop at KEK

Agenda What this service is about Old service New service 20/10/2017 HEPiX workshop at KEK

Aim/Mandate The mandate of the document Conversion Service is to provide automatic conversion of documents, usually office documents, towards a different format usually PDF or PDF/A. This is accomplished nowadays using a REST API. Workflow is asynchronous. Evolution to different formats (input or output) is possible. 20/10/2017 HEPiX workshop at KEK

Agenda What this service is about Old service New service 20/10/2017 HEPiX workshop at KEK

Old converter In operation since 10 years Multithreading application but only works with 1 thread Programmed in Python2 Missing any coding convention Missing any structure Very reduced logging and no testing framework Based on DFS as a backend to coordinate jobs and store input and output documents Running on Microsoft IIS server on the worker nodes (no load balancing) 20/10/2017 ASDF

Statistics Indico & EDMS workload 20/10/2017 HEPiX workshop at KEK

Agenda What this service is about Old service New service 20/10/2017 HEPiX workshop at KEK

New service guidelines design Keep as much as possible same interface with user community and converters admins Re-design program architecture Object Oriented Adding new conversions should be as simple as possible Exe type: HPGL converter, Tesseract-ocr COM type: Neevia REST type Renew technology: using IT services as much as possible IT services: ES, Openstack, Openshift, EOS & DBoD 20/10/2017 HEPiX workshop at KEK

New architecture 3. https callback Clients REST API Tesseract-OCR PLT https://doconverter.web.cern.ch REST API Tesseract-OCR PLT Real-time stats fuse fuse SMB 1. Accounting 2. fuse Possible in the future! 20/10/2017 HEPiX workshop at KEK

Accessing the service 20/10/2017 HEPiX workshop at KEK

New Doconverter OSS: code at Github[1] Developed in Python3 Flask, Flask-sqlalchemy,… Multi-process application 50% faster than old system while processing same workload Scalability/redundancy by adding more worker nodes or using some features of the infrastructure e.g. Openshift autoscale, HAProxy Doc based on gitbook[2] [1] https://github.com/CERNCDAIC/doconverter [2] https://cern.ch/docdocs or https://cernbox.cern.ch/index.php/s/Sb8lLIVgMBwqSzR (PDF book) 20/10/2017 HEPiX workshop at KEK

doconverter.ini: how to customize [default] extensions_all=doc,docx,ppt,pptx,xlsx,tif,htm,txt,png,jpg,pdf,plt prefix_dir=Y:\ archival_dir=Y:\ ca_bundle=c:\doconverter\cert\COMODO_OV_SHA-256_bundle.crt servers=doconvert01,doconvert02 doconvert01=doc,docx,ppt,pptx,xlsx,tif,htm,txt,png,jpg,pdf,ps,pdfa,thumb,toimg,plt,hpgl,tesocr,modiocr doconvert02=doc,docx,ppt,pptx,xlsx,tif,htm,txt,png,jpg,pdf,ps,pdfa,thumb,toimg,plt,hpgl,tesocr [manager] converters=Neevia,Hpglview_raster,Tesseract_ocr stopper=c:\doconverter\noconverter.txt [monitor] emails=admin@cern.ch tasksalert=50 smtpserver=XXXXX.cern.ch [Neevia] extensions_allowed=doc,docx,ppt,pptx,xlsx,tif,htm,txt,png,jpg,pdf output_allowed=pdf,png,ps,pdfa,thumb,toimg,modiocr type=windows exe=dConverter.exe [Hpglview_raster] …. [database] host=XXXXX.cern.ch port=6607 db=dbconverter user=postgresql://dbconverter password=xxxxx [test] url=https://xxxxx/doconverter/api/v1.0/uploads url_response=http://XXXX/doconverter/api/v1.0/received diresponse=c:\doconverter\testing files=c:\doconverter\files\wordfile.docx,c:\doconverter\files\excelfile.xlsx,c:\doconverter\files\htmfile.htm Accepted input extensions Accepted input and output extensions by a server. Asymmetric configuration possible Converters configuration logging Functional testing 20/10/2017 HEPiX workshop at KEK

Possible conversions NEW! From documentation: https://cernbox.cern.ch/index.php/s/Sb8lLIVgMBwqSzR 22/06/2017 ASDF

Real time stats: elasticsearch ES cluster to analyse and extract some info from the document converter servers logs Puppet module[1] to configure logstash Done to scale: being used by other services: E-mail, AVC SSL enforced through out the whole network path [1] https://gitlab.cern.ch/ai/it-puppet-hostgroup-doconverter 20/10/2017 HEPiX workshop at KEK

Dashboard view 20/10/2017 HEPiX workshop at KEK

EOS as a file system backend Allows to keep state among the different players Some instabilities while accessing it via SMB Following that with EOS and Network support Using the sync client[1] from Windows worker nodes for a more stable solution → increase latency [1] https://cernbox.cern.ch/cernbox/doc/clients.html 20/10/2017 HEPiX workshop at KEK

OCR & ICR Number of programs tested (see next) A number of OCR solutions offered good results ICR is still not mature Resolution is key, at least 200x200 dpi Conversion Service offers two possible solution “modiocr” and “tesocr” ICR view 20/10/2017 HEPiX workshop at KEK

OCR & ICR 20/10/2017 HEPiX workshop at KEK Product OCR ICR (handwriting) remarks PDF Editor 6 Professional N/A It didn’t work either using Windows 7 or 2012R2. Support didn’t provide any info CVISION: PDFCompressor Professional Edition v6.5.1177 Ok Bad Accepts PDF and images. Results as PDF or Word doc. Watch folder mechanism http://www.i2ocr.com/ Admits images, Not easy to work with. No RESTFul api http://www.smartocr.com/  Admits PDF. No RESTFul API but a exe program A2iA: TextReader OK ~OK It should work with proper handwritten docs and resolution Tesseract-OCR Open source, works with images only. Microsoft MODI (Module of sharepoint designer) High integration with Neevia. It works with Image and PDF files. 20/10/2017 HEPiX workshop at KEK

Conclusions Faster and more feature-rich system that relies on IT services Simpler to evolve the service, adding more conversions E.g. posters (PDF/X) Possible to integrate with other services e.g. CERNBox 20/10/2017 HEPiX workshop at KEK

Acknowledgements IT colleagues in general! And specially: Thomas Baron (IT-CDA) for all Document Conversion support Pablo Saiz (IT-CM) for his help on setting up ES cluster(s) Alberto Rodriguez (IT-CDA) for all Openshift support EOS support guys! 20/10/2017 HEPiX workshop at KEK