Appraisal and Data Mining of Large Size Complex Documents Rob Kooper, William McFadden and Peter Bajcsy National Center for Supercomputing Applications.

Slides:

Advertisements

Similar presentations

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Advertisements

Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.

© Copyright 2008, SoftWell Performance AB 1 Performance Testing Distributed Systems Concepts and Terminology v0.6.1.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Date: January 21 st, 2009 Appraisal of 3D Data Conversions and.

Selecting Preservation Strategies for Web Archives Stephan Strodl, Andreas Rauber Department of Software.

Choosing an Optimal Digital Preservation Strategy Andreas Rauber Department of Software Technology and.

Xyleme A Dynamic Warehouse for XML Data of the Web.

Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.

Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.

A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.

Chapter 10 Architectural Design

By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.

SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Date: October 16 th, 2008 To Preserve or Not To Preserve? How.

1SAS 03/ GSFC/SATC- NSWC-DD System and Software Reliability Dolores R. Wallace SRS Technologies Software Assurance Technology Center

A Framework for Relationship Discovery Among Files of Different Types Michal Ondrejcek, Jason Kastner and Peter Bajcsy National Center for Supercomputing.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

Exploring the Applicability of Scientific Data Management Tools and Techniques on the Records Management Requirements for the National Archives and Records.

Per Møldrup-Dalum State and University Library SCAPE Information Day State and University Library, Denmark, SCAPE Scalable Preservation Environments.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

Innovations in Justice Information Sharing Strategies and Best Practices November 30, 2006 Lisa M. Palmieri, CCA-Supervisory Intelligence Analyst President,

Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

Xiao Liu CS3 -- Centre for Complex Software Systems and Services Swinburne University of Technology, Australia Key Research Issues in.

Simone Görl │ 18th may 2006 Preserving Authentic Electronic Records: The InterPARES Project & The InterPARES Model DIGITAL PRESERVATION.

Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign The Conversion Software Registry Michal Ondrejcek, Kenton McHenry,

The Grid System Design Liu Xiangrui Beijing Institute of Technology.

Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.

MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.

1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.

Supporting Large-scale Social Media Data Analyses with Customizable Indexing Techniques on NoSQL Databases.

- 1 - HDF5, HDF-EOS and Geospatial Data Archives HDF and HDF-EOS Workshop VII September 24, 2003.

Chapter 2 Introduction to Systems Architecture. Chapter goals Discuss the development of automated computing Describe the general capabilities of a computer.

Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.

Technical Communication A Practical Approach Chapter 14: Web Pages and Writing for the Web William Sanborn Pfeiffer Kaye Adkins.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.

HDF and HDF-EOS Workshop VII September 24, 2003 HDF5, HDF-EOS and Geospatial Data Archives Don Keefer Illinois State Geological Survey Mike Folk Univ.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉教授 : 許毅然作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.

Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.

CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

VIEWS b.ppt-1 Managing Intelligent Decision Support Networks in Biosurveillance PHIN 2008, Session G1, August 27, 2008 Mohammad Hashemian, MS, Zaruhi.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Discovery of Relationships between 2D Engineering Drawings and.

Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )

IMAGE PROCESSING is the use of computer algorithms to perform image process on digital images   It is used for filtering the image and editing the digital.

Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING

Big Data is a Big Deal!.

Software Project Configuration Management

Software Systems Development

Hadoop MapReduce Framework

Adobe Visual Design Setting project requirements using InDesign (5%)

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Ministry of Higher Education

MANAGING DATA RESOURCES

Hadoop Technopoints.

Overview of big data tools

Efficient Document Analytics on Compressed Data:

Presentation transcript:

Appraisal and Data Mining of Large Size Complex Documents Rob Kooper, William McFadden and Peter Bajcsy National Center for Supercomputing Applications (NCSA), University of Illinois at Urbana-Champaign (UIUC) {kooper, Acknowledgments This research was partially supported by a National Archive and Records Administration supplement to NSF PACI cooperative agreement CA #SCI Abstract This poster addresses the problems of comprehensive document comparisons and computational scalability of document mining using cluster computing and the Map and Reduce programming paradigm. While the volume of contemporary documents and the number of embedded object types have been steadily growing, there is a lack of understanding (a) how to compare documents containing heterogeneous digital objects, and (b) what hardware and software configurations would be cost-efficient for handling document processing operations such as document appraisals. The novelty of our work is in designing a methodology and a mathematical framework for comprehensive document comparisons including text, image and vector graphics components of documents, and in supporting decisions for using Hadoop implementation of Map/Reduce paradigm to perform counting operations. Motivation From the Strategic Plan of The National Archives and Records Administration: “Assist in improving the efficiency with which archivists manage all holdings from the time they are scheduled through accessioning, processing, storage, preservation, and public use.” The motivation is to provide support for answering appraisal criteria related to document relationships, chronological order of information, storage requirements and incorporation of preservation constraints (e.g., storage cost). Experiments For illustration purposes we used the NASA Columbia accident report on the causes of the Feb. 1, 2003 Space Shuttle accident. The report is 10MB (10,330,897 bytes) and contains 248 pages with 179,187 words, 236 images (an average image size is 209x188 = 16,655,776 pixels), and 30,924 vector graphics objects. We compared the time it took to extract the occurrence statistics from the Columbia report using Hadoop compared to using a stand alone application (SA). Conclusions The graph shows the execution times in mili-seconds (y-axis) needed to extract occurrences for all PDF elements using CCT and NCSA clusters, and using multiple data splits. The number of used nodes in the NCSA cluster ranged between 1 and 4. The results provide input into a decision support for hardware and software investments in the domains processing a large volume of complex documents. Objectives Design a methodology, algorithms and a framework for conducting comprehensive document appraisals by: enabling exploratory document analyses and integrity/authenticity verification, supporting automation of appraisal analyses evaluating computational and storage requirements of computer- assisted appraisal processes Proposed Approach Decompose the series of appraisal criteria into a set of focused analyses: Find groups of records with similar content, Rank records according to their creation/last modification time and digital volume, Detect inconsistency between ranking and content within a group of records, compare sampling strategies for preservation of records. INTEGRITY VERIFICATION – two or more document versions within one group SAMPLING – document versions in group 1 and 2 Methodology The methodology is described by starting with pair-wise comparisons of text, image (raster) and vector graphics components, computing their weights, establishing the group relationship s to permanent records, and focusing on integrity verification and sampling. Hadoop Map/Reduce is a software framework for writing applications which perform operations that could be decomposed into Map and Reduce phases and process vast amounts of data in-parallel. Hadoop-based applications run on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. Configurations of Hadoop Clusters We used two different clusters. The first cluster, hadoop1, is at NCSA and consisted of four identical machines. The NCSA cluster is capable of 20 Map tasks and 4 Reduce tasks in parallel. The second cluster, Illinois Cloud Computing Testbed (CCT), is in the CS department at the University of Illinois and consists of 64 identical machines. The CCT is capable of 384 Map tasks and 128 Reduce tasks in parallel. Document Operations Suitable for Hadoop Our goal is to count the occurrence of words per page, of colors in each image and of vector graphics elements in the document. The counting operation is computationally intensive especially for images since each pixel is counted as if it was a word. While one can find about 900 words per page, a relative small image of the size 209x188 is equivalent to 44 pages of text. Occurrence of colors List of images Preview LOADED FILES “Ignore” colors Display of Pair-wise Document Similarities Exploratory View of Color Occurrences in a Selected PDF File and Its Image Input PDF File Viewed in Adobe Reader For more information: URL: Time [ms] Data Split [pages] Configuration Computational Scalability Using Hadoop