DocumentParser: November, 2013.

Slides:



Advertisements
Similar presentations
Don’t Type it! OCR it! How to use an online OCR..
Advertisements

Overview This presentation will be answering these main questions about AutoDoc: What does it do? What is it? How does it do it? Starting from the finish.
SCAPE Carl Wilson Open Planets Foundation SCAPE Training Guimarães Characterisation An introduction to the identification and characterisation of.
JavaScript Part 6. Calling JavaScript functions on an event JavaScript doesn’t have a main function like other programming languages but we can imitate.
Student Manager Catalog Builder An ACEware Webinar.
Apache Tika End-to-End An introduction to Apache Tika, and integrating it to your application.
COEN 445 Communication Networks and Protocols Lab 4
CIS101 Introduction to Computing Week 05. Agenda Your questions CIS101 Survey Introduction to the Internet & HTML Online HTML Resources Using the HTML.
Microsoft Access Exporting Access Data and Mail Merging.
HTTP Overview Vijayan Sugumaran School of Business Administration Oakland University.
ETT 429 Spring 2007 Web Design I.
Tutorial 11: Connecting to External Data
Word Templates- Documents Directly from GP.
©2011 Quest Software, Inc. All rights reserved. Steve Walch, Senior Product Manager Blog: November, 2011 Partner Training Webcast.
Records and Information Management IT - Enterprise Content Management SPIDR II Global Features Reference Guide April 2013.
DAT602 Database Application Development Lecture 15 Java Server Pages Part 1.
Unit J: Creating a Database Microsoft Office Illustrated Fundamentals.
Software All parts of the computer people can NOT touch, such as programs, files, documents and any other data.
Ts_print in a few easy steps There are four screens: Entities, Data Items, Date, and Report Format.
CSC 2720 Building Web Applications HTML Forms. Introduction  HTML forms are used to collect user input.  The collected input is typically sent to a.
XP New Perspectives on Integrating Microsoft Office XP Tutorial 2 1 Integrating Microsoft Office XP Tutorial 2 – Integrating Word, Excel, and Access.
© FPT SOFTWARE – TRAINING MATERIAL – Internal use 04e-BM/NS/HDCV/FSOFT v2/3 Working with MSSQL Server Code:G0-C# Version: 1.0 Author: Pham Trung Hai CTD.
NMED 3850 A Advanced Online Design January 12, 2010 V. Mahadevan.
Data Structure & File Systems Hun Myoung Park, Ph.D., Public Management and Policy Analysis Program Graduate School of International Relations International.
1 © Copyright 2000 Ethel Schuster The Web… in 15 minutes Ethel Schuster
Liang, Introduction to Java Programming, Eighth Edition, (c) 2011 Pearson Education, Inc. All rights reserved Introduction to Android (Part.
Topics Sending an Multipart message Storing images Getting confirmation Session tracking using PHP Graphics Input Validators Cookies.
HTML ( HYPER TEXT MARK UP LANGUAGE ). What is HTML HTML describes the content and format of web pages using tags. Ex. Title Tag: A title It’s the job.
Javadoc Dwight Deugo Nesa Matic
MapInfo Professional 11.0: getting started Xiaogang (Marshall) Ma School of Science Rensselaer Polytechnic Institute Friday, January 25, 2013 GIS in the.
Using Partial Indexes with PostgreSQL By Lloyd Albin 4/3/2012.
Comparison of different output options from Stata
A brief introduction to javadoc and doxygen. What’s in a program file? 1. Comments 2. Code.
©SoftMooreSlide 1 Introduction to HTML: Forms ©SoftMooreSlide 2 Forms Forms provide a simple mechanism for collecting user data and submitting it to.
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
Field Trip #24 Setting Up a Web Server. Apache Apache is one of the most successful open source web servers In 1995 the most popular web server was the.
1 CMPT 471 Networking II DNS © Janice Regan,
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
XP Exploring Outlook  Outlook is a powerful information manager  You can use Outlook to perform a wide range of communication and organizational tasks,
10. File Management. Computers and information Computers are all about processing the information. The information has to be organized in a systematic.
FINAL EXAM REVIEW 1. EXAM PROCEDURES 10 minutes to review project before starting 120 minutes to complete the exam, although most students finish in
FILES AND EXCEPTIONS Topics Introduction to File Input and Output Using Loops to Process Files Processing Records Exceptions.
9/21/04 James Gallagher Server Installation and Testing: Hands-on ● Install the CGI server with the HDF and FreeForm handlers ● Link data so the server.
Intro to Google Docs 2014.
Mail Merge for Lotus Notes and Excel User Guide
Performing Mail Merges
Introduction to XHTML.
Adding a File to a Course
Product Retrieval Statistics Canada / Statistique Canada Title page
Access Lesson 14 Import and Export Data
MapServer In its most basic form, MapServer is a CGI program that sits inactive on your Web server. When a request is sent to MapServer, it uses.
Introduction to javadoc
Operation System Program 4
Translation Workspace File Filters
Topics Introduction to File Input and Output
Organizing Files What is a file?
Chapter 15 Introduction to Rails.
Chapter Four UNIX File Processing.
Command Substitution Command substitution is the mechanism by which the shell performs a given set of commands and then substitutes their output in the.
Setup Sqoop.
CSE 491/891 Lecture 21 (Pig).
CSE 491/891 Lecture 24 (Hive).
Basic architecture: Files
Introduction to javadoc
File Management Staying Organized.
Controller’s Office – Journal Voucher eForm January 22, 2019
Topics Introduction to File Input and Output
12. Web Spidering These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.
Topics Introduction to File Input and Output
Unit J: Creating a Database
Presentation transcript:

documentParser: November, 2013

Introduction To documentParser

What is documentParser documentParser - Take documents (MS Office, PDF, Outlook Email, HTML, ePub, txt, jpg) from Hadoop or Aster and use Aster to parse and tokenize Example:

Installation INSTALLATION AND PREREQUISITES Installation is simple. From ACT: \install documentParser.zip Prerequisites: none. DO NOT UNZIP THE FILE PRIOR TO INSTALL. Install the zip file as-is. To install in nCluster, run \install from ACT using the zip file in the deployment directory.  DO NOT UNZIP THE FILE PRIOR TO INSTALL. Install the zip file as-is. Example: beehive=> \install documentParser.zip Note: Install the "zip". Do not install the "jar".

Description documentParser is a map function that pulls a variety of document files stored in HDFS (Hadoop Distributed File System) or Aster (as of 2013-04-25) and parses them using nCluster. The parsed files can be outputted in one of four modes by specifying a parseMode of 'text', 'tokenize', 'email', or 'image'. 'text' extracts the plain text portion of a given document and outputs it as a single varchar field. 'tokenize' is like text but it takes each word and outputs it into a separate line. 'email' parses out TO, FROM, CC, BCC, SUBJECT, and BODY fields and outputs each as a separate column. 'email' works on plain text RFC emails as well as Outlook .msg files. 'image' parses out EXIF metadata such as focal length, exposure time, ISO, etc. In addition to emitting the above mentioned columns, all of the modes output a "filename" column. This contains the full path and filename of the HDFS file. This is populated for all three operating modes. Thus, 'text' emits one row and two columns ("filename" and "content") for each document being parsed; 'email' and 'image' emit one row and multiple columns for each document or image; 'tokenize' emits many rows and two columns ("filename" and "word") for each document being parsed. Under the covers, documentParser uses Apache Tika to parse documents. This simplifies the MR implementation since Tika both detects the document format and extracts the plain text portion from the document. Tika also supports a wide variety of formats including all Microsoft Office documents (Office '97-2010 including doc, docx, ppt, pptx, xls, xlsx), Outlook msg files, Outlook pst files, PDF, HTML, ePub, RDF, plain text, Apple iWorks, image files (TIFF, jpg, etc), and more. Click here to see the complete list. Document files are loaded into HDFS through conventional Hadoop loading methods such as using the "put" command, e.g.: Document files are loaded into HDFS through conventional Hadoop loading methods such as using the "put" command, e.g.: ? hadoop dfs -put myWordDoc.docx /user/hadoop/

Table Driven Method A table in nCluster (which is specified in the SQL/MR, e.g. "hadoop_files") has a column containing all of the filenames of the Documents to be parsed. This table should be hash partitioned to take advantage of parallel document parsing. The filename should be a complete (absolute) path to the file sitting in HDFS. Optionally, other columns in this table such as a unique integer ID can exist to help keep track of the tokenized document. All columns in the input table will be in the final output with the exception of the filename column which is replaced by a tokenized word column. Note that a filename column will appear in the output. The column name is "filename". So, if your original filename column is "my_file_name", an equivalent column will be in the final output but named "filename". A typical input table to this SQL/MR function looks like this:

HDFS Directory Pattern Method Using the HDFS Directory and Pattern Method, the user specifies a "directory" argument and optionally a "pattern". If the pattern is omitted, the default pattern of '.*' is used returning all files in the HDFS directory. When an HDFS directory is specified, the argument filenameCol is optional. If it is omitted, the MR table can be an empty (zero row) hash partitioned table or it can be populated with rows (which will be ignored). If both "directory" and "filenameCol" are specified, the MR function operates in hybrid mode.

Hybrid Method When "directory" and "filenameCol" are specified, the MR function operates in hybrid mode. In this mode, the HDFS directory is queried and all files matching pattern are parsed. In addition, documents listed in the hash partitioned table in filenameCol are fed into the MR function. Think of it a as a UNION ALL of the Table Driven and HDFS Directory methods.

Read from Aster Method When "documentCol" is specified, the MR function will read contents from this column. The contents can be in plain text (useful for text and email) or base64 encoded (recommended, useful for all formats including plain text, email, and all binary formats). By default, it is assumed the contents are base64 encoded in a varchar column. If they are plain text (like an email), you can specify documentCol_base64 ('false') which will disable the base64 decoder inside the MR code. Contents will be read from the column and parsed using Tika. The output will have all of the input columns minus the documentCol. The documentCol will be replaced with one or more columns containing the parsed contents depending on the mode.

Examples

Examples