DocumentParser: November, 2013.

documentParser: November, 2013

Introduction To documentParser

What is documentParser
documentParser - Take documents (MS Office, PDF, Outlook , HTML, ePub, txt, jpg) from Hadoop or Aster and use Aster to parse and tokenize Example:

Installation INSTALLATION AND PREREQUISITES
Installation is simple. From ACT: \install documentParser.zip Prerequisites: none. DO NOT UNZIP THE FILE PRIOR TO INSTALL. Install the zip file as-is. To install in nCluster, run \install from ACT using the zip file in the deployment directory. DO NOT UNZIP THE FILE PRIOR TO INSTALL. Install the zip file as-is. Example: beehive=> \install documentParser.zip Note: Install the "zip". Do not install the "jar".

Description documentParser is a map function that pulls a variety of document files stored in HDFS (Hadoop Distributed File System) or Aster (as of ) and parses them using nCluster. The parsed files can be outputted in one of four modes by specifying a parseMode of 'text', 'tokenize', ' ', or 'image'. 'text' extracts the plain text portion of a given document and outputs it as a single varchar field. 'tokenize' is like text but it takes each word and outputs it into a separate line. ' ' parses out TO, FROM, CC, BCC, SUBJECT, and BODY fields and outputs each as a separate column. ' ' works on plain text RFC s as well as Outlook .msg files. 'image' parses out EXIF metadata such as focal length, exposure time, ISO, etc. In addition to emitting the above mentioned columns, all of the modes output a "filename" column. This contains the full path and filename of the HDFS file. This is populated for all three operating modes. Thus, 'text' emits one row and two columns ("filename" and "content") for each document being parsed; ' ' and 'image' emit one row and multiple columns for each document or image; 'tokenize' emits many rows and two columns ("filename" and "word") for each document being parsed. Under the covers, documentParser uses Apache Tika to parse documents. This simplifies the MR implementation since Tika both detects the document format and extracts the plain text portion from the document. Tika also supports a wide variety of formats including all Microsoft Office documents (Office ' including doc, docx, ppt, pptx, xls, xlsx), Outlook msg files, Outlook pst files, PDF, HTML, ePub, RDF, plain text, Apple iWorks, image files (TIFF, jpg, etc), and more. Click here to see the complete list. Document files are loaded into HDFS through conventional Hadoop loading methods such as using the "put" command, e.g.: Document files are loaded into HDFS through conventional Hadoop loading methods such as using the "put" command, e.g.: ? hadoop dfs -put myWordDoc.docx /user/hadoop/

Table Driven Method A table in nCluster (which is specified in the SQL/MR, e.g. "hadoop_files") has a column containing all of the filenames of the Documents to be parsed. This table should be hash partitioned to take advantage of parallel document parsing. The filename should be a complete (absolute) path to the file sitting in HDFS. Optionally, other columns in this table such as a unique integer ID can exist to help keep track of the tokenized document. All columns in the input table will be in the final output with the exception of the filename column which is replaced by a tokenized word column. Note that a filename column will appear in the output. The column name is "filename". So, if your original filename column is "my_file_name", an equivalent column will be in the final output but named "filename". A typical input table to this SQL/MR function looks like this:

HDFS Directory Pattern Method
Using the HDFS Directory and Pattern Method, the user specifies a "directory" argument and optionally a "pattern". If the pattern is omitted, the default pattern of '.*' is used returning all files in the HDFS directory. When an HDFS directory is specified, the argument filenameCol is optional. If it is omitted, the MR table can be an empty (zero row) hash partitioned table or it can be populated with rows (which will be ignored). If both "directory" and "filenameCol" are specified, the MR function operates in hybrid mode.

Hybrid Method When "directory" and "filenameCol" are specified, the MR function operates in hybrid mode. In this mode, the HDFS directory is queried and all files matching pattern are parsed. In addition, documents listed in the hash partitioned table in filenameCol are fed into the MR function. Think of it a as a UNION ALL of the Table Driven and HDFS Directory methods.

Read from Aster Method When "documentCol" is specified, the MR function will read contents from this column. The contents can be in plain text (useful for text and ) or base64 encoded (recommended, useful for all formats including plain text, , and all binary formats). By default, it is assumed the contents are base64 encoded in a varchar column. If they are plain text (like an ), you can specify documentCol_base64 ('false') which will disable the base64 decoder inside the MR code. Contents will be read from the column and parsed using Tika. The output will have all of the input columns minus the documentCol. The documentCol will be replaced with one or more columns containing the parsed contents depending on the mode.

Examples

DocumentParser: November, 2013.

Similar presentations

Presentation on theme: "DocumentParser: November, 2013."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DocumentParser: November, 2013.

Similar presentations

Presentation on theme: "DocumentParser: November, 2013."— Presentation transcript:

Similar presentations

About project

Feedback