Distribution A: Approved for public release; distribution is unlimited. Case Number: 88ABW-2015-1636, 31 Mar 2015 A Tool that Uses the SAS PRX Functions.

Slides:



Advertisements
Similar presentations
CPS120: Introduction to Computer Science INPUT/OUTPUT.
Advertisements

Lecture 2 Introduction to C Programming
Introduction to C Programming
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
 2007 Pearson Education, Inc. All rights reserved Introduction to C Programming.
Programming Logic and Design, Introductory, Fourth Edition1 Understanding Computer Components and Operations (continued) A program must be free of syntax.
Introduction to C Programming
Introduction to Array The fundamental unit of data in any MATLAB program is the array. 1. An array is a collection of data values organized into rows and.
The Scanner Class and Formatting Output Mr. Lupoli.
Group practice in problem design and problem solving
Shell Scripting Awk (part1) Awk Programming Language standard unix language that is geared for text processing and creating formatted reports but it.
1 Microsoft Access 2002 Tutorial 5 – Enhancing a Table’s Design, and Creating Advanced Queries and Custom Forms.
Computer Science 1000 Spreadsheets II Permission to redistribute these slides is strictly prohibited without permission.
A Variable is symbolic name that can be given different values. Variables are stored in particular places in the computer ‘s memory. When a variable is.
Created by, Author Name, School Name—State FLUENCY WITH INFORMATION TECNOLOGY Skills, Concepts, and Capabilities.
Input & Output: Console
Miscellaneous Excel Combining Excel and Access. – Importing, exporting and linking Parsing and manipulating data. 1.
Copyright © 2012 Pearson Education, Inc. Publishing as Pearson Addison-Wesley C H A P T E R 2 Input, Processing, and Output.
Key Data Management Tasks in Stata
Lecture 4 C Program Control Acknowledgment The notes are adapted from those provided by Deitel & Associates, Inc. and Pearson Education Inc.
Input, Output, and Processing
Finding the needle(s) in the textual haystack
CS 403: Programming Languages Fall 2004 Department of Computer Science University of Alabama Joel Jones.
New Perspectives on XML, 2nd Edition
1 Lab 2 and Merging Data (with SQL) HRP223 – 2009 October 19, 2009 Copyright © Leland Stanford Junior University. All rights reserved. Warning:
Regular Expressions CSC207 – Software Design. Motivation Handling white space –A program ought to be able to treat any number of white space characters.
WHAT IS A DATABASE? A DATABASE IS A COLLECTION OF DATA RELATED TO A PARTICULAR TOPIC OR PURPOSE OR TO PUT IT SIMPLY A GENERAL PURPOSE CONTAINER FOR STORING.
Copyright © 2012 Pearson Education, Inc. Publishing as Pearson Addison-Wesley C H A P T E R 2 Input, Processing, and Output.
Prof. Alfred J Bird, Ph.D., NBCT Door Code for IT441 Students.
Regular Expressions for PHP Adding magic to your programming. Geoffrey Dunn
XML 2nd EDITION Tutorial 4 Working With Schemas. XP Schemas A schema is an XML document that defines the content and structure of one or more XML documents.
CMSC 202 Java Console I/O. July 25, Introduction Displaying text to the user and allowing the user to enter text are fundamental operations performed.
Introducing Python CS 4320, SPRING Lexical Structure Two aspects of Python syntax may be challenging to Java programmers Indenting ◦Indenting is.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
CONSTANTS Constants are also known as literals in C. Constants are quantities whose values do not change during program execution. There are two types.
Artificial Intelligence Lecture No. 26 Dr. Asad Ali Safi ​ Assistant Professor, Department of Computer Science, COMSATS Institute of Information Technology.
CSCI 330 UNIX and Network Programming Unit IV Shell, Part 2.
Awk- An Advanced Filter by Prof. Shylaja S S Head of the Dept. Dept. of Information Science & Engineering, P.E.S Institute of Technology, Bangalore
CSCI 330 UNIX and Network Programming Unit IV Shell, Part 2.
Agenda Positional Parameters / Continued... Command Substitution Bourne Shell / Bash Shell / Korn Shell Mathematical Expressions Bourne Shell / Bash Shell.
Lecture 6: Output 1.Presenting results in a professional manner 2.semicolon, disp(), fprintf() 3.Placeholders 4.Special characters 5.Format-modifiers 1.
Authorizations AtlasNet Release Notes Authorizations.
1 Week 5 l Primitive Data types l Assignment l Expressions l Documentation & Style Primitive Types, Assignments, and Expressions.
Introduction to Programming the WWW I CMSC Winter 2004 Lecture 13.
1 CSE 2337 Chapter 7 Organizing Data. 2 Overview Import unstructured data Concatenation Parse Create Excel Lists.
Slides prepared by Rose Williams, Binghamton University Console Input and Output.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
Slide 1 Chapter 3 Variables  A variable is a name for a value stored in memory.  Variables are used in programs so that values can be represented with.
Learn SAS’s Perl Regular Expression (PRX) Matching to Catch All 384,000 Ways to Misspell “Afghanistan” Paul C. Genovesi, MSBA Practice PRX Matching Using.
Working Efficiently with Large SAS® Datasets Vishal Jain Senior Programmer.
Normalizing Data for Migration Kyle Banerjee
FILE I/O: Low-level 1. The Big Picture 2 Low-Level, cont. Some files are mixed format that are not readable by high- level functions such as xlsread()
Awk 2 – more awk. AWK INVOCATION AND OPERATION the "-F" option allows changing Awk's "field separator" character. Awk regards each line of input data.
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapters 5 & 6 By Ravi Mandal.
Prof. Alfred J Bird, Ph.D., NBCT Office – McCormick 3rd floor 607.
SAUSAG 71 – 21 Aug 2014 Tech Tips Jerry Le Breton On behalf of the SAUSAG Committee.
Distribution A: Approved for public release; distribution is unlimited. Case Number: 88ABW , 4 Sep 2014 Learning SAS’s Perl Regular Expression.
Creative Create Lists Elizabeth B. Thomsen Member Services Manager
Some other query issues:
ITM 352 Expressions, Precedence, Working with Strings Class #5
Introduction to Objects
MODULE 7 Microsoft Access 2010
Chapter 8 JavaScript: Control Statements, Part 2
Practice PRX Matching Using This Program And Learn How to Do This
Basic Types Chapter 7 Copyright © 2008 W. W. Norton & Company.
ECE 103 Engineering Programming Chapter 8 Data Types and Constants
CSV Files and ETL The Good, Bad, and Ugly
Chapter 8 JavaScript: Control Statements, Part 2
Introduction to Computer Science
Introduction to Objects
Presentation transcript:

Distribution A: Approved for public release; distribution is unlimited. Case Number: 88ABW , 31 Mar 2015 A Tool that Uses the SAS PRX Functions to Fix Delimited Text Files By: Paul Genovesi

2 Distribution A: Approved for public release; distribution is unlimited. Case Number: 88ABW , 31 Mar 2015 A Tool that Uses the SAS ® PRX Functions to Fix Delimited Text Files Paul Genovesi Henry Jackson Foundation for the Advancement of Military Medicine, Inc. A Tool that Uses the SAS ® PRX Functions to Fix Delimited Text Files Paul Genovesi Henry Jackson Foundation for the Advancement of Military Medicine, Inc. Abstract Objectives Truncated-Only Method For use on broken delimited text files containing truncated records but no appended records Within your broken text file, the first field of every record and its following field delimiter must occur on the same line (i.e., a record separator can’t occur between them) Does not use your last field and first field patterns Uses a built-in pattern and delimiter counting Delimited text files are often plagued by appended and/or truncated records. The file_fixing_tool can fix these files so they can be imported into SAS. Delimited text file structure (both normal and broken) Truncated-only method Appended method using first field and last field patterns Text qualifying Common surrounding characters (CSCs) Before and after examples of fixed delimited text files Delimited Text File Structure Four structure types (1 normal and 3 broken) where [-----] = one record and = the record separator [-----] Normal Records Truncated Records [---- -] [-----] [--- --] [-----] Note: Truncating occurs within records 1 and 3. Appended Records [-----][-----] [-----][-----][-----] [-----] [-----][-----] Note: Appending occurs after records 1, 3, 4, 7. Appended & Truncated [-----][---- -] [-----] [--- --][-----][-----] [--- --] Note: Appending occurs after records 1, 4, 5. Truncating occurs within records 2, 4, 7. Appended Method Fixing appended records (with or without truncated records) is more difficult than fixing truncated alone. The key is developing last field and first field patterns that identify and isolate either the last field or the first field from all other fields. The other fields are easily identified and isolated by their surrounding field delimiters but there is no field delimiter separating one record’s last field from the following record’s first field. Last Field, First Field Pattern Examples Ex. #1: SSN Contents: 9-digit string Pattern: \d{9} Ex. #2: SSN, blank fields Contents: 9-digit string, Pattern: (?:\d{9}| *) Ex. #3: Categorical Data, case-insensitive Contents: red, white, blue, RED, WhiTe, bLUe, Pattern: (?i:red|white|blue| *) Ex. #4: Number between 1 and Contents: 1 to 7 digit character string, Pattern: (?:\d{1,7}| *) Ex. #5: Date Contents: Date string (with format MM/DD/YYYY), Pattern: (?:\d{2}\/\d{2}\/\d{4}| *) Common Surrounding Characters CSCs are characters that are contained within a field (i.e., cell) yet occur on the field’s outer, exterior edges but still within any existing text qualifiers. The following CSCs can occur within text-qualified fields: 1.The field delimiter 2.Text-qualifier character pairs (i.e., a consecutive EVEN number of them, for example “”, “”””, …) 3.The other (i.e., not being used) text qualifier character (for example, if “ is being used, then other is ‘ and vice versa) 4.The space character The following CSCs can occur within non text-qualified fields: 1.The double quote character 2.The single quote character 3.The space character The file_fixing_tool automatically matches a continuous string of CSCs occurring on a field’s outer, exterior edges. Why are they automatically matched? Reason #1: If they weren’t, then your last field and first field patterns would have to account for their existence. Reason #2: Accounting for the existence of double or single quotes in a pattern contained in a macro variable can be tricky in terms of unmatched quote errors. It’s safer to let the file_fixing_tool take care of it. There are no known side effects to doing it this way. Text Qualifying Occurs when a cell’s contents within a delimited text file are enclosed in double quotes or single quotes Both the double quotes and single quotes cannot be used as text qualifiers within the same file. A cell’s contents MUST be text qualified when (1) they contain the field delimiter or (2) the text qualifier being used also occurs within these contents (this text qualifier character must be escaped with another text qualifier character, in other words, two mean one). Conclusions Use the appended method if you are able to develop last field and first field patterns that identify and isolate the last and first fields. If you’re not able to do this, then you can still use the truncated-only method as long as your broken file contains only truncated records. References CSVReader.com. CSV file format. [Accessed 1 Dec. 2014]. Available from Dunn T. Grouping, atomic groups, and conditions: creating if-then statements in Perl RegEx. In: Programming beyond the basics. Proceedings of the SAS Global Forum 2011 Conference; 2011 Apr 4-7; Las Vegas, NV. Cary (NC): SAS Institute, Inc.; Paper [Accessed 1 Mar. 2015]. Available from papers/proceedings11/ pdf. papers/proceedings11/ pdf Genovesi P. Learning SAS’s perl regular expression matching the easy way: by doing. [Accessed 1 Mar. 2015]. Available from RF Learning_SAS%27s_Perl_Regular_Expression_ Matching_the_Easy_Way_By_Doing.pdf. RF Learning_SAS%27s_Perl_Regular_Expression_ Matching_the_Easy_Way_By_Doing.pdf Shafranovich Y. RFC 4180: common format and MIME type for comma-separated values (CSV) Files [Accessed 1 Mar. 2015]. Available from rfc rfc4180 Wikipedia. Comma-separated values. [Accessed 1 Dec. 2014]. Available from Comma-separated_values. Comma-separated_values ACKNOWLEDGMENTS In memory of Jan Abshire, who gave me an assignment dealing with exactly this issue. A thank you to SAS Institute’s Adam Pilz, who along with Jan gave me the idea for this paper. DISCLAIMER The views expressed are those of the author and do not necessarily reflect the official policy or position of the Air Force, the Department of Defense, or the U.S. Government.

3 Distribution A: Approved for public release; distribution is unlimited. Case Number: 88ABW , 31 Mar 2015 Before and After Example #1: A broken delimited text file containing truncated but no appended records Upper Left: A broken delimited text file containing truncated but no appended records. Lower Left: The same delimited text file that was fixed using the truncated-only method. Upper Right: The SAS dataset created by importing the fixed delimited text file into SAS EG.

4 Distribution A: Approved for public release; distribution is unlimited. Case Number: 88ABW , 31 Mar 2015 Before and After Example #2: A broken delimited text file containing both truncated and appended records Upper Left: A broken delimited text file containing both truncated and appended records. Lower Left: The same delimited text file that was fixed using the appended method. Upper Right: The SAS dataset created by importing the fixed delimited text file into SAS EG.

5 Distribution A: Approved for public release; distribution is unlimited. Case Number: 88ABW , 31 Mar 2015 Before and After Example #3: A delimited text file in an “already-fixed” state meaning it contains no truncated or appended records Upper Left: A delimited text file in an “already-fixed” state, meaning it contains no truncated or appended records. Lower Left: The same delimited text file after using the appended method. Notice that the “already-fixed” state has been maintained except for a few double quotes that get matched to one record’s last field instead of the following record’s first field. This transferring of double quotes is unavoidable since they could logically belong to either field. Upper Right: The SAS dataset created by importing the delimited text file (pictured in lower left) into SAS EG.

Distribution A: Approved for public release; distribution is unlimited. Case Number: 88ABW , 31 Mar 2015