CSV Files and ETL The Good, Bad, and Ugly Eric Freeman CSV Files and ETL The Good, Bad, and Ugly
Comma-Separated Values- Overview
CSV Overview CSV- comma-separated values Plain text Delimited text file Each line is a new record Not fully standardized!
CSV Evolution 1972- IBM Fortran compiler under OS/360 Input Lists- commas or spaces only
CSV Evolution 1983- Osborne Executive computer w/ SuperCalc Spreadsheet Added quoted field containers
CSV Evolution 2005- RFC4180 (standardization initiative) Common Format and MIME Type for CSV Files
RFC 4180 RFC 4180 Standardization Initiative Each record Is delimited by a line break Last record may end with a line break Headers are optional- Same # of fields Double quotes may enclose fields: “abc”,”def”,”ghi” or abc,def,ghi Double quotes can be escaped: “abc”,”de””f”,”ghi”
CSV Overview Basic Concept- Clear Line-breaks Commas Quotes Escape Character
Powershell CSV Functions Export-Csv -InputObject <PSObject> [[-Path] <String>] [-LiteralPath <String>] [-Force] [-NoClobber] [-Encoding <String>] [-Append] [[-Delimiter] <Char>] [-IncludeTypeInformation] [-NoTypeInformation] [-WhatIf] [-Confirm]
Demo
Powershell CSV Functions Import-Csv [[-Delimiter]] <Char>] [[-Path] <String[]>] [-LiteralPath <String[]>] [-Header <String[]>] [-Encoding <String>]
Demo
The Good Simple File, Comma delimiters only BULK INSERT
Demo
The Bad Huge CSV file with a consistent format BULK INSERT w/ Format File
Demo
The Ugly Huge CSV file with Changing format Embedded quotes May contain duplicate column names
Demo