Comma Separated Values CSV Comma Separated Values
Goals for these videos Understand the distinction between a schema and a database instance Understand three commonly used file formats
Comma Separated Values Delimited flat file Stores tabular data (numbers and text) in plain text Each line is a record Each record is a list of fields, separated by commas No actual standard except convention.
CSV Edge Cases Fields can be put in double quotes "josh","2016" Fields containing an embedded comma character (,), double quote (") or newline character must be in double quotes "Nahum, Josh" Embedded double quotes must be preceded by an additional double quote "Josh said, ""Hi"" to us!" The first line of the file may be a header, which contains the column names. You need contextual information to tell if this is the case.
CSV Example CSV Contents Table Contents To Subject Message josh@msu.edu Sign Up Do it, Do it now tyler@msu.edu "Scare" Quotes allowed? To,Subject,Message josh@msu.edu,Sign Up,"Do it, Do it now" tyler@msu.edu,"""Scare"" Quotes"," Are they allowed?"
Well-Formed CSV Which of these lines are well-formed (legal) lines in a CSV file? Josh,Nahum,48823 Hi Class!,Friday,2016 "\"Stop\" he said",Josh New York City,40°42'46"N,74°00'21"W
CSV Schema 1.0 Schema defines a textual language which can be used to define the data structure, types and rules for a data format. For instance, we may want to constrict what values are legal in a given column. The CSV format itself is very permissive. So we need a second document to define what constitutes "valid" data. There is an working draft of a CSV schema found here (http://digital- preservation.github.io/csv- schema/) by the National Archives of the UK.
Example CSV Schema version 1.0 @totalColumns 3 name: notEmpty age: range(0, 120) gender: is("m") or is("f") Valid CSV Data name,age,gender james,21,m lauren,19,f simon,57,m
Well-Formed versus Valid Well-Formed means the data conforms to the file format (e.g. CSV). Valid means the data conforms to a schema (more restrictive than the format)
Whitespace Do these two lines represent the same record/content? Josh,Nahum,48823 Josh, Nahum, 48823 Yes No Depends