Article pages now have a discussion option at the bottom (moderated/captcha, but no registration needed)

Using ''awk'' to deal with CSV that uses quoted/unquoted delimiters

Keywords: awk,csv
Contributor: SiegeX on #bash on Freenode

CSV files are a mess, yes.

Assume you have CSV files that use the comma as delimiter and quoted data fields that can contain the delimiter.

"first", "second", "last"
"fir,st", "second", "last"
"firtst one", "sec,ond field", "final,ly"

Simply using the comma as separator for awk won't work here, of course.

Solution: Use the field separator ", "|^"|"$ for awk.

This is an OR-ed list of 3 possible separators:

", "matches the area between the datafields
^"matches the area left of the first datafield
"$matches the area right of the last data field

You can tune these delimiters if you have other needs (for example if you don't have a space after the commas).

Test:

The awk command used for the CSV above just prints the fileds separated by ### to see what's going on:

$ awk -v FS='", "|^"|"$' '{print $2"###"$3"###"$4}' data.csv
first###second###last
fir,st###second###last
firtst one###sec,ond field###final,ly

ATTENTION If the CSV data changes its format every now and then (for example it only quotes the data fields if needed, not always), then this way will not work.

Discussion

Enter your comment
 
snipplets/awkcsv.txt · Last modified: 2009/02/24 07:18 (external edit)
GNU Free Documentation License 1.2
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0