A Regular Expression for Parsing Comma-Separated Values
I recently solved a CSV-parsing problem and was so proud of myself that I wanted to blog about it.
The challenge: use regular expression to parse 9000+ calendar records expressed as comma-separated values.
The catch: not all values are quoted, some values may be missing, quoted values may contain double-double-quotes (“”) as “escaped” quotation marks, and some quoted values may span multiple lines. For example, here are some choice records from the file:
"April 26, 2007",7:00p,1:00,Disaster Response,0,,1 to 1 presentation - delivered by John Doe "April 27, 2007",6:00a,1:00,Men's breakfast,0,,"Contact: John Doe - ""monthly"" 1/5 begins 3 flavors of Code 4/27 chapters 4-7" "April 27, 2007",9:30a,1:00,Mother's Day Out,0,, "April 27, 2007",7:00p,1:00,Anon,0,,
The second record, with the last value spanning multiple lines, was the real tricky bit. My strategy went something like this:
- Treat the entire file as a single string representing many records.
- Assume that every record begins with a quoted date and ends with CRLF (\r\n).
- Assume that every record has one or more comma-separated values.
Here’s what I came up with, using Expresso (a .NET regex visualizer and editor) to guide me:
# Every record starts with a quoted date. (?"(?:January|February|March|April|May|June|July|August|September|October|November|December)\s\d+,\s\d+") # And is followed by # EITHER a comma and a term, zero or more times (?: , (? # EITHER a non-quoted, non-comma'd, single-line term... (?: (?!") # mustn't start with a quote [^,\r\n]+ ) | # OR a quoted, possibly multiline term with commas and escaped ""s (?: " # must start with a quote (?:""|[^"])+ # any combination of EITHER characters that aren't single quotes OR double quotes " #must end with a single quote ) | # OR nothing ) )* # OR by a CRLF, guaranteed at the end of every record. \r\n
Or, with all the learning stripped out,
Works like a charm.
Some of the key options that make this regex work (in .NET at least):
- I’m ignoring whitespace within the regex (to make it way easier to read; I love comments)
- The ‘.’ character represents any character, including newlines.