A Regular Expression for Parsing Comma-Separated Values

I recently solved a CSV-parsing problem and was so proud of myself that I wanted to blog about it.

The challenge: use regular expression to parse 9000+ calendar records expressed as comma-separated values.

The catch: not all values are quoted, some values may be missing, quoted values may contain double-double-quotes (“”) as “escaped” quotation marks, and some quoted values may span multiple lines.  For example, here are some choice records from the file:

"April 26, 2007",7:00p,1:00,Disaster Response,0,,1 to 1  presentation - delivered by John Doe
"April 27, 2007",6:00a,1:00,Men's breakfast,0,,"Contact: John Doe - ""monthly""
1/5 begins 3 flavors of Code
4/27 chapters 4-7"
"April 27, 2007",9:30a,1:00,Mother's Day Out,0,,
"April 27, 2007",7:00p,1:00,Anon,0,,

The second record, with the last value spanning multiple lines, was the real tricky bit.  My strategy went something like this:

  • Treat the entire file as a single string representing many records.
  • Assume that every record begins with a quoted date and ends with CRLF (\r\n).
  • Assume that every record has one or more comma-separated values.

Here’s what I came up with, using Expresso (a .NET regex visualizer and editor) to guide me:

# Every record starts with a quoted date.
(?"(?:January|February|March|April|May|June|July|August|September|October|November|December)\s\d+,\s\d+")

# And is followed by
# EITHER a comma and a term, zero or more times
(?:
  ,
  (?
    # EITHER a non-quoted, non-comma'd, single-line term...  
    (?:
      (?!") # mustn't start with a quote
      [^,\r\n]+
    )

    | # OR a quoted, possibly multiline term with commas and escaped ""s
    (?:
      " # must start with a quote
      (?:""|[^"])+ # any combination of EITHER characters that aren't single quotes OR double quotes
      " #must end with a single quote
    )

    | # OR nothing
  )
)*

# OR by a CRLF, guaranteed at the end of every record.
\r\n

Or, with all the learning stripped out,

(?"(?:January|February|March|April|May|June|July|August|September|October|November|December)\s\d+,\s\d+")(?:,(?(?:(?!")[^,\r\n]+)|(?:"(?:""|[^"])+")|))*\r\n

Works like a charm.

Some of the key options that make this regex work (in .NET at least):

  1. I’m ignoring whitespace within the regex (to make it way easier to read; I love comments)
  2. The ‘.’ character represents any character, including newlines.
About these ads
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: