My Byte of Code Blog. Tips and observations on creating software with Objective-C, C/C++, Python, Cocoa and Boost on Mac.

Saturday, November 20, 2010

Parse CSV File with Embedded New Lines Using Boost Tokenizer and C++

In my previous post, Parse CSV File With Boost Tokenizer in C++, I have shown how to use Boost Tokenizer to parse CSV files. The algorithm expected that the file contains one record per line.

However, CSV syntax allows for quoted fields to contain embedded line breaks (quoting wikipedia - should be enough to demonstrate this example). The example code from my previous post could not handle those embedded breaks.

As someone asked in comments if the code can handle them, here is a code that fixes that problem and handles embedded line breaks in quoted fields.

The problem with the original example was not in the way boost::tokenizer was used. The issue was that the code assume one record per line. Because of that assumption it reads one line from file at a time and processes it with the parser.

This post shows code that handles line breaks in csv fields. Example data:

Name;Address;Sport
John Smith;"100 Proper Street;Nowhere;Somestate";Soccer
Joe Smith;"101 Main Street
Springfield, Anystate";Basketball
Will Brown;;Baseball
Jeff Tall;"99 New Street;        
Sommerville;
Australia";Golf
Anthony Short;"1 Some Street;""Quoted Place""";Ping Pong

As we can see there are fields that contain line breaks as well as a field that spans 3 lines.

Updated Record Reading Algorithm

To fix the problem the reading part of the algorithm has to detect if the line, that was read from a file, contains full record or if the record continues on the next line. Here is the updated code:

string line;
string buffer;

bool inside_quotes(false);
size_t last_quote(0);

while (getline(in,buffer))
{
    last_quote = buffer.find_first_of('"');
    while (last_quote != string::npos)
    {
        inside_quotes = !inside_quotes;
        last_quote = buffer.find_first_of('"', last_quote + 1);
    }

    line.append(buffer);

    if (inside_quotes)
    {
        line.append("\n");
        continue;
    }

    // line now contains full record,
    // use boost::tokenizer to parse it
}

The way I implemented this check is to scan the line and check the number of quote characters. If we have odd number of quote characters the record is not complete and we need to read another line.

Custom Separator for Semicolons

I have updated the tokenizer to deal with semi-colon delimiter instead of the original comma, as per the format used in the original question (posted in comments to the previous post):

typedef tokenizer< escaped_list_separator<char> > Tokenizer;

escaped_list_separator<char> sep('\\', ';', '\"');

Tokenizer tok(line, sep);

The only difference is that I have to initialize separator with custom values, replacing comma with semicolon as the second parameter, and passing the new separator to Tokenizer object on creation.

Full code listing

#include <iostream>
#include <fstream>      // fstream
#include <vector>
#include <string>
#include <algorithm>    // copy
#include <iterator>     // ostream_operator

#include <boost/tokenizer.hpp>

int name_address_sport_parser()
{
    using namespace std;
    using namespace boost;

    string data("name_address_sport.csv");

    ifstream in(data.c_str());
    if (!in.is_open()) return 1;

    typedef tokenizer< escaped_list_separator<char> > Tokenizer;

    escaped_list_separator<char> sep('\\', ';', '\"');

    vector< string > vec;
    string line;
    string buffer;

    bool inside_quotes(false);
    size_t last_quote(0);

    while (getline(in,buffer))
    {
        // --- deal with line breaks in quoted strings

        last_quote = buffer.find_first_of('"');
        while (last_quote != string::npos)
        {
            inside_quotes = !inside_quotes;
            last_quote = buffer.find_first_of('"',last_quote+1);
        }

        line.append(buffer);

        if (inside_quotes)
        {
            line.append("\n");
            continue;
        }
        // ---

        Tokenizer tok(line, sep);
        vec.assign(tok.begin(),tok.end());

        line.clear(); // clear here, next check could fail

        // example checking
        // for correctly parsed 3 fields per record
        if (vec.size() < 3) continue;

        copy(vec.begin(), vec.end(),
                ostream_iterator<string>(cout, "|"));

        cout << "\n----------------------" << endl;
    }

    in.close();

    return 0;
}

int main()
{
    name_address_sport_parser();
}

Input Data and Generated Output

Example input data:

Name;Address;Sport 
John Smith;"100 Proper Street;Nowhere;Somestate";Soccer 
Joe Smith;"101 Main Street 
Springfield, Anystate";Basketball 
Will Brown;;Baseball 
Jeff Tall;"99 New Street;        
Sommerville; 
Australia";Golf 
Anthony Short;"1 Some Street;""Quoted Place""";Ping Pong

Generated output, I have inserted pipe character between fields in a record and extra line between records to show how the values were extracted:

Name|Address|Sport|
----------------------
John Smith|100 Proper Street;Nowhere;Somestate|Soccer|
----------------------
Joe Smith|101 Main Street
Springfield, Anystate|Basketball|
----------------------
Will Brown||Baseball|
----------------------
Jeff Tall|99 New Street;
Sommerville;
Australia|Golf|
----------------------
Anthony Short|1 Some Street;Quoted Place|Ping Pong|
----------------------

Feel free to point out any issues with the code. This example, and the one in the previous post, is supposed to show how to use boost::tokenizer more than how to deal with all possible csv formats. I hope this demonstrates how to fix the code to handle embedded line breaks.

3 comments:

  1. Good program, one question:
    What if the new line inside the quota is "\r\n", then by appending in this way would change the original record:
    if (inside_quotes)
    {
    line.append("\n");
    continue;
    }

    ReplyDelete
  2. Great job. One oversight though: If the quoted text has an escaped ", the algorithm will fail. To correct this, the following need to be changed
    while (last_quote != string::npos)
    {
    if (buffer[ last_quote - 1 ] != '\\')
    {
    inside_quotes = !inside_quotes;
    }

    last_quote = buffer.find_first_of('"',last_quote+1);
    }

    ReplyDelete
  3. Hi,
    Thank you for the code I was tryint to find an example of tokenizer.

    When you use getline, the function stops when arrive to the first '\n', isn't it? http://www.cplusplus.com/reference/string/getline/
    I thnik the code will be incorrect to support '\n' inside '"'.

    ReplyDelete