In my previous post, Parse CSV File With Boost Tokenizer in C++
, I have shown how to use Boost Tokenizer to parse CSV files.
The algorithm expected that the file contains one record per line.
However, CSV syntax allows for quoted fields to contain embedded line breaks (quoting wikipedia - should be enough to demonstrate this example). The example code from my previous post could not handle those embedded breaks.
As someone asked in comments if the code can handle them, here is a code that fixes that problem and handles embedded line breaks in quoted fields.
The problem with the original example was not in the way boost::tokenizer
was used. The issue was that the code assume one record per line. Because of that assumption it reads one line from file at a time and processes it with the parser.
This post shows code that handles line breaks in csv
fields. Example data:
Name;Address;Sport John Smith;"100 Proper Street;Nowhere;Somestate";Soccer Joe Smith;"101 Main Street Springfield, Anystate";Basketball Will Brown;;Baseball Jeff Tall;"99 New Street; Sommerville; Australia";Golf Anthony Short;"1 Some Street;""Quoted Place""";Ping Pong
As we can see there are fields that contain line breaks as well as a field that spans 3 lines.
Updated Record Reading Algorithm
To fix the problem the reading part of the algorithm has to detect if the line, that was read from a file, contains full record or if the record continues on the next line. Here is the updated code:
string line; string buffer; bool inside_quotes(false); size_t last_quote(0); while (getline(in,buffer)) { last_quote = buffer.find_first_of('"'); while (last_quote != string::npos) { inside_quotes = !inside_quotes; last_quote = buffer.find_first_of('"', last_quote + 1); } line.append(buffer); if (inside_quotes) { line.append("\n"); continue; } // line now contains full record, // use boost::tokenizer to parse it }
The way I implemented this check is to scan the line and check the number of quote characters. If we have odd number of quote characters the record is not complete and we need to read another line.
Custom Separator for Semicolons
I have updated the tokenizer to deal with semi-colon delimiter instead of the original comma, as per the format used in the original question (posted in comments to the previous post):
typedef tokenizer< escaped_list_separator<char> > Tokenizer; escaped_list_separator<char> sep('\\', ';', '\"'); Tokenizer tok(line, sep);
The only difference is that I have to initialize separator with custom values, replacing comma with semicolon as the second parameter, and passing the new separator
to Tokenizer
object on creation.
Full code listing
#include <iostream> #include <fstream> // fstream #include <vector> #include <string> #include <algorithm> // copy #include <iterator> // ostream_operator #include <boost/tokenizer.hpp> int name_address_sport_parser() { using namespace std; using namespace boost; string data("name_address_sport.csv"); ifstream in(data.c_str()); if (!in.is_open()) return 1; typedef tokenizer< escaped_list_separator<char> > Tokenizer; escaped_list_separator<char> sep('\\', ';', '\"'); vector< string > vec; string line; string buffer; bool inside_quotes(false); size_t last_quote(0); while (getline(in,buffer)) { // --- deal with line breaks in quoted strings last_quote = buffer.find_first_of('"'); while (last_quote != string::npos) { inside_quotes = !inside_quotes; last_quote = buffer.find_first_of('"',last_quote+1); } line.append(buffer); if (inside_quotes) { line.append("\n"); continue; } // --- Tokenizer tok(line, sep); vec.assign(tok.begin(),tok.end()); line.clear(); // clear here, next check could fail // example checking // for correctly parsed 3 fields per record if (vec.size() < 3) continue; copy(vec.begin(), vec.end(), ostream_iterator<string>(cout, "|")); cout << "\n----------------------" << endl; } in.close(); return 0; } int main() { name_address_sport_parser(); }
Input Data and Generated Output
Example input data:
Name;Address;Sport John Smith;"100 Proper Street;Nowhere;Somestate";Soccer Joe Smith;"101 Main Street Springfield, Anystate";Basketball Will Brown;;Baseball Jeff Tall;"99 New Street; Sommerville; Australia";Golf Anthony Short;"1 Some Street;""Quoted Place""";Ping Pong
Generated output, I have inserted pipe character between fields in a record and extra line between records to show how the values were extracted:
Name|Address|Sport| ---------------------- John Smith|100 Proper Street;Nowhere;Somestate|Soccer| ---------------------- Joe Smith|101 Main Street Springfield, Anystate|Basketball| ---------------------- Will Brown||Baseball| ---------------------- Jeff Tall|99 New Street; Sommerville; Australia|Golf| ---------------------- Anthony Short|1 Some Street;Quoted Place|Ping Pong| ----------------------
Feel free to point out any issues with the code. This example, and the one in the previous post, is supposed to show how to use boost::tokenizer
more than how to deal with all possible csv
formats.
I hope this demonstrates how to fix the code to handle embedded line breaks.
Good program, one question:
ReplyDeleteWhat if the new line inside the quota is "\r\n", then by appending in this way would change the original record:
if (inside_quotes)
{
line.append("\n");
continue;
}
Great job. One oversight though: If the quoted text has an escaped ", the algorithm will fail. To correct this, the following need to be changed
ReplyDeletewhile (last_quote != string::npos)
{
if (buffer[ last_quote - 1 ] != '\\')
{
inside_quotes = !inside_quotes;
}
last_quote = buffer.find_first_of('"',last_quote+1);
}
Hi,
ReplyDeleteThank you for the code I was tryint to find an example of tokenizer.
When you use getline, the function stops when arrive to the first '\n', isn't it? http://www.cplusplus.com/reference/string/getline/
I thnik the code will be incorrect to support '\n' inside '"'.