Often data required by application are available in CSV formatted files.
In c++
it is easy to read file line by line. All that is left is to extract fields from each line and insert them into datastructure stored in memory.
Boost Tokenizer is a package that provides a way to easilly break a string or sequence of characters into sequence of tokens, and provides standard iterator interface to traverse the tokens.
I will show simple way of using Boost Tokenizer to parse data from CSV file.
Boost provides tokenizers that are easy to construct and use.
To setup a tokenizer you select one of the provided tokenizer functions. Tokenizer is instantiated with the string that is to be parsed. You can then use standard iterator interface to access parsed tokens, tokenizer and tokenizer function take care of parsing the string. Optionaly you can use other standard algorithms that operate on iterators, example below uses std::vector
initialization using begin()
and end()
iterators.
Simple example code that parses CSV file into records:
#include <iostream> // cout, endl #include <fstream> // fstream #include <vector> #include <string> #include <algorithm> // copy #include <iterator> // ostream_operator #include <boost/tokenizer.hpp> int main() { using namespace std; using namespace boost; string data("data.csv"); ifstream in(data.c_str()); if (!in.is_open()) return 1; typedef tokenizer< escaped_list_separator<char> > Tokenizer; vector< string > vec; string line; while (getline(in,line)) { Tokenizer tok(line); vec.assign(tok.begin(),tok.end()); if (vec.size() < 3) continue; copy(vec.begin(), vec.end(), ostream_iterator<string>(cout, "|")); cout << "\n----------------------" << endl; } }
First, the boost::tokenizer
is setup using boost::escaped_list_separator
function. This function specifies how string is parsed.
typedef tokenizer< escaped_list_separator<char> > Tokenizer;
Next the tokenizer is initialized with each line read from csv
file:
Tokenizer tok(line);
Now tokens for one record are available via begin()
and end()
iterators. std::vector
is initialized with data from one parsed line:
vec.assign(tok.begin(),tok.end());
Vector now contains parsed data. The example dumps the data onto standard output using copy
algorithm and ostream_iterator
that pipes data into cout
using string "|"
to separate tokens.
copy(vec.begin(),vec.end(),ostream_iterator<string>(cout,"|"));
Often it is desirable to perform basic checking on data, such as to check if each line was parsed properly by checking the number of fields extracted.
This is easilly done by checking the number of elements in vector
, the example skips each record that has less than three fields:
if (vec.size() < 3) continue;
Compiling With Boost Tokenizer
To compile you need to include -I/usr/local/include/boost-1_42/
in compile flags so that the compiler can find appropriate boost
headers. No library for linking is required.
Iterating Over Tokens Using Standard Iterator Interface
You can use standard iterator interface to access tokens as they are parsed:
vector< string > vec. vec.clear(); Tokenizer tok(line); for (Tokenizer::iterator it(tok.begin()), end(tok.end()); it != end; ++it) { vec.push_back((*it)); }
Trim Strings
If the csv file includes spaces between delimiters and values, the extracted token will contain those extra spaces.
We can apply trim
from boost::string
library to remove spaces from front and back of the string:
#include <boost/algorithm/string/trim.hpp> trim(vec[0]); trim(vec[1]);
Store Data In Boost Bimap
I have shown in my previous blog how to use boost::bimap
to keep bidirectional maps of data to map two unique sets of values.
We now have a way to extract data from CSV file and insert them into data structure for lookup:
string data("map.csv"); ifstream in(data.c_str()); if (!in.is_open()) return 1; using namespace boost::bimaps; typedef bimap< unordered_set_of< string >, unordered_set_of< string > > symbol_map_type; symbol_map_type m_symbol_map; typedef tokenizer< escaped_list_separator<char> > Tokenizer; vector< string > vec; string line; while (getline(in,line)) { vec.clear(); Tokenizer tok(line); vec.assign(tok.begin(),tok.end()); if (vec.size() < 2) continue; trim(vec[0]); trim(vec[1]); m_symbol_map.insert( symbol_map_type::value_type(vec[0], vec[1]) ); }
Now we can access values in both directions, key to value:
symbol_map_type::left_map& map_view = m_symbol_map.left; for (symbol_map_type::left_map::iterator it(map_view.begin()), end(map_view.end()); it != end; ++it) { cout << "[" << (*it).first << "] - [" << (*it).second << "]" << endl; }
And reverse:
symbol_map_type::right_map& map_view = m_symbol_map.right; for (symbol_map_type::right_map::iterator it(map_view.begin()), end(map_view.end()); it != end; ++it) { cout << "[" << (*it).first << "] - [" << (*it).second << "]" << endl; }
See previous blog about searching boost::bimap
data structures.
Data And Output
The data.csv
file used is slightly modified file from boost::tokenizer
example, note that you do not see the second line from the bottom in the output because of the check for at least 3 fields per record:
Field 1,Field 2,Field 3 Field 1,"Field 2, with comma",Field 3 Field 1,Field 2 with \"embedded quote\",Field 3 Field 1, Field 2 with \n new line,Field 3 Field 1, Field 2 with embedded \\ ,Field 3 Field 1, Field 2 with missing third field so it is skipped and will not appear in the output Field 11, ,,Field 33
Output:
Field 1|Field 2|Field 3| ---------------------- Field 1|Field 2, with comma|Field 3| ---------------------- Field 1|Field 2 with "embedded quote"|Field 3| ---------------------- Field 1| Field 2 with new line|Field 3| ---------------------- Field 1| Field 2 with embedded \ |Field 3| ---------------------- Field 11| ||Field 33| ----------------------
Data For Map
Example data file map.csv
used with boost::bimap
. Note the extra space after comma:
ABC, cba.abc.cba EFG, gfe.efg.gfe HIJ, jih.hij.jih KLM, mlk.klm.mlk NOP, pon.nop.pon
Trimmed data output by iterating boost::bimap
in both directions:
[EFG] - [gfe.efg.gfe] [NOP] - [pon.nop.pon] [KLM] - [mlk.klm.mlk] [ABC] - [cba.abc.cba] [HIJ] - [jih.hij.jih] The other way ... [gfe.efg.gfe] - [EFG] [cba.abc.cba] - [ABC] [pon.nop.pon] - [NOP] [jih.hij.jih] - [HIJ] [mlk.klm.mlk] - [KLM]
I have shown easy way to parse CSV data with boost::tokenizer
and how to insert the data into 'boost::bimap`.
Enjoy parsing CSV files.
I wonder whether these tools are capable of parsing lines with fields including unescaped newlines. For example:
ReplyDeleteName;Address;Sport
Joe Smith;"101 Main Street
Springfield, Anystate";Basketball
Will Brown;;Baseball
The code above will not be able to parse embedded new line in a field as you show in your example in the first record.
ReplyDeleteThis is not an issue with the boost::tokenizer, you can specify ';' as delimiter.
The issue is that the code above assumes records are stored one per line so a line at a time is read and parsed.
The reading code could be adjusted to skim trough each line, check if we have a case of new line inside a quoted string and keep reading new lines from file until the whole field with embedded new lines is read.
I have added another post that shows one way of dealing with the type of records with embedded line breaks and semi-colon separator http://mybyteofcode.blogspot.com/2010/11/parse-csv-file-with-embedded-new-lines.html .
ReplyDeleteThanks! This was very helpful. I had another method, but it didn't like zero-length fields (i.e. commas with nothing between) and was slow. This is faster and handles it all.
ReplyDelete