cgv
cgv::utils::tokenizer Class Reference

#include <tokenizer.h>

Inheritance diagram for cgv::utils::tokenizer:
cgv::utils::token

Public Member Functions

 tokenizer ()
 construct empty tokenizer
 
 tokenizer (const token &)
 construct from token
 
 tokenizer (const char *)
 construct from character string
 
 tokenizer (const std::string &)
 construct from string
 
tokenizerset_ws (const std::string &ws)
 set the list of white spaces, that separate tokens and are skipped
 
tokenizerset_skip (const std::string &open, const std::string &close)
 set several character pairs that enclose tokens that are not split
 
tokenizerset_skip (const std::string &open, const std::string &close, const std::string &escape)
 set several character pairs that enclose tokens that are not split and one escape character for each pair
 
tokenizerset_sep (const std::string &sep, bool merge)
 set the list of separators and specify whether succeeding separators are merged into single tokens
 
tokenizerset_sep (const std::string &sep)
 set the list of separators
 
tokenizerset_sep_merge (bool merge)
 specify whether succeeding separators are merged into single tokens
 
token bite ()
 bite away a single token from the front
 
token reverse_bite ()
 bite away a single token from the back
 
void reverse_skip_whitespaces ()
 skip whitespaces at the back
 
void skip_whitespaces ()
 skip whitespaces at the front
 
bool skip_ws_check_empty ()
 skip whitespaces at the front and return whether the complete text has been processed
 
bool reverse_skip_ws_check_empty ()
 skip whitespaces at the back and return whether the complete text has been processed
 
bool balanced_bite (token &result, const std::string &open_parenthesis, const std::string &close_parenthesis, bool wait_for_sep=false)
 bite one token until all potentially nested opended parenthesis have been closed again
 
size_t get_length () const
 return the length of the token in number of characters
 
size_t size () const
 return the length of the token in number of characters
 
bool empty () const
 return whether the token is empty
 
void skip (const std::string &skip_chars)
 set begin by skipping all instances of the given character set More...
 
void reverse_skip (const std::string &skip_chars)
 set end by skipping all instances of the given character set
 
char operator[] (unsigned int i) const
 return the i-th character of the token
 
bool operator== (const char *s) const
 compare to const char*
 
bool operator== (const std::string &s) const
 compare to string
 
bool operator!= (const char *s) const
 compare to const char*
 
bool operator!= (const std::string &s) const
 compare to string
 

Public Attributes

const char * begin
 pointers that define the range of characters
 

Detailed Description

the tokenizer allows to split text into tokens in a convenient way. It supports splitting at white spaces and single or multi charactor separators. Furthermore, it supports enclosing character pairs like parantheses or string separators that skip white spaces and separators between enclosing pairs.

By default white spaces are set to space, tab, newline. The list of separators and skip character pairs is empty by default.

A tokenizer can be constructed from a string, a cont char* or a token. The resulting tokens are stored as two pointers to the begin and after the end of the token. No new memory is allocated and the tokens are only valid as long as the string or const char* is valid from which the tokenizer has been construct.

In the simplest usage, the tokenizer generates a vector of tokens through the bite_all function. Suppose you want to split the string str="Hello tokenizer." at the white spaces into two tokens <Hello> and <tokenizer.>. Notice that no token contains the white space separating the tokens. The following code performs this task:

std::vector<token> toks;
bite_all(tokenizer(str), toks);

If you want to also cut the dot into a separate token, just set the list of separators with the set_sep method:

std::vector<token> toks;
bite_all(tokenizer(str).set_sep("."), toks);

The result are three tokens: <Hello>, <tokenizer> and <.>. If you want to split a semicolon separated list with tokens that can contain white spaces and ignoring the semicolons, you can set the semicolon character as the only white space:

std::vector<token> toks;
bite_all(tokenizer(str).set_ws(";"), toks);

The previous code would split the string "a and b;c and d" into two tokens and .

If you want to not split into tokens in between strings enclosed by <'> and in between paranthesis, you can several skip character pairs:

std::vector<token> toks;
bite_all(tokenizer(str).set_sep("[]").set_skip("'({", "')}"), toks);

The previous code example would split the string "'a b'[{c d}]" into four tokens: <'a b'>, <[>, <{c d}> and <]>. Note that you can apply several setter methods to the tokenizer in a sequence as each setter returns a reference to the tokenizer itself similar to the stream operators.

Member Function Documentation

◆ skip()

void token::skip ( const std::string &  skip_chars)
inherited

set begin by skipping all instances of the given character set

return the length of the token in number of characters

return the length of the token in number of characters return whether the token is empty set begin by skipping all instances of the given character set


The documentation for this class was generated from the following files:
cgv::utils::tokenizer::set_ws
tokenizer & set_ws(const std::string &ws)
set the list of white spaces, that separate tokens and are skipped
Definition: tokenizer.cxx:38
cgv::utils::tokenizer::tokenizer
tokenizer()
construct empty tokenizer
Definition: tokenizer.cxx:16
cgv::utils::tokenizer::set_sep
tokenizer & set_sep(const std::string &sep, bool merge)
set the list of separators and specify whether succeeding separators are merged into single tokens
Definition: tokenizer.cxx:59
cgv::utils::tokenizer::set_skip
tokenizer & set_skip(const std::string &open, const std::string &close)
set several character pairs that enclose tokens that are not split
Definition: tokenizer.cxx:44