Hapy Prefix Parsing

This page documents a part of the Hapy parser API that is useful for handling a stream of communication protocol messages, interactive user interfaces, or similar input where things to be parsed do not have known-a-priory boundaries and/or are surrounded by opaque data such as message payload.

The Challenge

When parsing communication protocol messages or handling an interactive user interface, message boundaries are often unknown and several messages (commands, etc.) may be present in the already available input. These conditions require the programmar to distinguish among the following three cases:

Input starts with a complete message possibly followed by something else. The message needs to be interpreted, freeing buffers for more input. This is a common situation when message headers are followed by message body; the body is treated as opaque data or has syntax different from the headers. It is often large and/or has header-dependent length.
Input starts with a valid message prefix and contains nothing else. If input buffers are full, terminate processing with a "message too big to handle" error. Otherwise, keep buffering (i.e., wait for more input) in hope to get a complete message. This is a common case when input data trickles in over a slow network link or from a typing human.
Input contains something other than the above. Terminate processing with a "invalid message syntax" error. This is a common situation when communicating with humans and broken or malicious agents.

The usual parser interface that yields a yes/no answer for a given input is useless in all three cases: it can only work if the input contains a single complete message and nothing else. A parser that can distinguish the first case only will be more susceptible to even trivial denial of service attacks (intentional or not) since it would have to rely on timeouts to detect invalid input.

A Hapy parser can distinguish all three cases. It does not assume that the end of currently available input is the end of the thing being parsed, and it uses a three-way logic (miss/match/need-more-data). These and other features allow Hapy parsers to detect valid message prefixes, including complete valid messages, and distinguish them from garbage.

Interactive Calculator Example

Since providing a simple but realistic and testable example for parsing protocol messages is difficult, we settle on an interactive calculator example (doc/calci.cc). For a non-interactive version of a very similar calculator (doc/calc.cc), refer to the Primer. If you are new to Hapy, you should read the Primer first.

Our interactive calculator parses and interprets arithmetic expressions as they are being entered by the user. The calculator reads standard input stream, one character at a time, updates parser state, and prints expression values when input contains a complete expression. Invalid input is also detected as soon as possible. Invalid input terminates the program.

Grammar

The grammar for dynamic interactive parsing is the same as its static version except for two little but important differences. First, we concentrate on a single expression in a possibly infinite stream of expressions instead of expecting to parse a terminated sequence of expressions. Thus, our top-most start rule is a single rExpression:

	// was: rGrammar = +(rExpression >> ';');
	rGrammar = rExpression >> ';';
	...

Second, we do not want to wait for the next expression to interpret the previous one. This implies that our grammar must not trim whitespace after the semicolon, or we would have to wait for that whitespace to terminate (with the start of the next expression or with the final end of input). Thus, instead of specifying the trim rule starting with rGrammar, we trim starting with rExpression only.

	// was: rGrammar.trim(*space_r);
	rExpression.trim(*space_r);
	...

The above trimming rule takes care of whitespace between expressions via the left-side trim of rExpression. However, note that the rule does not trim any whitespace that follows the very last semicolon. Thus, the grammar does not allow any trailing garbage. The grammar or parsing loop could be slightly modified to accommodate for the latter case if needed.

The rest of the grammar is the same for doc/calc.cc and doc/calci.cc.

Processing Loop

The static calculator was able to build the parsing tree and initiate interpretation in a just a few statements:

	// abridged static code only; see below for calci code
	if (parser.parse(input)) 
		interpret(parser.result().pree);

The interactive calculator has to use prefix parsing interface. We no longer can separate parsing and interpretation stages in time. By the very nature of the problem, they become interleaved. The if-statement in the above static code is replaced by a loop:

	do {
		parser.moveOn();
		if (parser.begin()) {
			while (parser.step())
				getData(parser);
		}
		parser.end();
	} while (interpret(parser.result()));

The code inside the loop starts by reseting the parser before each new expression is parsed. We could simply create a new parser instead, but that would mean "compiling" and optimizing the same grammar on every loop iteration. The moveOn() method allows to perform expensive initialization of the parser just once, during the very first call to the begin() method.

The inner while-loop attempts to get more input data while more data is needed after parsing available input. The step() method always attempts to continue parsing available input and is guaranteed to return true if and only if more data is needed to parse the input. The latter implies that once the final end of input is reached and signaled to the parser, the step() method returns false regardless of the input and parsing success. Thus, the loop terminates once we found the first valid or corrupted message. The getData() function is discussed below.

The getData() function attempts to read more user input, one character at a time. If the final end is reached, the parser is notified and no more attempts to read input are made.

	// supply parser with more input
	void getData(Parser &parser) {
		if (!parser.sawDataEnd()) {
			char c;
			if (cin.get(c))
				parser.pushData(string(1, c));
			else
				parser.sawDataEnd(true);
		}
	}

The do-loop ends each iteration with a cleanup call to the end() method, followed by the call to the interpret() function. The latter returns true if the parser successfully parsed a complete expression. Otherwise, it complaints of a syntax error (unless the input was empty) and returns false.

	bool interpret(const Result &result) {
		if (result.statusCode == Result::scMatch) {
			intrpTop(result.pree);
			return true;
		}

		// if empty input did not parse, that is OK; just stop
		if (result.input.size() > 0)
			cerr << result.location() << ": syntax error" << endl;
		return false;
	}

The rest of the iterative calculator code is identical to the static calculator discussed in Primer.

Sample Session

Here is a sample interactive session with calci. Note how whitespace after semicolon is printed as a part of the next expression.

	1;
	1 = 1
	2;

	2 = 2
	1+1;

	1+1 = 2
	1+
	3+
	4+0
	;

	1+
	3+
	4+0
	 = 8


	1+1;


	1+1 = 2

Terminating our interactive session requires typing the end-of-file character (Control-D on many systems) immediately after an expression, to avoid trailing whitespace. We will use the echo command instead. In the following examples, the "-n" parameter is used to avoid the default new line at the end of the echo output.

	% echo "1+1;" | ./calc   # note calc, not calci
	1+1 = 2
	% echo "1+1;" | ./calci
	1+1 = 2
	near the end of input: syntax error
	% echo -n "1+1;" | ./calci
	1+1 = 2
	% echo -n "1+1;1+2+(3);" | ./calci
	1+1 = 2
	1+2+(3) = 6

HAPY PREFIX PARSING

Table of Contents

The Challenge

Interactive Calculator Example

Grammar

Processing Loop

Sample Session

Simple Versus Prefix Parsing