2007-04-02

I wanted to add a reentrant parser to some library code and the version of flex that comes with OS X 10.4 will not produce reentrant functions. Instead of installing an incompatible version of flex, I decided to try out a different tool, the Ragel State Machine Compiler. It is interesting, but the similarity of its syntax to that of flex has caused me some headaches. The manual packs a lot of ideas into 51 pages and it is hard to see the consequences of some of those concepts until you try to write source.

One thing that stalled my progress for a couple of hours was the appearance that "[0-9]+" would not match an indefinitely long sequence of numbers. I almost had myself convinced that the program was broken, but several readings of Chapter 4 cleared up the confusion. Familiar "longest match" regular expression engines such as those in flex and Perl use a nondeterministic finite automaton (NFA) to keep track of their state. If you feed them an ambiguous pattern, they create a "Schroedinger's Cat" state, a state in which you may be in state 1 or you may be in state 2, and you will decide which state it was later. The Ragel compiler does that too, but since the program semantics are implemented as side effects of code embedded into state transitions, and the dummy states encompass multiple states, it can be a struggle to design the patterns and actions to match your intentions. For example, consider the following code:

	number = [0-9]+
		>{ numval = 0; }
		${ numval = numval*10 + fc - '0'; }
		%{ printf( "[%d]\n", numval ); }
		;

Ragel source consists of regular expression elements mixed with actions. The first line means that "number" matches one or more digit characters. The second line means that when you read the first digit, preset numval to 0. The third line means that for every digit character, including the first, recalculate numval to match the text seen so far. The fourth line prints the result. So what happens? If you feed it the string "1234", it would print out [12][3][4]! Manual page 27 identifies the problem. Ragel cannot tell if the third digit is extending the pattern or starting a new one, so it uses a dummy state which includes all possible actions, which include ending the first number and starting a new one. It is the action code side effects that give the appearance that '+' is a non-greedy operator!

So how do you write a repetitive character consumer? Use the "finish-guarded concatenation" operator from manual section 4.2.2:

	number = ([0-9] ${ numval=fc-'0'; } :>> [0-9]* ${ numval=10*numval + fc - '0'; })
		%{ printf( "[%d]\n", numval ); }
		;

The ":>>" operator tells Ragel in effect, "if you have a choice between adding the next digit to the current pattern or going back to the beginning and starting a new pattern, choose the extension." So this should print [1234]. I use that operator often.


I need a syntax coloring routine for my code samples. That sounds like a project for the Ragel compiler! Too bad I don't have time right now.