2007-04-07

I am still learning how to use the Ragel compiler. I have been trying to create a DFA for a particular language. If I can write the parser as a pure DFA, it should be ridiculously fast. I stalled on one problem for a while though and I thought I would have to fall back to a longest-match type scanner.

I have three character sets A, B, C and five patterns I need to match: ABC, ABBC, ACBC, ABCBC, and BC. Note that they all end in BC. Matching BC and ACBC are easy: they have unique prefixes. The AB* patterns are a headache. It is easy to write machines that match two patterns but miss the third. You can write a pattern that matches ABC by itself but catches only the first three items in ABCBC, allowing the suffix to be matched by the BC machine. Two obvious solutions using the '?' operator fail: "AB?C?BC" and "A((CBC)|(BC(BC)?))".

I think there are two solutions. One is to create a 'garbage' class X which contains any character not in B or C. Match A((CBC)|(B((BC)|(C((BC)|X))))) and attach a user action "fhold;" to X so that that character is kept in the input stream to be matched by the next pattern. I bet that would work, but I haven't tested it. When I compiled that machine, it caused a combinatorial explosion of states and transitions. Ragel's C output went from 150K to 8.5M and attempting to compile that monstrosity with gcc caused my machine to thrash until I "kill -9"ed the compiler. Not good.

So I studied the manual some more and noticed a couple of odd character classes, "zlen" and "empty". They mean basically the same thing: a zero-length string. That's an odd concept, isn't it, a character class consisting of the absence of a character? It is exactly what I needed here, though. Insert either one into the pattern above in place of X and it will match all four A-prefix variants and it does not need the "fhold;" action.