March 27, 2024
MarkBernstein.org
 

Refactoring Actions

For the past two weeks, I’ve been dividing a good deal of time to refactoring the classes that let Tinderbox parse actions. It’s been a long haul.

The Parser Class: 2001

The core of the Tinderbox action parser remains Parser, a utility class that dates to very early Tinderbox. Parser collects a bunch of functions that are useful for working with strings and character streams; for example, Parser::Trim() is a static function that trims whitespace from the front and back of a string. It’s used in more than a hundred places in Tinderbox. About half of the class is this bundle of static convenience functions, handy but not very consequential. The rest of the class deals with various chores involved in parsing a character stream. For example, there’s QuoteParser, which reads a quoted string up to the next quotation mark. This gets a little trickier than it sounds because there are questions like escaped quotes, quotes inside quotes, comments, and such.

The problem with Parser was that it was all built around a character stream — an istream — and every client knew it. That was OK in 2001, because there wasn’t a plausible alternative. Of course, actions will be written as characters! Remember, the original Tinderbox was written for System 9/Carbon; there was no other way to represent strings. It was reasonable, back then, to expect that attribute names would use ASCII. Chinese or Thai? No way.

Nowadays, you’d like the option to parse UTF-8 strings and streams. In particular, Tinderbox now recognizes \u2028 (Line Separator: ctrl-Return) as an alternative line break for use when [Return] is unwanted. Recognizing multibyte UTF-8 code points in a byte stream is an ugly mess, and that’s only going to get worse over time. It would be much cleaner, nowadays, to use a proper unicode string as the backing store.

Indecent Exposure

The problem was, dozens of classes — some of them complex — knew that Parser used an istream and felt free to manipulate that istream themselves. This is the problem of indecent exposure: Parser allowed its clients to know about Parser’s internals, and even to mess around with them. So that was the first part of the refactoring: replace the concrete istream with an abstract storage class ParserStorage. ParserStore needs to provide the same functionality as the istream, but it doesn't let anyone know what’s inside or how ParserStorage works. Then, class by class, I rewrote things to use a ParserStorage object.

Building Strings

Parser sometimes builds strings, such as lists of arguments. Those, too, had been built from characters. Now, though, ParserStorage offered ways to get unichars — full unicode characters. And you can’t just put a unichar into a character stream.

Here, again, we hide the implementation inside an abstract object, TbxStringBuilder, which does the things that we formerly did with the output streams. One by one, I replaced the old ostringstreams with TbxStringBuilder. Once again, hundreds of changes, but each isolated to one place. (The 2000-test unit test suite is fairly strong for the action language, both because actions are easy to test and because actions are so handy for testing the rest of Tinderbox: if something goes wrong with the action parser, a bunch of tests will fail right quick.

Keeping A Retreat Path Open

Of course, this was an opportunity to tidy and improve lots of code not immediately connected to the refactoring. I was not entirely sure the refactoring would work, but I would hate to lose all the other changes by simply reverting to the start. So, I took care that, if things went sour, I could return to the old implementation, now sheltered behind the new abstract classes. That was a huge help. It does mean that Tinderbox now carries along a few duplicate methods: several parsers have Token() to handle modern streams and LegacyToken() to handle character streams. The old methods do little harm, and we can jettison them once everything is happy.