2007-07-01

I think my plan to post updates once a month has backfired. I have worked on a variety of things in June, but without the recurring process of posting content, I don't seem to finish anything. I might need to go back to a discipline of posting something, anything, twice a week to maintain progress. Anyway... here are a few things I have tinkered with in the past month that I might complete in July.

I played with Chicken Scheme quite a bit. I also started working on two documents that would be beneficial to others. I have an outline of a "C Programmer's Guide to embedding Chicken Scheme" which explains some of the issues I had to overcome. I also started working on a quick reference to Chicken Scheme functions and macros.

I started work on a document comparing different Scheme implementations and concentrating on their usefulness for embedding. My enthusiasm for that project has dimmed a bit. I just like Chicken's FFI better and I am willing to concede some performance to use it. Scheme performance is not my burning interest anyway. I am comfortable prototyping in the high-level language, then replacing the bottlenecks with C code.

I started working with the Ragel state machine compiler again. I see now that my early problems stemmed from not handling nondeterminism correctly. I was enamored by the prospect of embedding code into state transitions after each character. My patterns, however, tended to create DFAs that pursued multiple states at the same time and triggered incompatible embedded actions. I responded by trying to use Ragel priorities to force the DFA into "single-threaded" patterns, but that led to more and more complexity, so I gave up on it. When I came back to it last week, I had a break-through. To a large degree, I now use the Ragel start-actions and internal-actions to construct a token and confine all processing to the "pending out" transitions when the token is complete. Functionally, this is equivalent to the lex-yacc or flex-bison separations. For simple grammars like the chess PGN format, you can embed an effective parser into Ragel actions.

While testing the Ragel PGN parser, I did some analysis on games in the TWIC archive for another little research project. If I take a random game in the archive and select a random position past the first ten full moves, there is a 43% chance that a king will be on g1/g8 and about a 6.5% chance each that it will be on h1/h8, g2/g7, or e1/e8. On the other side, c1/c8 and b1/b8 combined barely exceeded 5%. King-side castling dominates recent (1998-2007) games.

I played around a little more with my Scheme-to-PDF-through-Quartz software. It is still too primitive to release to the public. I need to work on text-box handling and sizing.


I revised the website layout at the beginning of June. I skimmed the referrer logs for June today and was surprised to find 41 requests for my old ramble pages through various search engines. At some level, I thought I was basically casting my thoughts into the ether, never to be heard. Maybe not. I made links from the old URLs to the new topic index.

Hey! I just did a Google search on the string "UDF Media Reader" and my program gets three hits on the first page! Number one is the freshmeat.net entry, number two is taffysoft.com itself, and number eight is a reference by dpsmac.com. I remember doing a search a couple of months after I released the program and was disappointed that it did not show up in the first three pages. I guess all those spider tracks in my referrer logs have generated some word of mouth. Word of web? (It is not that I'm not seeking fame or anything like that, but I make tools, and I am happy when people can find them and use them.)


2007-07-16

I finally got around to implementing my Move List with Frequency Coding (MLFC) method of compressing chess positions and did some tests to compare it with the simpler alternatives. My data sets were mostly derived from the This Week In Chess (TWIC) archives 210 through 659. Each set marked TWIC-n below contains all non-duplicate positions from the TWIC archives at White's nth move. The R-2002 set contains the positions from Reinfeld's two "1001" quiz books.

Data Set Positions Average # pieces PLSK bits/pc PLAC bits/pc SLHC bits/pc MLFC bits/pc
TWIC-16 542884 27.249 9.134 6.971 5.581 4.544
TWIC-26 500453 21.988 9.150 7.216 6.154 5.468
R-2002 2002 21.637 9.160 7.245 6.282 5.287
TWIC-36 353310 17.215 9.197 7.562 6.955 6.396
TWIC-46 198005 13.516 9.261 7.995 8.025 7.574
TWIC-56 100808 10.832 9.331 8.495 9.341 8.941
TWIC-66 43274 9.191 9.354 8.946 10.56 10.15
TWIC-76 17610 8.171 9.368 9.315 11.59 11.14
TWIC-86 7103 7.578 9.382 9.565 12.33 11.83
TWIC-96 3161 7.155 9.406 9.761 12.94 12.42

PLSK is a Piece List with Sentinel Kings. It should use 9 bits per piece, but expands slightly when fitted to full bytes. PLAC is a Piece List with a form of run-length encoding. It uses a 24-bit header followed by 6-bit locations, so it performs better with more pieces. The Square List with Huffman Coding (SLHC) encodes a variable-length bit string for each square. It effectively has 64 bits of overhead (1 bit designating empty versus occupied squares) so its performance degrades steadily as the number of pieces declines. The Move List with Frequency Coding (MLFC) has a 52-bit overhead, but then only encodes pieces that have moved from their original locations. Its performance suffers as the position gets farther from the starting position, but its lower overhead makes it competitive with SLHC throughout the range of test positions.

My initial implementations of SLHC and MLFC had similar total time usage. After rewriting them to inline encoding/decoding partial bytes, SLHC was substantially faster. I ran timing tests on the TWIC-16 set with 542884 dense positions. Encoding SLHC is a simple linear path, but decoding it requires descent through a Huffman code tree. MLFC, on the other hand, is harder to encode but decodes pieces in larger chunks. I assume any given position will be encoded once and then decoded many times for various analyses, so this timing bias (and the space benefit) favors the use of MLFC.

Method Encode Time Decode Time Bytes
SLHC 3174 ms 3600 ms 12.5 MB
MLFC 4952 ms 3211 ms 10.6 MB

2007-08-05

My slow-updates policy just doesn't work. Without some form of external presentation, I fall back into my historical pattern of constantly tinkering with things without ever generating any useful content. I need to get back into the habit of posting something at least weekly. Let's start with a link to some VIM tips. VIM has such a cornucopia of options that it is hard to find something you seek, much less something you never imagined was there. It never occurred to me to look for the "scrolloff" option! Also, check out the color schemes!


2007-08-14

I have a Brother MFC-5460CN which is a combination printer, scanner, copier, and fax machine. Sometimes when I plug in the USB cable, I see an error dialog labelled MTW0005 with some text about interface problems. The dialog will not go away and stays on top of all other windows. Clicking on the OK button doesn't work. It is very annoying. The dialog is triggered by a failure of the OS X software to talk to the scanner portion of the device. The sad part of this is that the Brother device was never designed to work with the default software but requires its own custom software. (It was cheap.)

So how do you make that dialog go away? Try typing this line in the Terminal: "ps ax | grep ridge". It should return a line containing the words TWAINBridge.app. The number on the far left of that line is the process number. If that number were, for example, 18547, then entering the line "kill -9 18547" would terminate the TWAIN application and the error dialog would disappear.


2007-08-20

I tried to finish a programming project this weekend, any project, but was stymied by own scatterbrained mentality. I gave up after a while and followed the muse. I found myself drifting back to an old abandoned project. I have dabbled in learning Spanish for years. I have some Spanish text saved in a series of XML files created in 2004. This text is beyond my current vocabulary, but I can muddle through with frequent use of a bilingual dictionary. I have often wished for a software assistant that could recognize a click on a word or phrase, decipher the text under the mouse cursor, and pop up a dialog box giving a definition (possibly tuned to the specific usage) and cross references. In 2004 I had played with the user interface and was unsatisfied with the XML via XSLT to DHTML results. The state of the art has advanced significantly. I spent a few hours Sunday playing with the Yahoo User Interface (YUI) toolkit and found the whole experience is much friendlier now. The 2007 YUI panel control is more fun to work with than any 2004 CSS trick.

There are a few flaws which I haven't worked around yet. It would be nice to load the user-visible text by itself first, then load the definitions dynamically when the user clicks on them. The YUI Connection Manager really simplifies using XmlHttpRequest. (One flaw in the Connection Manager is the lack of support for local testing. File retrieval does not return 200 series success codes, so Connection Manager interprets that as failure. I wrote a couple of convenience functions specific to Safari.) The more difficult problem is character sets. For handling Spanish text, I must have accented vowels. Loading raw text or JSON dictionaries via XmlHttpRequest doesn't work because Safari does not decode UTF-8 text in that data. Loading XML or HTML required me to write JavaScript code to walk DOMs of the retrieved fragments and intelligently splice them into the DOM of the original document. What a pain. It is much simpler to predetermine all links and construct definition chunks in the original document.

In 2004, I had saved the Spanish text in XML with an XSLT style sheet that generated HTML+CSS. Mozilla could handle that back then. Some other browsers still cannot handle that today. I have since defected to the anti-XSLT camp, following the inspiration of Michael Leventhat and Hakon Lie. I like keeping data in XML formats for processing, then generating HTML+CSS+JS for interactive presentation. For ad hoc work, I have been using XML parsers in Python, but for larger scale processing, I need something faster. Fortunately, OS X ships with libxml 2.2 which is pretty simple to use.

The next step is to write software to tag my current text corpus with appropriate definitions. The raw text is over 1.9 megabytes, so building tags by hand would be ridiculous. I'm still at the planning stages for this. My first thought was to implement some type of lex/yacc processing, but I suspect that would approach the same level of work. I need a smarter plan. A little reading led me to Hidden Markov Models. It took a while to wrap my head around the Viterbi algorithm (the trellis diagrams that appear in most HMM texts are less than intuitive for me) but now I can envision a fairly simple C implementation. Now I need to figure out the Forward-Backward algorithm.

Links to investigate:


2007-08-22

The Finnish paper with pretty figures referenced a 1989 tutorial paper by Lawrence Rabiner which is available in PDF format.

Looking at the awful rendering of equations in some of the pages above and the mediocre quality of the Rabiner scans made me wonder how much could be done with HTML. Most systems have Unicode fonts with some Greek characters now, so you could render the simpler inline HMM equations, such as α1(i) = πibi(O1), without too much trouble, but more complex formulas are still out of reach. See also this page on typesetting equations in HTML+CSS. Some of Korpela's comments triggered the idea of a Lisp-based math typesetting system. This has a couple of obvious benefits. Equations structured as S-expressions can be examined and manipulated easily by other software. Also, lisp trees could be used to drive a variant of the graphical system I worked on earlier with Chicken Scheme, leading to better direct-to-PDF graphics. It would be a huge effort to produce something that could compete with TeX and I doubt it would garner much popular support, but the idea is very enticing. I suspect I will dabble with it later.

You might wonder why I would even consider competing with TeX. I like the idea of TeX, but it bugs me that it requires a completely separate software ecosystem. I understand the reasons for it. I know its history. It still bugs me. I guess it is the bizarre mirror image of my preference for Vim over Emacs. Emacs and TeX are the prima donnas that insist everything else should revolve around them. Even though Emacs Lisp should make it my preference, Vim's relative cognitive size fits me better. TeX lives in its own alien world and can only communicate through translators. It is a dot matrix program in a PostScript world. Knuth's focus seems to be on page-level typesetting. Document-level structure is an afterthought hacked together after the fact and it shows. The multiple passes required to do simple things seem, well, goofy. SVG embedded in something else (ODF?) would make much more sense. Also, I just don't like the textual part of TeX's Computer Modern fonts. This article claims that Lucida uses the same math symbols. Hmm, Wikipedia's article claims that outline CM fonts are preferred now.


2007-08-24

I did some reading about LaTeX and some of my criticisms are out of date. The description of LaTeXiT on this page is very interesting.


2007-08-27

I have been using hdiutil to create data DVD images from the command line ("/usr/bin/hdiutil makehybrid -o image.iso sourceDirectory") but that command recently stopped working if the image exceeded two gigabytes. I suspect something changed in the last OS X update. If I explicitly specify filesystems ("/usr/bin/hdiutil makehybrid -hfs -udf -o image.iso sourceDirectory"), it produces an image with no problems. There must be a problem in the ISO and/or Joliet filesystem software.


2007-09-03

An update for etymology nerds. Be sure to jot that down in the margins of your OED. That reference is missing from Etymology Online.


2007-09-16

I have been in a real slump lately. I decided to put away my other projects and try something different.I have some basic Python code for tracking an investment portfolio. It generates a list of current holdings and calculates capital gains when positions are closed. That was easy to port to Scheme. There was a problem I had never solved in the Python version, though: automatically handling wash sales. I thought maybe re-implementing the rest in Scheme would give me some new insight into that part, but it is still hideous. The bidirectional nature of the rule leaves me with a dilemma. Should I attempt to maintain all information necessary to identify retroactive wash sales at the point each position is closed? Then I would have to keep a set of extraneous data: positions opened and closed within the last thirty days, possibly altered by splits or symbol changes. The other option is to implement a second phase. After all basic transactions have been processed, scan thirty days backwards and forwards looking for equivalent purchases. This is probably safer, but picking through the splits and conversions is amazingly tedious. I keep thinking I am missing something obvious, but an elegant solution continues to elude me. Well, wash sales are a government rule after all. Perhaps I am setting my sights too high.