2007-01-07

I finally spent some time with SQLite3. I have a 182M file consisting of 746044 lines of tab-separated values. I wrote some code to load those fields into a SQL table and did some timing tests. Here are some observations when programming bulk data loads:

1. Use explicit commits on a large scale. On my first attempt, I made no attempt to use BEGIN...COMMIT blocks, so SQLite3 used autocommit mode (one transation per statement). This is painfully slow. I let the program run for maybe twenty minutes, killed it with Control-C, and found it had only loaded 5572 rows. After re-writing the program to do twenty thousand inserts inside BEGIN...COMMIT pairs, the program loaded 746044 rows in 94 seconds. What a difference!

2. Use sqlite3_prepare(). Although sqlite3_mprintf() is convenient, compiling the same INSERT 746044 imposed significant overhead. The sqlite3_prepare() version cut the run time from 94 seconds to 58 seconds. The overhead to read the input file and split each line into fields was approximately 10 seconds, so the compiler step added (84-48)/48 = 75% to the run time for simple INSERT commands.

3. Use the natural page size for your architecture! When I switched from SQLite3's default 1K pages to 4K pages on my G4 Powerbook, the times dropped to 76 seconds (21%) for sqlite3_mprintf() and 41 seconds (35%) for sqlite3_prepare().

4. Load data first, then index it. I tried creating two indexes on the table. One index covered a large variable field (83.6MB) and one covered a small fixed-length field (11.2MB). If I created the indexes first and then loaded the data, it took an average of 148 seconds. If I loaded the data, then indexed it, it took 112 seconds (43 seconds load, 50 seconds big index, 19 seconds small index). I guess page locality outweighs I/O delays of later indexing.