Note to Self: Convert UTF-8 w/ BOM to ASCII (WIX + DB) using GNU uconv

This one took me a long time to work out, and it took a non-latin alphabet user (Russian) to point me at the right tools. Yet again, I’m guilty of being a complacent anglophone.

I was producing a database installer project using WIX 3.5, and ran into all sorts of inexplicable problems, which I finally tracked down to the Byte Order Mark (BOM) on my SQL update files that I was importing into my MSI file. See here for more on that.

I discovered that the ‘varied’ toolset used in our dev environments (i.e. VS 2010, Cygwin, VIM, GIT, SVN, NAnt, MSBuild, R# etc) meant that the update scripts had steadily diffused out into Unicode space. You can find out (approximately) what the encodings are for a directory of files using the GNU file command. Here’s a selection of files that I was including in my installer:

$ file *
01.sql:          ASCII text, with CRLF line terminators
02.sql:          Little-endian UTF-16 Unicode text, with very long lines, with CRLF, CR line terminator
03.sql:          UTF-8 Unicode (with BOM) text, with CRLF line terminators
05.sql:          ASCII English text, with CRLF line terminators
06.sql:          UTF-8 Unicode (with BOM) text, with CRLF line terminators
11.sql:          ASCII C program text, with CRLF line terminators
12.sql:          UTF-8 Unicode (with BOM) text, with CRLF line terminators
23.sql:          ASCII text, with CRLF line terminators
24.sql:          UTF-8 Unicode (with BOM) text, with CRLF line terminators
25.sql:          UTF-8 Unicode (with BOM) text, with CRLF line terminators
26.sql:          ASCII text, with CRLF line terminators
27.sql:          UTF-8 Unicode (with BOM) text, with CRLF line terminators
28.sql:          UTF-8 Unicode (with BOM) text, with CRLF line terminators
29.sql:          Little-endian UTF-16 Unicode C program text, with very long lines, with CRLF, CR line
30.sql:          UTF-8 Unicode (with BOM) C program text, with very long lines, with CRLF line terminat
37.sql:          UTF-8 Unicode (with BOM) English text, with CRLF line terminators
38.sql:          Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
39.sql:          Little-endian UTF-16 Unicode text, with CRLF line terminators
44.sql:          UTF-8 Unicode (with BOM) text, with CRLF line terminators
AlwaysRun0001.sql: ASCII C program text, with CRLF line terminators
AlwaysRun0002.sql: UTF-8 Unicode (with BOM) C program text, with CRLF line terminators
TestData0001.sql:        UTF-8 Unicode (with BOM) text, with very long lines, with CRLF line terminators

You can see that there appear to be a variety of encodings. I initially assumed that a quick run through d2u or u2d would fix them up, but that did nothing to change the encoding or remove the BOM. In the end I found the IBM uconv command, that has the handy ‘–remove-signature’ option that was the key to the solution. Don’t confuse this with the GNU iconv app, that doesn’t allow you to strip the BOM from the front of your files.

$ uconv --remove-signature -t ASCII TestData0001.sql > TestData0001.sql2
$ rm TestData0001.sql
$ mv TestData0001.sql2 TestData0001.sql

After that, the WIX installer worked OK, and all was right with the world. I hope this helps you if you run into the same problem.

I can’t answer the question of why WIX/MSI fails to work with non-ASCII files (other than to say that Unicode blondness is a common problem of software written by Anglophones).

Dialogue & Discussion