lixo.org

Encoding Detector 1.0 Released

Ever had nasty problems coming from multiple development environments set up differently on your team, with developers accidentally creating files with bad encodings? Are your users complaining about all that mojibake showing up on your internationalised UI?

Well, I had… plenty of times. So I created a tiny little tool to help fight that. Encoding Detector is an aptly named tool (if I do say so myself) that recursively detects the encoding of files in a project directory errors out if anything seems fishy.

I’ve only ever used it with Ant, but it should be a breeze to set up and install on any project. All you need to do is call the main script against a directory, like:

$ python encoding-detector.py src


You’ll need a somewhat recent Python installed — anything that came out in the last 5 years should be OK — and that’s about it. Multiple directories can be passed as arguments, and fixing the errors is sometimes as easy as adding a few UTF-8 characters to boost the detector’s confidence up a bit.

This hack would definitely not be possible without Mark Pilgrim’s amazing chardet library. The code, as always, is on GitHub, and you can also grab a neat little tarball here.

It takes about 5 minutes to set up on your project, and you can thank me later ;)